Hacker Newsnew | past | comments | ask | show | jobs | submit | podnami's commentslogin

This is snark. Since when has a junior level dev managed to debug and deploy say a cloudformation stack and follow up with notes under 3 minutes?

Heard this analogy elsewhere, but worth repeating:

AI is like having the greatest developer who ever lived, but she is always on 4 beers.


personifying ai is incredibly cringe no matter how weird your comparison is

It's an analogy.

that’s the personification i’m referring to, yes. incredibly weird.

Imagine a drunk developer. Sparks of brilliance while missing obvious trees.

I know of a publicly traded company which in its early years was built on beer. Literally. 3 guys in a co-working space in Cambridge, MA. Beer fueled their progress. 15 years later the software is still the backbone of the org.

weird.

They lost me at Opus 4.7

Anecdotally OpenAI is trying to get into our enterprise tooth and nail, and have offered unlimited tokens until summer.

Gave GPT5.4 a try because of this and honestly I don’t know if we are getting some extra treatment, but running it at extra high effort the last 30 days I’ve barely see it make any mistakes.

At some points even the reasoning traces brought a smile to my face as it preemptively followed things that I had forgotten to instruct it about but were critical to get a specific part of our data integrity 100% correct.


Same here. I feel like all of these shenanigans could be because Anthropic are compute constrained, forcing then to take reckless risks around reducing it.

Same here. I was a fervent Claude code user at $200/mo until Opus4.7.

Freezing your IDE version is now a thing of the past, the new reality is that we can't expect agentic dev workflows to be consistent and I see too many people (including myself) getting burned by going the single-provider route.

On one hand I’m glad to finally see anthropic communicate on this but at this point all I have to say is… time to diversify?


Opus 4.7 via code has been inconsistent for me. Sometimes, it feels like working with a brilliant collaborator and is as good as 4.5 and 4.6 were. Other times, it takes dumb and lazy short cuts. It can be quite frustrating. Its response when I tell it it did something wrong is often to write a memory... which is then does not always read. The inconsistency isn't due to session length or age either. These are all new sessions. I feel like sometimes, I get routed do a dumber model or some other hidden setting is applied.

My experience as well. This is even worse than just having a mediocre model, because I can work around that. The inconsistency means it produces different outputs for the same prompt, and I can't rely on that as a business tool.

They lost me a little before then - Claude Code's regressions were so very obvious and there's no sign they've learned their lesson in this article or in the comments of those who work on Claude Code on HN. They'll continue to tweak and generally mess around with a product people are using, altering the behaviour without notice in ways that can severely impact use, for months! GPT5.4 has been remarkably consistent and capable, as a replacement. I've cancelled my max plan.

I started using Claude heavily on the 20th after having not used it for a year. Largely Sonnet 4.6, web, cowork and code. Can confidently say it is significantly worse than this time a year ago and regret that my new employer requires we use it, and only it.

GPT-5.4 was already better than Opus 4.6 on a lot of areas, especially correctness and tricky logic. I’m eager to see if 5.5 is even better.

I’ve never been one to complain about new models, and also didn’t experience most of the issues folks were citing about Claude Code over the last couple months. I’ve been using it since release, happy with almost each new update.

Until Opus 4.7 - this is the first time I rolled back to a previous model.

Personality-wise it’s the worst of AI, “it’s not x, it’s y”, strong short sentences, in general a bulshitty vibe, also gaslighting me that it fixed something even though it didn’t actually check.

I’m not sure what’s up, maybe it’s tuned for harnesses like Claude Design (which is great btw) where there’s an independent judge to check it, but for now, Opus 4.6 it is.


I noticed the difference, but coming from Gemini and xAI models it wasn’t that glaring. I still find that Opus makes much better plans than anything else I’ve tried, and it’s been very good at catching my mistakes in using public-key cryptography, also finding out why my crsqlite queries were failing despite no official documentation on the topic.

I’d never use such an expensive model for coding, so that might explain why I have little to complain about.


I went back to 4.5. No regrets and it’s a bit cheaper.

Same here. 4.6 was a downgrade in thinking quality, but I appreciated the extend context at first.

Over time, I realized the extended context became randomly unreliable. That was worse to me than having to compact and know where I was picking up.


I find that it is better at thinking broadly and at a high level, on tasks that are tangential to coding like UX flows, product management and planning of complex implementations. I have yet to see it perform better than either Opus 4.6 or 4.7 though.

extra high burns tokens i find. ( run 5.4 on medium for 90% of the tasks and high if i see medium struggling and its very focused and make minimum changes.

Yeah but it also then strikes the perfect balance between being meticulous and pragmatic. Also it pushes back much more often than other models in that mode.

Rework burns tokens.

Note mini-high is similar perf/latency to medium, but much cheaper

Not a problem if they're offering unlimited, lol

What's your workflow like? I'd be curious to test OpenAI out again but Claude Code is how I use the models. Does it require relearning another workflow?

Isn’t it bascially the same thing? You type what you want into the input box and it does what you ask for.

Claude code can be configured with custom /slash commands and other details that don't necessarily transfer over to codex. /remote-control in cc is really great for walking away from my computer and continuing from my phone, for instance.

I guess I'm asking if their CLI tool is the same or if it functions different. I've never used anything besides CC so I wouldn't know if it's basically the same thing

Truth

Do you have to know Assembler to be able to write code in Java? With the point being that you rarely know the underlying mechanics - and the same if true for vibe coding.


This is not a good analogy.


Nah, but you have to actually put the work in to get the credit. Lazily vibe coding slop and then passing it off as your work is like claiming you cooked a microwave meal.


What happens before the probability distribution? I’m assuming say alignment or other factors would influence it?


In microgpt, there's no alignment. It's all pretraining (learning to predict the next token). But for production systems, models go through post-training, often with some sort of reinforcement learning which modifies the model so that it produces a different probability distribution over output tokens.

But the model "shape" and computation graph itself doesn't change as a result of post-training. All that changes is the weights in the matrices.


I would assume this is from case to case, such as:

- How aligned has it been to “know” that something is true (eg ethical constraints)

- Statistical significance and just being able to corroborate one alternative in Its training data more strongly than another

- If it’s a web search related query, is the statement from original sources vs synthesised from say third party sources

But I’m just a layman and could be totally off here.


Is your prediction that most people actually like to use software?


Do they not? Many phone functions are already available through voice assistants, and have been for a very long time, and yet the vast majority of people still prefer to use them with the UI. Clicking on the weather icon is much easier than asking a chatbot "what's the weather like?"


My elderly mother has an essential tremor (though only in one hand now due to successful ultrasound treatment!) and she would still rather suffer through all her errors with a touch interface than use voice commands.


Some people seem to think that Deckard’s speech-controlled CSI software in Blade Runner is actually something to strive for, UX-wise. As if it makes any sense to use strictly nonvisual, non-two-dimensional affordances to work with visual data.


The sad part is that while everyone is chasing new interface modalities, the traditional 2D UI is slowly getting worse thanks to questionable design trends and a lack of interest.


No it’ll be some idea we have not developed or named yet.

The current ‘agent’ ecosystem is just hacks on top of hacks.


If you’re working in engineering, find an exit. AI is coming for you.


You can just ask AI


Hey man, take a step away from the keyboard. Instead imagine the every day person. Would they rather click, scroll, swipe and pull out credit cards across multiple websites - or just ask their digital assistant to do it?

The defaulting to negativity will really eat some communities up from the inside.


Chat windows and voice assistants are a terrible user experience for the average person. This doesn't change that.


I think there's a difference between a typical website chat window and how many people would use ChatGPT. The latter has tables, images and links which is enough to build up comparisons, order sheets and then ultimately have a format for confirming a purchase. I use it a lot for doing basic home construction comparisons (materials, volumes, etc) and could definitely see it getting to the point that it organised an order for me to submit, and eventually to where the submission and payment were within the chat.

A voice assistant doesn't give you that option to review, but maybe it'd work for ordering fast food. A small chat window could grow to work for simple purchases like takeaway food or small hardware, etc.


I am not so sure about that. The modern web has become complicated/unusable enough that I can see a lot of people prefering a chat interface over having to click through this unholy mess. I might be biased, as I have to deal with accessibility ussues on a daily basis. However, there is a whole demographic we're currently leaving behind. There are a lot of people around who simply don't try to use the Internet to get things done, because they are overwhelmed with how it works. My mother doesnt even want to click on a YouTube link sent to her via WhatsApp, because she would leave the well-known app and have to deal with the web... However, I can imagine her using an agentic interface to get things done, although not right now, maybe in 2 years.


I don’t understand how you can make this statement in the midst of chatGPT being the fastest growing consumer app in history.


That’s exactly what the folks at Amazon thought when they came up with Alexa. Have you ever bought anything online by asking Alexa to do it? Have you ever seen anyone else do it?


I think the "every-day person" simply isn't wealthy enough to (persistently) care about that level of delegated convenience versus the risks of getting the wrong product or ripped off.


The fact that you're being downvoted over this is proof that people here work and live in a bubble. People value convenience and are willing to pay for it, and if OpenAI is able to advance convenience through these actions, they'll make billions.


Does the average user does this? Granted, I’m not i. the USA, but does people really order that much on unusual websites?


You see negativity, I see disappointment that OpenAI isn’t trying to innovate, and instead hoping they can replay Google Search’s history for themselves


Wow this was actually blazing fast. I prompted "how can the 45th and 47th presidents of america share the same parents?"

On ChatGPT.com o3 thought for for 13 seconds, on OpenRouter GPT OSS 120B thought for 0.7 seconds - and they both had the correct answer.


I'm not sure that's a particularly good question for concluding something positive about the "thought for 0.7 seconds" - it's such a simple answer, ChatGPT 4o (with no thinking time) immediately answered correctly. The only surprising thing in your test is that o3 wasted 13 seconds thinking about it.


A current major outstanding problem with thinking models is how to get them to think an appropriate amount.


The providers disagree. You pay per token. Verbacious models are the most profitable. Have fun!


For API users, yes, but for the average person with a subscription or using the free tier it’s the inverse.


Nowadays it must be pretty large % of usage going through monthly subscriptions


Interesting choice of prompt. None of the local models I have in ollama (consumer mid range gpu) were able to get it right.


When I pay attention to o3 CoT, I notice it spends a few passes thinking about my system prompt. Hard to imagine this question is hard enough to spend 13 seconds on.


Not gonna lie but I got sorta goosebumps

I am not kidding but such progress from a technological point of view is just fascinating!


How many people are discussing this after one person did 1 prompt with 1 data point for each model and wrote a comment?

What is being measured here? For end-to-end time, one model is:

t_total = t_network + t_queue + t_batch_wait + t_inference + t_service_overhead


Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: