> This article is describing a problem that is still two steps removed from where AI code becomes actually useful.
But it does a good job of countering the narrative you often see on LinkedIn, and to some extent on HN as well, where AI is portrayed as all-capable of developing enterprise software. If you spend any time in discussions hyping AI, you will have seen plenty of confident claims that traditional coding is dead and that AI will replace it soon. Posts like this is useful because it shows a more grounded reality.
> 90 percent of the things users want either A) dont exist or B) are impossible to find, install and run without being deeply technical. These things dont need to scale, they dont need to be well designed. They are for the most part targeted, single user, single purpose, artifacts.
Yes, that is a particular niche where AI can be applied effectively. But many AI proponents go much further and argue that AI is already capable of delivering complex, production-grade systems. They say, you don't need engineers anymore. They say, you only need product owners who can write down the spec. From what I have seen, that claim does not hold up and this article supports that view.
Many users may not be interested in scalability and maintainability... But for a number of us, including the OP and myself, the real question is whether AI can handle situations where scalability, maintainability and sound design DO actually matter. The OP does a good job of understanding this.
You could use a wrapper that parses all the command-line options. Basically you loop over "$@", look for strings starting with '-' and '--', skip those; then look for a non-option argument, store that as a subcommand; then look for for more '-' and '--' options. Once that's all done you have enough to find subcommand "reset", subcommand option "--hard". About 50 lines of shell script.
Sounds like you care about data stored on your filesystem! Take one step back and solve that problem. Use a proper isolated sandbox, e.g. Github workspace on an account that is working with a fork.
Care about the data in that workspace? Push it first.
Othwerwise it is a cat and mouse game of whackamole.
Does any one of this help me if Claude runs `git reset --hard`?
If I am working in a sandbox, I have uncommitted changes in a sandbox and if Claude runs `git reset --hard` on those uncommitted changes in the sandbox, I've got the same problem?
> Care about the data in that workspace? Push it first.
But you're changing the problem. If I push everything, then yeah I've got no problem. But between pushing one change and the next, you're gonna have uncommitted changes, won't you? and if Claude runs `git reset --hard` at that time, same problem, isn't it?
Ok I contest. If you are worried about it resetting it's own work then yes. Although just chuck the same prompt and you should get a similar result amirite? Maybe a better one lol!
Also you can instruct it to commit and push at every step too.
Just fork git and patch that out?
Can't be that hard just ask the agent for that patch.
Don't need to update often either, so it's ok to rebase like twice a year.
Isn't this a natural consequence of how these systems work?
The model is probabilistic and sequences like `git reset --hard` are very common in training data, so they have some probability to appear in outputs.
Whether such a command is appropriate depends on context that is not fully observable to the system, like whether a repository or changes are disposable or not. Because of that, the system cannot rely purely on fixed rules and has to figure intent from incomplete information, which is also probabilistic.
With so many layers of probabilities, it seems expected that sometimes commands like this will be produced even if they are not appropriate in that specific situation.
Even a 0.01% failure rate due to context corruption, misinterpretation of intent, or guardrail errors would show up regularly at scale, that is like 1 in 10000 queries.
> Just by a thing being common in training data doesn't mean it will be produced.
That's not what I said at all. I never said it will be produced. I said there is some probability of it being produced.
> False, it goes against the RL/HF and other post training goals.
It is correct that frequency in training data alone does not determine outputs, and that post-training (RLHF, policies, etc.) is meant to steer the model away from undesirable behavior.
But those mechanisms do not make such outputs impossible. They just make them less likely. The underlying system is still probabilistic and operating with incomplete context.
I am not sure how you can be so confident that a probabilistic model would never produce `git reset --hard`. There is nothing inherent in how LLMs work that makes that sequence impossible to generate.
> It is meaningless to say that because the author was able to reproduce it multiple times.
I don't know how that refutes what I'm saying.
The behaviour was reproduced multiple times, so it is clearly an observable outcome, not a one-off. It just shows that the probability of `git reset --hard` is > 0 even with RLHF and post-training.
Yes, if something is reproducible and undesirable, it is a bug and RLHF can reduce it. I'm not disupting that. "reduce" is the keyword here. You can't eliminate them entirely.
My point is that fixing one bug does not eliminate the class of bugs. Heck, it does not even fix that one bug deterministically. You only reduce its probability like you rightly said.
With git commands, there is not like a system like Lean that can formally reject invalid proofs. Really I think the mathematicians have got it easier with LLMs because a proof is either valid or invalid. It's not so clear cut with git commands. Almost any command can be valid in some narrow context, which makes it much harder to reject undesirable outputs entirely.
Until the underlying probabilities of undesirable output become negligible so much that they become practically impossible, these kinds of issues will keep surfacing even if you address individual bugs. Will the probabilities become so low someday that these issues are practically impossible? Maybe. But we are not there yet. Until then, we should recalibrate our expectations and rely on deterministic safeguards outside the LLM.
When sampling from an LLM people normally truncate the token probability distribution so that low-probability tokens are never sampled. So the model shouldn't produce really weird outputs even if they technically have nonzero probability in the pre/post training data.
> I'm with you on all points except for it being bought.
Stars get bought all the time. I've been around startup scene and this is basically part of the playbook now for open core model. You throw your code up on GitHub, call it open source, then buy your stars early so it looks like people care. Then charge for hosted or premium features.
There's a whole market for it too. You can literally pay for stars, forks, even fake activity. Big star count makes a project look legit at a glance, especially to investors or people who don't dig too deep. It feeds itself. More people check it out, more people star it just because others already did.
> But what I think is an even better solution is to do it at the content level: sign the content, like a GPG signature
How would this work in reality? With the current state of browsers this is not possible because the ISP can still insert their content into the page and the browser will still load it with the modified content that does not match the signature. Nothing forces the GPG signature verification with current tech.
If you mean that browsers need to be updated to verify GPG signature, I'm not sure how realistic that is. Browsers cannot verify the GPG signature and vouch for it until you solve the problem of key revocation and key expiry. If you try to solve key revocation and key expiry, you are back to the same problems that certificates have.
Signatures do have similar problems to certificates. But Gemini doesn't avoid them either and often recommends TOFU certificates. I think the comment's point was that digital signatures ensure identity but are unsuitable for e-commerce, a leading source of enshittification.
> you are back to the same problems that certificates have.
Some of the same problems. One nice thing about verifying content rather than using an SSL connection is that plain-old HTTP caching works again.
That aside, another benefit of less-centralized and more-fine-grained trust mechanisms would be that a person can decide, on a case-by-case basis what entities should be trusted/revoked/etc rather than these root CAs that entail huge swaths of the internet. Admittedly, most people would just use "whatever's the default," which would not behave that differently from what we have now. But it would open the door to more ergonomic fine-grained decision-making for those who wish to use it.
How about in-ear earphones? They use silicone tips, right? Are there any known harmful effects of those?
The study names brands like Bose, Panasonic, Samsung, and Sennheiser. What about Apple airpods? Anyone knows what's that made of and if they've got any harmful effects?
Silicone doesn't require plasticizers (because it's elastic on its own) or fire retardants (because it doesn't burn easily). The material itself is also considered biologically inert and is less affected by temperature, solvents, etc. So it's usually the best choice for stuff like that. The reason it's not as common is that it's more expensive and not as durable. It has relatively poor abrasion and cut resistance.
But then, I wouldn't worry about headphones at all. You probably sleep on a mattress made from polyurethane foam that contains plasticizers and fire retardants in much greater quantities. The same goes for your car seats, and they off-gas a lot more when parked in the sun. You'd probably need to eat 1,000 earbuds to match that.
Expanding a thought beyond 280 characters and publishing it somewhere other than the X outrage machine is something we should be encouraging.
reply