The law (OK, well, British law) does recognise that many terms can be Unfair especially when one of the parties is an individual, and especially when it relates to employment. They can nullify them on that basis.
Pasting a big batch of new code and asking Claude "what have I forgotten? Where are the bugs?" is a very persuasive on-ramp for developers new to AI. It spots threading & distributed system bugs that would have taken hours to uncover before, and where there isn't any other easy tooling.
I bet there's loads of cryptocurrency implementations being pored over right now - actual money on the table.
Do you not run into too many false positives around "ah, this thing you used here is known to be tricky, the issue is..."
I've seen that when prompting it to look for concurrency issues vs saying something more like "please inspect this rigorously to look for potential issues..."
What's more useful is to have it attempt to not only find such bugs but prove them with a regression test. In Rust, for concurrency tests write e.g. Shuttle or Loom tests, etc.
It would be generally good if most code made setting up such tests as easy as possible, but in most corporate codebases this second step is gonna require a huge amount of refactoring or boilerplate crap to get the things interacting in the test env in an accurate, well-controlled way. You can quickly end up fighting to understand "is the bug not actually there, or is the attempt to repro it not working correctly?"
(Which isn't to say don't do it: I think this is a huge benefit you can gain from being able to refactor more quickly. Just to say that you're gonna short-term give yourself a lot more homework to make sure you don't fix things that aren't bugs, or break other things in your quest to make them more provable/testable.)
yes but i can identify those easily. i know that if it flags something that is obviously a non issue, i can discard it.
...because false positives are good errors. false negatives is what i'm worried about.
i feel massively more sure that something has no big oversights if multiple runs (or even multiple different models) cannot find anything but false positives
Just in case you didn't read the full article, this is how they describe finding the bugs in the Linux kernel as well.
Since it's a large codebase, they go even more specific and hint that the bug is in file A, then try again with a hint that the bug is in file B, and so on.
very interesting. i think "verbal biasing" and "knowing how to speak" in general is a really important thing with LLMs. it seems to massively affect output. (interestingly, somewhat less with Opus than with GPT-5.4 and Composer 2. Opus seems to intuit a little better. but still important.)
it's like the idea behind the book _The Mom Test_ suddenly got very important for programming
As a meta activity, I like to run different codebases through the same bug-hunt prompt and compare the number found as a barometer of quality.
I was very impressed when the top three AIs all failed to find anything other than minor stylistic nitpicks in a huge blob of what to me looked like “spaghetti code” in LLVM.
Meanwhile at $dayjob the AI reviews all start with “This looks like someone’s failed attempt at…”
You just have to be careful because it will sometimes spot bugs you could never uncover because they’re not real. You can really see the pattern matching at work with really twisted code. It tends to look at things like lock free algorithms and declare it full of bugs regardless of whether it is or not.
I have seen it start on a sentence, get lost and finish it with something like "Scratch that, actually it's fine."
And if it's not giving me a reason I can understand for a bug, I'm not listening to it! Mostly it is showing me I've mixed up two parameters, forgotten to initialise something, or referenced a variable from a thread that I shouldn't have.
The immediate feedback means the bug usually gets a better-quality fix than it would if I had got fatigued hunting it down! So variables get renamed to make sure I can't get them mixed up, a function gets broken out. It puts me in the mind of "well make sure this idiot can't make that mistake again!"
Ditto, I made a "/codex-review" skill in Claude Code that reviews the last git commit and writes an analysis of it for Claude Code to then work. I've had very good luck with it.
One particularly striking example: I had CC do some work and then kicked off a "/codex-review" and while it was running went to test the changes. I found a deadlock but when I switched back to CC the Codex review had found the deadlock and Claude Code was already working on a fix.
I actually work the other way around. I have codex write "packets" to give to claude to write. I have Claude write the code. Then have Codex review it and find all the problems (there's usually lots of them).
Only because this month I have the $100 Claude Code and the $20 Codex. I did not renew Anthropic though.
I usually do several passes of "review our work. Look for things to clean up, simplify, or refactor." It does usually improve the quality quite a lot; then I rewind history to before, but keep the changes, and submit the same prompt again, until it reaches the point of diminishing returns.
ive gone down this rabbit hole and i dunno, sometimes claude chases a smoking gun that just isn't a smoking gun at all. if you ask him to help find a vulnerability he's not gonna come back empty handed even if there's nothing there, he might frame a nice to have as a critical problem. in my exp you have to have build tests that prove vulnerabilities in some way. otherwise he's just gonna rabbithole while failing to look at everything.
ive had some remarkable successes with claude and quite a few "well that was a total waste of time" efforts with claude. for the most part i think trying to do uncharted/ambitious work with claude is a huge coinflip. he's great for guardrailed and well understood outcomes though, but im a little burnt out and unexcited at hearing about the gigantic-claude exercises.
Absolutely the opposite here, after reading a few paragraphs I was a bit bored. Then I saw the length of the piece, noticed the AI imagery, quit, came here. I read your comment and it makes sense. I'm not reading a story that somebody couldn't be bothered to write.
Right?? There is a working original After Burner in an arcade in Leeds - on free play and just open to kids of all ages. Sooo many places where it could trap a finger, and it moves pretty violently.
Congratulations Ben! The game sounds like a dangerous cult that I want no part of. But I've also done game ports recently and was curious - how much of the old codebase did you need to understand (and change!) in order to port it? And how much could you just wrap up / virtualise, and start building on top?
It is a cult and you should run while you still can. I would say to get to the point I'm so now I had to understand 40-50% of it. Let's face it, I work as a software developer and I don't remember half of the code in the project, maybe more, and I wrote it all! And this is way more complex than a business app. The reason I had to understand a lot is hard to explain but I will try... Basically a function might be called "FrontBuy". In this function contains all the math and all the decision tree logic and workflow for every possible situation to buy a stock. So now you say, let's replace all the Win32 dialogs with Electron front end. So what I did was I maintain essentially a high level GUI state manager in the Power Basic which controls the Electron app, using the C++ DLL as a FFI. That being said, you had to have some modicum of understanding of every function, maybe not all the math, but at the very least the workflow and structure of the PB code so that you don't break anything. And oh boy, did I break everything, many many times, to the chagrin of my beta testers.
This has been on the cards for at least a year, with the increasingly doomy commits noted by HN.
Unfortunately I don't know of any other open projects that can obviously scale to the same degree. I built up around 100PiB of storage under minio with a former employer. It's very robust in the face of drive & server failure, is simple to manage on bare hardware with ansible. We got 180Gbps sustained writes out of it, with some part time hardware maintenance.
Don't know if there's an opportunity here for larger users of minio to band together and fund some continued maintenance?
I definitely had a wishlist and some hardware management scripts around it that could be integrated into it.
Ceph can scale to pretty large numbers for both storage, writes and reads. I was running 60PB+ cluster few years back and it was still growing when I left the company.
reply