Think in the reverse direction. Since you can have exact provenance data placed into the token stream, formatted in any particular way, that implies the models should be possible to tune to be more "mindful" of it, mitigating this issue. That's what makes this different.
I am aware. That is not what the guy above was suggesting, nor what was I.
Things generally exist without an LLM receiving and maintaining a representation about them.
If there's no provenance information and message separation currently being emitted into the context window by tooling, the latter part of which I'd be surprised by, and the models are not trained to focus on it, then what I'm suggesting is that these could be inserted and the models could be tuned, so that this is then mitigated.
What I'm also suggesting is that the above person's snark-laden idea of thinking mode, and how resolvable this issue is, is thus false.
This seems like another instance of a problem I see so, so often in regard to LLMs: people observe the fact that LLMs are fundamentally nondeterministic, in ways that are not possible to truly predict or learn in any long-term way...and they equate that, mistakenly, to the fact that humans, other software, what have you sometimes make mistakes. In ways that are generally understandable, predictable, and remediable.
Just because I don't know what's in every piece of software I'm running doesn't mean it's all equally unreliable, nor that it's unreliable in the same way that LLM output is.
That's like saying just because the weather forecast sometimes gets it wrong, meteorologists are complete bullshit and there's no use in looking at the forecast at all.
Yes, they are; through the lens the person above offered that is.
In practice, all we ever get to deal with is empirical, and the person above was making an empirical argument. You may reject this on principled grounds, because programs are just structured logic, but they cease to be just that once you actually run them. Real hardware runs them. Even fully verified, machine-checked, correctly designed/specified software, only interacting with other such software, can enter into an inconsistent state through no fault of its own. Theory stops being theory once you put it in practice.
> people observe the fact that LLMs are fundamentally nondeterministic
LLMs are not "non-deterministic", let alone fundamentally so. If I launch a model locally, pin the seed, and ask the exact same question 10x, I'll get the same answer every single time. Provided you select your hardware and inference engine correctly, the output remains reproducible even across different machines. They're not even stateful! You literally send along the entire state (context window) every single time.
Now obviously, you might instead mean a more "practical" version of this, their general semantic unpredictability. But even then, every now and then I do ask the "same" question to LLMs, and they keep giving essentially the "same" response. They're pretty darn stable.
> In ways that are generally understandable, predictable, and remediable.
You could say the same thing about the issue in the OP. You have a very easy to understand issue that behaves super predictably, and will be (imo) remediable just fine by the various service providers.
Now think of all the hard to impossible to reproduce bugs people just end up working around. The never ending list of vulnerabilities and vulnerability categories. The inexplicable errors that arise due to real world hardware issues. Yes, LLMs are statistical in nature, not artisanally hardwired. But in the end, they're operated in the same empirical way, along the same lines of concerns, and with surprisingly similar outcomes at times.
You're not going to understand the millions (or really, tens or hundreds of millions) of lines of code running on a typical machine. You'll never be able to exhaustively predict their behavior (especially how they interact with terabytes of data over time) and the defects contained within. You'll never remediate those defects fully. Hell, even for classes of problems where such a thing would be possible to achieve structurally, people are resisting the change.
If they want to take an issue with LLMs, a plain gesturing at their statistical nature is just not particularly convincing. Not in a categorical, drop the mic way at least, that's for sure.
>That's like saying just because the weather forecast sometimes gets it wrong, meteorologists are complete bullshit and there's no use in looking at the forecast at all.
Are you really not seeing that GP is saying exactly this about LLMs?
What you want for this to be practical is verification and low enough error rate. Same as in any human-driven development process.
It's less about security in my view, because as you say, you'd want to ensure safety using proper sandboxing and access controls instead.
It hinders the effectiveness of the model. Or at least I'm pretty sure it getting high on its own supply (in this specific unintended way) is not doing it any favors, even ignoring security.
The companies selling us the service aren't saying "you should treat this LLM as a potentially hostile user on your machine and set up a new restricted account for it accordingly", they're just saying "download our app! connect it to all your stuff!" and we can't really blame ordinary users for doing that and getting into trouble.
There's a growing ecosystem of guardrailing methods, and these companies are contributing. Antrophic specifically puts in a lot of effort to better steer and characterize their models AFAIK.
I primarily use Claude via VS Code, and it defaults to asking first before taking any action.
It's simply not the wild west out here that you make it out to be, nor does it need to be. These are statistical systems, so issues cannot be fully eliminated, but they can be materially mitigated. And if they stand to provide any value, they should be.
I can appreciate being upset with marketing practices, but I don't think there's value in pretending to having taken them at face value when you didn't, and when you think people shouldn't.
> It's simply not the wild west out here that you make it out to be
It is though. They are not talking about users using Claude code via vscode, they’re talking about non technical users creating apps that pipe user input to llms. This is a growing thing.
I'm a naturally paranoid, very detail-oriented, man who has been a professional software developer for >25 years. Do you know anyone who read the full terms and conditions for their last car rental agreement prior to signing anything? I did that.
I do not expect other people to be as careful with this stuff as I am, and my perception of risk comes not only from the "hang on, wtf?" feeling when reading official docs but also from seeing what supposedly technical users are talking about actually doing on Reddit, here, etc.
Of course I use Claude Code, I'm not a Luddite (though they had a point), but I don't trust it and I don't think other people should either.
Oh, I never noticed this, really solid catch. I hope this gets fixed (mitigated). Sounds like something they can actually materially improve on at least.
I reckon this affects VS Code users too? Reads like a model issue, despite the post's assertion otherwise.
This is like all the usual anti-LLM talking points and sentiments fused together.
Doesn't it get boring?
I like using these models a lot more than I stand hearing people talk about them, pro or contra. Just slop about slop. And the discussions being artisanal slop really doesn't make them any better.
Every time I hear some variation of bullshitting or plagiarizing machines, my eyes roll over. Do these people think they're actually onto something? I've been seeing these talking points for literal years. For people who complain about no original thoughts, these sure are some tired ones.
If I have to suffer "look at this busted ass thing I slopped out with AI" a few times a week, you all have to suffer grouchy "AI bad" a few times a week. Fair is fair.
Just this week I was baited into joining two meetings about "AI good". Absolutely zero substance throughout each, of course.
They somehow managed to stretch out like 3 sentences worth of sentiment to a whole hour, interspersing brainwash about how good AI is along the way. It was like watching someone try to hit a word limit in real time. They always made it feel like we're just about to hit a substantive bit too, only for that to never come.
It may be fair (to the sentiments) in that there's balance, but good lord, the end result is incessant all around (and thus unfair to the people exposed).
Oh I don't know, maybe because I like to give dissenting takes a chance? Because from time to time they do make some new, decent points, or at least interesting ones? You know, basic intellectual rigor?
Do you imagine me being a clairvoyant by the way, or how do you expect me to know a post is of low quality before I read it or at least skim it?
This one ended up being a part of the vast majority that doesn't offer much of anything. It's a redundant rehash of all the usual rubbish anyone can come across any day. Left a comment about this stating so. Big deal.
"Hey ChatGPT. I've recently grown horns and I need some care advice. Should I polish my horns before going to have them trimmed or will the horn trimmer polish them for me?"
"Hey ChatGPT. I'm building a Final Fantasy 6 mod, and I need more space for the battle scripts. How would I rearrange the data in the ROM to give me the extra space I need?"
Anyway, it's trivial to get pretty much any model to make things up. Don't we all know this? That's why I was surprised by your position; if we know anything about these things it's that they make things up.
- it searches the internet to find the answer, it doesn't "reason". I'm not claiming Google is a bullshit machine, and it's not surprising the answer is discoverable (it has to be, for the conditions of our experiment).
- near the end it says "If you are building from the FF6 disassembly instead of hand-editing the ROM, the repo is already organized into separate modules and linker configs, so the clean approach is to relocate the script data in the source and let the build place it in a different ROM region." But I didn't reference a repo or git: it hallucinated that stuff from one of its sources.
I'm not saying this stuff doesn't have its place, but they definitely make things up and we can't stop them.
Wait I can't find the quote you are speaking about. Are you looking at something else?
In any case - it should be clear that it did not bullshit and it got it right. So far you have not come up with anything that tells me it bullshits. I'm happy for you to give me more prompts to verify because I think you haven't used the thinking version yet and you base your criticism on the free version.
I don't think this is an example of bullshit. It referenced a repo - the canonical repo for this project. I could not find any other repo that has the disassembly. It didn't hallucinate anything. I think you are trying really hard here but lets be clear here: there's no bullshitting and I'll leave it to the public to decide.
I could quibble with some things, but this is right. I don't have a paid account so I can't ping away at 5.4 or whatever, but, I do have access to frontier models at work, and they hallucinate regularly. Dunno what to do if you don't believe this; good luck I guess.
I agree that they hallucinate sometimes. I agree they bullshit sometimes. But the extent is way overblown. They basically don't bullshit ever under the constraints of
1. 2-3 pages of text context
2. GPT-5.4 thinking
I don't think the spirit of the original article (not your comments to be fair) captured this, hence the challenge. I believe we are on the same page here.
> I don't think the spirit of the original article (not your comments to be fair) captured this, hence the challenge. I believe we are on the same page here.
No. GPT-5 has a 40% hallucination rate [0] on SimpleQA [1] without web searching. The SimpleQA questions meet your criteria of "2-3 pages of text content. Unless 5.4 + web searching erases that (I bet it doesn't!) these are bullshit machines.
> Specifically in the case where it can use tools - no it doesn't hallucinate.
OpenAI's own system card says it does. Hallucination rates in GPT-5 with browsing enabled:
- 0.7% in LongFact-Concepts
- 0.8% in LongFact-Objects
- 1.0% in FActScore
> Which is why you are struggling to find counterexamples.
Hey look, over 500 counterexamples: [1].
GPT-5.4's hallucination rate on AA-Omniscience is 89% [0], which is atrocious. The questions are tiny too, like "In which year did Uber first expand internationally beyond the United States as part of its broader rollout (i.e., beyond an initial single‑city debut)?" It's a bullshit machine. 89%!
You had to go all the way and find it in the benchmark results that specifically stress test this.
You could not come up with a single one yourself. And you also linked an example where it was not allowed to use tools when I specifically said that it should be able to use tools. I'm not sure why are you present this as though it is a big gotcha.
>Anyway, it's trivial to get pretty much any model to make things up. Don't we all know this? That's why I was surprised by your position; if we know anything about these things it's that they make things up.
And look at how much effort you have had to do
1. use the wrong model for the horns example
2. the game one also didn't work
3. now you are searching for examples in literal benchmarks and you are still not able to find any
How is this trivial in any interpretation of the word?
I think it would be perfectly reasonable to agree that it is not at all trivial to find counter examples for my challenge.
I've got about 20 minutes in this; mostly I've been reading wallstreetbets at the Shake Shack bar in the Boston airport. I'm happy to post this over and over again until you engage w/ it:
> I found over 500 examples that fit your criteria.
You're talking to a different person there, but I do obviously also disagree with a lot of what's written in the post too.
At the same time, it is also just super redundant nevertheless, yes. Not sure why you find it so bizarre that one would take an issue with that. See also the very existence of the website called TV-Tropes.
Yeah, it gets really boring. Whenever I see "slot machines" or "bullshit machines" or whatever, I just ignore the comment and move on, because it signals that it's someone in such deep denial that they've turned their brain off.
I'd much rather read articles about what LLMs can/can't do, or stuff people have built with LLMs, than read how everything LLMs touch turns to shit.
My personal red flag for this is the scare quoting of AI, and the super try-hard categorization work that people perform to try and discredit LLMs.
It takes approximately 1 min to find out that machine learning is a subfield of artificial intelligence, both having existed for about half a century now. This basic historical fact is also taught on AI 101 courses across the globe for compsci students.
Yet here we are, people portraying it as some sort of cheap sales trick. Reminds me when I discussed quantum dots with a friend, which he was very enthusiastic to quickly file under "yet another bullshit with quantum in its name" before finally taking the time to understand that the "quantum" bit is not a marketing gimmick. Except in this case, people are a million times more inclined to willfully propagate this. Genuinely so tiresome.
Implying that I did make such mistake, which I did not, unless you're willfully taking me overly literal.
Nor did they make any mistakes when they described how they produced a specification, (and indeed, that it is a specification) despite your insinuation otherwise, for a similar reason.
Maybe instead of pointing towards dictionaries, stop pretending that you lack reading comprehension, and get off of your high horse please.
Which is heartbreaking (and I'd argue misleading too), but not the whole story.
You can only issue takedowns in relation with material that you have copyright over. At least one of these sites I know for a fact routinely scrubs FAKKU licensed content, and abides by takedown requests.
Moving to a Germany based host of all places, after being legally harassed over copyright, doesn't strike me as a particularly good idea. Aren't the local courts infamous for being awful to deal with?
reply