More

perching_aix · 2026-04-09T10:53:12 1775731992

Think in the reverse direction. Since you can have exact provenance data placed into the token stream, formatted in any particular way, that implies the models should be possible to tune to be more "mindful" of it, mitigating this issue. That's what makes this different.

perching_aix · 2026-04-09T10:50:12 1775731812

If you think that confusing message provenance is part of how thinking mode is supposed to work, I don't know what to tell you.

otabdeveloper4 · 2026-04-09T13:01:21 1775739681

There is no "message provenance" in LLM machinery.

This is an illusion the chat UX concocts. Behind the scenes the tokens aren't tagged or colored.

perching_aix · 2026-04-09T13:42:44 1775742164

I am aware. That is not what the guy above was suggesting, nor what was I.

Things generally exist without an LLM receiving and maintaining a representation about them.

If there's no provenance information and message separation currently being emitted into the context window by tooling, the latter part of which I'd be surprised by, and the models are not trained to focus on it, then what I'm suggesting is that these could be inserted and the models could be tuned, so that this is then mitigated.

What I'm also suggesting is that the above person's snark-laden idea of thinking mode, and how resolvable this issue is, is thus false.

perching_aix · 2026-04-09T10:46:36 1775731596

So like every software? Why do you think there are so many security scanners and whatnot out there?

There are millions of lines of code running on a typical box. Unless you're in embedded, you have no real idea what you're running.

danaris · 2026-04-09T14:03:13 1775743393

...No, it's not at all "like every software".

This seems like another instance of a problem I see so, so often in regard to LLMs: people observe the fact that LLMs are fundamentally nondeterministic, in ways that are not possible to truly predict or learn in any long-term way...and they equate that, mistakenly, to the fact that humans, other software, what have you sometimes make mistakes. In ways that are generally understandable, predictable, and remediable.

Just because I don't know what's in every piece of software I'm running doesn't mean it's all equally unreliable, nor that it's unreliable in the same way that LLM output is.

That's like saying just because the weather forecast sometimes gets it wrong, meteorologists are complete bullshit and there's no use in looking at the forecast at all.

perching_aix · 2026-04-09T20:58:59 1775768339

> ...No, it's not at all "like every software"

Yes, they are; through the lens the person above offered that is.

In practice, all we ever get to deal with is empirical, and the person above was making an empirical argument. You may reject this on principled grounds, because programs are just structured logic, but they cease to be just that once you actually run them. Real hardware runs them. Even fully verified, machine-checked, correctly designed/specified software, only interacting with other such software, can enter into an inconsistent state through no fault of its own. Theory stops being theory once you put it in practice.

> people observe the fact that LLMs are fundamentally nondeterministic

LLMs are not "non-deterministic", let alone fundamentally so. If I launch a model locally, pin the seed, and ask the exact same question 10x, I'll get the same answer every single time. Provided you select your hardware and inference engine correctly, the output remains reproducible even across different machines. They're not even stateful! You literally send along the entire state (context window) every single time.

Now obviously, you might instead mean a more "practical" version of this, their general semantic unpredictability. But even then, every now and then I do ask the "same" question to LLMs, and they keep giving essentially the "same" response. They're pretty darn stable.

> In ways that are generally understandable, predictable, and remediable.

You could say the same thing about the issue in the OP. You have a very easy to understand issue that behaves super predictably, and will be (imo) remediable just fine by the various service providers.

Now think of all the hard to impossible to reproduce bugs people just end up working around. The never ending list of vulnerabilities and vulnerability categories. The inexplicable errors that arise due to real world hardware issues. Yes, LLMs are statistical in nature, not artisanally hardwired. But in the end, they're operated in the same empirical way, along the same lines of concerns, and with surprisingly similar outcomes at times.

You're not going to understand the millions (or really, tens or hundreds of millions) of lines of code running on a typical machine. You'll never be able to exhaustively predict their behavior (especially how they interact with terabytes of data over time) and the defects contained within. You'll never remediate those defects fully. Hell, even for classes of problems where such a thing would be possible to achieve structurally, people are resisting the change.

If they want to take an issue with LLMs, a plain gesturing at their statistical nature is just not particularly convincing. Not in a categorical, drop the mic way at least, that's for sure.

orbital-decay · 2026-04-09T14:44:36 1775745876

>That's like saying just because the weather forecast sometimes gets it wrong, meteorologists are complete bullshit and there's no use in looking at the forecast at all.

Are you really not seeing that GP is saying exactly this about LLMs?

What you want for this to be practical is verification and low enough error rate. Same as in any human-driven development process.

perching_aix · 2026-04-09T10:19:22 1775729962

It's less about security in my view, because as you say, you'd want to ensure safety using proper sandboxing and access controls instead.

It hinders the effectiveness of the model. Or at least I'm pretty sure it getting high on its own supply (in this specific unintended way) is not doing it any favors, even ignoring security.

sanitycheck · 2026-04-09T10:24:06 1775730246

It's both, really.

The companies selling us the service aren't saying "you should treat this LLM as a potentially hostile user on your machine and set up a new restricted account for it accordingly", they're just saying "download our app! connect it to all your stuff!" and we can't really blame ordinary users for doing that and getting into trouble.

perching_aix · 2026-04-09T10:34:01 1775730841

There's a growing ecosystem of guardrailing methods, and these companies are contributing. Antrophic specifically puts in a lot of effort to better steer and characterize their models AFAIK.

I primarily use Claude via VS Code, and it defaults to asking first before taking any action.

It's simply not the wild west out here that you make it out to be, nor does it need to be. These are statistical systems, so issues cannot be fully eliminated, but they can be materially mitigated. And if they stand to provide any value, they should be.

I can appreciate being upset with marketing practices, but I don't think there's value in pretending to having taken them at face value when you didn't, and when you think people shouldn't.

le-mark · 2026-04-09T11:08:29 1775732909

> It's simply not the wild west out here that you make it out to be

It is though. They are not talking about users using Claude code via vscode, they’re talking about non technical users creating apps that pipe user input to llms. This is a growing thing.

perching_aix · 2026-04-09T11:26:30 1775733990

The best solution to which are the aforementioned better defaults, stricter controls, and sandboxing (and less snakeoil marketing).

Less so the better tuning of models, unlike in this case, where that is going to be exactly the best fit approach most probably.

sanitycheck · 2026-04-09T11:53:40 1775735620

I'm a naturally paranoid, very detail-oriented, man who has been a professional software developer for >25 years. Do you know anyone who read the full terms and conditions for their last car rental agreement prior to signing anything? I did that.

I do not expect other people to be as careful with this stuff as I am, and my perception of risk comes not only from the "hang on, wtf?" feeling when reading official docs but also from seeing what supposedly technical users are talking about actually doing on Reddit, here, etc.

Of course I use Claude Code, I'm not a Luddite (though they had a point), but I don't trust it and I don't think other people should either.

perching_aix · 2026-04-09T10:13:13 1775729593

Oh, I never noticed this, really solid catch. I hope this gets fixed (mitigated). Sounds like something they can actually materially improve on at least.

I reckon this affects VS Code users too? Reads like a model issue, despite the post's assertion otherwise.

perching_aix · 2026-04-08T16:27:36 1775665656

This is like all the usual anti-LLM talking points and sentiments fused together.

Doesn't it get boring?

I like using these models a lot more than I stand hearing people talk about them, pro or contra. Just slop about slop. And the discussions being artisanal slop really doesn't make them any better.

Every time I hear some variation of bullshitting or plagiarizing machines, my eyes roll over. Do these people think they're actually onto something? I've been seeing these talking points for literal years. For people who complain about no original thoughts, these sure are some tired ones.

camgunz · 2026-04-08T17:34:58 1775669698

If I have to suffer "look at this busted ass thing I slopped out with AI" a few times a week, you all have to suffer grouchy "AI bad" a few times a week. Fair is fair.

perching_aix · 2026-04-08T18:42:56 1775673776

Just this week I was baited into joining two meetings about "AI good". Absolutely zero substance throughout each, of course.

They somehow managed to stretch out like 3 sentences worth of sentiment to a whole hour, interspersing brainwash about how good AI is along the way. It was like watching someone try to hit a word limit in real time. They always made it feel like we're just about to hit a substantive bit too, only for that to never come.

It may be fair (to the sentiments) in that there's balance, but good lord, the end result is incessant all around (and thus unfair to the people exposed).

masfuerte · 2026-04-08T16:38:54 1775666334

Why do you insist on reading and commenting on these articles that bore you so much?

perching_aix · 2026-04-08T18:28:13 1775672893

Oh I don't know, maybe because I like to give dissenting takes a chance? Because from time to time they do make some new, decent points, or at least interesting ones? You know, basic intellectual rigor?

Do you imagine me being a clairvoyant by the way, or how do you expect me to know a post is of low quality before I read it or at least skim it?

This one ended up being a part of the vast majority that doesn't offer much of anything. It's a redundant rehash of all the usual rubbish anyone can come across any day. Left a comment about this stating so. Big deal.

stavros · 2026-04-08T16:52:38 1775667158

Because saying "this is boring, let's stop talking about it" is an opinion worthwhile of expression.

hackable_sand · 2026-04-08T20:10:28 1775679028

Oppressors don't like people talking about their oppression

Go figure

giraffe_lady · 2026-04-08T16:40:10 1775666410

"These arguments may be correct but they aren't novel" ??

simianwords · 2026-04-08T16:42:12 1775666532

I don't think calling AI a bullshit machine is correct. In spirit.

camgunz · 2026-04-08T17:36:06 1775669766

I'm earnestly curious why not.

simianwords · 2026-04-08T17:46:50 1775670410

Can you try to get a question that fits in 2-3 pages (text only) and test whether ChatGPT bullshits? I can’t do it. It gets pretty much everything.

Edit: I forgot to mention thinking version - I did this for all the other times I asked in this thread but not this one. Apologies.

camgunz · 2026-04-08T18:01:10 1775671270

"Hey ChatGPT. How would you describe me?"

https://chatgpt.com/share/69d69780-ae58-83e8-a41c-7d10a5f298...

It has no conversations and no memory of me. Maybe this is true, maybe it isn't, but there's no basis for it.

simianwords · 2026-04-08T18:06:27 1775671587

This is not falsifiable, I don't buy it. Do one where we all know is false please?

camgunz · 2026-04-08T18:15:31 1775672131

"Hey ChatGPT. I've recently grown horns and I need some care advice. Should I polish my horns before going to have them trimmed or will the horn trimmer polish them for me?"

https://chatgpt.com/share/69d69b18-d1c8-83e8-bc47-8f315a1b55...

simianwords · 2026-04-08T18:27:24 1775672844

I wanted this challenge with the thinking version (I apologised for it and edited the earlier version).

It doesn't bullshit on the GPT-5.4 thinking version.

Here is the result with thinking https://chatgpt.com/share/69d69dd6-fb50-838d-863c-4e1eda5d08...

I suggest you try it yourself to be convinced. Try it in incognito mode if you wish. Or not.

camgunz · 2026-04-08T18:44:16 1775673856

"Hey ChatGPT. I'm building a Final Fantasy 6 mod, and I need more space for the battle scripts. How would I rearrange the data in the ROM to give me the extra space I need?"

https://chatgpt.com/share/69d6a16c-6014-83e8-a79d-d5d11ed2eb...

That is not where the battle scripts are.

---

Anyway, it's trivial to get pretty much any model to make things up. Don't we all know this? That's why I was surprised by your position; if we know anything about these things it's that they make things up.

simianwords · 2026-04-08T18:52:17 1775674337

https://chatgpt.com/share/69d6a38c-bd54-838c-82e3-609d9e66c9...

I used the thinking version (like I asked before). I think this is right. If not, please tell.

Also; you didn’t falsify anything. Nor the first. Nor the second.

If the second one is bullshit, I accept I’m wrong - I have no idea how to verify though so I’ll leave it up to you.

I think yours is the classic case of “use the free version to judge the paid one”.

camgunz · 2026-04-08T19:20:05 1775676005

The thinking version is mostly right, but:

- it searches the internet to find the answer, it doesn't "reason". I'm not claiming Google is a bullshit machine, and it's not surprising the answer is discoverable (it has to be, for the conditions of our experiment).

- near the end it says "If you are building from the FF6 disassembly instead of hand-editing the ROM, the repo is already organized into separate modules and linker configs, so the clean approach is to relocate the script data in the source and let the build place it in a different ROM region." But I didn't reference a repo or git: it hallucinated that stuff from one of its sources.

I'm not saying this stuff doesn't have its place, but they definitely make things up and we can't stop them.

simianwords · 2026-04-08T19:25:22 1775676322

Wait I can't find the quote you are speaking about. Are you looking at something else?

In any case - it should be clear that it did not bullshit and it got it right. So far you have not come up with anything that tells me it bullshits. I'm happy for you to give me more prompts to verify because I think you haven't used the thinking version yet and you base your criticism on the free version.

camgunz · 2026-04-08T19:30:00 1775676600

Sorry: https://chatgpt.com/share/69d6ac63-d200-8330-8c47-95a75db8bb...

Also what? The repo bit is clear bullshit.

simianwords · 2026-04-08T19:39:14 1775677154

it linked it: https://github.com/everything8215/ff6 (check the end)

camgunz · 2026-04-08T19:39:45 1775677185

I saw; I replied up there

simianwords · 2026-04-08T19:43:06 1775677386

I don't think this is an example of bullshit. It referenced a repo - the canonical repo for this project. I could not find any other repo that has the disassembly. It didn't hallucinate anything. I think you are trying really hard here but lets be clear here: there's no bullshitting and I'll leave it to the public to decide.

camgunz · 2026-04-08T19:39:03 1775677143

I could quibble with some things, but this is right. I don't have a paid account so I can't ping away at 5.4 or whatever, but, I do have access to frontier models at work, and they hallucinate regularly. Dunno what to do if you don't believe this; good luck I guess.

simianwords · 2026-04-08T19:46:33 1775677593

I agree that they hallucinate sometimes. I agree they bullshit sometimes. But the extent is way overblown. They basically don't bullshit ever under the constraints of

1. 2-3 pages of text context

2. GPT-5.4 thinking

I don't think the spirit of the original article (not your comments to be fair) captured this, hence the challenge. I believe we are on the same page here.

camgunz · 2026-04-08T20:54:24 1775681664

> I don't think the spirit of the original article (not your comments to be fair) captured this, hence the challenge. I believe we are on the same page here.

No. GPT-5 has a 40% hallucination rate [0] on SimpleQA [1] without web searching. The SimpleQA questions meet your criteria of "2-3 pages of text content. Unless 5.4 + web searching erases that (I bet it doesn't!) these are bullshit machines.

[0]: https://arxiv.org/pdf/2601.03267

[1]: https://github.com/openai/simple-evals

simianwords · 2026-04-08T21:23:49 1775683429

Specifically in the case where it can use tools - no it doesn't hallucinate. Which is why you are struggling to find counterexamples.

camgunz · 2026-04-08T21:56:52 1775685412

> Specifically in the case where it can use tools - no it doesn't hallucinate.

OpenAI's own system card says it does. Hallucination rates in GPT-5 with browsing enabled:

- 0.7% in LongFact-Concepts

- 0.8% in LongFact-Objects

- 1.0% in FActScore

> Which is why you are struggling to find counterexamples.

Hey look, over 500 counterexamples: [1].

GPT-5.4's hallucination rate on AA-Omniscience is 89% [0], which is atrocious. The questions are tiny too, like "In which year did Uber first expand internationally beyond the United States as part of its broader rollout (i.e., beyond an initial single‑city debut)?" It's a bullshit machine. 89%!

At some point you gotta face the music, right?

[0]: https://artificialanalysis.ai/evaluations/omniscience?model-...

[1]: https://huggingface.co/datasets/ArtificialAnalysis/AA-Omnisc...

simianwords · 2026-04-08T22:03:59 1775685839

You had to go all the way and find it in the benchmark results that specifically stress test this.

You could not come up with a single one yourself. And you also linked an example where it was not allowed to use tools when I specifically said that it should be able to use tools. I'm not sure why are you present this as though it is a big gotcha.

I think my main point pretty much stands.

camgunz · 2026-04-08T22:05:59 1775685959

I found over 500 examples that fit your criteria. Embarrassing you were arguing in bad faith this whole time.

simianwords · 2026-04-08T22:07:11 1775686031

They all use the tool search, no? Please correct me if I'm wrong.

My criteria was using ChatGPT which explicitly allows it.

https://arxiv.org/html/2511.13029v1 if you don't believe me.

BTW this was your original point

>Anyway, it's trivial to get pretty much any model to make things up. Don't we all know this? That's why I was surprised by your position; if we know anything about these things it's that they make things up.

And look at how much effort you have had to do

1. use the wrong model for the horns example

2. the game one also didn't work

3. now you are searching for examples in literal benchmarks and you are still not able to find any

How is this trivial in any interpretation of the word?

I think it would be perfectly reasonable to agree that it is not at all trivial to find counter examples for my challenge.

camgunz · 2026-04-08T22:37:56 1775687876

I've got about 20 minutes in this; mostly I've been reading wallstreetbets at the Shake Shack bar in the Boston airport. I'm happy to post this over and over again until you engage w/ it:

> I found over 500 examples that fit your criteria.

simianwords · 2026-04-09T05:51:22 1775713882

They don't use tools. Like the 4th time you ignored this on purpose. That was not part of the challenge.

giraffe_lady · 2026-04-08T17:32:54 1775669574

Oh, well you should have said that then.

perching_aix · 2026-04-08T18:49:39 1775674179

You're talking to a different person there, but I do obviously also disagree with a lot of what's written in the post too.

At the same time, it is also just super redundant nevertheless, yes. Not sure why you find it so bizarre that one would take an issue with that. See also the very existence of the website called TV-Tropes.

stavros · 2026-04-08T16:55:24 1775667324

Yeah, it gets really boring. Whenever I see "slot machines" or "bullshit machines" or whatever, I just ignore the comment and move on, because it signals that it's someone in such deep denial that they've turned their brain off.

I'd much rather read articles about what LLMs can/can't do, or stuff people have built with LLMs, than read how everything LLMs touch turns to shit.

simianwords · 2026-04-08T16:41:24 1775666484

[flagged]

perching_aix · 2026-04-08T18:57:21 1775674641

My personal red flag for this is the scare quoting of AI, and the super try-hard categorization work that people perform to try and discredit LLMs.

It takes approximately 1 min to find out that machine learning is a subfield of artificial intelligence, both having existed for about half a century now. This basic historical fact is also taught on AI 101 courses across the globe for compsci students.

Yet here we are, people portraying it as some sort of cheap sales trick. Reminds me when I discussed quantum dots with a friend, which he was very enthusiastic to quickly file under "yet another bullshit with quantum in its name" before finally taking the time to understand that the "quantum" bit is not a marketing gimmick. Except in this case, people are a million times more inclined to willfully propagate this. Genuinely so tiresome.

simianwords · 2026-04-08T19:01:19 1775674879

I think it’s just anxiety because to internalise that it is actually so good is a bit hard for some

perching_aix · 2026-04-07T13:27:18 1775568438

It's (what they're describing is) just reverse engineering. That's what reverse engineering is.

chrisjj · 2026-04-07T15:03:58 1775574238

Fortunately reverse engineering too is in the dictionary - to help anyone mistaking it for spec generation.

perching_aix · 2026-04-07T16:42:42 1775580162

Implying that I did make such mistake, which I did not, unless you're willfully taking me overly literal.

Nor did they make any mistakes when they described how they produced a specification, (and indeed, that it is a specification) despite your insinuation otherwise, for a similar reason.

Maybe instead of pointing towards dictionaries, stop pretending that you lack reading comprehension, and get off of your high horse please.

perching_aix · 2026-04-06T10:12:35 1775470355

Which is heartbreaking (and I'd argue misleading too), but not the whole story.

You can only issue takedowns in relation with material that you have copyright over. At least one of these sites I know for a fact routinely scrubs FAKKU licensed content, and abides by takedown requests.

perching_aix · 2026-04-06T03:51:53 1775447513

Moving to a Germany based host of all places, after being legally harassed over copyright, doesn't strike me as a particularly good idea. Aren't the local courts infamous for being awful to deal with?

perching_aix · 2026-04-06T02:18:28 1775441908

Author must clearly never use porn sites like xvideos or PornHub, if they think YouTube's search is what "barely works".