The definition used to be "passes the Turing test" .. until LLMs passed it.

davemp · 2026-02-27T23:27:34 1772234854

Extremely debatable. Especially because there is no "The Turing Test" [0] only a game and a few instances were described by Turing. I recommend reading the original paper before making bold claims about it. The bar for the interrogator has certainly be raised, but considering:

- the prevalence "How many |r|'s are in the word 'strawberry'?" esque questions that cause(d) LLMs to stumble

- context window issues

It would be naive to claim that there does not exist, or even that it would be difficult to construct/train, an interrogator that could reliably distinguish between an LLM and human chat instance.

[0]: https://archive.computerhistory.org/projects/chess/related_m...

hunterpayne · 2026-02-27T23:33:36 1772235216

Sure, when the expected monetary value was 0. Then they started claiming that investing $1,000,000,000,000.00 (that's $1T) into a 4 year old startup was a good idea. Change the valuation, change the goal. Then the goal was be better than a human employees (or at least more efficient or even just improves efficiency) because without that the value of the LLM is far lower than what it is being sold as. All the research so far says that LLMs fall far short of that goal. And if this was someone else's money, fine. But this is basically everyone's retirement savings. Again, higher valuation, higher goal. Finally, when you start losing people's retirement savings, criminal penalties start getting attached to things.

eqvinox · 2026-02-28T06:20:10 1772259610

I mean… just ask about something "naughty" and they'll fail? At the very least you'd need to use setups without safeguards to pass any Turing test…

The Turing test could also be considered equivalent to "can humans come up with questions that break the AI?" and the answer to that is still yes I'd say.

casey2 · 2026-02-28T13:51:30 1772286690

It hasn't even passed the original turning test, depending on the question. There are an unlimited number of questions that cause LLMs to give inhuman looking answers.

As for writing in general slop score is still higher than a human baseline for all models[1], so all a human tester has to do is grade it and make the human write a bunch, the interrogator is allowed to submit an arbitrarily long list of questions.

[1] https://eqbench.com/slop-score.html