Yup. That post is a typical example, symptomatic of modern technology culture, of calling for humans to change their nature in response to technology.
This is a fundamental mistake. It’s always the job of technology (indeed, its most important job) to work within the constraints of human nature, not the other way round. Being unable to do that is the defining characteristic of bad technology.
People keep throwing this phrase around in relation to LLMs, when not a single “fundamental limitation” has been rigorously demonstrated to exist, and many tasks that were claimed to be impossible for LLMs two years ago supposedly due to “fundamental limitations” (e.g. character counting or phonetics) are non-issues for them today even without tools.
The models now whaste a vast amount of useless neurons memorising the character count the entire English language so that people can ask how many r's are in strawberry and check a tickbox in a benchmark.
The architecture cannot efficiently or consistently represent counting letters in words. We should never have forced trained them to do it.
This goes for other more important "skills" that are unsuited to tranformer models.
Most models can now do decent arithmetics. But if you knew how it has encoded that ability in its neurons then you would never ever ever ever trust any arithmetic it ever outputs, even in seems to "know" it (unless it called a calculator MCP to achieve it).
There are fundamental limitations, but we're currently brute forcing ourselves through problems we could trivially solve with a different tool.
> The models now whaste a vast amount of useless neurons memorising the character count the entire English language
No they don’t. They only need to know the character count for each token, and with typical vocabularies having around 250k entries, that’s an insignificant number for all but the tiniest LLMs.
Those "tolkens" humans "count" are translated to a ~2048 (depends on model) floating point vector.
bird => {mamal, english, noun, Vertebrate, aviant} has one r but what if you make it 20% more "french". Is is still 1 r? That could be the word "bird" in french, or it could be a french speaking bird or a bird species common in france.
If nearest neibour distance to the vocabulary of every language makes the vector no longer map to "bird"; then the amount of rs' must change, using a series of trained conditional checks (with some efficiency where languages have some general spelling patterns).
That is such an unreasonable amount of compute, that it is likley faar cheaper, easier and more reliable to train the model to memorise the output:
{"MCP":"python", "content":"len((c for c in 'strawberry' if c='r'))"}
The attention mechanism allow LLMs to learn this kind of absurdly inefficient calculations. But we really shouldn't use LLMs where they're outperformed by trivial existing solutions.
>People keep throwing this phrase around in relation to LLMs, when not a single “fundamental limitation” has been rigorously demonstrated to exist
Some limitations are not rigorously demonstrated to be fundamental, but continuously present from the first early LLMs yes. Shouldn't the burden of proof be on those who say it can be done?
And some limitations are fundamental, and have been rigorously demonstrated, e.g.:
What part of "Specifically, we define a formal world where hallucination is defined as inconsistencies between a computable LLM and a computable ground truth function. By employing results from learning theory, we show that LLMs cannot learn all the computable functions and will therefore inevitably hallucinate if used as general problem solvers. " doesn't carry the title, to ask mildly?
As with all the works that use too broad a definition of an LLM they prove too much. This work defines an "LLM" as a computable function obtained by applying a finite number of steps of a generic algorithm to an initial computable function.
What they really prove is that it's impossible to extrapolate unconstrained non-continuous function from a finite subset of its values. Good for them, I guess.
It's like saying that the no free lunch theorems proves that LLMs can't be the best optimizers, while it proves (roughly) that the best optimizers don't exists. That is, even people aren't the best optimizers, but we manage somehow, so LLMs can too.
So substitute another phrase, if you prefer. It doesn't change the logic.
"Specifically, we define a formal world where bungling is defined as inconsistencies between a computable LLM and a computable ground truth function. By employing results from learning theory, we show that LLMs cannot learn all the computable functions and will therefore inevitably bungle if used as general problem solvers."
Character counting remains a huge issue without tools.
Are you using only frontier models that are gated behind openai/anthropic/google APIs? Those use tools to help them out behind the scenes. It remains no less impressive, but I think we should be clear.
The literal best public models still fail to count characters consistently in practice so I’m not sure what you mean. It’s literally a problem we’re still trying to solve at work
What's amazing is that they even can fairly reliably appear to count characters. I mean we're talking about systems that infer sequences not character counters or calculators. They are amazing in unrelated ways and we need to accept this so we can use them effectively.
I suspect character counting - counting small numbers in general in fact - is something that multimodal models will gradually learn through their visual capabilities. We have generative systems that are capable of generating an image of the word ‘strawberry’, and of counting how many strawberries are visible in an image; seems likely it’s possible for an LLM to ‘imagine’ what the word strawberry looks like and count the ‘Rs’ it can ‘see’.
That’s false. Larger LLMs learn token decompositions through their training, and in fact modern training pipelines are designed to occasionally produce uncommon tokenizations (including splitting words into individual characters) for this reason. Frontier models have no trouble spelling words even without tools. Even many mid-sized models can do that.
Wait, where can I learn more about this? I don't doubt that varying the tokenization during training improves results, but how does/would that enable token introspection?
Because LLMs can learn that different token sequences represent the same character sequence from training context. Just like they learn much more complex patterns from context.
You can try this out locally with any mid-sized current-gen LLM. You’ll find that it can spell out most atomic tokens from its input just fine. It simply learned to do so.
Character counting errors are a side effect of tokenization, which is a performance optimization. If we scaled the hardware big enough we could train on raw bytes and avoid it.
This is kind of my point, we need to get better at describing the limitations and study them. It seems extremely clear that there are limitations, and not just temporary ones, but structural limitations that existed at the beginning and continue to persist.
When I talk about fundamental limitations, I mean limitations that can't be solved, even if they could be improved.
We have improved hallucinations significantly, and yet it seems clear that they are inherent to the technology and so will always exist to some extent.
As a general architecture, an LLM also has limitations that can't be improved unless we switch to another, fundamentally different AI design that's non LLM based.
There are also limitations due to maths and/or physics that aren't fixable under any design. Outside science fiction, there is no technology whose limitations are all fixable.
Am I misreading that paper? They define hallucinations as anything other than the correct answer and prove that there are infinitely many questions an LLM can't answer correctly, but that's true of any architecture- there are infinitely many problems a team of geniuses with supercomputers can't answer. If an LLM can be made to reliably say "I don't know" when it doesn't, hallucinations are solved- they contend that this doesn't matter because you can keep drawing from your pile of infinite unanswerable questions and the LLM will either never answer or will make something up. Seems like a technically true result that isn't usefully true.
I appreciate you acknowledging that this was a mistake, but as you surely know from your own experience with other people’s mistakes, some mistakes are so egregious that they cast doubt on the intentions of the people involved even if they are corrected later.
To me, “let’s add false attribution to every commit by default without informing the user” falls squarely into that category. I don’t think I’ve ever worked in an environment where something like that wouldn’t have been red-flagged in three seconds by anyone who took even a casual glance. I’d honestly be embarrassed if such a proposal even made it into a public pull request for my organization, nevermind that pull request getting merged.
If what you described would make it to our PR queue, it would definitely not pass the gates.
The idea was to track AI-only changes and add the trailer when such changes were detected AND the setting was enabled. Obviously, we didn't want to attribute all changes to AI. There is a bug in change detection (which slipped through testing), which led to even non-AI changes being tracked. And thus we have this problem.
The PR linked here wasn't even implementing the feature, it was changing the default for the setting.
Yeah, I think 4+ legged bots should be more common than 2 leg variants. 2 legs is neat, but takes far more work and processing to control and balanced. It also requires much more powerful legs, a spider bot has more legs which makes it more "complex" in some ways, but individual legs don't need to hold and maneuver its entire body weight alone and it can hold 3 points of ground contact at all times, even when moving around, making it exceptionally stable. A bipedal robot has to be able to hold like twice its own body weight or more in order to balance and maneuver on a single foot in order to walk around and navigate obstacles.
> We'll know this works when it starts replacing Amazon pickers in quantity.
That doesn’t follow. There are plenty of tasks that can be fully and reliably automated but aren’t, for the simple reason that human labor is dirt cheap compared to advanced robotics.
I disagreed, then re-read your post, then re-read the OP, and now I've come full circle to apologize; I think you make a fair point.
I work at a biotech. We spent who knows how much time and money trying to develop a 'lab technician bot' to automate one of our critical assays. Turns out, a 6-figure machine still isn't as economical as my coworker Y, one of the veteran lab-technicians. Sure she takes the occasional sick day but even at our volume (and we do industrial-level, multiple clients batched into a single assay pass) it won't be economical to replace her for a very long time (if we even reach that scale).
Absolutely. I worked at a gene sequencing company and I led the software side of making a robotic product[0] to automate the 20-30 minutes of sample preparation time. It's great for lots of uses, but for anything outside the exact thing it automates, it doesn't cover it. For that you need an expert human.
No problem! It was probably the most fun product I've ever had the pleasure of leading the software dev of.
The company is one of the few in the world that makes gene sequencing technology - actual chemistry, biologics, protocols, hardware and software. Plasmidsaurus is a customer[0] - they use our devices and have built an incredibly successful service on top of them!
What is the point of humanoid “general” robots then?
We already have pretty reliable ways to make and train humans.
Humans are cheaper and better than robots.
I could imagine robots for some specialised tasks where you don’t want to use a human for eg security reasons, but you don’t need general purpose robots for that
In natural ecosystems, nobody beats the apex predator directly, and nobody beats the hyperspecialized niche critter at their own game. The new species has some advantage that’s different than what is there.
If a humanoid robot is slower dumber human that is expensive, requires power, can’t get wet, falls over, and doesn’t understand stairs. Is not sleeping and being radiation tolerant enough of an advantage to be worth it?
The nature comparison doesn't work on a fundamental level because you're only getting a fraction of the human's power based on how much they're happy to sell.
They already are, the problem with humanoid robots is that people think that adding legs to the robot will somehow fundamentally make it more intelligent.
People see a robot arm attached to a stationary platform and understand it requires integration work to perform a single task.
But when those same people see a humanoid robot, they think they can just talk to it like a real human and it will do what you told it to do.
They don't think about the fact that the humanoid robot has to be programmed exactly the same way the stationary robot arm has to be programmed and that programming the legs in addition to the arms is a much more challenging problem.
technology gets cheaper over time. If they were always going to cost the amount they do now, you might have a point. But they'll eventually get much cheaper.
Robots can be optimized for tasks and if they are, their benefits are greater. When cars replaced the horse, it was because they didn’t poop, and because a car designed only for transport would not suddenly have a heart attack and stop working.
It’s far less frequent, is at least recoverable, and of course there’s no immediate public sanitation issue the way poop and dead horses attract flies and disease.
But at far less frequency and severity than a temperamental horse.
In Manhattan in 1900, 400 horses would die a day, and a rotting horse carcass is a far bigger sanitation problem than a broken down car, which you can tow and fix up.
A friend who works at Amazon made the same point: "We don't really need robots in the FCs urgently [other than the Kivas], because it turns out you can just pay people $17/hour"
Mechanical picking has been too slow. It's not a problem with the robot mechanics. Here's 300 picks/minute from 2012.[1] The parts are all the same, so the vision problem is simple.
But picking arbitrary objects from fulfillment bins is still running at a few picks per minute.[2] As the speed picks up, humans become less necessary.
That's the point of the test condition. When running a robot becomes more economical than paying full-scale humans $17/h, something important about robot abilities will have changed.
I dunno, I worked in an Amazon Warehouse for a year part-time (and a couple of weeks full-time when in-between jobs) --- on one occasion, I pulled up to a bin full of non-descript cardboard boxes near where a group of trainees were working their way through, grabbed one box, spun it around for the six-sided box check, scanned it, confirming it was the right one, and before I could move on to my next pick, a trainee asked, "How did you know that was the right box?", which required a several minute explanation of how the item description and the slight differentiations of the boxes led to that conclusion.
The big win would be training the folks doing stowing to not create such situations and to put markedly different things in each rainbow bin.
This would be a more convincing take if reasoning LLMs didn't already exist. Given the growth in capability over the last few years alone nothing about your description "several minute explanation of how the item description and the slight differentiations of the boxes" seems beyond an artificial intelligence to solve by the time humanoid robots would be ready to physically traverse a warehouse.
Your last point is also interesting given perhaps a robot is more amenable to such instruction, thus creating cascading savings. Each human has to be trained, and could be individually a failure. Robot can essentially copy its "brain" to its others.
Or likely more accurately, download the latest brain trained from all the robot's aggregate experiences from the amazon hivemind hq
The "Markedly Different things" in each bin was a big Amazon Warehouse advance in warehousing. Traditionally - things that were "alike" were put on shelves/bins - but (according to Amazon) it was far more efficient for pickers (at least back in the day - may have changed since then) to have random things on shelves located near each other to allow for equal access to popular items by pickers.
I was thinking this week that AI token costs are probably going to get so expensive soon that bright spark CEOs are going to realise “why am I paying for such expensive coding agents when I can pay people from the third world to code!?!” and announce outsourcing like it’s some kind of stunning and innovative revelation.
C-suite has been saying this for 30+ years. They never tire of it. Ask yourself: At this point in time, why aren't all programmers working from low cost jurisdictions?
I think you didn’t grok the hidden punchline - this is the stage after they’ve replaced all their third world coders with AI agents, until one day a C-suiter has the revelation that humans are cheaper and better, and the company then starts toting its humanistic credentials all over LinkedIn.
But that does follow. The economics working is not some outside factor. If the robot “could do the task” but would cost more than paying a human to do the same task then the robot “does not work”. It is frequently because the robot would be too slow, or not reliable enough, or could only handle certain types of items. But ultimately all of these boil down to cost.
We have seen lab demoes of robotic manipulation for decades. The reason why they stay in the lab (when they do) and don’t become ubiquitous is because they are not good enough. In other words they don’t work. The economics and “does it work” is not two separate concerns but one and the same.
It's a continuum, not binary. The same robot that doesn't financially "work" for replacing a manual scavenger sorting garbage in an African slum might be quite cost-effective sorting recycling in Switzerland, and would likely have a niche regardless of price if used to (say) sort biohazardous or radioactive materials. And there are already millions of robots out there assembling cars etc.
Not “Here’s a random guess that I just pulled out of my ass.”
LLMs have picked up the bad habit of trying to give an answer when no answer can be given from scientists, who overall don’t say “I don’t know” nearly as often as they should.
I tried asking LLMs about food before. They all say "I can't tell for certain, but this is an estimate based on the ingredients I can spot/infer/guess".
You need to write a specific prompt to avoid any warnings.
Of course a lot of people don't know what limitations LLMs have, so there's some value to a blog post about it, but it's not as black-and-white as the article might suggest with its graphs.
The prompt (documented here: https://www.diabettech.com/wp-content/uploads/2026/04/Supple...) lists specific instructions and a specific output format that doesn't allow the LLM any room for explanation or warning in processable data (only in notes fields). In fact, the prompt explicitly tells the LLM to ignore visual inferencing for some statistics and to rely on a nutrition authority instead.
Even in that intentionally restricted format, the English language output uses words like "roughly" and "estimated" in the LLMs I've tested.
Sure, if you take the numeric values and plot them in graphs, you get wildly inconsistent results, but that research method intentionally restricts the usefulness and reliability of the LLMs being researched.
What's much more troubling is this line from the preprint:
> The open-source iAPS automated insulin delivery (AID) system now offers food analysis through APIs from OpenAI, Anthropic and Google [8]
The linked app does seem to have a disclaimer, though:
> "AI nutritional estimates are approximations only. Always consult with your healthcare provider for medical decisions. Verify nutritional information whenever possible. Use at your own risk."
From the paper they're using structured JSON schema mode opposed to freeform answers, so it can't. Models do typically caveat their answer for questions like this, in my experience.
They'll qualify their answers in English but as the article mentions, if your prompt asks for a confidence score, that "uncertainty" doesn't translate into low numerical confidence.
Quantifying their own confidence is also something they're not good at, and which the format would prevent them from refusing to do or preceding with a caveat if that's what you'd want of them. Particularly since the response format seems backwards - giving confidence, then carbs estimate, then observations/notes, rather than being able to base carbs estimate off of observations/notes and then confidence estimate off of both of those.
> They'll qualify their answers in English but [...]
That the default user-facing chat as a normal user would use it gives a warning is the key part IMO. I don't think expectations of there being no "wrong way" to use the model can necessarily extend to API usage with long custom system prompt and restricted output format.
Serious question: What exactly do they love America for? I just don’t get it. Seems like in every way that matters to the common people, the US is at best mediocre.
Could it be that they secretly subscribe to a different version of the same mythical exceptionalism as the president they despise?
People love their home sports teams, even when they're losing. They love their kids, even when they're getting mediocre grades in school. It's like that.
You're thinking of nationalism, which is when people think their country is the best one. Real patriotism is loving your country just because it's yours.
reply