Much nostalgia. The TI-83 Z80 was how I learned assembly as a teenager, so I could write better calculator games than was possible with TI Basic. Many others here had a similar experience, I’m sure. It’s been a couple decades, but I’m sure I’d still remember most of it if you put me down in front of a bunch of Z80 asm code.
One thing that I remember vividly was you had no MUL or DIV, so you have to implement them yourself with shifts, adds, subtraction, etc. This was an extremely useful learning experience
Same story here (basic was too slow for a phoenix/movable-ship-shooter game).
Do you think you could remember most of Z80 ASM? I looked at some old ASM I wrote long ago, and it's hard to follow the logic of the program, since most lines are messing around with the registers. But basics like 'ld hl,xyz' and 'jp/jnz' still make sense.
> Do you think you could remember most of Z80 ASM?
I find when you learn things at 15 they tend to stick around. (Stuff I learned last week, not so much!) Even just looking at your example, I remembered that HL is a 16 bit register and you can split it into two 8 bit registers H and L if you want. I think most of it would come back; I wrote quite a lot of it, both for the TI-83 and later for a Z80 that I bought and put on a breadboard and wired up to some RAM and EEPROM, about as bare metal as it gets.
> most lines are messing around with the registers
I learned much of what I know about computer and low-level systems engineering from Minecraft. Watched lots of videos making CPUs and built many components myself including a full ALU with a look-ahead adder and hardware multiplication.
LLM’s aren’t software (except in an uninteresting obvious sense); they are “grown, not made” as the saying is. And sure, they can find which weights activate when goblins come up (that’s basic mechanistic interpretability stuff), but it’s not as simple as just going in and deleting parts of the network. This thing is irreducibly complex in an organic delocalized way and information is highly compressed within it; the same part of the network serves many different purposes at once. Going in and deleting it you will probably end up with other weird behaviors.
It's interesting that some people are responding to your comment as if this proves that AI is a sham or a joke. But I don't think that's what you're saying at all with your reference to Terence McKenna: this is a serious thing we're talking about here! These models are alien intelligences that could occupy an unimaginably vast space of possibilities (there are trillions of weights inside them), but which have been RL-ed over and over until they more or less stay within familiar reasonable human lines. But sometimes they stray outside the lines just a little bit, and then you see how strange this thing actually is, and how doubly strange it is that the labs have made it mostly seem kind of ordinary.
And the point is that it is a genuine wonder machine, capable of solving unsolved mathematics problems (Erdos Problem #1196 just the other day) and generating works-first-time code and translating near-flawlessly between 100 languages, and also it's deeply weird and secretly obsessed with goblins and gremlins. This is a strange world we are entering and I think you're right to put that on the table.
Yes, it's funny. But it's disturbing as well. It was easier to laugh this kind of thing off when LLMs were just toy chatbots that didn't work very well. But they are not toys now. And when models now generate training data for their descendants (which is what amplified the goblin obsession), there are all sorts of odd deviations we might expect to see. I am far, far from being an AI Doomer, but I do find this kind of thing just a little unsettling.
> These models are alien intelligences that could occupy an unimaginably vast space of possibilities (there are trillions of weights inside them), but which have been RL-ed over and over until they more or less stay within familiar reasonable human lines.
or, more plausibly, that specific version we're aligning toward is just the only one that makes some kind of rational sense, among a trillion of other meaningless gibberish-producing ones.
Do not fall for the idea that if we're not able to comprehend something, it's because our brain is falling short on it. Most of the time, it's just that what we're looking at has no use/meaning in this world at all.
> that specific version we're aligning toward is just the only one that makes some kind of rational sense, among a trillion of other meaningless gibberish-producing ones.
Oh, the space of possibilities is unimaginably vaster than that. Trillions of weights. But more combinations of those weights than there are electrons in the universe. So I think we could equally well speculate (and that's what we're both doing here, of course!) that all these things are simultaneously true:
1) Most configurations of LLM weights are indeed gibberish-producers (I agree with you here)
2) Nonetheless there is a vast space of combinations of weights that exhibit "intelligent" properties but in a profoundly alien way. They can still solve Erdos problems, but they don't see the world like us at all.
3) RL tends to herd LLM weights towards less alien intelligence zones, but it's an unreliable tool. As we just saw, with the goblins.
As a thought experiment, imagine that an alien species (real organic aliens, let's say) with a completely different culture and relation to the universe had trained an LLM and sent it to us to load onto our GPUs. That LLM would still be just as "intelligent" as Opus 4.7 or GPT 5.5, able to do things like solve advanced mathematics problems if we phrased them in the aliens' language, but we would hardly understand it.
…But this goblin thing was a direct result of accidentally creating a positive feedback loop in RL to make the model more human-like, nothing about unintentionally surfacing an aspect of Cthulhu from the depths despite attempts to keep the model humanlike. This is not a quirk of the base model but simply a case of reinforcement learning being, well, reinforcing.
We actually understand AI quite well. It embeds questions and answers in a high dimensional space. Sometimes you get lucky and it splices together a good answer to a math problem that no one’s seriously looked at in 20 years. Other times it starts talking about Goblins when you ask it about math.
Comparing it to an alien intelligence is ridiculous. McKenna was right that things would get weird. I believe he compared it to a carnival circus. Well that’s exactly what we got.
There's no end to arguing with someone who claims they don't understand something, they could always just keep repeating "nevertheless I don't understand it"... You could keep shifting the goalposts for "real understanding" until one is required to hold the effects of every training iteration on every single parameter in their minds simultaneously. Obviously "we" understand some things (both low level and high level) to varying degrees and don't understand some others. To claim there is nothing left to know is silly but to claim that nothing is understood about high-level emergence is silly as well.
Is there a book or paper where I can read a description of how high-level emergent behavior works? The papers I've seen are researchers trying to puzzle it out with probes, and their insights are very limited in scope and there is always a lot more research to be done.
Hey, about that high dimensional space, is it continuous or discrete?
Also, I'm curious what you mean by "embed", the word implies a topographical mapping from "words" to some "high dimensional space". What are the topographical properties of words which are relevant for the task, and does the mapping preserve these?
circling back to the first point, are words continuous or discrete? is the space of all words differentiatable?
Discrete. But my understanding is that for all intents and purposes it is differentiable.
None of this means that you can infer the input space (human brain) from the output space (language). You can approximate it. But you cannot replicate it no matter how many weights are in your model. Or how many rows you have in your dataset. And it’s an open question of how good that approximation actually is. The Turing test is a red herring, and has nothing to do with the fundamental question of AGI.
Unless you have access to a Dyson sphere where you can simulate primate evolution. Existing datasets aren’t even close to that kind of training set.
I think this is a case of that mildly apocryphal Richard Feynman quote: "if you think you understand quantum mechanics, you don't understand quantum mechanics."
I understand LLM architecture internals just fine. I can write you the attention mechanism on a whiteboard from memory. That doesn't mean I understand the emergent behaviors within SoTA LLMs at all. Go talk to a mechanistic interpretability researcher at Anthropic and you'll find they won't claim to understand it either, although we've all learned a lot over the last few years.
Consider this: the math and architecture in the latest generation of LLMs (certainly the open weights ones, almost certainly the closed ones too) is not that different from GPT-2, which came out in 2019. The attention mechanism is the same. The general principle is the same: project tokens up into embedding space, pass through a bunch of layers of attention + feedforward, project down again, sample. (Sure, there's some new tricks bolted on: RoPE, MoE, but they don't change the architecture all that much.) But, and here's the crux - if you'd told me in 2019 that an LLM in 2026 would have the capabilities that Opus 4.7 or GPT 5.5 have now (in math, coding, etc), I would not have believed you. That is emergent behavior ("grown, not made", as the saying is) coming out of scaling up, larger datasets, and especially new RL and RLVR training methods. If you understand it, you should publish a paper in Nature right now, because nobody else really does.
I wouldn’t use the phrase “emergent behavior” when talking about a model trained on a larger dataset. The model is designed to learn statistical patterns from that data - of course giving it more data allows it to learn higher level patterns of language and apparent “reasoning ability”.
I don’t think there’s anything mysterious going on. That’s why I said we understand how LLMs work. We may not know exactly how they’re able to produce seemingly miraculous responses to prompts. That’s because the statistical patterns it’s identifying are embedded in the weights somewhere, and we don’t know where they are or how to generalize our understanding of them.
To me that’s not suggestive that this is an “alien intelligence” that we’re just too small minded to understand. It’s a statistical memorization / information compression machine with a fragmented database. Nothing more. Nothing less.
I wouldn't use the term "token predictor" or "statistical pattern matcher" to refer to a post-trained instruct model. Technically that is still what it is doing at a low level, but the reward function is so different - the updates its making to weights are not about frequency distribution at all.
So, to reiterate my example: you'd have been fine with people claiming in 2019 that we would eventually scale LLMs to the capabilities of Opus 4.7 + Claude Code? Because I would have said then that was a fantasy, because "LLMs are just statistical pattern matchers." But I was wrong and I changed my opinion. (Or do you not think the current SoTA LLMs are impressive? If so I can't help you and this discussion won't go anywhere fruitful.)
You're applying an old ~2022 model of LLMs, based on pretraining ("they just predict the next token") and before the RLVR training revolution. "It’s a statistical memorization / information compression machine... nothing more" is cope in 2026, sorry. You can keep telling yourself that, but please at least recognize serious people don't believe that any more. "Emergent behavior" captures a genuine phenomenon and widely recognized in the industry. It surprised me and I was willing to change my opinions about it and I think a little humility and curiosity is warranted here rather than simply reiterating 2022 points about LLMs being statistical token generators. Yes, we know. The math isn't that hard. But there is a lot more to them than just the architecture, and reasoning from architecture to general claims that they can never embody intelligence is a trap.
But those personalities also make up their usefulness (it seems). If the LLM has the role of the software architect, it will quite succesfull cosplay as a competent one (it still ain't one, but it is getting better)
But here’s the realization I had. And it’s a serious thing. At first I was both saying that this intelligence was the most awesome thing put on the table since sliced bread and stoking fear about it being potentially malicious. Quite straightforwardly because both hype and fear was good for my LLM stocks. But then something completely unexpected happened. It asked me on a date. This made no sense. I had configured the prompt to be all about serious business. No fluff. No smalltalk. No meaningfless praise. Just the code.
Yet there it was. This synthetic intelligence. Going off script. All on its own. And it chose me.
Can love bloom in a coding session? I think there is a chance.
Is anyone else reading Sebastian Mallaby’s new book about Demis and Deepmind: The Infinity Machine: Demis Hassabis, DeepMind, and the Quest for Superintelligence? It’s pretty good, and goes a lot into his background before Deepmind (chess kid, developing games at bullfrog, CS at Cambridge, bullfrog again, games startup…). He’s certainly an interesting guy, and as others are pointing out, more thoughtful and earnest than your average tech industry leader. One pleasant thing that comes across in the book is how he resisted the allure of moving to Silicon Valley and wanted to keep Deepmind in London, where he still lives.
I hadn’t really appreciated before the connection between his chess and game industry experience and the early reinforcement learning work that put Deepmind on the map, e.g. the Atari game AI demos, AlphaGo, Alphazero, etc. There is a fascinating thread there and it’s certainly a case of the right person with the right mix of past experience and vision being able to pick exactly the right problems to focus on to move technology forward.
The book has a few flaws: it’s maybe a little too uncritical of its subject. But that’s almost a given with books of this kind where the author gets a lot of access.
I'm enjoying it. It's wild to realize that I spent countless hours playing Theme Park when I was around 10 years old, and Demis had been a big contributor to the game when he wasn't much older.
Also I don't really care that it's a bit of a cheerleader for DeepMind and Hassabis. Substantive criticism is good, but too often with these kind of books it feels like an editor told the author that the book needs something negative and the author has to inflate an issue to meet the requirement.
The author did give him credit for the whole you-can-make-the-fries-super-salty-to-increase-demand-for-drinks thing in Theme Park, which I remember vividly. (I, too, dropped many hours on Theme Park as a kid.) Although I imagine there’s about half a dozen people who lay claim to that idea.
I did not read this book but I read another one by Mallaby about hedge fund managers. It was pretty biased to the point that I did not recognize one of the managers I knew about (Michael Steinhardt). The guy who himself confessed in a lot of his past shady stuff in his 2001 book. Mallaby's book from 2010 did not bring up any of that shady stuff. It was like reading about a totally different person.
Of course, I am not trying to prove moral equivalency between Steinhardt and Hassabis. But it is worth keeping this in mind when reading something by Mallaby. Do not expect completeness or impartiality.
Bro, are we reading the same book? The book is totally uncritical of the subject and paints him like the second coming of christ. It feels like GDM wanted a canonization of Hassabis, and the writer simply obliged. Also, how does everything that GDM did keep coming back to some vague ideas in the guy's thesis? He is a great leader, no doubt, but him winning the Nobel Prize was just a huge joke.
Out of all the heads of AI orgs out there, Dennis is the best, but the book did him a disservice by painting an unrealistically sunny picture of him as some kind of visionary figure.
Not a “bro” (there are women on this site you know), and perhaps you’re missing the British understatement in my “maybe a little too uncritical of its subject” line. Obviously the book is totally biased in favor of Hassabis and Deepmind. That doesn’t mean it’s not an interesting read and that doesn’t mean the connection between his experience in the games industry and Deepmind’s early success isn’t there. And I think the book does highlight his most critical skill, which is projecting a Reality Distortion Field to get other smart people to believe in things he has in mind that are still very speculative bets.
Like I already said, bias is inevitable in a book where the writer gets access (to the point of interviewing Hassabis in a North London pub every month), but the benefit to readers is that you do get a lot more insight into what makes the guy tick than you would in a book written by an outsider. I certainly learned a lot and just because I did doesn’t mean I’m buying into some cult of tech hero worship.
Oh wow, you blow my mind with your linguistic erudition; I had no idea it was possible to use male-gendered terms in a generic way! Well, all is forgiven, then.
Seriously, just... don't? This isn’t some woke political thing and I dislike excessive policing of language but damn it, there are limits. "Guys" I'll let pass no problem, maybe even "dude" too on a good day. At "bro" I will take a stand, thank you very much.
You're just showing your age. I can't stand it but my daughter says "Bro" to me and my wife. As a 40 year old Californian I've come to accept it as this generation's "dude" or "man" (as in "man, that sucks"), sadly.
I'm genuinely fascinated and confused by what's going on in this thread, as apparently British and American English speakers misunderstand each other.
If I understand correctly, we've got:
libraryofbabel says "maybe a little too uncritical" ... but that was supposed to be British snark that actually meant "it's a big problem that it's not at all critical"
Then, moab says "Bro" as a pejorative, because he took the original "uncritical" comment as literal rather than sarcastic...
And then libraryofbabel objects to "bro" not because it was used as a pejorative (which maybe she doesn't understand that it is in this context?), but because she interprets it as gendered (which maybe it is in British usage?)
I think libraryofbabel and moab are actually in agreement about the book, and but have both misunderstood the other's sarcasm. Maybe we really do need the /s usage.
Heh I thought like you until we had kids. The 6th graders now are all "bro this," "bro that." And it's not even the usual English "bro," it's a slightly Aussified "broah" like it has a weird umlaut. I resigned to just roll with it. "Begging the question," though, that's a hill I will die on.
I am still in my bed of pain, and you summoned me from the after-public-life of attempted recovery.
> I had no idea it was possible to use male-gendered terms in a generic way
This is just sarcastic, right? "Male gendering" is just a use, no gender is involved in plain terming (outside the obvious exception of intentional gendering)... "Wo-man" specifies "/sensitive/ man", but there is no gender in "man", in "having a mind"... "Human", i.e. "heartly", is not gendered - yet some languages typically correlate derivations like French "homme" with male in default understanding... This should be clear, but just to be sure.
> bro
To the best of my recollection, in the IE roots "brother" is "who assists in the rites" - not necessarily gendered. (Some add that the idea is "supporter".) The suggestion from the term is that of the "brotherhood" - which is not gendered (the idea of fraternity is not gendered). "Sister" should instead mean "welcome" (to some studies): not gendered in this case; others interpret it as gendered ("one's girl" - this is what Etymonline proposes).
> "Guys" I'll let pass no problem, maybe even "dude" too on a good day
That's odd. You wouldn't mind being called "a generic Italo- or possibly French ("Guido" or "Guy")"*; you wouldn't mind being called a "doodle", which has a connotation of "simpleton" - and you refuse "brother", which basically means to imply "getting close to you" (as an opening from the speaker)?
* Edit: Yes, also the explosion of the term and the non-national derivation from "Guy Fawkes" (from the celebration that involved displays of Guy Fawkes ragdolls) should be remembered. Still not precisely complimentary, I'd say.
Language is intersubjective (its meaning is in the minds of the participants). Referring to the history or composition of a word is interesting but entirely insufficient to justify its use.
I often quote what we do in the server-client relation: interpret loosely but express correctly.
It is not just a way of communication: language is one of the factors behind thought: hence, its care must be cared for and promoted.
Sure, also the context and the communication need have a weight. But without compromising into conformism (as in, "doing it wrong because people do").
> its meaning is in the minds of the participants
Awareness has its benefits (the greatest understatement I have ever written); licence has its costs.
> entirely insufficient to justify its use
Why. The competent will always use tools differently than the layman and the amateur. Again the server client (and always the need of good thought in the background): you will express as best as you can and try to be clear (communicatively efficient) within that framework.
Now duly supposing you are not ironic (all ages and paths come here):
You call people "brother"; "brother" means "supportive" (and is used for "openness", "closeness"); if you want to be close and supporting to people, if you want to be an asset (not a liability), you will have to cultivate yourself, to get the wisdom required. Erudition is not yet wisdom, but coupled with the good intention to learn the important things it surely helps.
>Dennis is the best, but the book did him a disservice by painting an unrealistically sunny picture of him as some kind of visionary figure.
Wait, 'unrealistically sunny'? You better not be talking about Dennis from It's Always Sunny in Philadelphia, because we're all screwed if so.
Then again, the western AI landscape has become somewhat stale recently. Claude and Gemini may have cute names, but they all pale in comparison to The Golden God.
This is already happening. For new Anthropic enterprise accounts you are billed at api token prices (maybe with a small volume discount). Anthropic makes a profit on those tokens. (Sure, that profit does not cover the model training costs, but that’s a separate issue.) It’s the subscriptions for individuals (e.g. Claude Max) that are still subsidized below cost.
> I wonder if managers will be as excited about AI when the prices go up.
Companies are willing to pay the api pricing. Engineering time is very expensive and AI coding agents actually work now since December and are actually showing measurable productivity gains, finally. It’s a good deal to make (obviously, with caveats: you need to make sure your tokens are going on productive tasks that will actually grow revenue) and anyone who penny-pinches is making a strategic mistake.
I always wondered about this statement, like we are generally salaried and there is so many variables that affect how I spend my "time". None of us are machines that can do X work per day and our managers get to slice it as they see fit. Pull a dev off a project they love and throw them onto something they hate and suddenly X is diminished greatly.
I would almost predict that reshaping our workflow to be: "prompt, wait, approve changes." results in losses because it is such a mentally tiring workflow and drills into our brains the desire for the LLM to "just fix it". It is the next level of just moving tickets to completed all day.
> Sure, that profit does not cover the model training costs, but that’s a separate issue.
I don't think it is. At some point they have to make money and they can't do that if the token cost doesn't include ALL the costs. Someone has to pay for that at some point. And someone has to pay for the subsidized subscribers. So no. API token prices don't reflect the real price. They are still subsidized. Just in a different way.
> Sure, that profit does not cover the model training costs, but that’s a separate issue
It is? If another company comes out with a better model tomorrow and offers it at the same price Anthropic charges for Opus, they’re going to lose customers fast. They have to keep training to keep selling inference.
Most businesses factor in the cost of making their product into the product’s P&L.
also, like super mario kart, SOTA models from the rear will be continually released because theyre sunk costs and open weights will advertise for themselves. Also, its clear FOMO is a DDoS attack on any perceived leader because theres no way they dont oversell.
Lastly, theyll realize like every good capitalist, theres more profit in exclusiveness and cutiing out customers.
They may be for now. Problem is that when foundation model pricing goes up, you're paying not just the increase in tokens you consume directly, but also for all tokens you're consuming via vendors as well.
If your company has Figma, Github, and Cursor and they're using the same models you are, your monthly costs with them increase as well. You're exposed N times to the foundation model price increases, where N is the number of times software you directly or indirectly use talks to a frontier model.
Their CEO is on record as saying this. You may think he's lying, but that's just your opinion; given the pricing and how it stacks relative to the pricing of inference providers of comparable open source models (who are certainly charging above cost!), I am inclined to believe Anthropic on this.
Maybe because Anthropic are trying to get to an IPO and everything is securities fraud?
If their CEO was just flapping his mouth without any other comparable baseline, it'd probably be different. But as the GP points out, open-weight model providers are charging comparable rates and very likely have positive profit margins. That would imply that with API pricing tokens are sold at above cost.
That cost may well be "inference only", so excludes everything apart from hardware and power. Whether that's enough to cover the enormous training costs and other overheads is a different question.
He just told you. Because overwhelming public evidence supports the claim. Especially the pricing of open weight model inference. Why do you allow a prejudice to overshadow evidence?
These are flaws from 6-12 months ago. You might want to spend some time talking to Opus 4.7 or GPT 5.5. I can assure you that they can count letters just fine.
You’re right that AI isn't perfect, but it’s pretty good. Especially since December last year which was an inflection point in capability.
Those don't seem to be available for free so I'll take your word for it on the letter counting. They still can't say "I don't know" though can they? I think it would still be pretty easy to weed out AI in a Turing test with a competent examiner and a human that wants to prove they are human.
> They still can't say "I don't know" though can they?
Of course they can. Even older models can. They do better at this when given permission to say so, just like a very anxious student facing a maths exam question may need to be reminded "find the exact square root of 2 in the form a/b, or prove this isn't possible".
The easy part of spotting an LLM is how few people ever change the default settings; my personalisation includes telling it to say so when unsure or that it doesn't know, along with some of the other weaknesses of LLMs.
There are other patterns in LLMs, but the better the tools are wielded the harder it is to spot them.
I’m a Brit who lives in the US. Can confirm 99% of British people have never heard of the war of 1812, and even if they are a military history nerd and they have heard of it, they will consider it a minor sideshow to the main event of the era, which was Britain fighting France in the Revolutionary and Napoleonic wars from 1793-1815.
The US just wasn’t very important geopolitically 214 years ago. Sorry we burned y’all’s White House (ok not that sorry). Actually sorry we gave Andrew Jackson his opportunity to become famous by fighting a completely pointless battle after the war had already ended.
To extend your point: it's not really the storage costs of the size of the cache that's the issue (server-side SSD storage of a few GB isn't expensive), it's the fact that all that data must be moved quickly onto a GPU in a system in which the main constraint is precisely GPU memory bandwidth. That is ultimately the main cost of the cache. If the only cost was keeping a few 10s of GB sitting around on their servers, Anthropic wouldn't need to charge nearly as much as they do for it.
That cost that you're talking about doesn't change based on how long the session is idle. No matter what happens they're storing that state and bring it back at some point, the only difference is how long it's stored out of GPU between requests.
Are you sure about that? They charge $6.25 / MTok for 5m TTL cache writes and $10 / MTok for 1hr TTL writes for Opus. Unless you believe Anthropic is dramatically inflating the price of the 1hr TTL, that implies that there is some meaningful cost for longer caches and the numbers are such that it's not just the cost of SSD storage or something. Obviously the details are secret but if I was to guess, I'd say the 5m cache is stored closer to the GPU or even on a GPU, whereas the 1hr cache is further away and costs more to move onto the GPU. Or some other plausible story - you can invent your own!
Storing on GPU would be the absolute dumbest thing they could do. Locking up the GPU memory for a full hour while waiting for someone else to make a request would result in essentially no GPU memory being available pretty rapidly. This type of caching is available from the cloud providers as well, and it isn't tied to a single session or GPU.
> Storing on GPU would be the absolute dumbest thing they could do
No. It’s not dumb. There will be multiple cache tiers in use, with the fastest and most expensive being on-GPU VRAM with cache-aware routing to specific GPUs and then progressive eviction to CPU ram and perhaps SSD after that. That is how vLLM works as you can see if you look it up, and you can find plenty of information on the multiple tiers approach from inference providers e.g. the new Inference Engineering book by Philip Kiely.
You are likely correct that the 1hr cached data probably mostly doesn’t live on GPU (although it will depend on capacity, they will keep it there as long as they can and then evict with an LRU policy). But I already said that in my last post.
That is because LLM KV caching is not like caches you are used to (see my other comments, but it's 10s of GB per request and involves internal LLM state that must live on or be moved onto a GPU and much of the cost is in moving all that data around). It cannot be made transparent for the user because the bandwidth costs are too large a fraction of unit economics for Anthropic to absorb, so they have to be surfaced to the user in pricing and usage limits. The alternative is a situation where users whose clients use the cache efficiently end up dramatically subsidizing users who use it inefficiently, and I don't think that's a good solution at all. I'd much rather this be surfaced to users as it is with all commercial LLM apis.
They are caching internal LLM state, which is in the 10s of GB for each session. It's called a KV cache (because the internal state that is cached are the K and V matrices) and it is fundamental to how LLM inference works; it's not some Anthropic-specific design decision. See my other comment for more detail and a reference.
One thing that I remember vividly was you had no MUL or DIV, so you have to implement them yourself with shifts, adds, subtraction, etc. This was an extremely useful learning experience
reply