Technically, multi-node cluster with failover (or full on active-active) will have far higher uptime than just a single node.
Practically, to get the multi-node cluster (for any non trivial workload) to work right, reliably, fail-over in every case etc. is far more work, far more code (that can have more bugs), and even if you do everything right and test what you can, unexpected stuff can still kill it. Like recently we had uncorrectable memory error which just happened to hit the ceph daemon just right that one of the OSDs misbehaved and bogged down entire cluster...
I agree with OP - the weights are more akin to the binary output from a compiler. You can't see how it works, how it was made, you can't freely manipulate with it, improve it, extend it etc. It's like having a binary of a program. The source code for the model was the training data. The compiler is the tooling that can train a module based on a given set of training data. For me it is not critical for an open source model that it is ONLY distributed in source code form. It is fine that you can also download just the weights. But it should be possible to reproduce the weights - either there should be a tar.gz ball with all the training data, or there needs to be a description/scripts of how one could obtain the training data. It must be reproducible for someone willing to invest the time, compute into it even if 99.999% use only the binary. This is completely analogous to what is normally understood by open source.
It's the flux / resin also found in the solder that causes that. At the typical soldering temperature of 400 °C, lead evaporates 10 million times slower than ice at -40 °C.
About the axioms, not really. Axiom sets is mostly there just as a 'short hand' to quickly describe a context we're talking about, but ultimately you could just do away with them. E.g. if we let A be the set of axioms from some theory (e.g. set theory, number theory etc.) and you have a mathematical statement of the form X => Y within that theory, you could just as well consider the statement "A ^ X => Y" in the purely formal system without any axioms at all, then it is purely a logical question (essentially, if X => Y is a theorem within theory A) and more objectively true than "X => Y" which would be theory-independent.
I haven't read it all and must admit that I'm not sure I really understood the parts that I did read.
Reading the part under the headline "Why We Need the World, and How LLMs Pretend to Understand It" and the focus on 'next-token-prediction' makes me wonder how seriously to take it. It just seems like another "LLM's are not intelligent, they are merely next token predictors". An argument which in my view is completely invalid and based on a misunderstanding.
The fact that they predict next token is just the "interface" i.e. an LLM has the interface "predictNextToken(String prefix)". It doesn't say how it is implemented. One implementation could be a human brain. Another could be a simple lookup table that looks at the last word and then selects the next from that. Or anything in between. The point is that 'next-token-prediction' does not say anything about implementation and so does not reduce the capabilities even though it is often invoked like that. Just because it is only required to emit the next token (or rather, a probability distribution thereof) it is permitted to think far ahead, and indeed has to if it is to make a good prediction of just the next token. As interpretability research (and common sense) shows, LLM's have a fairly good idea what they are going to say in the many, many next tokens ahead in order that it can make a good prediction for the next immediate tokens. That's why you can have nice, coherent, well-structured, long responses from LLM's. And have probably never seen it get stuck in a dead end where it can't generate a meaningful continuation.
If you are to reason about LLM capabilities never think in terms of "stochastic parrot", "it's just a next token predictor" because it contains exactly zero useful information and will just confuse you.
I think people hear "next token prediction" and think someone is saying the prediction is simple or linear, and then argue there is a possibility of "intelligence" because the prediction is complex and has some level of indirection or multiple-token-ahead planning baked into the next token.
But the thrust of the critique of next-token prediction or stochastic output is that there isn't "intelligence" because the output is based purely on syntactic relations between words, not on conceptualizing via a world model built through experience, and then using language as an abstraction to describe the world. To the computer there is nothing outside tokens and their interrelations, but for people language is just a tool with which to describe the world with which we expect "intelligences" to cope. Which is what this article is examining.
>But the thrust of the critique of next-token prediction or stochastic output is that there isn't "intelligence" because the output is based purely on syntactic relations between words, not on conceptualizing via a world model built through experience, and then using language as an abstraction to describe the world. To the computer there is nothing outside tokens and their interrelations, but for people language is just a tool with which to describe the world with which we expect "intelligences" to cope. Which is what this article is examining.
LLMs model concepts internally and this has been demonstrated empirically many times over the years, including recently by anthropic (again). Of course, that won't stop people from repeating it ad nauseum.
Concepts within modalities are potentially consistent, but the point the author is making is that the same "concept" vector may lead to inconsistent percepts across modalities (e.g. a conflicting image and caption).
Yes, LLMs often generate coherent, structured, multi-paragraph responses. But this coherence emerges as a side effect of learning statistical patterns in data, not because the model possesses a global plan or explicit internal narrative. There is no deliberative process analogous to human thinking or goal formation. There is no mechanism by which it consciously “decides” to think 50 tokens ahead; instead, it learns to mimic sequences that have those properties in the training data.
Planning and long-range coherence emerge from training on text written by humans who think ahead, not from intrinsic model capabilities. This distinction matters when evaluating whether an LLM is actually reasoning or simply simulating the surface structure of reasoning.
>But this coherence emerges as a side effect of learning statistical patterns in data, not because the model possesses a global plan or explicit internal narrative.
"I added a load-balancer to improve system reliability" (happy)
"Load balancer crashed" (smiling-through-the-pain)