The opera, symphony, and ballet sell out every performance where I live. Me, my friends, wife, etc all read multiple books per month. To me it feels like the problem is in the supply-side - there's just endless content being constantly published - more than could ever be read.
I chased down what the "4x faster at AI tasks" was measuring:
> Testing conducted by Apple in January 2026 using preproduction 13-inch and 15-inch MacBook Air systems with Apple M5, 10-core CPU, 10-core GPU, 32GB of unified memory, and 4TB SSD, and production 13-inch and 15-inch MacBook Air systems with Apple M4, 10-core CPU, 10-core GPU, 32GB of unified memory, and 2TB SSD. Time to first token measured with an 8K-token prompt using a 14-billion parameter model with 4-bit quantization, and LM Studio 0.4.1 (Build 1). Performance tests are conducted using specific computer systems and reflect the approximate performance of MacBook Air.
>Time to first token measured with an 8K-token prompt using a 14-billion parameter model with 4-bit quantization
Oh dear 14B and 4-bit quant? There are going to be a lot of embarrassed programmers who need to explain to their engineering managers why their Macbook can't reasonably run LLMs like they said it could. (This already happened at my fortune 20 company lol)
It won’t handle serious tasks but I have Gemma 3 installed on my M2 Mac and it is good for most of my needs—-esp data I don’t want a corporation getting its hands on.
I run Qwen 3.5 30B MOE and it’s reasonable at most tasks I would use a local model for - including summarizing things. For instance I auto update all my toolchains automatically in the background when I log in and when finished I use my local model to summarize everything updated and any errors or issues on the next prompt rendering. It’s quite nice b/c everything stay updated, I know whats been updated, and I am immediately aware of issues. I also use it for a variety of “auto correct” tasks, “give me the command for,” summarize the man page and explain X, and a bunch of tasks that I would rather not copy and paste etc.
Nothing like coding, just like relatively basic stuff. Idk its hard to explain but I use AI so frequently for work that I have a sense for what it is capable of.
I should clarify that by small I mean in the 3-8B range. I haven't tested the 14-30B ones, my experience is only about the smaller ones.
In my experience, small models are not good for coding (except very basic tasks), they're not good for general knowledge. So the only purpose I could see for them would be, when they're given the information, i.e. summarization or RAG.
But in my summarization experiments, they consistently misunderstood the information given to them. They constantly made basic errors and failed to understand the text.
So having eliminated programming, general knowledge, summarization and (by extension, RAG, because if you can't understand the information, then you can't do RAG either, by definition) -- I have eliminated all the use cases that I had in mind!
That would leave very basic tasks like classification or keywords, but I think there they would be in the awkward middle ground of being disappointing relative to big LLMs for many tasks, and cumbersome relative to small specialized models which can run fast and cheap and be fine tuned.
This wasn’t a statement about capability. It’s just a detail about what model they used to compare the speed of two chips for this purpose. You want a bigger model, run a bigger model.
Yeah no it didn’t. If you have a fully speced out M3/4 MacBook with enough memory you’re running pretty decent models locally already. But no one is using local models anyway.
I run a local model on the daily. I have it making tickets when certain emails come in and made a small that I can click to approve ticket creation.
It follows my instructions and has a nice chain of thought process trained.
Local LLMs are starting to become very useful. Not OpenClaw crap.
I’m not sure what model I’d trust locally with anything meaningful in Openclaw. The smaller/simpler the model is, the greater the chance of fluff answers is.
Technically you can get most MoE models to execute locally because RAM requirements are limited to the active experts' activations (which are on the order of active param size), everything else can be either mmap'd in (the read-only params) or cheaply swapped out (the KV cache, which grows linearly per generated token and is usually small). But that gives you absolutely terrible performance because almost everything is being bottlenecked by storage transfer bandwidth. So good performance is really a matter of "how much more do you have than just that bare minimum?"
Oh sure it is! I’ve helped set up an AI cluster rack with four K2.5s.
With some custom tooling, we built our own local enterprise setup:
Support ticketing system
Custom chat support powered by our trained software-support model
Resolved repository with detailed step-by-step instructions
User-created reports and queries
Natural language-driven report generation (my favorite — no more dragging filters into the builder; our (Secret) local model handles it for clients)
In-application tools (C#/SQL/ASP.NET) to support users directly, since our software runs on-site and offline due to PPI
A cool repair tool: import/export “support file packet patcher” that lets us push fixes live to all clients or target niche cases
Qwen3 with LoRA fine-tuning is also incredible — we’re already seeing great results training our own models.
There’s a growing group pushing K2.5s to run on consumer PCs (with 32GB RAM + at least 9GB VRAM) — and it’s looking very promising. If this works, we’ll be retooling everything: our apps and in-house programs. Exciting times ahead!
Now that you mentioned it, these macs could theoretically also run crysis if it supported arm and such! They should add that to the marketing material :)
So it's not measuring output tokens/s, just how long it takes to start generating tokens. Seems we'll have to wait for independent benchmarks to get useful numbers.
For many workflows involving real time human interaction, such as voice assistant, this is the most important metric. Very few tasks are as sensitive to quality, once a certain response quality threshold has been achieved, as is the software planning and writing tasks that most HN readers are likely familiar.
The way that voice assistants work even in the age of LLMs are:
Voice —> Speech to Text -> LLM to determine intent -> JSON -> API call -> response -> LLM -> text to speech.
TTFT is irrelevant, you have to process everything through the pipeline before you can generate a response. A fast model is more important than a good model
Source: I do this kind of stuff for call centers. Yes I know modern LLMs don’t go through the voice -> text -> LLM -> text -> voice anymore. But that only works when you don’t have to call external sources
An “intent” is something that a person wants to do - set a timer, get directions, etc.
A “slot” is the variable part of an intent. For instance “I want directions to 555 MockingBird Lane”. Would trigger a Directions intent that required where you are coming from and where you are going. Of course in that case it would assume your location.
Back in the pre LLM days and the way that Siri still works, someone had to manually list all of the different “utterances” that should trigger the intent - “Take me to {x}”,”I want to go to {x}” in every supported language and then had to have follow up phrases if someone just said something like “I need directions” to ask them something like “Where are you trying to go”.
Now you can do that with an LLM and some prompting and the LLM will keep going back and forth until all of the slots are filled and then tell it to create a JSON response when it has all of the information your API needs and you call your API.
This us what a prompt would look like to use a book a flight tool.
Using LLMs for voice assistants is relatively new at scale that’s the difference between Alexa and Alexa+ and Gemini powered Google Assistant and what Apple has been trying to do with Siri for two years.
It’s really just using LLMs for tool calling. It is just call centers were mostly built before the age of LLMs and companies are slow to update
Understood. This overlaps with a side project where I’m getting acceptable (but not polished) results, so trying to do some digging about optimizations. Thanks!
One of my niches is Amazon Connect - the AWS version of Amazon’s internal call center. It uses Amazon Lex for voice to text. Amazon Lex is still the same old intent based system I mentioned. If it doesn’t find an intent, it goes to the “FallbackIntent” and you can get the text transcription from there and feed it into a Lambda and from the Lambda call a Bedrock hosted LLM. I have found that Nova Lite is the fastest LLM. It’s much faster than Anthropic or any of the other hosted ones.
It's going to be faster no matter what. My M3 MAX prints tokens faster than I can read for the new MoE models. It's the prompt processing that kills it when the context grows beyond a threshold which is easy to do in the modern agentic loops.
This is broadly correct for currently favoured software, but in computer science optimization problems you can usually trade off compute for memory and vice versa.
Good point on speculative decoding techniques. I'd forgotten about them, and they're good. Would love to see some of these get into llama.cpp and friends, but it does require somebody to come up with a distilled draft model.
But low rank compression isn't trading off compute for memory - it's just compressing the model. And critically, that's lossy compression. That's primarily a trade-off of quality for speed/size, with a little bit of added compute. Same goals as quantization. If there was some compute-intensive lossless compression of parameters, lots of people would be happy. But those floating point values look a lot like gaussian noise, making them extremely difficult to compress.
Topical. My hobby project this week (0) has been hyper-optimizing microgpt for M5's CPU cores (and comparing to MLX performance). Wonder if anything changes under the regime I've been chasing with these new chips.
I think these aren't meant to be representative of arbitrary userland-workload LLM inferences, but rather the kinds of tasks macOS might spin up a background LLM inference for. Like the Apple Intelligence stuff, or Photos auto-tagging, etc. You wouldn't want the OS to ever be spinning up a model that uses 98% of RAM, so Apple probably considers themselves to have at most 50% of RAM as working headroom for any such workloads.
On my 24GB RAM M4 Pro MBP some models run very quickly through LM Studio to Zed, I was able to ask it to write some code. Course my fan starts spinning off like the worlds ending, but its still impressive what I can do 100% locally. I can't imagine on a more serious setup like the Mac Studio.
That's how they make loot on their 128GB MacBook Pros. By kneecapping the cheap stuff. Don't think for a second that the specs weren't chosen so that professional developers would have to shell out the 8 grand for the legit machine. They're only gonna let us do the bare minimum on a MacBook Air.
For anyone who has been watching Apple since the iPod commercials, Apple really really has grey area in the honesty of their marketing.
And not even diehard Apple fanboys deny this.
I genuinely feel bad for people who fall for their marketing thinking they will run LLMs. Oh well, I got scammed on runescape as a child when someone said they could trim my armor... Everyone needs to learn.
Yesterday I ran qwen3.5:27b with an M1 Max and 64 GB of ram. I have even run Llama 70B when llama.cpp came out. These run sufficiently well but somewhat slow but compared to what the improvements with the M5 Max it will make it a much faster experience.
I don't know that there would be a huge overlap between the people who would fall for this type of marketing and the people who want to run LLMs locally.
There definitely are some who fit into this category, but if they're buying the latest and greatest on a whim then they've likely got money to burn and you probably don't need to feel bad for them.
Reminds me of the saying: "A fool and his money are soon parted".
In retrospect, was there a better place to learn about the cruelty of the world than runescape? Must've got scammed thrice before I lost the youthful light in my eye
my mac mini m4 is getting to be a good substitute for claude for a lot of use cases. LM Studio + qwen3.5, tailscale, and an opencode CLI harness. It doesn't do well with super long context or complexity but it has gotten production quality code out for me this week (with some fairly detailed instructions/background).
Musk is leading the build of the biggest objects we have ever sent to space. It does give him some sort of aura that is hard to dismantle, let's be honest.
He can do and say a lot of shit because he will still be viewed as real-life Iron Man, because in some ways he kind of is.
Elon Musk would put Apple's money sloshing about over the years to better uses than failing to build one battery electric vehicle costing $1 billion a year over many years.
He doesn't have a RDF but has Kardashev Scale Intent (KSI).
The lobbyists in the political fray are out to steal his value for money lunch despite his demonstrated effectiveness, over and over again.
Jobs couldn't even engage the politicians to give away or at discount the Apple ][ to education.
Somehow Tim Cook's many year's position that the lightening port was very important to Apple vs USB-C, fell flat as a parsec wide pancake.
(It didn't help that they couldn't point to a single user facing feature.)
Or that the App Store lock in is for our safety. When anyone who wanted that particular safety, could choose to continue using there store exclusively.
Etc.
He just does not have it. No field. No spiraling eyes. Perhaps he should grow a beard and wave around a tobacco pipe. Works for some.
A bit strange to use time to first token instead of throughput.
Latency to the first token is not like a web page where first paint already has useful things to show. The first token is "The ", and you'll be very happy it's there in 50ms instead of 200ms... but then what you really want to know is how quickly you'll get the rest of the sentence (throughput)
As far as benchmarketing goes they clearly went with prefill because it's much easier for apple to improve prefill numbers (flops-dominated) than decode (bandwidth-dominated, at least for local inference); M5 unified memory bandwidth is only about 10% better than the M4.
To add to the sibling "good is relative" it also depends what you're running, not just your relative tolerances of what good is. E.g. in a MoE the decode speedup means the speed of prompt processing delay is more noticeable for the same size model in RAM.
Not strange, for the kind of applications models at that size are often used for the prefill is the main factor in responsiveness. Large prompt, small completion.
No you don't. Not as a sticky mushy human with emotions watching tokens drip in. There's a lot of feeling and emotion not backed by hard facts and data going around, and most people would rather see something happening even if it takes longer overall. Hence spinner.gif, that doesn't actually remotely do a damned thing, but it gives users reassurance that they're waiting for something good. So human psychology makes time to first token an important metric to look at, although it's not the only one.
The 4x comes from the neural accelerators (tensor core in NVIDIA jargon). It's 4x fp16 over the vector path (And 8x compared to M1 because at some point they 2x'd the fp16 vector path). Therefore LLM prefill(context processing/TTFT), diffusion models (image gen), and e.g. video and photo effects that make use of them can be up to 4x faster.
At fp16 that's the same speed at the same clock as NVIDIA.
But NVIDIA still has 2xfp8 and 4xnvfp4.
Batch-1 token generation, that is often quoted, does not benefit from this. It's purely RAM bandwidth-limited.
This is a huge step over M4 32GB 153GB/s memory transfer
For local LLM this make it a replacement for a DGX Spark, which offers a third of the transfer speed and is not something you toss in your backpack as your laptop. It’s practically useful for a lot of local use cases and that I think is the 4x factor (memory xfer) - but the 128Gb unified headroom tremendously improves the models you can run and training you can do.
What is truly amazing is the M1 Max is 400GB/s. 5 years later and we still only hit 1.5x on memory bandwidth. It's quite fascinating how high Apple spec'd it back then with apparently little foreknowledge of how important memory bandwidth would become, and then conversely how little they've managed to improve it now when it's so obvious how important it is.
The reason for that is that most memory bandwidth bumps come with new memory generations. For example an early DDR4 platform (e.g. Intel Skylake/Core iX-6000) and a late one (e.g. AMD Zen3/Ryzen 5000) only differ by 1.5x as well, typically.
The same trend is visible in GPUs: for example, my RTX 2070 (GDDR6) has the same memory bandwidth as a 3070 and only a little bit less than a 4070 (GDDR6X). However, a 5070 does get significantly more bandwidth due to the jump to GDDR7. Lower-end cards like the 4060 even stuck to GDDR6, which gave them a bandwidth deficit compared to a 3060 due to the narrower memory buses on the 40 series.
I bought an iPad for the same reason ~everyone does, for media consumption. But if I could use the hardware to do more interesting things then I'd be willing to spend more on a more powerful model.
yes. in a similar vein, we're seeing that get standardized in coding agents as "don't have the agent use tools directly, have the agent write code to call the tools"
are non programmers actually using openclaw successfully? because even "step 1 install your API keys" requires navigating concepts that are foreign to most "civilians"
Journalists, anyway. I think I originally heard it from Casey Newton on Hard Fork, but it was a month back so not 100% sure.
But there's loads of people who would be stumped by a for loop, yet can easily work their way through a setup guide, particularly with the hype/promise and an active community.
I would say 27B matches with Sonnet 4.0, while 397B A17B matches with Opus 4.1. They are indeed nowhere near Sonnet 4.5, but getting 262144 context length at good speed with modest hardware is huge for local inference.
You mean 35B A3B? If this is shit, this is some of the best shit out I've seen yet. Never in a million years did I think I'd have an LLM running locally, actually writing code on my behalf. Accurately too.
yeah, g3p is as smart or smarter as the other flagships but it's just not reliable enough, it will go into "thinking loops" and burn 10s of 1000s of tokens repeating itself.
Similarly, Cursor's "Auto Mode" purports to use whichever model is best for your request, but it's only reasonable to assume it uses whatever model is best for Cursor at that moment
reply