Hacker Newsnew | past | comments | ask | show | jobs | submit | jbellis's commentslogin

The 90s aren't coming back to publishing. The audience who reads multiple books a month is going the way of the classical symphony attendee.

The opera, symphony, and ballet sell out every performance where I live. Me, my friends, wife, etc all read multiple books per month. To me it feels like the problem is in the supply-side - there's just endless content being constantly published - more than could ever be read.

Now you're just being contrarian for the sake of being contrarian.

https://www.arts.gov/sites/default/files/SPPA_Comprehensive_...


> more than could ever be read

by a human...


You can, until they turn it off.

Anthropic is pulling the plug on Haiku 3 in a couple months, and they haven't released anything in that price range to replace it.


Surely there are open source models that surpass Haiku 3 at better price points by now.

I chased down what the "4x faster at AI tasks" was measuring:

> Testing conducted by Apple in January 2026 using preproduction 13-inch and 15-inch MacBook Air systems with Apple M5, 10-core CPU, 10-core GPU, 32GB of unified memory, and 4TB SSD, and production 13-inch and 15-inch MacBook Air systems with Apple M4, 10-core CPU, 10-core GPU, 32GB of unified memory, and 2TB SSD. Time to first token measured with an 8K-token prompt using a 14-billion parameter model with 4-bit quantization, and LM Studio 0.4.1 (Build 1). Performance tests are conducted using specific computer systems and reflect the approximate performance of MacBook Air.


>Time to first token measured with an 8K-token prompt using a 14-billion parameter model with 4-bit quantization

Oh dear 14B and 4-bit quant? There are going to be a lot of embarrassed programmers who need to explain to their engineering managers why their Macbook can't reasonably run LLMs like they said it could. (This already happened at my fortune 20 company lol)


I don’t really get why people are smack talking this, are there other laptops available that can do better?

Wrong question. If you sell a 6k€ machine "for AI", then you are judged on your own merits.

Replies like "but, but other laptops" are very weak attempts at deflection.


at 6k you can get 128 gb RAM so you can use bigger models

My 2023 Nvidia 3060 laptop I spent $700 on?

you can't run models that are bigger than 16GB, not comparable.

sure you can. system ram is will be your limiter here.

it's too slow for usable inference though.

Slow and can’t run are different things, but I get your point.

Nope, but other producers does not claim that their hardware "can run AI".

I wonder if Apple has foresight into locally running LLMs becoming sufficiently useful.

It won’t handle serious tasks but I have Gemma 3 installed on my M2 Mac and it is good for most of my needs—-esp data I don’t want a corporation getting its hands on.

What kind of tasks are you using it for? I haven't really found any uses for small models.

I run Qwen 3.5 30B MOE and it’s reasonable at most tasks I would use a local model for - including summarizing things. For instance I auto update all my toolchains automatically in the background when I log in and when finished I use my local model to summarize everything updated and any errors or issues on the next prompt rendering. It’s quite nice b/c everything stay updated, I know whats been updated, and I am immediately aware of issues. I also use it for a variety of “auto correct” tasks, “give me the command for,” summarize the man page and explain X, and a bunch of tasks that I would rather not copy and paste etc.

Nothing like coding, just like relatively basic stuff. Idk its hard to explain but I use AI so frequently for work that I have a sense for what it is capable of.

Which size Gemma are you using?

I should clarify that by small I mean in the 3-8B range. I haven't tested the 14-30B ones, my experience is only about the smaller ones.

In my experience, small models are not good for coding (except very basic tasks), they're not good for general knowledge. So the only purpose I could see for them would be, when they're given the information, i.e. summarization or RAG.

But in my summarization experiments, they consistently misunderstood the information given to them. They constantly made basic errors and failed to understand the text.

So having eliminated programming, general knowledge, summarization and (by extension, RAG, because if you can't understand the information, then you can't do RAG either, by definition) -- I have eliminated all the use cases that I had in mind!

That would leave very basic tasks like classification or keywords, but I think there they would be in the awkward middle ground of being disappointing relative to big LLMs for many tasks, and cumbersome relative to small specialized models which can run fast and cheap and be fine tuned.


They do! "You're holding it wrong*

This wasn’t a statement about capability. It’s just a detail about what model they used to compare the speed of two chips for this purpose. You want a bigger model, run a bigger model.

Yeah no it didn’t. If you have a fully speced out M3/4 MacBook with enough memory you’re running pretty decent models locally already. But no one is using local models anyway.

I run a local model on the daily. I have it making tickets when certain emails come in and made a small that I can click to approve ticket creation. It follows my instructions and has a nice chain of thought process trained. Local LLMs are starting to become very useful. Not OpenClaw crap.

What vram you running to allow both a capable model to run and also everything else the device needs to run?

> Yeah no it didn’t

What is "it" and what didn't it do?


If your company can afford fully speced out M3/4 MacBook, then it can also afford cloud AI costs.

Perhaps, but sending everything to the cloud might get them in (very expensive) trouble. Depending on who we are talking about, of course.

cost isn't even close to the main motivating factor for my context

With OpenClaw and powerful local models like Kimi 2.5, these specs make a lot of sense.

I’m not sure what model I’d trust locally with anything meaningful in Openclaw. The smaller/simpler the model is, the greater the chance of fluff answers is.

GPT-OSS-120 works well.

K2.5 isn't remotely a local model

Technically you can get most MoE models to execute locally because RAM requirements are limited to the active experts' activations (which are on the order of active param size), everything else can be either mmap'd in (the read-only params) or cheaply swapped out (the KV cache, which grows linearly per generated token and is usually small). But that gives you absolutely terrible performance because almost everything is being bottlenecked by storage transfer bandwidth. So good performance is really a matter of "how much more do you have than just that bare minimum?"

Oh sure it is! I’ve helped set up an AI cluster rack with four K2.5s.

With some custom tooling, we built our own local enterprise setup:

Support ticketing system Custom chat support powered by our trained software-support model Resolved repository with detailed step-by-step instructions User-created reports and queries Natural language-driven report generation (my favorite — no more dragging filters into the builder; our (Secret) local model handles it for clients) In-application tools (C#/SQL/ASP.NET) to support users directly, since our software runs on-site and offline due to PPI A cool repair tool: import/export “support file packet patcher” that lets us push fixes live to all clients or target niche cases Qwen3 with LoRA fine-tuning is also incredible — we’re already seeing great results training our own models.

There’s a growing group pushing K2.5s to run on consumer PCs (with 32GB RAM + at least 9GB VRAM) — and it’s looking very promising. If this works, we’ll be retooling everything: our apps and in-house programs. Exciting times ahead!


of course it's not remotely local: remote and local are literally antonyms

You can totally run it locally. If you have 500GB of RAM.

Quite interesting that it's now a selling point just like fps in Crysis was a long time ago.

Next is the fps of an AI playing Crysis.

If AI actually becomes somewhat sentient, it may be bored out of its skull in between our queries, and may want to do some "light gaming".

Or tasks per minute of the AI doing your job for you

That measurement will be AI assembling MacBook pros vs human assemblers: number of units per hour, day, or whatever unit is most applicable.

Now that you mentioned it, these macs could theoretically also run crysis if it supported arm and such! They should add that to the marketing material :)

That is talking about battery life, not AI tasks. Footnote 53, where it says, "Up to 18 hours battery life":

https://www.apple.com/macbook-pro/


So it's not measuring output tokens/s, just how long it takes to start generating tokens. Seems we'll have to wait for independent benchmarks to get useful numbers.

For many workflows involving real time human interaction, such as voice assistant, this is the most important metric. Very few tasks are as sensitive to quality, once a certain response quality threshold has been achieved, as is the software planning and writing tasks that most HN readers are likely familiar.

The way that voice assistants work even in the age of LLMs are:

Voice —> Speech to Text -> LLM to determine intent -> JSON -> API call -> response -> LLM -> text to speech.

TTFT is irrelevant, you have to process everything through the pipeline before you can generate a response. A fast model is more important than a good model

Source: I do this kind of stuff for call centers. Yes I know modern LLMs don’t go through the voice -> text -> LLM -> text -> voice anymore. But that only works when you don’t have to call external sources


I'm curious, what does the 'determine intent' mean in this case?

An “intent” is something that a person wants to do - set a timer, get directions, etc.

A “slot” is the variable part of an intent. For instance “I want directions to 555 MockingBird Lane”. Would trigger a Directions intent that required where you are coming from and where you are going. Of course in that case it would assume your location.

Back in the pre LLM days and the way that Siri still works, someone had to manually list all of the different “utterances” that should trigger the intent - “Take me to {x}”,”I want to go to {x}” in every supported language and then had to have follow up phrases if someone just said something like “I need directions” to ask them something like “Where are you trying to go”.

Now you can do that with an LLM and some prompting and the LLM will keep going back and forth until all of the slots are filled and then tell it to create a JSON response when it has all of the information your API needs and you call your API.

This us what a prompt would look like to use a book a flight tool.

https://chatgpt.com/share/69a7d19f-494c-8010-8e9e-4e450f0bf0...

You also get the benefit of this works in any language not just English.


Can you recommend any good resources that discuss structure and performance improvement of these types of systems?

Unfortunately, I don’t know of any.

Using LLMs for voice assistants is relatively new at scale that’s the difference between Alexa and Alexa+ and Gemini powered Google Assistant and what Apple has been trying to do with Siri for two years.

It’s really just using LLMs for tool calling. It is just call centers were mostly built before the age of LLMs and companies are slow to update


Understood. This overlaps with a side project where I’m getting acceptable (but not polished) results, so trying to do some digging about optimizations. Thanks!

One of my niches is Amazon Connect - the AWS version of Amazon’s internal call center. It uses Amazon Lex for voice to text. Amazon Lex is still the same old intent based system I mentioned. If it doesn’t find an intent, it goes to the “FallbackIntent” and you can get the text transcription from there and feed it into a Lambda and from the Lambda call a Bedrock hosted LLM. I have found that Nova Lite is the fastest LLM. It’s much faster than Anthropic or any of the other hosted ones.

It's going to be faster no matter what. My M3 MAX prints tokens faster than I can read for the new MoE models. It's the prompt processing that kills it when the context grows beyond a threshold which is easy to do in the modern agentic loops.

If your computer was faster at it, you could run more capable models at the same token rate.

Token/s is entirely determined by memory bandwidth. TTFT is compute bound.

This is broadly correct for currently favoured software, but in computer science optimization problems you can usually trade off compute for memory and vice versa.

For example just now from the front page: https://news.ycombinator.com/item?id=47242637 "Speculative Speculative Decoding"

Or this: https://openreview.net/forum?id=960Ny6IjEr "Low-Rank Compression of Language Models Via Differentiable Rank Selection"


Good point on speculative decoding techniques. I'd forgotten about them, and they're good. Would love to see some of these get into llama.cpp and friends, but it does require somebody to come up with a distilled draft model.

But low rank compression isn't trading off compute for memory - it's just compressing the model. And critically, that's lossy compression. That's primarily a trade-off of quality for speed/size, with a little bit of added compute. Same goals as quantization. If there was some compute-intensive lossless compression of parameters, lots of people would be happy. But those floating point values look a lot like gaussian noise, making them extremely difficult to compress.


None of these really change the fundamental shape of the problem.

Topical. My hobby project this week (0) has been hyper-optimizing microgpt for M5's CPU cores (and comparing to MLX performance). Wonder if anything changes under the regime I've been chasing with these new chips.

0: https://entrpi.github.io/eemicrogpt/


consider using fp16 or bf16 for the matrix math (in SME you can use svmopa_za16_f16_m or svmopa_za16_bf16_m)

14-billion parameter model with 4-bit quantization seems rather small

I think these aren't meant to be representative of arbitrary userland-workload LLM inferences, but rather the kinds of tasks macOS might spin up a background LLM inference for. Like the Apple Intelligence stuff, or Photos auto-tagging, etc. You wouldn't want the OS to ever be spinning up a model that uses 98% of RAM, so Apple probably considers themselves to have at most 50% of RAM as working headroom for any such workloads.

Also: they're advertising the degree of improvement ("4x faster"), not an absolute level of performance.

On my 24GB RAM M4 Pro MBP some models run very quickly through LM Studio to Zed, I was able to ask it to write some code. Course my fan starts spinning off like the worlds ending, but its still impressive what I can do 100% locally. I can't imagine on a more serious setup like the Mac Studio.

Your limitation after prefill is memory bandwidth. A maxed out Studio has less than a single 3090 (really).

Yeah, the 3090 has faster memory, but not by a lot.

The 5090 is at 1,792GB/sec and potential M5 Ultra would be 1,230GB/sec and 512GB RAM. Maybe 1TB. Not 32.


You’re suggesting that a difference of the entirety of the M5 Max’s bandwidth is an insignificant gap!

No, that difference is the 5090, not the 3090.

How is the output quality of the smaller models?

not good enough for coding anything more than simple scripts.

generally, the less parameters, the less knowledge they have.


what model were you using?


It's not much for a frontier AI but it can be a very useful specialized LLM.

It is.

That's how they make loot on their 128GB MacBook Pros. By kneecapping the cheap stuff. Don't think for a second that the specs weren't chosen so that professional developers would have to shell out the 8 grand for the legit machine. They're only gonna let us do the bare minimum on a MacBook Air.


For anyone who has been watching Apple since the iPod commercials, Apple really really has grey area in the honesty of their marketing.

And not even diehard Apple fanboys deny this.

I genuinely feel bad for people who fall for their marketing thinking they will run LLMs. Oh well, I got scammed on runescape as a child when someone said they could trim my armor... Everyone needs to learn.


Yesterday I ran qwen3.5:27b with an M1 Max and 64 GB of ram. I have even run Llama 70B when llama.cpp came out. These run sufficiently well but somewhat slow but compared to what the improvements with the M5 Max it will make it a much faster experience.

I don't know that there would be a huge overlap between the people who would fall for this type of marketing and the people who want to run LLMs locally.

There definitely are some who fit into this category, but if they're buying the latest and greatest on a whim then they've likely got money to burn and you probably don't need to feel bad for them.

Reminds me of the saying: "A fool and his money are soon parted".


In retrospect, was there a better place to learn about the cruelty of the world than runescape? Must've got scammed thrice before I lost the youthful light in my eye

I run local models on my M1 Max. there are a number of them that are quite useful.

my mac mini m4 is getting to be a good substitute for claude for a lot of use cases. LM Studio + qwen3.5, tailscale, and an opencode CLI harness. It doesn't do well with super long context or complexity but it has gotten production quality code out for me this week (with some fairly detailed instructions/background).

There used to be a polite way to call this out, the "Steve Jobs's reality distortion field".

Now that every CEO has their own reality distortion field I wonder if it's even worth calling out any more.

No current CEO has a RDF comparable to Jobs.

Musk is probably closest, but he’s become so involved in partisan politics it makes his field far less effective at distorting reality.


Musk is leading the build of the biggest objects we have ever sent to space. It does give him some sort of aura that is hard to dismantle, let's be honest.

He can do and say a lot of shit because he will still be viewed as real-life Iron Man, because in some ways he kind of is.


Elon Musk would put Apple's money sloshing about over the years to better uses than failing to build one battery electric vehicle costing $1 billion a year over many years.

He doesn't have a RDF but has Kardashev Scale Intent (KSI).

The lobbyists in the political fray are out to steal his value for money lunch despite his demonstrated effectiveness, over and over again.

Jobs couldn't even engage the politicians to give away or at discount the Apple ][ to education.


Somehow Tim Cook's many year's position that the lightening port was very important to Apple vs USB-C, fell flat as a parsec wide pancake.

(It didn't help that they couldn't point to a single user facing feature.)

Or that the App Store lock in is for our safety. When anyone who wanted that particular safety, could choose to continue using there store exclusively.

Etc.

He just does not have it. No field. No spiraling eyes. Perhaps he should grow a beard and wave around a tobacco pipe. Works for some.


Most are not nearly as smooth and successful at the distorting.

Seems very reasonable to me

A bit strange to use time to first token instead of throughput.

Latency to the first token is not like a web page where first paint already has useful things to show. The first token is "The ", and you'll be very happy it's there in 50ms instead of 200ms... but then what you really want to know is how quickly you'll get the rest of the sentence (throughput)


As far as benchmarketing goes they clearly went with prefill because it's much easier for apple to improve prefill numbers (flops-dominated) than decode (bandwidth-dominated, at least for local inference); M5 unified memory bandwidth is only about 10% better than the M4.

Ok, but prefill/prompt processing was definitely the weak point before. They were already solid in raw tokens/sec after TTFT

In previous generations, throughout was excellent for an integrated GPU, but the time to first token was lacking.

So throughput was already good but TTFT was the metric that needed more improvement?

To add to the sibling "good is relative" it also depends what you're running, not just your relative tolerances of what good is. E.g. in a MoE the decode speedup means the speed of prompt processing delay is more noticeable for the same size model in RAM.

Good is relative but first token was clearly the biggest limitation.

Let’s say TTFT needed the most improvement. At some point, loading the model with enough context size may take tens of seconds in some macs.

Yeah TTFT was terrible. I don’t think it’s unreasonable to benchmark the most-improved metric.

Not strange, for the kind of applications models at that size are often used for the prefill is the main factor in responsiveness. Large prompt, small completion.

I assume it’s time to first output token so it’s basically throughput. How fast can it output 8001 tokens

No you don't. Not as a sticky mushy human with emotions watching tokens drip in. There's a lot of feeling and emotion not backed by hard facts and data going around, and most people would rather see something happening even if it takes longer overall. Hence spinner.gif, that doesn't actually remotely do a damned thing, but it gives users reassurance that they're waiting for something good. So human psychology makes time to first token an important metric to look at, although it's not the only one.

Some kinds of spinners serve as a coal-mine canary indicating if the app has gotten wedged. Not hugely useful, but also not entirely useless.

I would consider it reasonable if this was 4x TTFT and Throughput, but it seems like it's only for TTFT.

The 4x comes from the neural accelerators (tensor core in NVIDIA jargon). It's 4x fp16 over the vector path (And 8x compared to M1 because at some point they 2x'd the fp16 vector path). Therefore LLM prefill(context processing/TTFT), diffusion models (image gen), and e.g. video and photo effects that make use of them can be up to 4x faster. At fp16 that's the same speed at the same clock as NVIDIA. But NVIDIA still has 2xfp8 and 4xnvfp4.

Batch-1 token generation, that is often quoted, does not benefit from this. It's purely RAM bandwidth-limited.


I think the key stats are this (for m5 max)

M5 128GB RAM with 614GB/s memory transfer

This is a huge step over M4 32GB 153GB/s memory transfer

For local LLM this make it a replacement for a DGX Spark, which offers a third of the transfer speed and is not something you toss in your backpack as your laptop. It’s practically useful for a lot of local use cases and that I think is the 4x factor (memory xfer) - but the 128Gb unified headroom tremendously improves the models you can run and training you can do.


You are comparing a M5 Max to a base M4.

The M4 Max has 546 GB/s compared to 614GB/s for the M5 Max. Which is like 12% faster not 4x.


What is truly amazing is the M1 Max is 400GB/s. 5 years later and we still only hit 1.5x on memory bandwidth. It's quite fascinating how high Apple spec'd it back then with apparently little foreknowledge of how important memory bandwidth would become, and then conversely how little they've managed to improve it now when it's so obvious how important it is.

The reason for that is that most memory bandwidth bumps come with new memory generations. For example an early DDR4 platform (e.g. Intel Skylake/Core iX-6000) and a late one (e.g. AMD Zen3/Ryzen 5000) only differ by 1.5x as well, typically.

The same trend is visible in GPUs: for example, my RTX 2070 (GDDR6) has the same memory bandwidth as a 3070 and only a little bit less than a 4070 (GDDR6X). However, a 5070 does get significantly more bandwidth due to the jump to GDDR7. Lower-end cards like the 4060 even stuck to GDDR6, which gave them a bandwidth deficit compared to a 3060 due to the narrower memory buses on the 40 series.


thank you that is great insight to have

Does that include loading the model again? Apple seems to be the only company doing such shenanigans in their measurements

Like saying my PC boots up 2x faster so it must be 2x more powerful. lol

It was also one of the areas it was weakest in though, so this brings it way more in line with usable GPU territory.

When the M1 ipad came out I said I'd upgrade from whatever my model year 2020 ipad is once I could run a Linux VM on it without rooting it.

Still waiting.


Why did you get an iPad if you wanted to run a Linux VM? Wouldn't a Macbook Air have been a better choice?


Because they like the hardware? The better question is why Apple pretends these devices can't run VMs.


I bought an iPad for the same reason ~everyone does, for media consumption. But if I could use the hardware to do more interesting things then I'd be willing to spend more on a more powerful model.


ok


yes. in a similar vein, we're seeing that get standardized in coding agents as "don't have the agent use tools directly, have the agent write code to call the tools"


are non programmers actually using openclaw successfully? because even "step 1 install your API keys" requires navigating concepts that are foreign to most "civilians"


Journalists, anyway. I think I originally heard it from Casey Newton on Hard Fork, but it was a month back so not 100% sure.

But there's loads of people who would be stumped by a for loop, yet can easily work their way through a setup guide, particularly with the hype/promise and an active community.


this is bullshit with a kernel of truth.

none of the qwen 3.5 models are anywhere near sonnet 4.5 class, not even the largest 397b.

BUT 27b is the smartest local-sized model in the world by a wide wide margin. (35b is shit. fast shit, but shit.)

benchmarks are complete, publishing on Monday.


I would say 27B matches with Sonnet 4.0, while 397B A17B matches with Opus 4.1. They are indeed nowhere near Sonnet 4.5, but getting 262144 context length at good speed with modest hardware is huge for local inference.

Will check your updated ranking on Monday.


You mean 35B A3B? If this is shit, this is some of the best shit out I've seen yet. Never in a million years did I think I'd have an LLM running locally, actually writing code on my behalf. Accurately too.


He's talking about taking the government to court to force it to follow the law, not "maybe we'll get sued later."


You just reinvented Skills


I don't prefer to use online skills where half has malware

Official MCPs are trusted. Official MCPs CLIs are trusted.


Did he? Skills are for CLIs, not for converting MCPs into CLIs.


yeah, g3p is as smart or smarter as the other flagships but it's just not reliable enough, it will go into "thinking loops" and burn 10s of 1000s of tokens repeating itself.

https://blog.brokk.ai/gemini-3-pro-preview-not-quite-baked/

hopefully 3.1 is better.


> it will go into "thinking loops" and burn 10s of 1000s of tokens repeating itself.

Maybe it is just a genius business strategy.


Similarly, Cursor's "Auto Mode" purports to use whichever model is best for your request, but it's only reasonable to assume it uses whatever model is best for Cursor at that moment


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: