Hacker Newsnew | past | comments | ask | show | jobs | submit | Readerium's commentslogin

AI workloads are all about memory size and bandwidth not compute

LLMs are memory bandwidth bound not compute bound.

LLMs are bound by both and depends on the hardware which factor is higher.

Technically true, but if we're talking about local models, overwhelmingly you're gonna be bandwidth bound. You need about 2 flops per active parameter per token. An M5 chip has what, 150-200GB of bandwidth? But it can easily do something like 16tflops of fp16, so you're talking like 100 flops per byte of bandwidth. Which is just to say that in a batch=1 scenario, ie one user, you're only gonna use a few % of the GPU while you're totally saturated your memory bandwidth. For all practical purposes at the consumer level, take your memory bandwidth, divide by the size of the model, and that gives you the max tok/s throughput you're gonna get.

Even a 5090 has something like 50-60 flops per byte of bandwidth, you just can't saturate the compute without running large batches. (At least at inference, prefill is obviously more compute bound).


This is incorrect, prompt processing is compute bound.

This is only true for some parts of the time cost function.

Add an iPad mini esque screen on the trackpad of a MacBook.

Trackpad of a 16 inch MacBook Pro is humongous anyways.

Add a touchscreen display to the trackpad, and give it iPad OS


that is true. gguf does not support any Architecture.

for the most recent example, as of April 16, 2026 (today)

Turboquant isnt still added to GGUF


perhaps increasing repitition_penalty might be helpful


In coding they are worse.

Chinese models (GLM, MiniMax) are better.


Anyway, there are a few model that are freely distributable, and that can reasonably run on consumer-grade local hardware.

It changes a number of things. Not all tasks require very high intelligence, but a lot of data may be sensitive enough to avoid sharing it with a third party.


Can someone explain if the 3D Vcache are stacked on top of each other or side by side.

If they are stacked then why not 9800X3D2?


The 99xx chips have two CPU dies, and one cache die is on each CPU die.


The 3D V-Cache sits underneath only one of the CCDs. See https://en.wikipedia.org/wiki/Ryzen#Ryzen_9000.


That's what's different about this one. "Enter the Ryzen 9 9950X3D2 Dual Edition, a mouthful of a chip that includes 64MB of 3D V-Cache on both processor dies, without the hybrid arrangement that has defined the other chips up until now."


Did you forget which thread we are on?


Oh heh, I thought they were asking about the X3D. My bad ><.


Qwen 3.5 4B is the goat then


Yes he is room-mates with.


Love the interviews Dwarkesh sponsored with Sarah Paine from the Naval War College.

Also, somewhat spitefully, find it funny that he has multiple roommates.


I'm assuming he's in some sort of high-end communal housing, a trend that began emerging in SF ~15 years back ... ie. where multi-millionaire startup founders and the like choose it on purpose for the synergistic benefits.


Those ones were a bit on the nose, no?


Not sure what you mean, but I’d never heard of Sarah Paine before that. I thought she gave a very concise yet nuanced view of the modern world order in her lectures for Dwarkesh.


How so? I enjoyed them, keeping in mind that the lecturer was a professor at a US naval war college.


Would you like to know MORE?


You are pattern matching to something that doesn't really fit, I think.


99.999 percent lol


Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: