Technically true, but if we're talking about local models, overwhelmingly you're gonna be bandwidth bound. You need about 2 flops per active parameter per token. An M5 chip has what, 150-200GB of bandwidth? But it can easily do something like 16tflops of fp16, so you're talking like 100 flops per byte of bandwidth. Which is just to say that in a batch=1 scenario, ie one user, you're only gonna use a few % of the GPU while you're totally saturated your memory bandwidth. For all practical purposes at the consumer level, take your memory bandwidth, divide by the size of the model, and that gives you the max tok/s throughput you're gonna get.
Even a 5090 has something like 50-60 flops per byte of bandwidth, you just can't saturate the compute without running large batches. (At least at inference, prefill is obviously more compute bound).
Anyway, there are a few model that are freely distributable, and that can reasonably run on consumer-grade local hardware.
It changes a number of things. Not all tasks require very high intelligence, but a lot of data may be sensitive enough to avoid sharing it with a third party.
That's what's different about this one. "Enter the Ryzen 9 9950X3D2 Dual Edition, a mouthful of a chip that includes 64MB of 3D V-Cache on both processor dies, without the hybrid arrangement that has defined the other chips up until now."
I'm assuming he's in some sort of high-end communal housing, a trend that began emerging in SF ~15 years back ... ie. where multi-millionaire startup founders and the like choose it on purpose for the synergistic benefits.
Not sure what you mean, but I’d never heard of Sarah Paine before that. I thought she gave a very concise yet nuanced view of the modern world order in her lectures for Dwarkesh.
reply