Memory bandwidth is quite low for the price. $3,000 for 128GB @ 273GB/s.
You can buy a Mac Studio with an M4 Max for $3,500. 128GB unified memory @ 546 GB/s.
You're also getting a much faster CPU and more usable daily computer.
I suppose if you're a CUDA developer, this thing is probably better though I doubt you'd be training anything worthwhile on a computer this weak. Nvidia advertises the DGX Spark as a machine that mimics very large DGX clusters so the environment is the same. But in terms of hardware specs, it's very disappointing for $3,000.
DGX Station is another beast. It's Blackwell Ultra with 288GB HBM3e and 496GB LPDDR5X. I'm guessing $150k - $200k.
I love MacBooks and Mac machines however they aren't cut out for specialised work.
I have a M4 Max with 128 GB RAM. If I play Civ VI (a game released in 2016) on it for a few hours without limiting the FPS it will heat up till it turns itself off.
It's not cut out for heavy loads consistently like games or crypto mining. It's cut out for a heavy loads like Xcode compiling for a few minutes then back to editing text.
My gaming machine which has poorer performance (in fps terms) and is equipped with AMD 3700 and a 2080 Super can play Civ VI indefinitely without breaking a sweat.
Couple of things:
This must be new to the M4? I've played Civ VI for hours and hours on an M1 Max without issue.
If someone is looking at this Nvidia box, they're likely fine with a desktop footprint, in which case they'd be looking at the Mac Studio which should not have any thermal issues whatsoever. I'm guessing you're on a laptop?
If they are insistent on a laptop format, you can alleviate overheating issues pretty easily with some thermal pads and running the laptop on a cooling base when you're running heavy operations:
That's pretty surprising and breaks my mental model of how these chips perform. Do you think that's because they don't have the raw FLOPs, something inefficient in the Apple/Metal rendering pipeline, something about Civ VI and how it was converted (i.e. x86 or DirectX emulation), or something else entirely?
I've had no trouble with a (base) M4 mini for regular dev work, though I compile remotely and haven't played any games on it.
FWIW I also just started seeing comments from other devs on my team today about their M4 mini's thermally shutting down during a compile run in Android Studio frequently.
Did you change the fan settings? Default is for fans to be quiet. If you go to energy management in settings you can choose “more power” or something instead of automatic. Your Mac will be louder but throttle less.
I’m curious if Civilization VII has the same issue. Baldur’s Gate 3 was released around the same timeframe as Civ 6 and has some pathological behavior on higher-core-count machines.
Only looking at memory bandwidth as a measure of performance give you an incomplete picture. You also need to look at how much of that bandwidth your processor (CPU, GPU, NPU, etc.) can actually consume because it can be far less than the memory modules are capable of.
You can also get an Epyc 9115 for $800, motherboard for $640, and 12 16-GiB ddr5-6400 dims for $1400, that gives you 614.4 GiB/sec, for around $2800. You may also want to add in a small GPU to do prompt processing (inference on a CPU is memory bandwidth bound, prompt processing is processing bound).
I was going by the number of memory channels the CPU spec says it supports (12). But apparently I was wrong, as that gets bottlenecked by the number of CCDs on the chip. In which case you would need to go with a much higher end epyc processor, and then there are other limits. So much for napkin math
The M series CPUs have very good memory bandwidth and capacity which lets them load in the billions of weights of a large LLM quickly.
Because the bottleneck to producing a single token is typically the time taken to get the weights into the FPU macs perform very well at producing additional tokens.
Producing the first token means processing the entire prompt first. With the prompt you don't need to process one token before moving on to the next because they are all given to you at once. That means loading the weights into the FPU onlu once for the entire prompt, rather than once for every token. That means the bottleneck isn't the time to get the weights to the FPU, it's the time taken to process the tokens.
Macs have comparatively low compute performance (M4 Max runs at about 1/4 the FP16 speed of the small nvidia box in this article, which itself is roughly 1/4 the speed of a 5090 GPU).
Next token is mostly bandwidth bound, prefill/ingest can process tokens in parallel and starts becoming more compute heavy. Next token(s) with speculative decode/draft model also becomes compute heavy since it processes several in parallel and only rolls back on mispredict.
You can buy a Mac Studio with an M4 Max for $3,500. 128GB unified memory @ 546 GB/s.
You're also getting a much faster CPU and more usable daily computer.
I suppose if you're a CUDA developer, this thing is probably better though I doubt you'd be training anything worthwhile on a computer this weak. Nvidia advertises the DGX Spark as a machine that mimics very large DGX clusters so the environment is the same. But in terms of hardware specs, it's very disappointing for $3,000.
DGX Station is another beast. It's Blackwell Ultra with 288GB HBM3e and 496GB LPDDR5X. I'm guessing $150k - $200k.