I currently run the qwen3.5-122B (Q4) on a Strix Halo (Bosgame M5) and am pretty happy with it. Obviously much slower than hosted models. I get ~ 20t/s with empty context and am down to about 14t/s with 100k of context filled.
No tuning at all, just apt install rocm and rebuilding llama.cpp every week or so.
The privacy/data security angle really is important in some regions and industries. Think European privacy laws or customers demanding NDAs. The value of Anthropic and OpenAI is zero for both cases, so easy to beat, despite local models being dumber and slower.
1. Gemma-4 we re-uploaded 4 times - 3 times were 10-20 llama.cpp bug fixes - we had to notify people to upload the correct ones. The 4th is an official Gemma chat template improvement from Google themselves.
2. Qwen3.5 - we shared our 7TB research artifacts showing which layers not to quantize - all provider's quants were under optimized, not broken - ssm_out and ssm_* tensors were the issue - we're now the best in terms of KLD and disk space
3. MiniMax 2.7 - we swiftly fixed it due to NaN PPL - we found the issue in all quants regardless of provider - so it affected everyone not just us. We wrote a post on it, and fixed it - others have taken our fix and fixed their quants, whilst some haven't updated.
Note we also fixed bugs in many OSS models like Gemma 1, Gemma 3, Llama chat template fixes, Mistral, and many more.
Unfortunately sometimes quants break, but we fix them quickly, and 95% of times these are out of our hand.
We swiftly and quickly fix them, and write up blogs on what happened. Other providers simply just take our blogs and fixes and re-apply, re-use our fixes.
Fair enough, appreciate the detailed response! Can you elaborate why other quantizations weren't affected (e.g. bartowski)? Simply because they were straight Q4 etc. for every layer?
Thanks for all the amazing work Daniel. I remember you guys being late to OH because you were working on weights released the night before - and it's great to see you guys keep up the speed!
To add to that, the current take that the US could just walk away from the conflict is incredibly naive - Iran will decide when this is over, and it won't be before the November elections. Before the US attacked, blocking the strait was only a potential, now Trump gave Iran the chance to prove that they are capable of doing it. And why on earth would Iran now give that away for free?
In Germany there was zero investment into the electric infrastructure, but the power allowed to flow from the panels into the grid is currently limited to 800W for this type of system. Seems to work fine. Larger systems still need a license.
They paid about $10B on inference and had about $10B in revenue in 2025. The users and numbers of zeroes on those numbers are not relevant. What is relevant is the ratio of those numbers. They apparently are not even profitable on inference, wich is the cheap part of the whole business.
And cost of inference tripled from $3B in 2024 to $10B in 2025, so cost of revenue linearly grows with number of users, i.e. it does not get cheaper.
Of course they bundle R&D with inference pricing, how else could you the recoup that investment.
The interesting question is: In what scenario do you see any of the players as being able to stop spending ungodly amounts for R&D and hardware without losing out to the competitors?
But only if you ignore all the other market participants, right? How can we ever reach a point where all the i.e. smaller Chinese competitors perpetually trailing behind SOTA with a ~9 month lag but at a tiny fraction of the cost stop existing?
I mean we just have to look at old discussions about Uber for the exact same arguments. Uber, after all these years, still is at a negative 10 % lifetime ROI , and that company doesn't even have to meaningfully invest in hardware.
IMO this will probably develop like the railroad boom in the first half of the 19th century: All the AI-only first movers like OpenAI and Anthropic will go bust, just like most railroad companies who laid the tracks, because they can't escape the training treadmill. But the tech itself will stay, and even become a meaningful productivity booster over the next decades.
I am also thinking long term where is the moat if it will inevitably lead to price competition? Like it's not a Microsoft product suite that your whole company is tied in multiple ways. LLMs can be quite easily swapped to another.
> there was a single brain region where we saw that higher cannabis use was actually associated with lower brain volume – the posterior cingulate, which is part of the limbic system and is implicated in processes like memory, learning, and emotion. That said, some research suggests smaller posterior cingulate volume is actually associated with better working memory, so it’s a little unclear what this means.
No tuning at all, just apt install rocm and rebuilding llama.cpp every week or so.
reply