Claude tends to default to and do best with first making ASCII diagrams in markdown files, which you can then ask it to translate into Mermaid if appropriate.
Prompts like, "Please write a comprehensive report on _____ to work with _____. Include a holistic report on architecture and meaning and purpose of all involved systems. Describe the why and how of the changes in depth and include a full glossary of terms and systems.
Write as a new .md in docs when you are sure there are no major gaps in your understanding. Include a report on the plan to _____."
Will be the rough shape you want to get it to dig through all the relevant code and make relevant architectural diagrams. Guide more or less towards specifics as appropriate. This has worked well since Opus 4.5.
Yeah, agree. This is the sort of thing you release to build brand awareness and either offer a hosted option as a bonus or integrate into a larger stack. It is not the product. Someone will just make a better OSS option if they don't do it themselves.
Unless you're looking at something like a pass@100 benchmark, the benchmarks are confounded heavily by a likelihood of a "golden path" retrieval within their capabilities. This is on top of uncertainties like how well your task within a domain maps to the relevant test sets, as well as factors like context fullness and context complexity (heavy list of relevant complex instructions can weigh on capabilities in different ways than e.g. having a history where there's prior unrelated tasks still in context).
The best tests are your own custom personal-task-relevant standardized tests (which the best models can't saturate, so aiming for less than 70% pass rate in the best case).
All this is to say that most people are not doing the latter and their vibes are heavily confounded to the point of being mostly meaningless.
The pass@100 is such a weird critique angle that is surprisingly mainstream; guess what, no one cares if the correct answer is in the top 100, it needs to be the top 1. A model with a better answer in the top 1 is a better model, full stop.
This. Plus if you want to even attempt measuring real 'intelligence' you want to run a neuro-symbolic, de-lexicalized benchmark (e.g. DL-ReasonSuite, SoLT, GSM-Symbolic) - which none of the providers releasing new models showcase.
I was an original Thunderbird pre-1.0 (from 2003) user and prior to that, Netscape Mail, and am quite certain it has had bayesian spam filtering all this time, at least since the late ‘90s. That was a headline feature in the early days. My first email account used POP3 through a shared web host for my own domain in that era.
While I understand why they used the METR data, a cleaner look would be against the current cost-optimal frontier of open models (e.g. GLM-5.1 and MiniMax-M2.7). That paints a very different picture. Comparing just the frontier models at the time of the METR report invariably leads to looking at providers who are pushing the limits of cost at the time of the report.
GPT-5 was shown as being on the costly end, surpassed by o3 at over $100/hr. I can't directly compare to METR's metrics, but a good proxy is the cost of the Artificial Analysis suite. GLM-5.1 is less than half the cost to complete the suite of GPT-5 and is dramatically more capable than both GPT-5 and o3.
So while their analysis is interesting, it points towards the frontier continuing to test the limits of acceptable pricing (as Mythos is clearly reinforcing) and the lagging 6-12 months of distillation and refinement continuing to bring the cost of comparable capabilities to much more reasonable levels.
Calculating hourly costs for these models makes me think that the decision of when to hire an SWE vs. increase use of AI may follow a similar pattern to the decision to use cloud compute vs. on-premises. I don’t cost $120/hr (incl. fringe), but my employer pays my salary all year long, no matter if I am working or on vacation. Whereas if they use an AI model to do the same work, they may be happy to pay $120/hr or more, since they may only use the model for a small fraction of 2080 hours per year, so they’d still save money, and not have a messy human to deal with.
I remain convinced we won’t look at project estimates as time based in software engineering as our primary cost estimate. And this is transition will happen rapidly. We’re going to shift to a capex/token spend model for project estimates where the business will say “ok I do want that feature for $1000 in tokens”.
I agree with you directionally that project estimates are/will be affected by this but I don't see a scenario in which time is completely removed from the equation with respects to projects & estimates to execute on them. We're all constrained by time, finite resource. It's always a factor in business.
Why is ollama so many people’s go-to? Genuinely curious, I’ve tried it but it feels overly stripped down / dumbed down vs nearly everything else I’ve used.
Lately I’ve been playing with Unsloth Studio and think that’s probably a much better “give it to a beginner” default.
Ollama is good enough to dabble with, and getting a model is as easy as ollama pull <model name> vs figuring it out by yourself on hugging face and trying to make sense on all the goofy letters and numbers between the forty different names of models, and not needing a hugging face account to download.
So you start there and eventually you want to get off the happy path, then you need to learn more about the server and it's all so much more complicated than just using ollama. You just want to try models, not learn the intricacies of hosting LLMs.
to be fair, llama.cpp has gotten much easier to use lately with llama-server -hf <model name>. That said, the need to compile it yourself is still a pretty big barrier for most people.
I started with ollama and now I'm using llama.cpp/llama-server's Router Mode that allows you to manage multiple models through a single server instance.
One thing I haven't figured out: Subjectively, it feels like ollama's model loading was nearly instant, while I feel like I'm always waiting for llama.cpp to load models, but that doesn't make sense because it's ultimately the same software. Maybe I should try ollama again to convince myself that I'm not crazy and that ollama's model loading wasn't actually instant.
Ah, 'twas a mere jest, a sarcastic jab that of all the manifold builds provided, the most useful is missing - doubtless for good and practical reasons.
Nevertheless, worth looking at the Vulkan builds. They work on all GPUs!
> That said, the need to compile it yourself is still a pretty big barrier for most people.
My distro (NixOS) has binary packages though...
And there's packages in the AUR (Arch), GURU (Gentoo), and even Debian Unstable. Now, these might be a little behind, but if you care that much you can download binaries from GitHub directly.
Ollama got some first-mover advantage at the time when actually building and git pulling llama.cpp was a bit of a moat. The devs' docker past probably made them overestimate how much they could lay claim to mindshare. However, no one really could have known how quickly things would evolve... Now I mostly recommend LM-studio to people.
LM Studio has been around longer. I’ve used it since three years ago. I’d also agree it is generally a better beginner choice then and now.
Unsloth Studio is more featureful (well integrated tool calling, web search, and code execution being headline features), and comes from the people consistently making some of the best GGUF quants of all popular models. It also is well documented, easy to setup, and also has good fine-tuning support.
I run Little Snitch[1] on my Mac, and I haven't seen LM Studio make any calls that I feel like it shouldn't be making.
Point it to a local models folder, and you can firewall the entire app if you feel like it.
Digressing, but the issue with open source software is that most OSS software don't understand UX. UX requires a strong hand and opinionated decision making on whether or not something belongs front-and-center and it's something that developers struggle with. The only counterexample I can think of is Blender and it's a rare exception and sadly not the norm.
LM Studio manages the backend well, hides its complexities and serves as a good front-end for downloading/managing models. Since I download the models to a shared common location, If I don't want to deal with the LM Studio UX, I then easily use the downloaded models with direct llama.cpp, llama-swap and mlx_lm calls.
Ollama's org had people flood various LLM/programming related Reddits and Discords and elsewhere, claiming it was an 'easy frontend for llama.cpp', and tricked people.
Only way to win is to uninstall it and switch to llama.cpp.
Ollama user with the opposite question -- why not? What am I missing out on? I'm using it as the backend for playing with other frontend stuff and it seems to work just fine.
And as someone running at 16gb card, I'm especially curious as to if I'm missing out on better performance?
Ollama has had bad defaults forever (stuck on a default CTX of 2048 for like 2 years) and they typically are late to support the latest models vs llamacpp. Absolutely no reason to use it in 2026.
> Ollama user with the opposite question -- why not? What am I missing out on? I'm using it as the backend for playing with other frontend stuff and it seems to work just fine.
Used to be an Ollama user. Everything that you cite as benefits for Ollama is what I was drawn to in the first place as well, then moved on to using llama.cpp directly. Apart from being extremely unethical, The issue is that they try to abstract away a bit too much, especially when LLM model quality is highly affected by a bunch of parameters. Hell you can't tell what quant you're downloading. Can you tell at a glance what size of model's downloaded? Can you tell if it's optimized for your arch? Or what Quant?
`ollama pull gemma4`
(Yes, I know you can add parameters etc. but the point stands because this is sold as noob-friendly. If you are going to be adding cli params to tweak this, then just do the same with llama.cpp?)
That became a big issue when Deep Seek R1 came out because everyone and their mother was making TikToks saying that you can run the full fat model without explaining that it was a distill, which Ollama had abstracted away. Running `ollama run deepseek-r1` means nothing when the quality ranges from useless to super good.
> And as someone running at 16gb card, I'm especially curious as to if I'm missing out on better performance?
I'd go so far as to say, I can *GUARANTEE* you're missing out on performance if you are using Ollama, no matter the size of your GPU VRAM. You can get significant improvement if you just run underlying llama.cpp.
Secondly, it's chock full of dark patterns (like the ones above) and anti-open source behavior. For some examples:
1. It mangles GGUF files so other apps can't use them, and you can't access them either without a bunch of work on your end (had to script a way to unmangle these long sha-hashed file names)
2. Ollama conveniently fails contribute improvements back to the original codebase (they don't have to technically thanks to MIT), but they didn't bother assisting llama.cpp in developing multimodal capabilities and features such as iSWA.
3. Any innovations to the do is just piggybacking off of llama.cpp that they try to pass off as their own without contributing back to upstream. When new models come out they post "WIP" publicly while twiddling their thumbs waiting for llama.cpp to do the actual work.
It operates in this weird "middle layer" where it is kind of user friendly but it’s not as user friendly as LM Studio.
After all this, I just couldn't continue using it. If the benefits it provides you are good, then by all means continue.
IMO just finding the most optimal parameters for a models and aliasing them in your cli would be a much better experience ngl, especially now that we have llama-server, a nice webui and hot reloading built into llama.cpp
> 1. It mangles GGUF files so other apps can't use them, and you can't access them either without a bunch of work on your end (had to script a way to unmangle these long sha-hashed file names)
This is what pushed me away from Ollama. All I wanted was to scp a model from one machine to another so I didn't have to re-download it and waste bandwidth. But Ollama makes it annoying, so I switched to llama.cpp. I did also find slightly better performance on CPU vs Ollama, likely due to compiling with -march=native.
> (they don't have to technically thanks to MIT)
Minor nit: I'm not aware of any license that requires improvements to be upstreamed. Even GPL just requires that you publish derivative source code under the GPL.
A full-module add-on in this power class is about $7 at 1,000 unit scale [0]. It would be around $3 with your own custom PCB design in terms of BoM addon at scale. That’s power only. Add another dollar or two for 10/100 PHY.
The trick is as others have said in what adding it to your design does in terms of complicating compliance design.
I never felt during this era that the information about these chips was hard to come by as the author claims. Retrospectively I appreciate that’s because I grew up living by a large, well funded library in a tech centric town, so they always had all the latest tech publications.
Prompts like, "Please write a comprehensive report on _____ to work with _____. Include a holistic report on architecture and meaning and purpose of all involved systems. Describe the why and how of the changes in depth and include a full glossary of terms and systems. Write as a new .md in docs when you are sure there are no major gaps in your understanding. Include a report on the plan to _____."
Will be the rough shape you want to get it to dig through all the relevant code and make relevant architectural diagrams. Guide more or less towards specifics as appropriate. This has worked well since Opus 4.5.
reply