That blog post really makes it look like it's graded from an LLM's estimation of an OSS maintainer's review. I see three issues:
1. That estimate could easily be wrong.
2. That estimate is, of course, usable in RL training. This isn't an inherently bad thing, and this is more or less what has improved coding models so much lately. But it does mean that other companies could and surely will do this sort of training, and Anthropic probably did too.
3. OSS maintainers are far from perfect, and there's an unfortunate uncanny valley-like effect in which a coding model can produce code that is just convincing enough to pass review even though it's actually totally wrong. I don't know whether this is a specific issue here.
There is also the possibility that an LLM judge would be happy with some code that looks like LLM generated code. But a maintainer for a specific project might not merge it for stylistic reasons
Given it was made by cognition (team behind devin flop) who now just got to wait out until claude and gpt5 basically do all of the work for them - not very. When you read about it, the framework is highly subjective. Which very quickly becomes a problem because its based on heuristics that probably change a bunch with a better code model.
i worked on one of the benchmarks typically found in new model releases
this benchmark looks very good from the methodology. a cog researcher checking the data themselves is very high signal (not scaleable so don't take the benchmark as gospel, but directionally good)
It's a relatively new benchmark but from what I can tell it has serious cred behind it. I assume it will be picked up as part of the standard suite of CS-related benchmarks soon enough.
Yeah, right. If this benchmark was truly developed in an independent manner, and the timing just “lined up”, how did Anthropic even know to include results in their model release documentation the day after the benchmark is revealed? It seems like there must have been some collaboration or influence from Anthropic behind the scenes.
People game benchmarks for fake internet points to get their favorite web framework to the top of the list. I'm pretty sure they will do it for billions of dollars.
Cognition did well in documenting their approach [1].
TL;DR - they worked with OSS project maintainers to build tasks. They score models based on whether a PR is mergeable. All tasks are graded by a human researcher. SoTA models have hill-climbing to do which raises the bar and inspires confidence. I'd say it's legit.
Did you read the blog post? They compare to deepswe and call it out as the worst one for false positives (failed, but the benchmark assessed it as correct). It also has less language variance.
how so? it has been my daily work horse,in fact so was 4.5 BUT as long as we steer it IT does a good enough job. I have not tried Mythos/Fable yet SO do not have an opinion on it.
I'm not familiar with model pricing trends, did they clearly state how the new pricing compares? (Note that I'm actually asking a question, and am not arguing)
Token prices have increased, but it's not really the whole story at this point, given some models will use far more tokens to complete a task than others. One of the charts in Anthropic's blog posts shows Fable at 'low' reasoning achieving better results for less money than Opus on 'high'.
Depends no the Enterprise - obviously - in the bay area - 0% of the tech companies care in the slightest. And I'm willing to wager < 5% of enterprises would send their traffic to OpenRouter. Most of them don't even want to send traffic directly to Anthropic or OpenAI - which is why Bedrock has gotten so much traction lately.
But - these $3k-$5k/month/engineer bills are going to start to get attention soon - only question is whether the response is to slow down on the $$$ spending or reduce the # of engineers.
There a few benchmarks out there where all existing models have abysmal scores. So it's not actually a problem if Antrophic's older models are bad, especially if the jump to the newest model is huge, and the competition is also way below it.
Huh? It's a benchmark by Cognition which (1) is building their own models and (2) offers all providers and thus has an incentive to avoid hyping up any one too much.
I agree in principle with the math. But I believe that in reality if revenues don't show up quickly, then lenders will just restructure the debt and defer the payback period. Similar to SF commercial real-estate; many buildings should've come due during the depressed covid market, but lenders (banks) were willing to delay payment until the market picked up again.
The scale of these investments put the lenders at substantial risk, so the lenders will do anything to make it work. If the current lenders will be damaged by extended payback periods, they can simply sell the debt to someone else who won't be.
Dario (the founder) has a phd in biophysics, so I assume that’s why they mention biological weapons so much - it’s probably one of the things he fears the most?
Going off the recent biography of Demis Hassabis (CEO/co-founder of Deepmind, jointly won the Nobel Prize in Chemistry) it seems like he's very concerned about it as well
it has been pretty much a benchmark for memorization for a while. there is a paper on the subject somewhere.
swe bench pro public is newer, but its not live, so it will get slowly memorized as well. the private dataset is more interesting, as are the results there:
@deno team, how do secrets work for things like connecting to DBs over a tcp connection? The header find+replace won't work there, I assume. Is the plan to add some sort of vault capability?
I follow Dioxus and particularly blitz / #native on your Discord and I noticed the exact same thing too. There was a comment in a readme in Cursor's browser repo they linked mentioning taffy and I thought, hang on, it's definitely not from scratch, as they advertise. People really do believe everything they read on Twitter.
Great work by the way, blitz seems to be coming along nicely, and I even see you guys created a proto browser yourselves which is pretty cool, actually functional unlike Cursor's.
WASM for frontend, at least, has been held back by fundamental tools like bundle splitting, hot-reload, debugger symbols, asset integration, etc. We spent a lot of 2025 working on improving this. Vite and friends are really good!
I've been working on a big Dioxus project recently and am pretty happy with where WASM is now. The AI tools make working with Rust code much faster. I'm hopeful people gravitate towards WASM frameworks more now that the tools are better.
- Opus 4.7 xhigh: 5.2%
- Opus 4.8 xhigh: 13.4%
- Fable 5 xhigh: 29.3%
Seems like a huge jump.
[1] https://cognition.ai/blog/frontier-code
reply