More

jkelleyrtp · 2026-06-09T17:10:00 1781025000

On the new FrontierCode [1] benchmark (ie graded from an OSS maintainer's perspective of "would I merge this code?")

- Opus 4.7 xhigh: 5.2%

- Opus 4.8 xhigh: 13.4%

- Fable 5 xhigh: 29.3%

Seems like a huge jump.

[1] https://cognition.ai/blog/frontier-code

amluto · 2026-06-09T17:39:41 1781026781

That blog post really makes it look like it's graded from an LLM's estimation of an OSS maintainer's review. I see three issues:

1. That estimate could easily be wrong.

2. That estimate is, of course, usable in RL training. This isn't an inherently bad thing, and this is more or less what has improved coding models so much lately. But it does mean that other companies could and surely will do this sort of training, and Anthropic probably did too.

3. OSS maintainers are far from perfect, and there's an unfortunate uncanny valley-like effect in which a coding model can produce code that is just convincing enough to pass review even though it's actually totally wrong. I don't know whether this is a specific issue here.

rdedev · 2026-06-10T00:48:20 1781052500

There is also the possibility that an LLM judge would be happy with some code that looks like LLM generated code. But a maintainer for a specific project might not merge it for stylistic reasons

amluto · 2026-06-10T03:42:13 1781062933

I think the intent was to specifically train an LLM to judge what a specific maintainer would consider to be good style.

zzleeper · 2026-06-09T17:25:19 1781025919

How credible is this benchmark? does it correlated with others real world experience?

bfeynman · 2026-06-09T17:59:25 1781027965

Given it was made by cognition (team behind devin flop) who now just got to wait out until claude and gpt5 basically do all of the work for them - not very. When you read about it, the framework is highly subjective. Which very quickly becomes a problem because its based on heuristics that probably change a bunch with a better code model.

vanuatu · 2026-06-09T18:02:15 1781028135

the subjective framework is exactly why its good

prior bms relied mostly on unit tests or synthetic judges which are easily benchmaxxed, which leads to nobody trusting benchmarks

we need people manually checking the data for good code quality

vanuatu · 2026-06-09T18:00:26 1781028026

i worked on one of the benchmarks typically found in new model releases

this benchmark looks very good from the methodology. a cog researcher checking the data themselves is very high signal (not scaleable so don't take the benchmark as gospel, but directionally good)

Catloafdev · 2026-06-09T17:29:04 1781026144

It's a relatively new benchmark but from what I can tell it has serious cred behind it. I assume it will be picked up as part of the standard suite of CS-related benchmarks soon enough.

emp17344 · 2026-06-09T17:29:56 1781026196

Seems like it literally popped up yesterday with the express purpose of building hype for this release.

osti · 2026-06-09T18:13:05 1781028785

And notable absence of DeepSWE benchmark where they do badly, but somehow a benchmark that was published yesterday is in this announcement.

zzleeper · 2026-06-09T21:27:27 1781040447

Exactly.. a bit of a red flag for me..

swyx · 2026-06-09T18:46:15 1781030775

team member here - we had been working on frontiercode for ~6-7months. timing just lined up

emp17344 · 2026-06-09T19:46:16 1781034376

Yeah, right. If this benchmark was truly developed in an independent manner, and the timing just “lined up”, how did Anthropic even know to include results in their model release documentation the day after the benchmark is revealed? It seems like there must have been some collaboration or influence from Anthropic behind the scenes.

oblio · 2026-06-09T22:15:18 1781043318

Come on, why are you a jerk about this?

Nobody would have 800+ billion reasons to lie by commission or omission here.

vanuatu · 2026-06-09T17:57:33 1781027853

i doubt it, cog wants coding agents to be better because it directly improves their product

they aren't married to a particular lab, most of their usage is their in house model i believe

anthonypasq · 2026-06-09T17:33:47 1781026427

what incentive does Cognition have for doing this? seems like complete nonsense speculation on your part.

bel8 · 2026-06-09T17:45:35 1781027135

With billions/trillions of dollars floating around, is it hard to imagine benchmarks could be biased?

I think it's safe to assume everything AI related is heavily biased until proven otherwise. Just like in pharma.

camdenreslink · 2026-06-09T18:22:40 1781029360

People game benchmarks for fake internet points to get their favorite web framework to the top of the list. I'm pretty sure they will do it for billions of dollars.

anthonypasq · 2026-06-09T19:04:11 1781031851

you didnt answer my question. Why would cognition be biased towards making anthropic look good?

gloosx · 2026-06-10T07:42:14 1781077334

Because Cognition is a major customer of Anthropic?

anthonypasq · 2026-06-10T15:55:52 1781106952

they are also a major customer of OpenAI and every other model maker. whats your point?

schipperai · 2026-06-09T18:29:17 1781029757

Cognition did well in documenting their approach [1].

TL;DR - they worked with OSS project maintainers to build tasks. They score models based on whether a PR is mergeable. All tasks are graded by a human researcher. SoTA models have hill-climbing to do which raises the bar and inspires confidence. I'd say it's legit.

[1]: https://x.com/cognition/status/2064061031912288715

shimman · 2026-06-10T03:07:03 1781060823

It's an unacademic benchmark by a failed VC startup clawing for relevancy.

CSMastermind · 2026-06-09T21:17:38 1781039858

DeepSWE is the benchmark you want to actually look out for. Only one that aligns with actual user reported results from trying the models.

ryeguy · 2026-06-10T00:14:54 1781050494

Did you read the blog post? They compare to deepswe and call it out as the worst one for false positives (failed, but the benchmark assessed it as correct). It also has less language variance.

CSMastermind · 2026-06-10T05:23:52 1781069032

I mean yes that is what you'd say if you were writing a blog post about your new benchmark.

ryeguy · 2026-06-11T01:42:58 1781142178

Sure, but they at least quantified it with data. It's not like they just dropped a sentence saying the above, they showed numbers.

OtomotO · 2026-06-09T19:00:46 1781031646

Bummer! When can I finally and confidently get slopcode into Zig?

swyx · 2026-06-09T18:53:21 1781031201

jump in chart form https://x.com/swyx/status/2064414823748886591/photo/1

DonsDiscountGas · 2026-06-09T21:54:35 1781042075

I am shocked at the low scores from previous models. Maybe I just have low code standards but I've generally been vibe coding since 4.6

make3 · 2026-06-09T22:19:51 1781043591

4.6 had functional but very poor quality code

anshumankmr · 2026-06-10T10:20:11 1781086811

how so? it has been my daily work horse,in fact so was 4.5 BUT as long as we steer it IT does a good enough job. I have not tried Mythos/Fable yet SO do not have an opinion on it.

hydra-f · 2026-06-09T17:17:27 1781025447

Yes, and the price reflects that

leecommamichael · 2026-06-09T17:22:05 1781025725

I'm not familiar with model pricing trends, did they clearly state how the new pricing compares? (Note that I'm actually asking a question, and am not arguing)

EDIT: Oh I see, this is the best link for pricing https://platform.claude.com/docs/en/about-claude/pricing

So the price is double across the board...

bhelkey · 2026-06-09T17:28:27 1781026107

>Fable 5 and Mythos 5 are being offered at $10 per million input tokens and $50 per million output tokens

From their pricing page, Opus 4.8 costs $5 per million input tokens and $25 per million output tokens [1].

[1] https://platform.claude.com/docs/en/about-claude/models/over...

wongarsu · 2026-06-09T17:47:36 1781027256

Still cheaper than Opus 4.0 and 4.1 (which was and still is $15/MTok input and $75/MTok output)

I would have expected Mythos to be much more expensive than just 2x current Opus (which is clearly cheaper to run than original Opus)

ainch · 2026-06-09T22:46:12 1781045172

Token prices have increased, but it's not really the whole story at this point, given some models will use far more tokens to complete a task than others. One of the charts in Anthropic's blog posts shows Fable at 'low' reasoning achieving better results for less money than Opus on 'high'.

hydra-f · 2026-06-09T17:29:02 1781026142

As per OpenRouter:

Input Price $10/M tokens

Output Price $50/M tokens

Cache Read $1/M tokens

Cache Write $12.50/M tokens

2x Claude Opus 4.8, same as Claude Opus 4.8 (Fast)

Frankly, not even Opus 4.8 would be enough of an incentive to use at that price range (enterprise-wise; would not even bat an eye as a consumer)

ghshephard · 2026-06-09T21:30:32 1781040632

Depends no the Enterprise - obviously - in the bay area - 0% of the tech companies care in the slightest. And I'm willing to wager < 5% of enterprises would send their traffic to OpenRouter. Most of them don't even want to send traffic directly to Anthropic or OpenAI - which is why Bedrock has gotten so much traction lately.

But - these $3k-$5k/month/engineer bills are going to start to get attention soon - only question is whether the response is to slow down on the $$$ spending or reduce the # of engineers.

m3kw9 · 2026-06-09T17:32:38 1781026358

FrontierCode is likely paid for by anthropic.

lanthissa · 2026-06-09T17:39:14 1781026754

did they not pay them enough to get good ratings on the other 3 models?

whats the logic in claiming its a borked metric when everything listed is an anthropic model.

Narretz · 2026-06-09T17:59:09 1781027949

There a few benchmarks out there where all existing models have abysmal scores. So it's not actually a problem if Antrophic's older models are bad, especially if the jump to the newest model is huge, and the competition is also way below it.

reasonableklout · 2026-06-09T17:35:21 1781026521

Huh? It's a benchmark by Cognition which (1) is building their own models and (2) offers all providers and thus has an incentive to avoid hyping up any one too much.

jstummbillig · 2026-06-09T17:41:26 1781026886

But you can just say shit now. Tokens might not be too cheap to meter but saying shit increasingly is.

jkelleyrtp · 2026-05-27T19:36:55 1779910615

I agree in principle with the math. But I believe that in reality if revenues don't show up quickly, then lenders will just restructure the debt and defer the payback period. Similar to SF commercial real-estate; many buildings should've come due during the depressed covid market, but lenders (banks) were willing to delay payment until the market picked up again.

The scale of these investments put the lenders at substantial risk, so the lenders will do anything to make it work. If the current lenders will be damaged by extended payback periods, they can simply sell the debt to someone else who won't be.

jkelleyrtp · 2026-04-07T19:05:40 1775588740

Dario (the founder) has a phd in biophysics, so I assume that’s why they mention biological weapons so much - it’s probably one of the things he fears the most?

conradkay · 2026-04-07T19:37:31 1775590651

Going off the recent biography of Demis Hassabis (CEO/co-founder of Deepmind, jointly won the Nobel Prize in Chemistry) it seems like he's very concerned about it as well

jkelleyrtp · 2026-02-05T18:27:40 1770316060

claude swe-bench is 80.8 and codex is 56.8

Seems like 4.6 is still all-around better?

gizmodo59 · 2026-02-05T18:28:19 1770316099

Its SWE bench pro not swe bench verified. The verified benchmark has stagnated

joshuahedlund · 2026-02-05T18:30:36 1770316236

Any ideas why verified has stagnated? It was increasing rapidly and then basically stopped.

Snuggly73 · 2026-02-05T18:53:23 1770317603

it has been pretty much a benchmark for memorization for a while. there is a paper on the subject somewhere.

swe bench pro public is newer, but its not live, so it will get slowly memorized as well. the private dataset is more interesting, as are the results there:

https://scale.com/leaderboard/swe_bench_pro_private

Rudybega · 2026-02-05T21:59:49 1770328789

You're comparing two different benchmarks. Pro vs Verified.

jkelleyrtp · 2026-02-03T22:42:38 1770158558

@deno team, how do secrets work for things like connecting to DBs over a tcp connection? The header find+replace won't work there, I assume. Is the plan to add some sort of vault capability?

jkelleyrtp · 2026-01-28T18:14:01 1769624041

I needed Mac / win/ Linux / iOS / android for dioxus dev, so I built my own in rust.

https://skyvm.dev/

jkelleyrtp · 2026-01-27T00:50:53 1769475053

I started building something for the dioxus team to have access to mac/linux persistent and ephemeral dev envs with vnc and beefy cpu/mem.

Nobody offered multiplatform and we really needed it!

https://skyvm.dev

jkelleyrtp · 2026-01-15T07:01:58 1768460518

WGPU for render, winit for window, servo css engine, taffy for layout sounds eerily similar to our existing open source Rust browser blitz.

https://github.com/dioxuslabs/blitz

Maybe we ended up in the training data!

satvikpendem · 2026-01-15T09:48:40 1768470520

I follow Dioxus and particularly blitz / #native on your Discord and I noticed the exact same thing too. There was a comment in a readme in Cursor's browser repo they linked mentioning taffy and I thought, hang on, it's definitely not from scratch, as they advertise. People really do believe everything they read on Twitter.

Great work by the way, blitz seems to be coming along nicely, and I even see you guys created a proto browser yourselves which is pretty cool, actually functional unlike Cursor's.

jkelleyrtp · 2026-01-09T09:16:35 1767950195

I work on Dioxus (Rust WASM framework).

WASM for frontend, at least, has been held back by fundamental tools like bundle splitting, hot-reload, debugger symbols, asset integration, etc. We spent a lot of 2025 working on improving this. Vite and friends are really good!

I've been working on a big Dioxus project recently and am pretty happy with where WASM is now. The AI tools make working with Rust code much faster. I'm hopeful people gravitate towards WASM frameworks more now that the tools are better.

jkelleyrtp · 2025-12-10T09:52:23 1765360343

https://blog.cloudflare.com/incident-report-on-memory-leak-c...

better to crash than leak https keys to the internet