More

Grimblewald · 2026-04-28T22:18:44 1777414724

Isn't it an open secret that benchmarks are largly irrelevant at this point? Why else we do all have a personalized test battery for new models? That said i've stopped testing chatgpt entierly. Its still ok but is beaten by local models and it gets thrashed by non oai frontier providers. I get the history, but holding up oai outputs as equivallent is lile comparing yahoo to google post yahoo's collapse in search domains.

Oai language models are largly irrelevant at this point imo.

Grimblewald · 2026-04-27T23:34:53 1777332893

Agree, this seemed silly. It seems to be more a question of "would you say turquise is blue or green?" Rather than a question of our blues match. Better imo would be to ask something like paired colours and pick the "more blue" one. Cool idea for a website, but imo poorly formulated.

martin- · 2026-04-28T15:09:17 1777388957

But if cyan for me is blue, and for you it's green, or neither (though that option is not available in this test), then that DOES tell us if our definitions of the word "blue" match. For me, the concept "blue" covers the cyan part of the spectrum, while for others it clearly doesn't.

Grimblewald · 2026-04-28T21:21:38 1777411298

A neither option would also work. Point is half the colours i was shown fall into neither

Grimblewald · 2026-04-20T22:27:00 1776724020

This, even on android via termux you can run ollama with gpu accelaration on phone. This works, though milage will vary.

Grimblewald · 2026-04-19T23:22:22 1776640942

Right, all valid points, but consider the scale of a game like those coming out of rockstar. I'd understand for indie games and arcade games, but a single player rpg that will likely never be seen in arcade settings? Seems odd to me to see it here. Rockstar has the resources to do it properly, one would think, no?

JoshTriplett · 2026-04-20T02:25:36 1776651936

Suppose you don't care as much about replays, and you're willing to use other tricks to "cheat" on multiplayer sync instead (because most AAA titles seem to have multiplayer these days). Suppose, instead, your top priority is visual fidelity and being perceived as having cutting-edge graphics. You want maximum computational effort going into letting the gamers with a top-of-the-line GPU render on their 360FPS monitor. And you want lots of objects and realistic physics.

If you run physics on a global timer, you could run it at a slower rate and try to fake some of those frames (extrapolating intermediate positions of objects), which is complex. Or you could run it at a faster rate, and every frame has real physics updates, and then it's taking time you could be using for graphics or something else that you think sells better. And there are ways around that, too, but they're complicated and your team is busy and they aren't what your engine gives you for free...

Grimblewald · 2026-04-19T23:17:12 1776640632

I miss 4.5. It was gold.

lossyalgo · 2026-04-20T13:47:40 1776692860

4.5 sonnet/opus/haiku are still available via github copilot plugins.

xvector · 2026-04-20T04:03:33 1776657813

Rose tinted glasses

Grimblewald · 2026-04-20T09:08:57 1776676137

Nah, until recently i still had access via web chat interface, and often paste a transcript and files for somethong 4.7 keeps fucking up, paste response into files as appropriate, and attempt to continue with 4.7.

I swear 4.6+ looks for reasons to ask clarifying questions sometimes, even when really not required, and this fucks flow/quality up in a big way.

I just wish there was a "im not stupid" checkbox you can use to get a minimalistic interference access to claude. Im starting to use local models again, which I havent in a while because claude was so much better, but once i fully lose access to 4.5 it might be time to go back to fully local for good. 4.6+ fails to add value for me, projects 4.5- did good jobs on first try now require multiple prompts and feedback. Exact same initial prompt and project files extracted from archive. I liked claude because it aced those tests while local required handholding. Now claude requires handholding, so why use it over local? Once 4.5 leaves openrouter it might just be time.

nwienert · 2026-04-20T04:10:35 1776658235

4.5 was clearly better than .6 and .7. Like, clear as day.

.6 is some sort of quantized or distilled .5 with a bit more RL, and the current .5 is that same cost reduced model without the extra RL.

Grimblewald · 2026-04-19T22:56:31 1776639391

Wild that it gets billed before it is accepted.

Grimblewald · 2026-04-19T22:53:52 1776639232

Good lord, these cases are quite problematic, i was going to use claude for some legacy stuff but i don't feel like getting banned over something innocent "can you identify a how we can fix the slave's behaviour? They're not listening to master properly"

kay_o · 2026-04-19T22:58:15 1776639495

Imagine you are working on a hobby game, with terms like attack and equipping weapons :)

I have opus consume >90% of my quota in a single prompt to form a plan then refuse to output it and tell me it's been stopped due to terms of service, please use sonnet.

Grimblewald · 2026-04-19T22:48:32 1776638912

The beatings will continue until moral improves

salawat · 2026-04-20T02:48:02 1776653282

Don't you mean morale? Businesses are basically amoral by desi....ooooooooh. I see what you did there.

Grimblewald · 2026-04-19T21:38:58 1776634738

List is massive. Anything novel? It'll fuck up without extreme handholding. Anything for which the components arent solved public published problems? It'll fuck it up.

Basically, claude can solve issues for you where it requires the implementation of existing code or a combination of existing patterns, but novel it cannot do.

Grimblewald · 2026-04-18T04:45:00 1776487500

The whole movie is worth a watch. It was made by Teddy Katz, an Israeli student whose 1990s thesis on the massacre was discredited after legal action by veterans, largely baseless push-back against the thesis which was well researched and strongly supported by primary evidence. The movie is called "Tantura" and was released in 2022. IF you're not an AV person, the wiki on the tantura massacre is worth a read. https://en.wikipedia.org/wiki/Tantura_massacre