Isn't it an open secret that benchmarks are largly irrelevant at this point? Why else we do all have a personalized test battery for new models? That said i've stopped testing chatgpt entierly. Its still ok but is beaten by local models and it gets thrashed by non oai frontier providers. I get the history, but holding up oai outputs as equivallent is lile comparing yahoo to google post yahoo's collapse in search domains.
Oai language models are largly irrelevant at this point imo.
Agree, this seemed silly. It seems to be more a question of "would you say turquise is blue or green?" Rather than a question of our blues match. Better imo would be to ask something like paired colours and pick the "more blue" one. Cool idea for a website, but imo poorly formulated.
But if cyan for me is blue, and for you it's green, or neither (though that option is not available in this test), then that DOES tell us if our definitions of the word "blue" match. For me, the concept "blue" covers the cyan part of the spectrum, while for others it clearly doesn't.
Right, all valid points, but consider the scale of a game like those coming out of rockstar. I'd understand for indie games and arcade games, but a single player rpg that will likely never be seen in arcade settings? Seems odd to me to see it here. Rockstar has the resources to do it properly, one would think, no?
Suppose you don't care as much about replays, and you're willing to use other tricks to "cheat" on multiplayer sync instead (because most AAA titles seem to have multiplayer these days). Suppose, instead, your top priority is visual fidelity and being perceived as having cutting-edge graphics. You want maximum computational effort going into letting the gamers with a top-of-the-line GPU render on their 360FPS monitor. And you want lots of objects and realistic physics.
If you run physics on a global timer, you could run it at a slower rate and try to fake some of those frames (extrapolating intermediate positions of objects), which is complex. Or you could run it at a faster rate, and every frame has real physics updates, and then it's taking time you could be using for graphics or something else that you think sells better. And there are ways around that, too, but they're complicated and your team is busy and they aren't what your engine gives you for free...
Nah, until recently i still had access via web chat interface, and often paste a transcript and files for somethong 4.7 keeps fucking up, paste response into files as appropriate, and attempt to continue with 4.7.
I swear 4.6+ looks for reasons to ask clarifying questions sometimes, even when really not required, and this fucks flow/quality up in a big way.
I just wish there was a "im not stupid" checkbox you can use to get a minimalistic interference access to claude. Im starting to use local models again, which I havent in a while because claude was so much better, but once i fully lose access to 4.5 it might be time to go back to fully local for good. 4.6+ fails to add value for me, projects 4.5- did good jobs on first try now require multiple prompts and feedback. Exact same initial prompt and project files extracted from archive. I liked claude because it aced those tests while local required handholding. Now claude requires handholding, so why use it over local? Once 4.5 leaves openrouter it might just be time.
Good lord, these cases are quite problematic, i was going to use claude for some legacy stuff but i don't feel like getting banned over something innocent "can you identify a how we can fix the slave's behaviour? They're not listening to master properly"
Imagine you are working on a hobby game, with terms like attack and equipping weapons :)
I have opus consume >90% of my quota in a single prompt to form a plan then refuse to output it and tell me it's been stopped due to terms of service, please use sonnet.
List is massive. Anything novel? It'll fuck up without extreme handholding. Anything for which the components arent solved public published problems? It'll fuck it up.
Basically, claude can solve issues for you where it requires the implementation of existing code or a combination of existing patterns, but novel it cannot do.
The whole movie is worth a watch. It was made by Teddy Katz, an Israeli student whose 1990s thesis on the massacre was discredited after legal action by veterans, largely baseless push-back against the thesis which was well researched and strongly supported by primary evidence. The movie is called "Tantura" and was released in 2022. IF you're not an AV person, the wiki on the tantura massacre is worth a read. https://en.wikipedia.org/wiki/Tantura_massacre
Oai language models are largly irrelevant at this point imo.
reply