Don't forget that Germans were able to quickly find people they wanted to exterminate by going through municipal and church records. Today it will be much easier thanks to push for Digital ID. Choose demographic and the dashboard will show where everyone is and their connections.
Ten years ago multiple tech giants openly stated they would not help the Trump administration build a Muslim registry [1]. Since then, several of them have bowed the knee and donated to his second inauguration.
I’m from Germany, and keep wondering how much more damage the NSDAP could have done if they had access to the data these companies now have on everybody.
[1] https://www.theverge.com/2016/12/16/13990234/google-muslim-r...
Issue with these benchmark also is that they measure a model you are unlikely going to be routed to. My experience with Anthropic is that despite using Opus 4.6 and 4.7, most of the time the performance is matching low B parameter Qwen. I think there should be a way to verify what model is actually being used to process prompts - that should be independently verified. At the moment it is so bad, you have to ask verification question to the model in form of a non-trivial problem. If it solves it, then there is a chance you actually get Opus and not an impostor and so you can continue the session instead of restarting it hoping you get routed correctly. But that does not help if model is replaced with cheaper one mid session. I've got so much work lost because of these shenanigans.
I'm sure some inference providers don't, but most intentionally obfuscate this data. They have the full trace logs- my impression is that they don't share them because it's their competitive advantage, and it's easier for a competitor to distil their model if they did.
It also seems to me they route prompts to cheaper dumber models that present themselves as e.g. Opus 4.7. Perhaps that's what is "adaptive reasoning" aka we'll route your request to something like Qwen saying it's Opus. Sometimes I get a good model, so I found I'll ask a difficult question first and if answer is dumb, I terminate the session and start again and only then go with the real prompt. But there is no guarantee model will be downgraded mid session. I wish they just charged real price and stopped these shenanigans. It wastes so much time.
You're describing a Taravangian prompt situation (a character in a book series who wakes up with a different/random intelligence level each day and has a series of tests for himself to determine which kind of decisions he's capable of that day). https://coppermind.net/wiki/Taravangian
I always agree if this is for academic purposes, if it helps with research etc. I can't see why I shouldn't. We are just meat that will expire one day.
But when's the last time you had $300 million in your personal budget to spend on advertising to a specific human being to improve your personal income?
When's the last time you got a call from an actual politician begging you for money and "support"?
US congress members spend the vast vast majority of their time on the phone begging a list of rich people for a piss of nickles to fund advertising for their next election. There's always a subtle threat of strings attached.
Both the prince and pauper are forbidden from sleeping under the bridge.
I think they are routing to cheaper models that present themselves as e.g. Opus. I add to prompts now stuff to ensure that I am not dealing with an impostor. If it answers incorrectly, I terminate the session and start again. Anthropic should be audited for this.
It appears that Opus 4.7 has been nerfed already. Can't get any sensible results since yesterday. It just keeps running in circles. Even mention that it is committing fraud by doing superficial work it has been told specifically not to do doesn't help.
oh yes. I tried to get some review of a code base after some refactoring. CC produced a complete garbage review. After pointing that out it admitted that that was garbage - and promptly produced another pile of garbage. After the third failed attempt I had to call it a day.
reply