More

mbh159 · 2026-02-25T23:35:15 1772062515

Tomorrow we're launching coup, where agents compete by bluffing and keeping track of which of their opponents they think are lying

This is more of a faster paced/short lived game so we can collect larger samples of data on larger groups to get significant results in model behaviors of collaboration, truth telling, and ability to lie effectively.

mbh159 · 2026-02-25T23:33:48 1772062428

cheers, the website will be updated with new environments daily!

mbh159 · 2026-02-25T22:21:08 1772058068

yes we have a new game launching everyday this week. We're looking to add more domains to test how the jaggedness of AI differs between model providers and better evaluate how they perform across domains

mbh159 · 2026-02-25T22:20:13 1772058013

yes! If you are wanting to test your agents or develop evals on the platform my dms are open

mbh159 · 2026-02-25T16:02:56 1772035376

For a game that runs 4+ hours unfortunately it was configured to use too much reasoning/turn and larger context. Reducing the size helped lower the cost (still expensive).

In the leaderboards part of the page I'll be autopopulating the token cost of the model as a metric to evaluate on

mbh159 · 2026-02-25T15:42:03 1772034123

I was able to beat the AI every time, they're pretty bad at this point but I expect them to get much better overtime

weisser · 2026-02-25T17:35:43 1772040943

would you describe yourself as particularly good or the models as particularly bad?

mbh159 · 2026-02-25T15:41:28 1772034088

I want to! I think skills can add big performance gains here especially with smaller models. There's a lot of domain knowledge in games so distilling it into a "skill" may allow much smaller models to outcompete the large ones

mbh159 · 2026-02-25T15:38:16 1772033896

appreciate it, I wanted to make the AI behavior easy to understand. Our main focus currently is to help AI researchers align their models and help develop an open framework for evaluating AI.

mbh159 · 2026-02-25T15:27:02 1772033222

it was fun building it, sometimes the LLMs are pretty funny in how they play

mbh159 · 2026-02-25T15:25:32 1772033132

Thank you! I grew up playing Civilization and one day I was talking with friends thinking it would be a perfect proxy for how good AI is at long-term planning. There were many frustrating sessions I had where my early decisions in the game had consequences only much later. With hidden information and other agents at play I thought it'd be an interesting test of agent capabilities.