This ok from your perspective then? def make_pass@1_agent(agent, n): def retry_a...

terminalshort · 2025-08-12T15:08:38 1755011318

Definitely wouldn't have written the code that way, but yes, if (and this is a massive "if") the agent has an accurate and meaningful way to determine which way to set the success boolean. The obvious caveat would be if n needed to be large enough to set the costs higher than I am willing to pay for the additional performance or it makes it take longer than I'm willing to wait.

Think of the agent like an employee. If he delivers the code within the expected time and to the expected quality standards, his process of getting there means almost nothing. Do I care if he tried 4 different approaches along the way and threw out the first 3? Not a bit.

DougBTX · 2025-08-12T13:49:32 1755006572

Absolutely fine, as long as the success flag is predicted by the model ensemble under test. That’s how Claude Code works for example, it will continue to iterate until success (or it will give up with failure at a certain point).

gronky_ · 2025-08-12T13:37:21 1755005841

Keep in mind that this isn’t about users - the top agents on the leaderboard aren’t running an actual product on the benchmark.

If they are running their production product as is, then of course whatever is built into the product is fine.