This is a perfect illustration of something I noticed with llm progress. Ask them to improve an svg like this, and it never fixes the missing crossbar or disconnected limbs, it just adds more stuff. In this example they have obviously improved greatly, and it contains a ridiculous amount of detail, but they still to get the basic shape of the frame wrong. It's weird. And the pattern shows up everywhere, try it with a webpage and it will add more buttons and stuff. I've even experimented with feeding the broken pelican svgs to an image model to look for flaws, and they still fail to spot the broken elements.
When you say "improve an svg like this", how are you imagining setting that workflow up? Are you just feeding them the SVG to iterate on; or are you giving them access to a browser to look at the rendering of the SVG?
I ask because:
Insofar as the original pelican test is zero-shot, it effectively serves as a way to test for the presence of a kind of "visual imagination" component within the layers of the model, that the model would internally "paint" an SVG [or PostScript, etc] encoding of an image onto, to then extract effective features from, analyze for fitness as a solution to a stated request, etc.
But if you're trying to do a multi-shot pelican, then just feeding back in the SVG produced in the previous attempt, really doesn't correspond to any interesting human capability. Humans can't take an SVG of a pelican and iteratively improve upon it just based on our imagined version of how that SVG renders, either! Rather, a human, given the pelican, would simply load the pelican SVG in a browser; look at the browser's rendering of the pelican; note the things wrong with that rendering; and then edit the SVG to hopefully fix those flaws (and repeat.)
I imagine current (mult-modal and/or computer-use) LLMs would actually be very good at such an "iterative rendered pelican" test.
I'm talking about two type of improvement, model improving, and prompt based improving. I am noticing that the baseline output has a lot more going on, the model has improved, yet it still makes those obvious looking mistakes with the shape of the frame or disconnected limbs etc.
And I am saying that if you take one of these SVGs and ask an LLM to look for flaws, it rarely spots those obvious flaws and instead suggests adding a sunset and fish in the birds mouth.
This is also my gripe with a lot of this stuff, always evaluating models on what they can literally oneshot is completely pointless; it's not how anything works, neither for humans nor for scaffolded AIs. I guess it's neat if you want to argue that a certain level of intelligence can "never be achieved" in a single forward pass, but like, so what. No one cares about that, except people who have already decided to be anti AI.
(not that I am in any sense pro AI, but it's just a weird lack of intellectual rigor)
Asking a model to improve its output is not one-shotting tho? My observation was that asking an llm to iterate and improve a response causes it to add more stuff, rather tha repair the broken stuff. And that model progress in general has the same pattern. This new model adds more details to its responses but continues to make mistakes at about the same rate.
The question was whether you were giving it the rendered image and using the model's visual modal capability, or feeding back in the textual SVG.
It's hard to "imagine" what the rendered SVG looks like, for both humans and LLMs, so just iterating on text won't really be as useful of a test. But if you show it what it rendered, it might observe the bad-looking bicycle and be able to fix the text that way.
To a certain extent, it feels like a Sonnet 3.7 moment. Slightly overeager - you ask for a button color change, you see layout changes, new package dependencies, and the README rewritten from scratch - and not necessarily correctly.
When I ask for a pelican on a bike, I want the Platonic ideal of a pelican on a bike, not a vision of an alternative reality in which pelicans created bikes. Though, thinking about it again, maybe I should.
Slightly overeager - you ask for a button color change, you see layout changes, new package dependencies, and the README rewritten from scratch - and not necessarily correctly.
It's because LLMs are fundamentally generative (creative), not truth-seeking or logic-seeking. Simple logic has always been incredibly expensive to impossible for LLMs.
Their ability is best described as "spiky". To steal from aphyr: think kiki, more than bouba. Whats interesting is that a lot of the models seem to have similar spikes and "troughs", though there are differences.
I think its less misleading this way because every other reader would have to pay $1.3M to emulate his workflow for a similar size project. His discounted internal costs are relevent only to openai.
I don't believe anything out of these startups anymore unless its backed by evidence.
Too expensive? Why would anthropic train a model too expensive to run? I doubt they would. Let's look at the evidence: Opus 4.5 came in at double the speed and half the price of old opus. Its speed matched older sonnet models. Higher Speed + Lower price = smaller model. So they rebranded sonnet sized models to opus. Where is the og opus sized model?
Arc has no predictive power whatsoever. I always use the best models available. So far I haven't found a task that chineses models cannot solve very quickly and reasonably. Do you have any examples where they failed for you?
It's interesting how Musk has engaged in such a distracting lawsuit against openai while he also prepares for the largest deal of his life, and the largest IPO in history. Exceedingly generous of him.
I like chutes. I think I get about 5K prompts per day for $20/m, though they may have stricter limits for new customers.
This gives you practically unlimited usage of frontier models like kimi, deepseek, glm.
Their models are always fullsize, never quantised except where the lab themselves provides an 4bit or 8bit model. You can see from the model config exactly which hf model it pulls and the serving co figuration used.
Prompts are encrypted using Trusted Execution Environment (TEE). So neither a model host or neighbour can view your prompts. That's as close as you can get to local level privacy in the cloud.
I tried looking into Chutes just now. Seems like there is no easy way to just pay & start using it with OpenCode or Claude Code, right? Their docs don’t seem to mention it. Do I really have to execute code with their API in order to use the models?
No its super easy. I think the confusion is due to the serving and hosting APIs that let you add your own GPUs to a pool and earn money. But for regular inference they have an openai responses API a basic chat app. You can signup to a $3 subscription, or deposit $5 and use your api key.
The statue is in Westminster, right by Whitehall. The heart of British government. It depicts a figure in a suit, marching off a ledge, completely blinded by a flag.
Who wears a suit and marches through Westminster under a flag?
- Businessmen? No. Merchants have no country.
- Officials? They wear suits but don't march
- Old-guard politicians? Rarely march or flag-wave with any conviction.
So who are we left with? The populist. The Nigel Farage archetype. The suited firebrand who wrap themselves in nationalist fervor, stoke the rabble, and blindly march everyone right off a cliff.
Banksy isn't known for complex, multi-layered messaging. He is popular precisely because he uses visual shorthand to say plainly what the general public is already thinking. There is no hidden 4D chess; it's just blunt satire about blind patriotism.
Edit: This also explains why the government is happy to keep this particular Banksy on display.
Not op but I wrote llm-consortium to prompt multiple models and create a synthesis. And it can run on an openai endpoint using llm-model-gateway. It's expensive, naturally, but for situations where you absolutely must get max intelligence its hard to beat.
e.g.
Pelican Riding a Bicycle — Engineering Study by DeepSeek v4 Pro, Kimi K2.6, and GLM-5.1 (1 iteration in synthesis mode with DeepSeek v4 flash as judge)
reply