I guess the hope is that combining two sub-par coding models (xAI's grok + cursor's composer) and combining the data they have access to, they can build something that can compete with OpenAI / Anthropic in the coding space...
I guess I kinda see it... it makes sense from both points of view (xAI needs data + places to run their models, cursor needs to not be reliant on Anthropic/OpenAI).
I think I don't see it working out... I just don't see an Elon company sustaining a culture that leads to a high-quality AI lab, even with the data + compute.
Have to call out that comment about grok code being sub par. I used it exclusively when it was free in Cursor and have nothing bad to say about it. And that was months ago. I imagine it’s a lot better now.
Composer-2 is based on Kimi K2.5, but with extensive RL. Cursor estimated 3x more compute on their RL than the original K2.5 training run (some details in https://cursor.com/blog/composer-2-technical-report).
I used to hate on Composer 2 but I'm coming around to it. Opus for the big stuff and multi-file operations, Composer for all the small day-to-day IDE tasks works pretty good for me.
I'm going to be brutally honest but I have not found Kimi to be useful at all. It simply cannot compete with what closed models from Codex and Claude offers. I don't want to risk using a model outside the ecosystem and introduce variables as most of my workflow is baked into two to three large company models.
That's interesting, Kimi K2.5 used through KimiCode was comparable to Sonnet in my tests, and is an excellent alternative to Anthropic models
That being said, I noticed that Kimi being served through Openrouter providers was trash. Whatever they do on the backend to optimize for throughput really compromised the intelligence of the model. You have to work with Kimi directly if you want the best results, and that's also probably why they released a test suite to verify the intelligence of their new models.
Hey, thanks for responding. You're a very evocative writer!
I do want to push back on some things:
> We treat "cognitive primitives" like object constancy and causality as if they are mystical, hardwired biological modules, but they are essentially just
I don't feel like I treated them as mystical - I cite several studies that define what they are and correlate them to certain structures in the brain that have developed millennia ago. I agree that ultimately they are "just" fitting to patterns in data, but the patterns they fit are really useful, and were fundamental to human intelligence.
My point is that these cognitive primitives are very much useful for reasoning, and especially the sort of reasoning that would allow us to call an intelligence general in any meaningful way.
> This "all-at-once" calculation of relationships is fundamentally more powerful than the biological need to loop signals until they stabilize into a "thought."
The argument I cite is from complexity theory. It's proof that feed-forward networks are mathematically incapable of representing certain kinds of algorithms.
> Furthermore, the obsession with "fragility"—where a model solves quantum mechanics but fails a child’s riddle—is a red herring.
AGI can solve quantum mechanics problems, but verifying that those solutions are correct still (currently) falls to humans. For the time being, we are the only ones who possess the robustness of reasoning we can rely on, and it is exactly because of this that fragility matters!
> The argument I cite is from complexity theory. It's proof that feed-forward networks are mathematically incapable of representing certain kinds of algorithms.
Claiming FFNs are mathematically incapable of certain algorithms misses the fact that an LLM in production isn't a static circuit, but a dynamic system. Once you factor in autoregression and a scratchpad (CoT), the context window effectively functions as a Turing tape, which sidesteps the TC0 complexity limits of a single forward pass.
> AGI can solve quantum mechanics problems, but verifying that those solutions are correct still (currently) falls to humans. For the time being, we are the only ones who possess the robustness of reasoning we can rely on, and it is exactly because of this that fragility matters!
We haven't "sensed" or directly verified things like quantum mechanics or deep space for over a century; we rely entirely on a chain of cognitive tools and instruments to bridge that gap. LLMs are just the next layer of epistemic mediation. If a solution is logically consistent and converges with experimental data, the "robustness" comes from the system's internal logic.
Thanks for reading, and I really appreciate your comments!
> who feed their produced tokens back as inputs, and whose tuning effectively rewards it for doing this skillfully
Ah, this is a great point, and not something that I considered. I agree that the token feedback does change the complexity, and it seems that there's even a paper by the same authors about this very thing! https://arxiv.org/abs/2310.07923
I'll have to think on how that changes things. I think it does take the wind out of the architecture argument as it's currently stated, or at least makes it a lot more challenging. I'll consider myself a victim of media hype on this, as I was pretty sold on this line of argument after reading this article https://www.wired.com/story/ai-agents-math-doesnt-add-up/ and the paper https://arxiv.org/pdf/2507.07505 ... who brush this off with:
>Can the additional think tokens provide the necessary complexity to correctly
solve a problem of higher complexity? We don't believe so, for two fundamental reasons: one that
the base operation in these reasoning LLMs still carries the complexity discussed above, and the
computation needed to correctly carry out that very step can be one of a higher complexity (ref our
examples above), and secondly, the token budget for reasoning steps is far smaller than what
would be necessary to carry out many complex tasks.
In hindsight, this doesn't really address the challenge.
My immediate next thought is - even solutions up to P can be represented within the model / CoT, do we actually feel like we are moving towards generalized solutions, or that the solution space is navigable through reinforcement learning? I'm genuinely not sure about where I stand on this.
> I don't have an opinion on this, but I'd like to hear more about this take.
I've been developing an ai coding harness https://github.com/dlants/magenta.nvim for over a year now, and I use it (and cursor and claude code) daily at work.
Fun observation - almost every coding harness (claude code, cursor, codex) uses a find/replace tool as the primary way of interacting with code. This requires the agent to fully type out the code it's trying to edit, including several lines of context around the edit. This is really inefficient, token wise! Why does it work this way? Because the LLMs are really bad at counting lines, or using other ways of describing a unique location in the file.
I've experimented with providing a more robust dsl for text manipulation https://github.com/dlants/magenta.nvim/blob/main/node/tools/... , and I do think it's an improvement over just straight search/replace, but the agents do tend to struggle a lot - editing the wrong line, messing up the selection state, etc... which is probably why the major players haven't adopted something like this yet.
So I feel pretty confident in my assessment of where these models are at!
And also, I fully believe it's big. It's a huge deal! My work is unrecognizable from what it was even 2 years ago. But that's an impact / productivity argument, not an argument about intelligence. Modern programming languages, IDEs, spreadsheets, etc... also made a fundamental shift in what being a software engineer was like, but they were not generally intelligent.
> Fun observation - almost every coding harness (claude code, cursor, codex) uses a find/replace tool as the primary way of interacting with code. This requires the agent to fully type out the code it's trying to edit, including several lines of context around the edit. This is really inefficient, token wise! Why does it work this way? Because the LLMs are really bad at counting lines, or using other ways of describing a unique location in the file.
Incidentally, I saw an interesting article about exactly this subject a little ways back, using line numbers + hashes instead of typing out the full search/replace, writing patches, or doing a DSL, and it seemed to have really good success:
It's general-purpose enough to do web development. How far can you get from writing programs and seeing if you get the answers you intended? If English words are "grounded" by programming, system administration, and browsing websites, is that good enough?
$2700/mo is about 1/3 of an engineers' salary (cost to the business of a mid-level engineer in the UK)...
But, there's the time to set all of this up (which admittedly is a one-time investment and would amortize).
And there's the risk of having made a mistake in your backups or recovery system (Will you exercise it? Will you continue to regularly exercise it?).
And they're a 3-person team... is it really worth your limited time/capacity to do this, rather than do something that's likely to attract $3k/mo of new business?
If the folks who wrote the blog see this, please share how much time (how many devs, how many weeks) this took to set up, and how the ongoing maintenance burden shapes up.
One thing that is interesting about this is that I wasn't able to get good results from smaller/faster models like claude haiku, so I opted to use a larger model instead. I found that the small delay of about a second was worth it for more consistent results.
I never got the valuation. I (and many others) have built open source agent plugins that are pretty much just as good, in our free time (check out magenta nvim btw, I think it turned out neat!)
I recently finished some updates to my neovim ai plugin:
- context tracking
- multiple threads
- compaction
- sub-agents
I decided to record a video of myself using the plugin to implement another feature. Aside from a demo of the plugin itself, I think it's a good view into what a typical workflow of using AI for development might look like, and I think is a good illustration of what these agents can and cannot currently do.
I guess I kinda see it... it makes sense from both points of view (xAI needs data + places to run their models, cursor needs to not be reliant on Anthropic/OpenAI).
I think I don't see it working out... I just don't see an Elon company sustaining a culture that leads to a high-quality AI lab, even with the data + compute.
reply