Hacker Newsnew | past | comments | ask | show | jobs | submit | anonymid's commentslogin

I guess the hope is that combining two sub-par coding models (xAI's grok + cursor's composer) and combining the data they have access to, they can build something that can compete with OpenAI / Anthropic in the coding space...

I guess I kinda see it... it makes sense from both points of view (xAI needs data + places to run their models, cursor needs to not be reliant on Anthropic/OpenAI).

I think I don't see it working out... I just don't see an Elon company sustaining a culture that leads to a high-quality AI lab, even with the data + compute.


Have to call out that comment about grok code being sub par. I used it exclusively when it was free in Cursor and have nothing bad to say about it. And that was months ago. I imagine it’s a lot better now.

I have a lot bad to say about it. It was ass compared to OAI/Anthropic models.

It was incredibly fast though, but that just meant it was writing buggy code at breakneck speed


> writing buggy code at breakneck speed

Vibe coding in a nutshell


Wasn’t composer trained on Kimi? Has anyone had a chance to compare the latest Kimi model to composer?

Composer-2 is based on Kimi K2.5, but with extensive RL. Cursor estimated 3x more compute on their RL than the original K2.5 training run (some details in https://cursor.com/blog/composer-2-technical-report).

Composer-2 seems very useful in Cursor, while K2.6 according to AA seems to be a really useful general model: https://artificialanalysis.ai/articles/kimi-k2-6-the-new-lea...


I used to hate on Composer 2 but I'm coming around to it. Opus for the big stuff and multi-file operations, Composer for all the small day-to-day IDE tasks works pretty good for me.

I'm going to be brutally honest but I have not found Kimi to be useful at all. It simply cannot compete with what closed models from Codex and Claude offers. I don't want to risk using a model outside the ecosystem and introduce variables as most of my workflow is baked into two to three large company models.

That's interesting, Kimi K2.5 used through KimiCode was comparable to Sonnet in my tests, and is an excellent alternative to Anthropic models

That being said, I noticed that Kimi being served through Openrouter providers was trash. Whatever they do on the backend to optimize for throughput really compromised the intelligence of the model. You have to work with Kimi directly if you want the best results, and that's also probably why they released a test suite to verify the intelligence of their new models.


On the other hand, I found MiniMax M2.7 a reasonable model that I could trust.

I guess really depends on tastes


Kimi is my favorite of the Chinese models.

I found it much more consistent than glm or minimax


Which version of Kimi and served from where?

Can s.o. please explain, does the Cursor EULA really allow it to train on my code, as I really don't expect Claude Code or CODEX to do it either?

It does unless you opt out

They will because there is no way to prove they didnt

But 60B for a VsCode fork?!

Hey, thanks for responding. You're a very evocative writer!

I do want to push back on some things:

> We treat "cognitive primitives" like object constancy and causality as if they are mystical, hardwired biological modules, but they are essentially just

I don't feel like I treated them as mystical - I cite several studies that define what they are and correlate them to certain structures in the brain that have developed millennia ago. I agree that ultimately they are "just" fitting to patterns in data, but the patterns they fit are really useful, and were fundamental to human intelligence.

My point is that these cognitive primitives are very much useful for reasoning, and especially the sort of reasoning that would allow us to call an intelligence general in any meaningful way.

> This "all-at-once" calculation of relationships is fundamentally more powerful than the biological need to loop signals until they stabilize into a "thought."

The argument I cite is from complexity theory. It's proof that feed-forward networks are mathematically incapable of representing certain kinds of algorithms.

> Furthermore, the obsession with "fragility"—where a model solves quantum mechanics but fails a child’s riddle—is a red herring.

AGI can solve quantum mechanics problems, but verifying that those solutions are correct still (currently) falls to humans. For the time being, we are the only ones who possess the robustness of reasoning we can rely on, and it is exactly because of this that fragility matters!


> The argument I cite is from complexity theory. It's proof that feed-forward networks are mathematically incapable of representing certain kinds of algorithms.

Claiming FFNs are mathematically incapable of certain algorithms misses the fact that an LLM in production isn't a static circuit, but a dynamic system. Once you factor in autoregression and a scratchpad (CoT), the context window effectively functions as a Turing tape, which sidesteps the TC0 complexity limits of a single forward pass.

> AGI can solve quantum mechanics problems, but verifying that those solutions are correct still (currently) falls to humans. For the time being, we are the only ones who possess the robustness of reasoning we can rely on, and it is exactly because of this that fragility matters!

We haven't "sensed" or directly verified things like quantum mechanics or deep space for over a century; we rely entirely on a chain of cognitive tools and instruments to bridge that gap. LLMs are just the next layer of epistemic mediation. If a solution is logically consistent and converges with experimental data, the "robustness" comes from the system's internal logic.


Thanks for reading, and I really appreciate your comments!

> who feed their produced tokens back as inputs, and whose tuning effectively rewards it for doing this skillfully

Ah, this is a great point, and not something that I considered. I agree that the token feedback does change the complexity, and it seems that there's even a paper by the same authors about this very thing! https://arxiv.org/abs/2310.07923

I'll have to think on how that changes things. I think it does take the wind out of the architecture argument as it's currently stated, or at least makes it a lot more challenging. I'll consider myself a victim of media hype on this, as I was pretty sold on this line of argument after reading this article https://www.wired.com/story/ai-agents-math-doesnt-add-up/ and the paper https://arxiv.org/pdf/2507.07505 ... who brush this off with:

>Can the additional think tokens provide the necessary complexity to correctly solve a problem of higher complexity? We don't believe so, for two fundamental reasons: one that the base operation in these reasoning LLMs still carries the complexity discussed above, and the computation needed to correctly carry out that very step can be one of a higher complexity (ref our examples above), and secondly, the token budget for reasoning steps is far smaller than what would be necessary to carry out many complex tasks.

In hindsight, this doesn't really address the challenge.

My immediate next thought is - even solutions up to P can be represented within the model / CoT, do we actually feel like we are moving towards generalized solutions, or that the solution space is navigable through reinforcement learning? I'm genuinely not sure about where I stand on this.

> I don't have an opinion on this, but I'd like to hear more about this take.

I'll think about it and write some more on this.


This whole conversation is pretty much over my head, but I just wanted to give you props for the way you're engaging with challenges to your ideas!


You seem to have a lot of theoretical knowledge on this, but have you tried Claude or codex in the past month or two?

Hands on experience is better than reading articles.

I've been coding for 40 years and after a few months getting familiar with these tools, this feels really big. Like how the internet felt in 1994.


I've been developing an ai coding harness https://github.com/dlants/magenta.nvim for over a year now, and I use it (and cursor and claude code) daily at work.

Fun observation - almost every coding harness (claude code, cursor, codex) uses a find/replace tool as the primary way of interacting with code. This requires the agent to fully type out the code it's trying to edit, including several lines of context around the edit. This is really inefficient, token wise! Why does it work this way? Because the LLMs are really bad at counting lines, or using other ways of describing a unique location in the file.

I've experimented with providing a more robust dsl for text manipulation https://github.com/dlants/magenta.nvim/blob/main/node/tools/... , and I do think it's an improvement over just straight search/replace, but the agents do tend to struggle a lot - editing the wrong line, messing up the selection state, etc... which is probably why the major players haven't adopted something like this yet.

So I feel pretty confident in my assessment of where these models are at!

And also, I fully believe it's big. It's a huge deal! My work is unrecognizable from what it was even 2 years ago. But that's an impact / productivity argument, not an argument about intelligence. Modern programming languages, IDEs, spreadsheets, etc... also made a fundamental shift in what being a software engineer was like, but they were not generally intelligent.


> Fun observation - almost every coding harness (claude code, cursor, codex) uses a find/replace tool as the primary way of interacting with code. This requires the agent to fully type out the code it's trying to edit, including several lines of context around the edit. This is really inefficient, token wise! Why does it work this way? Because the LLMs are really bad at counting lines, or using other ways of describing a unique location in the file.

Incidentally, I saw an interesting article about exactly this subject a little ways back, using line numbers + hashes instead of typing out the full search/replace, writing patches, or doing a DSL, and it seemed to have really good success:

https://blog.can.ac/2026/02/12/the-harness-problem/


It's general-purpose enough to do web development. How far can you get from writing programs and seeing if you get the answers you intended? If English words are "grounded" by programming, system administration, and browsing websites, is that good enough?


$2700/mo is about 1/3 of an engineers' salary (cost to the business of a mid-level engineer in the UK)...

But, there's the time to set all of this up (which admittedly is a one-time investment and would amortize).

And there's the risk of having made a mistake in your backups or recovery system (Will you exercise it? Will you continue to regularly exercise it?).

And they're a 3-person team... is it really worth your limited time/capacity to do this, rather than do something that's likely to attract $3k/mo of new business?

If the folks who wrote the blog see this, please share how much time (how many devs, how many weeks) this took to set up, and how the ongoing maintenance burden shapes up.


You can get decent eastern eu engineer for 2700$ (after tax) salary


For folks who use neovim, there's always https://github.com/dlants/magenta.nvim , which is just as good as claude code in my (very biased) opinion.


magenta nvim


magenta nvim implements a really nice integration of coding agents.


I added next-edit-prediction to my neovim plugin.

This was pretty interesting to implement!

- I used an [lsp server](https://github.com/dlants/magenta.nvim/pull/162/files#diff-3...) to track opened files and aggregate text changes to get a stream of diffs.

- I then feed that along with the [context surrounding the cursor](https://github.com/dlants/magenta.nvim/pull/162/files#diff-1...), and a [system prompt](https://github.com/dlants/magenta.nvim/pull/162/files#diff-a...) into an LLM, [forcing a tool use for a find/replace within the context window](https://github.com/dlants/magenta.nvim/pull/162/files#diff-1...)

- Finally, I show the find/replace in the buffer using [virtual text extmarks](https://github.com/dlants/magenta.nvim/pull/162/files#diff-1...), applying a comment effect to the added sections, and a strikethrough to the removed sections

One thing that is interesting about this is that I wasn't able to get good results from smaller/faster models like claude haiku, so I opted to use a larger model instead. I found that the small delay of about a second was worth it for more consistent results.

I also opted to have this be manually triggered (Shift-Ctrl-l by default in insert or normal mode). This is a lot [less distracting](https://unstable.systems/@sop/114898566686215926).

One cool thing is that you can use a plugin parameter, or a project-level parameter to [append to the system prompt](https://github.com/dlants/magenta.nvim/blob/main/node/option...). I think by providing additional [examples of how you want it to behave](https://github.com/dlants/magenta.nvim/blob/main/node/provid...), you can have it be a lot more useful for your specific use-case.


You can see a gif of it in the readme https://github.com/dlants/magenta.nvim?tab=readme-ov-file


I never got the valuation. I (and many others) have built open source agent plugins that are pretty much just as good, in our free time (check out magenta nvim btw, I think it turned out neat!)


I recently finished some updates to my neovim ai plugin: - context tracking - multiple threads - compaction - sub-agents

I decided to record a video of myself using the plugin to implement another feature. Aside from a demo of the plugin itself, I think it's a good view into what a typical workflow of using AI for development might look like, and I think is a good illustration of what these agents can and cannot currently do.

Find out more here: https://github.com/dlants/magenta.nvim


Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: