Hacker Newsnew | past | comments | ask | show | jobs | submit | AgentMatrixAI's commentslogin

Reading this makes me sad at how different my generation was compared to the new ones.

I remember when Counterstrike 1.3 came out and everybody at my school were talking about it and playing it. We would line up at computer labs before lunch started, pay a toonie and entire room would crackle with in-game radio comms, AK47 and HE going off with a room full of people side by side excitedly shouting for an hour until lunch was over.

When classes finished we would head back to the lab again and we would play endless round of de_dust 1 & 2, de_rats, fy_iceworld and the occasional as_oilrig and the rush of being the VIP and experiencing my first headshot.

Sometimes the admin running the labs would add fun mods like no gravity and weird stuff....

It was such a memorable and social fun time and it runs in complete contrast to the everything-gambling culture that has taken foothold....


First time making something 3D and code wise, codex has been very useful, even creating a in-game 3D editor as well as netcode

The trickiest part is really using 3D and it comes with lot of extra scoping you normally take for granted: animation, uv texture, rigging for humanoids, making sure stuff doesn't clip through etc.

Still learning Blender but its very slow. I haven't tried the MCP for it yet but I want to get proficient at it to be able to produce psx graphic models and textures...


I'm kind of keen to see what mess Claude code could do with a small Unreal Engine 5 C++ project. Or what clever tricks it could actually pull off in that environment.


amazing timing! been needing something like this since Blender has been tough.

I've been working on a cybercafe simulator with Three.js and codex and animation is hard to get right

https://x.com/AgentifySH/status/1981761490274464068


Anybody else just wrapping SOPS in a rest api call and using that? Feel like that is just as good from my experience. While I think Vault is useful for large companies, I just need something to encrypt and decrypt and not rely on pgycrypto


I'm not really convinced, the benchmark blunder was really strange but the demos were quite underwhelming, and it appears this was reflected by a huge market correction in the betting markets as to who will have the best AI by end of the year.

What excites me now is that Gemini 3.0 or some answer from Google is coming soon and that will be the one I will actually end up using. It seems like the last mover in the LLM race is more advantageous.


Polymarket betters are not impressed. Based upon the market odds, OpenAI had a 35% chance to have the best model (at year end), but those odds have dropped to 18% today.

(I'm mostly making this comment to document what happened for the history books.)

https://polymarket.com/event/which-company-has-best-ai-model...


After a few hours with gpt-5, I'd trade that spread. Not that I think oAI will win end of year. But I think gpt5 is better than it looks on the benchmark side. It is very very good at something we don't have a lot of benchmarks for -- keeping track of where it's at. codex is vassstly better in practice than claude code or gemini cli right now.

On the chat side, it's also quite different, and I wouldn't be surprised if people need some time to get a taste and a preference for it. I ask most models to help me build a macbook pro charger in 15th century florence with the instructions that I start with only my laptop and I can only talk for four hours of chat before the battery dies -- 5 was notable in that it thought through a bunch of second order implications of plans and offered some unusual things, including a list of instructions for a foot-treadle-based split ring commutator + generator in 15th century florentine italian(!). I have no way of verifying if the italian was correct.

Upshot - I think they did something very special with long context and iterative task management, and I would be surprised if they don't keep improving 5, based on their new branding and marketing plan.

That said, to me this is one of the first 'product release' moments in the frontier model space. 5 is not so much a model release as a polished-up, holes-fixed, annoyances-reduced/removed, 10x faster type of product launch. Google (current polymarket favorite) is remarkably bad at those product releases.

Back to betting - I bet there's a moment this year where those numbers change 10% in oAIs favor.


I would agree. I am a big fan of Claude and I've Claude code a bunch although after testing Codex & GPT-5 extensively, it just gets stuck in a rut way less often and much more often is able to pinpoint issues & fixes in the codebase.


How on Earth does that market have Anthropic at 2%, in a dead heat with the likes of Meta? If the market was about yesterday rather than 5 months from now I think Claude would be pretty clearly the front runner. Why does the market so confidently think they’ll drop to dead last in the next little while?


It's because those markets are based on the LLM Arena leaderboard (https://lmarena.ai/), where Claude has historically done poorly.

That eval has also become a lot less relevant (it's considered not very indicative of real-world performance), so it's unlikely Anthropic will prioritize optimizing for it in future models.


Anthropic has always been one of the best at not optimizing for stupid metrics. Rather, they spend significant energy researching weaknesses and building metrics around that. Google is also pretty on point IMO, but they can also afford to dedicate to these nonsense metrics as they are still good marketing.

Meanwhile Meta and Xai are behind the ball and largely marketing focused.


True. I'm surprised they are not based on e.g. OpenRouter usage or similar.


How is Claude doing on the benchmark that market is based on? Maybe not so good? Idk. Just because Claude is good for real world use doesn't mean it's winning the benchmark, but the benchmark is all that matters for the Polymarket.


I'm a fan of Anthropic for this reason. I use Claude and it's very good most of the time for my coding requirements.

Generally when you have a lot of companies competing to show whos product X does the best at Y, there's a lot of monetary incentives to manipulate the products to perform well specifically on those types of tests.


If you think it's wrong, participate. That's the only way prediction markets end up predicting anything.


Ah, yes, if you disagree you must participate in real money gambling based on the outcome of a single user-based, single-prompt leaderboard.


Well I for example don't give a shit what prediction markets do and never participated, but if someone thinks they're wrong, they should just participate and get free money. Otherwise why complain.


I wasn't complaining per-se, I was asking for (and expecting) a legitimate reason. Which I got: that the market is resolved purely based on LLM Arena which Anthropic has never done well on (which says more about the benchmark than about Anthropic).


You got a random person saying a random thing. There's no explanation for a market. The same way the stock market doesn't move for the reason the articles say it does. Everyone on each side has their own multitude of reasons.


I think they also based their expectation on the release cycles and speeds of update. Anthropic is known for more conservative release cycle and incremental updates. Google on the other hand is accelerated recently. It also seems that other actors are better at benchmark cheating ;)


I find this confusing too. I dropped my OpenAI subs for Claude a while back and I don't feel like I'm missing much.

I need to spend some more time with Gemini too though. I was using that as a backend for Cursor for a while and had some good results there too.


Claude is a useful tool, IMO the most useful one even, but not a road to AGI.


I mean, if you feel strongly enough that it will be #1 at the end of year then $100 now would net you $3000 end of year... Do bear in mind what my sibling said about the specific benchmark that is being used, though.


That bet does not seem to be very illuminating. Winner is likely who happens to release closest to end of year, no?


Looking at LMarena which polymarket uses, I'm not surprised. Based on the little data there is (3k duels, it's possibly worse than Gemini, it lost more to Gemini 2.5 Pro than it won in direct duels). Not sure why the ELO is still higher, possibly GPT5 did more clearly better against bad models, which I don't care about.


The Musk effect is pretty crazy. Or is there another explanation for why x can compete with Google?


Elon's Y Combinator interview was pretty good. He seemed more in his element back amongst the hacker crowd (rather than dirty politics), and seemed to be doing hackery things at X, like renting generators and mobile cooling vans and just putting them the car park outside a warehouse to train Grok, since there were no data centres available and he was told it would take 2 years to set it all up properly.

I think he's just good at attracting good talent, and letting them focus on the right things to move fast initially, while cutting the supporting infra down to zero until it's needed.


Are you talking about this:

https://futurism.com/elon-musk-memphis-illegal-generators

It's hackery but also kind of sociopathic to dump a bunch of loud, dirty generators in the middle of a low-income community. Go set your data center up on Martha's Vineyard and see how long the residents put up with it.


Thinking more cynically: political corruption and connections I'm guessing? Just a couple months ago Musk was treating the US government like his personal playground.


Because they started so late but somehow managed to make something close to SOTA?

Either way or people think Trump will just give Elon a 500B government contract...


They have a lot of compute already and Grok 4 was pretty strong?


they’ve managed to acquire compute remarkably quickly and i’m no Musk lover


You don't actually hold polymarket odds with any significant weighting on actual outcomes do you?


Is not that they are not impressed, is just google came out with steerable video gen


That was a few days ago. The big drop in that Polymarket I mentioned all happened today. It was reaction to GTP5 specifically.


> Polymarket betters are not impressed. Based upon the market odds, OpenAI had a 35% chance to have the best model (at year end)

who will decide the winner to resolve bets?


I am convinced. I've been giving it tasks the past couple hours that Opus 4.1 was failing on and it not only did them but cleaned up the mess Opus made. It's the real deal.


On that same vein, I had just tried Opus 4.1 yesterday, and it succesfully completed tasks that Sonnet 4 and Opus 4 failed at.


When it came out on Tuesday I wanted to throw my laptop out of the window. I don't know what happened but results were total garbage earlier this week. It got better the past couple days but so far with gpt-5 being able to solve problems without as much correction I'm going to use it more.


Interesting, I've had the complete opposite experience. Opus 4.1 feels like a generational improvement compared to GPT-5.


It is funny how it can be like this sometimes. I think a lot depends on coding styles, languages, prompting, etc.


And it's almost 10x cheaper via flex, and in #1 position on lmarena. It's not even close.


The real last mover is Apple, because boy are they not moving.


As an iOS dev, I really hope they acquire Anthropic before it’s too expensive.


As a Linux developer, I hope they do not.


Wow would that be horrible


How so? This is not scarce technology. Claude Code has a few months on the other tools, tops.


I really don't want the already trillion dollar mega monopoly to own the world.


I would rather the already trillion dollar mega monopoly own the world than "Open"Ai


Yea maybe it’s naive but I’ve started learning towards preferring the devil I know. It also helps that Gemini is great.


Plus it's the mega monopoly that is already being scrutinized by the government. Every tech company seems to start out with too much credibility that it has to whittle down little by little before we really hold them accountable.


Google is multiple orders of magnitute closer to being 'owned by the people' than a privately held for profit charity.


Yes, I would only prefer Gemini because google is under scrutiny, not because I think I know alphabet better than openAI. I think it’s a changing beast and no one can “know” it, it’s an illusion created by the brand, underneath it, it’s different every day.


Are we forgetting that they're getting more evil, not less?

They just removed ManifestV2.


If you think manifest v2 is related to being more evil you have to rethink your sense of ethics. Companies of that size regularly engage in business that results in the deaths of many innocent people. Overall Google does quite well by many metrics compared to its peers.


Yea we’re in Silicon Valley’s Lex Luthor era. World Coin is just really next level though compared to most Google things. Sama has kinda always been going for the Lex Luthor vibe.


Growing up in a Southern Baptist household where televangelists preached the end of the world every day at 4 PM, World Coin has some serious Antichrist and Revelation vibes. I'll give you that point.


Which betting markets were you referring to and where can they be viewed?



Polymarket has a whole AI category https://polymarket.com/search/ai?_sort=volume of markets.


The demos were awful. It felt like watching sloppy vibe coded css UIs


Gpt5 high reasoning is a big step up from o3


haven't tried Qwen but used these "near instant token" like groq and another one that uses diffusion model to generate code via LLaMA and the results weren't satisfactory.

now if something like Gemini 2.5 pro or Sonnet 4 even can run on Cerebras generating tens of thousands of code in a few seconds, that could really make a difference.


Been using proxmox for a home lab and I still can't believe how much value they provide for free.

I use it with Cursor and create vm templates and clone them with a proxmox MCP server I've been adding features to and it's been incredibly satisfying to just prompt "create template from vm 552 and clone a full VM with it".

https://github.com/agentify-sh/cursor-proxmox-mcp


Yes, same here. Big shout-out to https://community-scripts.github.io/ProxmoxVE as well.


I use it for DevOps at work and it’s just wonderful. The data center features alone are worth the license fees .. but what I like most of all is how easy it makes managing ZFS pools.


To be fair, Proxmox is essentially a UX wrapper around QEMU/KVM, which is free software and the true kernel of value. If you are going the MCP route I wonder if a direct QEMU or libvirt MCP server would be much more powerful and precise.


While UI/UX is–as probably everywhere–a huge topic, we actually have spent most engineering power in the whole management stack. And of that managing QEMU/KVM–while surely significant–is by far not the biggest part of our also 100% free and open source code bases. I'd invite you to try our full feature set, from clustering, to SDN to integrated Ceph management, to containers, backups including third party and our own backup server, and many more, all accessible through a REST API with full permission management and modern access control.

And we naturally try to contribute to every project we use however possible, be it in the form of patches, QA, detailed bug reports, ... and especially for QEMU et al. we're pretty content with our impact, at least compared to resources of other companies leveraging it, I think.

If all it'd take is being "just" a simple UI wrapper, it would make lots of things way easier for us :-)


> I'd invite you to try our full feature set, from clustering, to SDN to integrated Ceph management, to containers, backups including third party and our own backup server, and many more, all accessible through a REST API with full permission management and modern access control.

This would be appealing in a world where Kubernetes doesn't exist as a mature option.

Don't the vast majority of Proxmox users use it for small VM labs, without all the bells and whistles?


> This would be appealing in a world where Kubernetes doesn't exist as a mature option.

The feature set of Kubernetes and Proxmox VE do not really overlap completely IMO, or would need much more hands-on approach and specialized in-house staff to set up and maintain, why go for that if you can do everything one need with much less headache.

As the former needs much more dedicated management and maintenance resources and is often pulling in more complexity than most need, but there one needs also to differentiate from those developing and releasing their own applications and being fine with what e.g. kubernetes offers, quite possibly even wanting that, compared to those e.g. providing infrastructure for internal use or just favoring more monolithic applications, often boils down to taste and what one is comfortable with. Our enterprise customers are very diverse, from small shops with two or three node clusters to host the office infra to five digit hosts and 6 digits VMs spread out. We even know a few setups using Proxmox VE as basis for their Kubernetes cluster.

Finally, PVE is quite a bit older than Kubernetes, we still exist and see a lot of adoption (already before the Broadcom deal, albeit there was an uptick since then), so even without some technical comparison of features or use cases or the like it seems clear that Kubernetes isn't an alternative for all, just like Proxmox VE certainly isn't one.

> Don't the vast majority of Proxmox users use it for small VM labs, without all the bells and whistles?

One should not confuse being very popular in the home lab scene due to being very approachable, simple to set up, and most importantly 100% FLOSS, not just open core or the like, with it being the main or target audience, but we're really happy with the synergies it provides.

And actually I'd not frame it as bad thing that Proxmox VE is used that way, we do not want to be a club that is hard to access, neither cost wise nor hindering scaling smaller VM labs to bigger setups, and certainly not from a complexity barrier.


Kubernetes also has off-the-shelf distros that are more apples-to-apples with Proxmox VE.

Let's not pretend that Proxmox or any of these are silver bullets that kill the (inherent) complexity demon. Anything touching SDN or clustered storage, in any ecosystem, will need dedicated in-house experts that know networking, storage, Linux and how Proxmox (or Kubernetes) approaches those domains.

Unless you are just using Proxmox for small VM labs, in which case it ought to be compared with libvirt and standalone QEMU.


Kill? no, definitively not. But 1) not adding a lot of extra and 2) applying the Pareto principle here are a huge difference. I.e. making the essential parts approachable enough to get one started while not being blocked too much by upfront (steep) learning curve, while not blocking one later by limiting one to just the predefined path.

> Anything touching SDN or clustered storage, in any ecosystem, will need dedicated in-house experts that know networking, storage, Linux and how Proxmox (or Kubernetes) approaches those domains.

There are widely different level of expertise needed though, and the setups that often are managed by admins with not in-depth expertise in clustering or SDN can still get things done with Proxmox VE, and if they are out of idea we got our enterprise support and naturally also the very active and friendly community forum to help.

> Unless you are just using Proxmox for small VM labs, in which case it ought to be compared with libvirt and standalone QEMU.

Yeah, I really do not get that point, you basically invalidate 95% of Proxmox VE's feature set because it might not be fully leveraged by a specific user group and because there are some different solutions that also allow one to do similar things. That would also invalidate Kubernetes, it's not completely unpopular in small labs either after all.

To be honest, to me, it feels a bit to me like a justification attempt for the initial post in this chain here that brushed off Proxmox VE as just some small UI/UX wrapper around QEMU/KVM and all the Real Work™ being done by others, possibly because you never actually used it, but I might read it the wrong way, and I'm certainly not offended in any way, just find it a bit odd.


My OP is more about promoting understanding of underlying VM technologies. Proxmox adds value, but also complexity of its own abstractions. QEMU and libvirt don't have salespeople trolling the internet to promote their use, so there is less awareness of what these core technologies are capable of.


Note that I'm neither a "sales people", nor is the one that made the original post, as that is Olaf from the Perl foundation, who reached out to me after I made a contribution to one of his Perl projects, if you must know. Tbh. I didn't even know that he would post this on some channels like here or reddit before getting pinged by a colleague that we were mentioned on the front page of HN. And for a fact, I actually know (and enjoy!) several people from the QEMU and libvirt developer community actively posting on other sites comment sections, or is that now a bad thing too?

And FWIW, I tried several times to point out that QEMU itself is only a small part of what we provide–even if not, just providing a good API abstraction around that is significant work, especially if it should allow two decades (and counting!) of stable upgrade paths and _without_ libvirt. And we nowhere hide the underlying technologies, we're proudly building upon–and trying to give back–to all projects we use, be it Debian, QEMU/KVM, LXC (which we co-maintain), Linux kernel, FRR, rust, or–like here–Perl...

But as you're rather dismissive and now even start to call people trolling I hardly see any need to take your writings as serious discussion, they do not seem to be done in good faith, and IMO doing it this way certainly won't help to promote FLOSS, that should be possible without being dismissive to others work.


Have you used proxmox? It makes stuff easy, that was really complicated before proxmox entered the stage.


You started with saying it was "essentially a UX wrapper", it was explained that there is a lot more to it, so you immediately shift into (paraphrased) "well there's other options and most people don't use the other features anyways".

Would have been cool of you to just say "oh neat, I didn't realize that you did all that too".


Do you really think people are using Proxmox MCP servers to administer mission-critical distributed computing systems that leverage SDN and clustered storage, rather than small VM labs? Because that was not my assumption.


That's very much not the point I was making, at all.

There could be tons! Or not many. It's completely irrelevant to my comment.


> Don't the vast majority of Proxmox users use it for small VM labs, without all the bells and whistles?

In my little world this largely was true until Broadcom bought VMware and proceeded to blow it up.

I know of a handful of rather “large” (100s of physical servers at least) VMware based products that are as quickly as possible migrating to proxmox.

Given how universal this is for my little slice of IT, I imagine this is quite pervasive for the “mid tier” VMware organization.


While you could do that, proxmox offers lot of value with its UI which I need to default to time to time. With just an API key I generate from proxmox I have a wide range of capability that I can hook up an MCP server to.

The funny thing is with Cursor I can just generate a new capability, like the clone and template actions were created after asking Sonnet 4.


Calling Proxmox a wrapper for KVM is hilarious, you're ignoring that Proxmox does all the work to make a functional cluster of VM servers including stuff like shared storage and live migrations and networking. If you only use Proxmox on a single server with local storage then I could see how you would say this but having a fleet of VMs on a cluster of servers where you can take down physical hosts to patch transparently is the "hard problem."


It's an annoying thing common to a lot of technologists. They (we?) see a product which solves a problem, then imagine some hacky way to set up existing tools to half-way solve a similar problem in an easy but incomplete way, then make fun of the product for existing when the hacky solution exists. The Hacker News comment on Docker's announcement post comes to mind, where a person makes fun of the concept when you could "just" run rsync in a cron job.


I think you might mean Dropbox, but your point still stands.

For context, here's that classic thread: https://news.ycombinator.com/item?id=8863 :)


Haha, yeah I meant Dropbox. Derp


Proxmox has a UI and a bunch of APIs so I don't need to rebuild them myself, and maintains everything quite well (all major upgrades I've done have been pretty seamless). Proxmox is definitely an easy path, and you still have root access for drawing outside the lines.


This is true for the act of launching VMs, but it’s pretty reductive towards the entire suite of important features that Proxmox provides like clustering, high availability, integration with various storage backends, backups, and more that qemu doesn’t.


I mean, that's actually not being fair... It's like saying Windows is just a UX wrapper around a microkernel. There is quite a bit of functionality provided by that wrapper.


it's not. their SDN built on frr and vxlan is itself a complex piece that has been missing from the free space (integrated as a package).


What do you use all these VMs for in your homelab? I've dabbled with Proxmox in the past but settled on plain Ubuntu for my home server that I now treat as a pet managed with Ansible.


Take your pick. Everyone wants different things. This site/repo is pretty great.

https://community-scripts.github.io/ProxmoxVE/scripts


For me Proxmox is mainly a means to be able to have more than 1 pet (partially for simplicity's sake of not having to make everything play well together in the same install, partially because I have some things which require Windows and some things which require Ubuntu).

I guess I do also sometimes use it for ephemeral things without having to worry about cleaning up after too. E.g. I can dork around with some project I saw on GitHub for an afternoon then hit "revert to snapshot" without having to worry about what that means for the permanent stuff running at the same time.


I personally self-host a bunch of stuff for myself and my household. Nextcloud for my phone, mattermost for in-house communication, private wordpress as a multimedia diary, a bunch of experiments, wekan for organization, network storage, network printer.

I found Turnkey Linux pretty nice. They provide ready to use Linux images for different services. Proxmox integrates with them, so for example to install Nextcloud, all I needed to do is to click around a bunch on the Proxmox interface, and I'm good to go. They have around 80-90 images to choose from.


immich, n8n, openwebui, metube, hoarder, gethomepage, freshrss, tailscale, reverse proxy, on and on it goes.


I do that with containers running on a single-node Kubernetes cluster (k3s). Doing it via the Proxmox UI feels like I'm giving up "control". Maybe that's just because doing it with Kubernetes etc. is closer to how I'd do it at work.


it gives me a direct bridge from cursor -> VM, for local dev & test out open source projects

I like having a local server I can carry with me and control using just Cursor to manage it.

So basically the freedom that comes with a homelab without using proxmox UI and ssh.


So they are profiling people using Pixel phones with GrapheneOS....because its good at what it does? Am I reading this right?


I'm not so optimistic as someone that works on agents for businesses and creating tools for it. The leap from low 90s to 99% is classic last mile problem for LLM agents. The more generic and spread an agent is (can-do-it-all) the more likely it will fail and disappoint.

Can't help but feel many are optimizing happy paths in their demos and hiding the true reality. Doesn't mean there isn't a place for agents but rather how we view them and their potential impact needs to be separated from those that benefit from hype.

just my two cents


In general most of the previous AI "breakthrough" in the last decade were backed by proper scientific research and ideas:

- AlphaGo/AlphaZero (MCTS)

- OpenAI Five (PPO)

- GPT 1/2/3 (Transformers)

- Dall-e 1/2, Stable Diffusion (CLIP, Diffusion)

- ChatGPT (RLHF)

- SORA (Diffusion Transformers)

"Agents" is a marketing term and isn't backed by anything. There is little data available, so it's hard to have generally capable agents in the sense that LLMs are generally capable


I disagree that there isn't an innovation.

The technology for reasoning models is the ability to do RL on verifiable tasks, with the some (as-of-yet unpublished, but well-known) search over reasoning chains, with a (presumably neural) reasoning fragment proposal machine, and a (presumably neural) scoring machine for those reasoning fragments.

The technology for agents is effectively the same, with some currently-in-R&D way to scale the training architecture for longer-horizon tasks. ChatGPT agent or o3/o4-mini are likely the first published models that take advantage of this research.

It's fairly obvious that this is the direction that all the AI labs are going if you go to SF house parties or listen to AI insiders like Dwarkesh Patel.


Fair enough I guess, even though the concept of agent/agentic task popped before reasoning models were really a thing


The idea of chatbots existed before ChatGPT, does that mean it's purely marketing hype?


'Agents' are just a design pattern for applications that leverage recent proper scientific breakthroughs. We now have models that are increasingly capable of reading arbitrary text and outputting valid json/xml. It seems like if we're careful about what text we feed them and what json/xml we ask for, we can get them to string together meaningful workflows and operations.

Obviously, this is working better in some problem spaces than others; seems to mainly depend on how in-distribution the data domain is to the LLM's training set. Choices about context selection and the API surface exposed in function calls seem to have a large effect on how well these models can do useful work as well.


My personal framing of "Agents" is that they're more like software robots than they are an atomic unit of technology. Composed of many individual breakthroughs, but ultimately a feat of design and engineering to make them useful for a particular task.


Agents have been a field in AI long since 1990s.

MDP, Q learning, TD, RL, PPO are basically all about agent.

What we have today is still very much the same field as it was.


Yep. Agents are only powered by clever use of training data, nothing more. There hasn't been a real breakthrough in a long time.


"Long time" as in, 7 months since o1 and reasoning models were released? That was a pretty big breakthrough.


In the context of our conversation and what OP wrote, there has been no breakthrough since around 2018. What you're seeing is the harvesting of all low-hanging fruit from a tree that was discovered years ago. But fruit is almost gone. All top models perform at almost the same level. All the "agents" and "reasoning models" are just products of training data.

I wrote more about it here:

https://news.ycombinator.com/item?id=44426993

You may also be interested in this article, that goes into details even more:

https://blog.jxmo.io/p/there-are-no-new-ideas-in-ai-only


This "all breakthroughs are old" argument is very unsatisfying. It reminds me of when people would describe LLMs as being "just big math functions". It is technically correct, but it misses the point.

AI researchers spent years figuring out how to apply RL to LLMs without degrading their general capabilities. That's the breakthrough. Not the existence of RL, but making it work for LLMs specifically. Saying "it's just RL, we've known about that for ages" does not acknowledge the work that went into this.

Similarly, using the fact that new breakthroughs look like old research ideas is not particularly good evidence that we are going to head into a winter. First, what are the limits of RL, really? Will we just get models that are highly performant at narrow tasks? Or will the skills we train LLMs for generalise? What's the limit? This is still an open question. RL for narrow domains like Chess yielded superhuman results, and I am interested to see how far we will get with it for LLMs.

This also ignores active research that has been yielding great results, such as AlphaEvolve. This isn't a new idea either, but does that really matter? They figured out how to apply evolutionary algorithms with LLMs to improve code. So, there's another idea to add to your list of old ideas. What's to say there aren't more old ideas that will pop up when people figure out how to apply them?

Maybe we will add a search layer with MCTS on top of LLMs to allow progress on really large math problems by breaking them down into a graph of sub-problems. That wouldn't be a new idea either. Or we'll figure out how to train better reranking algorithms to sort our training data, to get better performance. That wouldn't be new either! Or we'll just develop more and better tools for LLMs to call. There's going to be a limit at some point, but I am not convinced by your argument that we have reached peak LLM.


I understand your argument. The recipe that finally let RLHF + SFT work without strip mining base knowledge was real R&D, and GPT 4 class models wouldn’t feel so "chatty but competent" without it. I just still see ceiling effects that make the whole effort look more like climbing a very tall tree than building a Saturn V.

GPT 4.1 is marketed as a "major improvement" but under the hood it’s still the KL-regularised PPO loop OpenAI first stabilized in 2022 only with a longer context window and a lot more GPUs for reward model inference.

They retired GPT 4.5 after five months and told developers to fall back to 4.1. The public story is "cost to serve” not breakthroughs left on the table. When you sunset your latest flagship because the economics don’t close, that’s not a moon shot trajectory, it’s weight shaving on a treehouse.

Stanford’s 2025 AI-Index shows that model to model spreads on MMLU, HumanEval, and GSM8K have collapsed to low single digits, performance curves are flattening exactly where compute curves are exploding. A fresh MIT-CSAIL paper modelling "Bayes slowdown" makes the same point mathematically: every extra order of magnitude of FLOPs is buying less accuracy than the one before.[1]

A survey published last week[2] catalogs the 2025 state of RLHF/RLAIF: reward hacking, preference data scarcity, and training instability remain open problems, just mitigated by ever heavier regularisation and bigger human in the loop funnels. If our alignment patch still needs a small army of labelers and a KL muzzle to keep the model from self lobotomising calling it "solved" feels optimistic.

Scale, fancy sampling tricks, and patched up RL got us to the leafy top so chatbots that can code and debate decently. But the same reports above show the branches bending under compute cost, data saturation, and alignment tax. Until we swap out the propulsion system so new architectures, richer memory, or learning paradigms that add information instead of reweighting it we’re in danger of planting a flag on a treetop and mistaking it for Mare Tranquillitatis.

Happy to climb higher together friend but I’m still packing a parachute, not a space suit.

1. https://arxiv.org/html/2507.07931v1

2. https://arxiv.org/html/2507.04136v1


But that's how progress works! To me it makes sense that llms first manage to do 80% of the task, then 90, then 95, then 98, then 99, then 99.5, and so on. The last part IS the hardest, and each iteration of LLMs will get a bit further.

Just because it didn't reach 100% just yet doesn't mean that LLMs as a whole are doomed. In fact, the fact that they are slowly approaching 100% shows promise that there IS a future for LLMs, and that they still have the potential to change things fundamentally, more so than they did already.


But they don’t do 80% of the task. They do 100% of the task, but 20% is wrong (and you don’t know which 20% without manually verifying all of it).

So it is really great for tasks where do the work is a lot harder than verifying it, and mostly useless for tasks where doing the work and verifying it are similarly difficult.


Right — and I'd conjecture until LLMs get close to the accuracy of an entry-level employee, they may not have enough economic value to be viable beyond the hype/novelty phase. Why? Because companies already chose a "minimum quality to be valuable" bar when they set the bar for their most junior entry level. They could get lower-quality work for cheaper by just carving out an even lower-bar hiring tier. If they haven't, maybe it's because work below that quality level is just not a net-positive contribution at all.


I would go so far as to say that the reason people feel LLMs have stagnated is precisely because they feel like they're only progressing a few percentage points between iteration - despite the fact that these points are the hardest.


The people who feel that LLMs have stagnated are similar to the ones who feel like LLMs are not useful.


> Can't help but feel many are optimizing happy paths in their demos and hiding the true reality.

Even with the best intentions, this feels similar to when a developer hands off code directly to the customer without any review, or QA, etc. We all know that what a developer considers "done" often differs significantly from what the customer expects.


>> many are optimizing happy paths in their demos and hiding the true reality

Yep. This is literally what every AI company does nowadays.


>The more generic and spread an agent is (can-do-it-all) the more likely it will fail and disappoint.

To your point - the most impressive AI tool (not an LLM but bear with me) I have used to date, and I loathe giving Adobe any credit, is Adobe's Audio Enhance tool. It has brought back audio that prior to it I would throw out or, if the client was lucky, would charge thousands of dollars and spend weeks working on to repair to get it half as good as that thing spits out in minutes. Not only is it good at salvaging terrible audio, it can make mediocre zoom audio sound almost like it was recorded in a proper studio. It is truly magic to me.

Warning: don't feed it music lol it tries to make the sounds into words. That being said, you can get some wild effects when you do it!


Not even well-optimized. The demos in the related sit-down chat livestream video showed an every-baseball-park-trip planner report that drew a map with seemingly random lines that missed the east coast entirely, leapt into the Gulf of Mexico, and was generally complete nonsense. This was a pre-recorded demo being live-streamed with Sam Altman in the room, and that’s what they chose to show.


I mostly agree with this. The goal with AI companies is not to reach 99% or 100% human-level, it's >100% (do tasks better than an average human could, or eventually an expert).

But since you can't really do that with wedding planning or whatnot, the 100% ceiling means the AI can only compete on speed and cost. And the cost will be... whatever Nvidia feels like charging per chip.


yep, the same problem with outsourcing, getting the 90% "done" is easy, the 10% is hard and completely depends on how the "90%" was archived


Seen this happen many times with current agent implementations. With RL (and provided you have enough use case data) you can get to a high accuracy on many of these shortcomings. Most problems arise from the fact that prompting is not the most reliable mechanism and is brittle. Teaching a model on specific tasks help negate those issues, and overall results in a better automation outcome without devs having to make so much effort to go from 90% to 99%. Another way to do it is parallel generation and then identifying at runtime which one seems most correct (majority voting or llm as a judge).

I agree with you on the hype part. Unfortunately, that is the reality of current silicon valley. Hype gets you noticed, and gets you users. Hype propels companies forward, so that is about to stay.


What % of cloudflare's protection can this provide? I've been looking at bunkerweb + anubis as alternative to cloudflare tunnel (im actually not sure if this provides WAF)


This isn't really comparable to any of the SaaS based products.

While this offers many of the same technical capabilities as Cloudflare, a lot of Cloudflare's value is in having high-level, aggregate insight into threats.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: