Hacker Newsnew | past | comments | ask | show | jobs | submit | symfrog's commentslogin

We have had LLMs for much longer than 3 years.


I took humans thousands of years, then hundreds of years, to come to terms with very basic concepts about numbers.

Its amazing to me when people talk about recombining things, or following up on things as somehow lesser work.

People can't separate the perspective they were given when they learned the concepts, that those who developed the concepts didn't have because they didn't exist.

Simple things are hard, or everything simple would have been done hundreds of years ago, and that is certainly not the case. Seeing something others have not noticed is very hard, when we don't have the concepts that the "invisible" things right in front of us will teach us.


Anyone in the arts is aware that creativity is not the new, it is the repackaging of what already exists into something that is itself new.


Except for "Being John Malkovich". That movie was way out there on its own.


It's "just" a Man-vs-Self story, of the ~7 story archetypes out there.


You need to rewatch that movie.


It's why the invention of teaching has been so important. Took a long time for humans to develop calculus. A long time to then refine it and make it much more useful. But then in a year or two an average person can learn what took hundreds of years to invent. It's crazy to equate these tasks as being the same. Even incremental innovation is difficult. You have to see something billions of people haven't. But there's also paradigm shifts and well... if you're not considered crazy at first then did you really shift a paradigm?


And yet it is still taught in less than optimal form, lacking algebraic closure in ways that are completely unnecessary.

It isn't a secret, but the percentage of people who don't know that, plus the percentage of mathematicians who vaguely or more directly know that, but habitually use the broken, more difficult (i.e. less algebraic) notation is ... virtually everyone.

I am not trying to pick on calculus, this is everywhere. Important and useful concepts are right in front of all of us, that we don't see even in the context of what we are relatively fluent with.

Because we learn quickly, where we have (almost always inherited) the right preparatory perspectives (earned over lifetimes by others), we vastly overrate our ability to reason independently.


What is that algebraic calculus you are hinting at?


Were I to guess they're talking about the different derivatives. Here's at least something that might introduce you to some of the shortcuts people take but it's far from complete [0] (you can probably find more if you search things like how physicists use the derivative wrong. (I make this critique as someone with a degree in physics too))

I often say that math is taught through a game of telephone. It's a fanatic example of the problem with "I just care that it works" type of attitudes. The problem is if that's your actual belief then you wouldn't be saying that because you'd need to dig deeper. Caring about it working is exactly the reason people do did deeper and bring up issues. The reason things fall apart less in math is because the language was specifically invented to make miscommunication difficult. That's why it's overly pedantic. That's why we use formal languages rather than natural ones. So we should rephrase "I just care that it works" is that it's actually "I just care that it works for this exact case." It makes it easier to see the problem. If you don't know the subject in more detail then you can't actually know if it breaks in that use case. The broken parts are completely invisible to you! Which undermines your own stated goal.

This goes for a lot more than math. But being a formal language it's just easier to point things out and how people misunderstand. If you're an expert in any field you've probably see this same phenomena in that domain though. People having over confidence and their refusal to get deeper knowledge actually just undermines their whole goal. I'd honestly call this a form of Murray-Gell-man Amnesia

[0] https://m.youtube.com/watch?v=oIhdrMh3UJw


No, we haven't, for any reasonable definition of L.


OpenAI themselves must not have a "reasonable definition of L", then. Their own papers and press releases refer to GPT-2 (from 2019) as a "large language model".

https://openai.com/index/better-language-models/


Yes, and 1.5 billion parameters meets no reasonable current definition of large. It would be considered a tiny language model. OpenAI themselves refer to their small/fast models as small models all over their documentation.


The term doesn't change its meaning because something new comes along.

The point of the term "large" is to highlight the massive parameter count (compared to traditional statistical models, where having 1.5 billion parameters was basically unheard of). It leads to the "double decent" phenomenon that allows them to generalize in ways traditional statistical models can't.

The idea that the "large" descriptor was just a subjective exclamation, like "oh wow this model is pretty large ain't it", is revisionism.


yes, it does. That's why OpenAI refers to it's small models as small. They are just so different. The capabilities have changed dramatically. The use cases are wildly different. The architectures are quite different. Even the core idea of attention is different. Training them is materially different. Serving them is materially different. A 1.5 bill parameter model from 2019 is so different from today's LLMs that they really don't have much in common. What we have now is quite similar to what we had a couple years ago though.


  The term doesn't change its meaning because something new comes along.
...you're gonna flip when you hear about how language works :)


Sure we do, since Fei-Fei Li and team created that annotated dataset, which allowed to train first LLMs. So LLMs are here for more than a decade already.


You are confused by what the L and L mean in LLM, or which data set she created, or both, or in general.


Or it is you who are confused. And I want to remind you that you can't retcon historical word use.


Fei Fei was annotating images... the second L in LLM is for "language". The first language models named LLM at the time were trained on language data, with an objective function of predicting the next token. It had nothing to do with the imagenet data. Imagenet data was used in... vision models.

The attention is all you need paper didn't ever use the term LLM or large language model because the phrase didn't exist in industry.

Why comment on a field you know nothing about?


When people say this what they mean is that we've had plausibly useful LLMs for around three years, and I would say that is basically true. The stuff before 2023 could barely be classified above the level of an interesting toy.


When people say this what they mean is that we've had plausibly useful LLMs for around three years, and I would say that is basically true.


Fine, 8 years? That's not a long time


The closer you get to releasing software, the less useful LLMs become. They tend to go into loops of 'Fixed it!' without having fixed anything.

In my opinion, attempting to hold the hand of the LLM via prompts in English for the 'last mile' to production ready code runs into the fundamental problem of ambiguity of natural languages.

From my experience, those developers that believe LLMs are good enough for production are either building systems that are not critical (e.g. 80% is correct enough), or they do not have the experience to be able to detect how LLM generated code would fail in production beyond the 'happy path'.


> The closer you get to releasing software, the less useful LLMs become.

Which is _always_ the case with these things, honestly. Remember Ruby on Rails? Make a Twitter clone in half an hour by just writing some DSL! Of course, in reality Rails was _not_ a productivity revolution, and making _real_ software which had to be operated at scale and maintained, and work properly, in it wasn't much easier than it had been previously.


The amount of "apps" I've had dumped on my team that are everything from un-releasable to deployed on some random shit-cloud we haven't approved (vercel comes up a lot). If you needed hand holding to release things or had to throw software over the fence to others to "productionise" etc then you probably don't know what you're talking about.


This is not my experience with claude code. It does forget big picture things but if you scope your changes well it’s fine.


I would estimate that out of every 200 lines of code that Claude Code produces, I notice at least 1 issue that would cause severe problems in production.

In my opinion these discussions should include MREs (minimal reproducible examples) in the form of prompts to ground the discussion.

For example, take this prompt and put it into Claude Code, can you see the problematic ways it is handling transactions?

---

The invoicing system is being merged into the core system that uses Postgres as its database. The core system has a table for users with columns user_id, username, creation_date . The invoicing data is available in a json file with columns user_id, invoice_id, amount, description.

The data is too big to fit in memory.

Your role is to create a Python program that creates a table for the invoices in Postgres and then inserts the data from the json file. Users will be accessing the system while the invoices are being inserted.

---


And that's why you ask for a high level plan for something like that before you let the agent write any code. Then you review the plan for flaws, revise it, and prompt the system to fill out more details for each step. Repeat as necessary. Yes it's slow, but it's the best way of using this "glorified autocomplete" to ease and speed up real work.


People that have never written their own code won't know what the flaws are.


Those people can ask Claude to review the flaws for them.


Then they won't know if it's accurate or missing something.


Oh good point.


What he’s saying is split this up into multiple tasks to create the table, insert the data etc


Isn’t that the hard part? If the tasks are small enough and well defined, where’s the win over just writing the code right there and then?


Well claude can also refine it into smaller tasks and that’s where you can fix those major problems in production issues.


It’s the hard part which is why these tools are so great, the writing of code was the tedious part


You can use an LLM to generate that list of tasks.


And how does a new grad that's never actually programmed know whether that list of tasks makes sense?


Yes, but knowing how to scope your changes requires a lot of expertise.


If you are trying to build something well represented in the training data, you could get a usable prototype.

If you are unfamiliar with the various ways that naive code would fail in production, you could be fooled into thinking generated code is all you need.

If you try to hold the hand of the coding agents to bring code to a point where it is production ready, be prepared for a frustrating cycle of models responding with ‘Fixed it!’ while only having introduced further issues.


On what basis are you making that prediction?


Any sufficiently complicated LLM generated program contains an ad hoc, informally-specified, bug-ridden, slow implementation of half of an open source project.


We had an effort recently where one much more experienced dev from our company ran Claude on our oldish codebase for one system, with the goal of transforming it into newer structure, newer libraries etc. while preserving various built in functionalities. Not the first time this guy did such a thing and he is supposed to be an expert.

I took a look at the result and its maybe half of stuff missing completely, rest is cryptic. I know that codebase by heart since I created it. From my 20+ years of experience correcting all this would take way more effort than manual rewrite from scratch by a senior. Suffice to say thats not what upper management wants to hear, llm adoption often became one of their yearly targets to be evaluated against. So we have a hammer and looking for nails to bend and crook.

Suffice to say this effort led nowhere since we have other high priority goals, for now. Smaller things here & there, why not. Bigger efforts, so far sawed-off 2-barrel shotgun loaded with buckshot right into both feet.


Not to take away from your experience but to offer a counterpoint.

I used claude code to port rust pdb parsing library to typescript.

My SumatraPDF is a large C++ app and I wanted visibility into where does the size of functions / data go, layout of classes. So I wanted to build a tool to dump info out of a PDB. But I have been diagnosed with extreme case of Rustophobiatis so I just can't touch rust code. Hence, the port to typescript.

With my assistance it did the work in an afternoon and did it well. The code worked. I ran it against large PDB from SumatraPDF and it matched the output of other tools.

In a way porting from one language to another is extreme case of refactoring and Claude did it very well.

I think that in general (your experience notwithstanding) Claude Caude is excellent at refactorings.

Here are 3 refactorings from SumatraPDF where I asked claude code to simplify code written by a human:

https://github.com/sumatrapdfreader/sumatrapdf/commit/a472d3... https://github.com/sumatrapdfreader/sumatrapdf/commit/5624aa... https://github.com/sumatrapdfreader/sumatrapdf/commit/a40bc9...

I hope you agree the code written by Claude is better than the code written by a human.

Granted, those are small changes but I think it generalizes into bigger changes. I have few refactorings in mind I wanted to do for a long time and maybe with Claude they will finally be feasible (they were not feasible before only because I don't have infinite amount of time to do everything I want to do).


“I want this thing, but in a different language” seems to be something that the current generation of cutting edge LLMs are pretty good at.

Translating a vibe is something the Ur-LLMS (GPT3 etc) were very good at so it’s not entirely surprising that the current state of the art is to be found in things of a “translate thing X that already exists into context Y” nature.


All software before LLMs had a copious number of bugs, many of which were never fixed.


Was software ever a moat? Software typically only gave companies a small window of opportunity to turn a fleeting software advantage into a more resilient moat (network effects, switching costs etc.)


Yes, I would argue good (stable, fast, easy to use) software was somewhat of a moat and much harder before coding agents.

Stripe, Square, Shopify, Google, all thrived in some part because their services take a hard problem and make it easier to use. Now more people can take a hard problem and make it easier to use.

All you have to do is look around (esp 5+ years ago) and see the many many BAD, unstable, hard to use, slow, etc versions of these companies


Windows was a moat but it looks more like an anchor now.


Windows' moat was not the operating system code, but that they were able to get distribution via IBM, and then grow an ecosystem of applications that were targeted at Windows, which created a snowball effect for further applications.


Yes though still it was a big barrier to build an OS.


In what way is the long term impact of LLMs being underestimated? If anything, it seems that it has been overestimated in the past years and that something other than LLMs will be needed to reach the original scaled LLM hope of AGI.


Back when the Internet was America online and some CGI bin perl scripts, there were a lot of very lofty things said about the potential of the Internet in the future. I don’t remember any of them predicting the power of the tech would have over business, politics, media, and hours of every single day for billions of people. Even without AGI, it’s quite possible that were still underestimating. The effects of predictive, probabilistic computing 20 or 50 years from now.


The internet alone didnt change sh!t. Without smartphones, unified app stores, cellular network innovation et al internet traffic would not be so high.

Funny how people leave this stuff out. Yawn. Basic simpleton analysis and takes.


The Internet created the backbone that allowed for rapid experimentation in communications technologies, and created the ability for anyone to create and share technologies and reach a huge audience very quickly.

Without the Internet, most consumer electronics would have been far more expensive to build, and would have been strictly controlled walled gardens, but the Internet in general and the Web in particular allowed so many inventors to flourish. Ever since that Genie was let out of the bottle, corporate and government interests have been trying to put it back in, and most companies are trying to build and reinforce walled gardens under the banner of unified app stores that extract insane rents.


Wow this is like going on a medical forum and saying "medicine didn't change shit".


They were replying to this particular underestimation:

> AI is limited to probabilistic and annoying chatbots that are for entertainment and for looking up trivia questions.

That is not a rational assessment of the utility that the technology provides, even today.


If only that is what investors have figured out.

Unfortunately, it seems investors now think that all paid software will be replaced by AI generated software, somehow open source projects laundered through generative AI models should finally convince enterprise customers to go with free.


EXWM is great, having the same flow to manage X applications as for emacs buffers is a huge benefit. My only concern is if X11 will be maintained sufficiently into the future to keep using it, currently there is no Wayland support in EXWM.


Emacs as a Wayland compositor has been shown to be possible. If we eventually get that and threading the future might be rather rosy.

https://emacsconf.org/2022/talks/wayland/

http://perma-curious.eu/repo-ewx/


> If we eventually get that and threading

That's a really big ask. The entire ecosystem around Emacs isn't built for multithreading.


I don't mean adding threading to existing functionality, and I mostly wouldn't want that. I very strongly prefer emacs' behaviour of queueing up my input events and processing them deterministically regardless of how long it takes to get to them over eg. the JetBrains behaviour where something can happen asynchronously with my input events that can change their meaning depending on when it happens.

What I mean is having threading capabilities available for things that want to (and should) use them. AIUI some support for that was added in emacs 26, so it might already be good enough.

The relevance is that EXWM is single threaded, so the window management blocks when emacs does. I don't find that much of a problem with EXWM but I doubt it would fly for a Wayland compositor, though perhaps the separate server used in that emacsconf talk sidesteps the problem.


I've moved to openbsd for this reason. It works well and I don't have to deal with Linux drama. Toxic slug strategy is really working well for them.


I once read a comment here or reddit explaining that the X11 developers moved to Wayland because the X11 code has turned into an unmaintainable mess that can't be worked with anymore. So the reasons are not drama, but just plain old tech debt.


This pre-packaged talking point is often repeated without evidence. The vast majority of X.org developers, including all of the original ones, simply moved to other venues at one point or another. Only a few, like Daniel Stone, have made contributions to both. And it shows in how many lessons had to be re-learned.


What is your evidence? A quick search on google (and the git commits) would show you that many wayland developers are significant former xorg developers.

1. Kristian Høgsberg the founder of wayland, did all the DRI2 work on xorg before becomming frustrated 2. Peter Hutterer was a main xorg developer and has been behind the wayland input system 3. Adam Jackson, long time xorg maintainer essentially called for moving on to wayland https://ajaxnwnk.blogspot.com/2020/10/on-abandoning-x-server... (I found that he was involved in wayland discussions, but not sure if he contributed code) 4. you already mentioned Daniel Stone

The only main xorg developer not involved in wayland arguably could be Keith Packard, although he made a lot of the changes for xwayland so I'm not sure if it is correct to say he did not have wayland involvement.

So who are the "vast majority of X.org developers"? I think people always read about the couple of names above and then think, "well there must have been hundreds of others", because they thought xorg was like the linux kernel. AFAIK xorg always only had low 10s of active developers.


My evidence is the git commit log: https://desuarchive.org/g/thread/84460945/#q84481507

This doesn’t even include the XFree86 CVS commit history and older, which accounts for most of the code in X.org. Some of those people may actually be dead now.

>AFAIK xorg always only had low 10s of active developers.

There are 38 people with 100+ commits, which obviously counts as a major contributor.


Openbsd has brought in x11 into their own codebase: https://xenocara.org/

This is why openbsd is great.

I don't care about the drama that happens in Linux land at all.


The drama was mostly over whether or not Wayland should have been the replacement. AFAIU, everyone agreed X11 development was effectively unsustainable or at least at a dead end.


Wayland is not a solution, just a name for some protocols... It's either KDE or Gnome (with it's weird quirks) or some alternative.


So is X11, though the reference implementation of X11 is also widely agreed to have some serious problems going forward on top of problems with the protocol itself.


I'm really happy with OpenBSD also. What is toxic slug strategy?



Do you have a link to some of the code that you have produced using this approach? I am yet to see a public or private repo with non-trivial generated code that is not fundamentally flawed.


This one was a huge success:

https://github.com/micahscopes/radix_immutable

I took an existing MIT licensed prefix tree crate and had Claude+Gemini rewrite it to support immutable quickly comparable views. The execution took about one day's work, following two or three weeks thinking about the problem part time. I scoured the prefix tree libraries available in rust, as well as the various existing immutable collections libraries and found that nothing like this existed. I wanted O(1) comparable views into a prefix tree. This implementation has decently comprehensive tests and benchmarks.

No code for the next two but definitely results...

Tabu search guided graph layout:

https://bsky.app/profile/micahscopes.bsky.social/post/3luh4d...

https://bsky.app/profile/micahscopes.bsky.social/post/3luh4s...

Fast Gaussian blue noise with wgpu:

https://bsky.app/profile/micahscopes.bsky.social/post/3ls3bz...

In both these examples, I leaned on Claude to set up the boilerplate, the GUI, etc, which gave me more mental budget for playing with the challenging aspects of the problem. For example, the tabu graph layout is inspired by several papers, but I was able to iterate really quickly with claude on new ideas from my own creative imagination with the problem. A few of them actually turned out really well.


Not the OP, not my code. But here is Mitchel Hashimoto showing his workflow and code in Zig, created with AI agent assistance: https://youtu.be/XyQ4ZTS5dGw


I think this still is some kind of 'fight' between assisted and more towards 'vibe'. Vibe for me means not reading the generated code, just trying it and the other extreme is writing all without AI. I don't think people here are talking about assisted : they are taking about vibe or almost vibe coding. And its fairly terrible if the llm does not have tons of info. It can loop, hang, remove tons of features, break random things etc all while being cheerful and saying 'this is production code now, ready to deploy'. And people believe it. When you use it to assist, it is great imho.


https://github.com/wglb/gemini-chat Almost entirely generated by gemini based on my english language description. Several rounds with me adding requirements.

(edit)

I asked it to generate a changelog: https://github.com/wglb/gemini-chat/blob/main/CHANGELOG.md


That's disingenuous or naive. Almost nobody decides to expressly highlight the section of code (or whole files generated by ai) they just get on with the job when there's real deadlines and it's not about coding for the sake of the art form...


If the generated implementation is not good, you're trading short-term "getting on with the job" and "real deadlines" for mid-to-long-term slowdown and missed deadlines.

In other words, it matters whether the AI is creating technical debt.


If you're creating technical debt, you're creating technical debt.

That has nothing to do with AI/LLMs.

If you can't understand what the tool spits out either; learn, throw it away, or get it to make something you can understand.


Do you want to clarify your original comment, then? I just read it again, and it really sounds like you're saying that asking to review AI-generated code is "disingenuous or naive".


I am talking about correctness, not style, coding isn't just about being able to show activity (code produced), but rather producing a system that is correctly performing the intended task


Yes, and frankly you should be spending time writing large integration tests correctly not microscopic tests that forgot how tools interact.

It's not about lines of code or quality it's about solving a problem. If the problem creates another problem then it's bad code. If it solves the problem without causing that then great. Move onto the next problem.


Same as pretending that vibe coding isn't producing tons of slop. "Just improve your prompt bro" doesn't work for most real codebases. The recent TEA app leak is a good example of vibe coding gone wrong, I wish I had as much copium as vibe coders to be blind to these things, as most of them clearly are like "it happened to them but surely won't happen to ME."


> The recent TEA app leak is a good example of vibe coding gone wrong

Weren't there 2 or 3 dating apps that were launched before the "vibecoding" craze that went extremely popular and got extremely hacked weeks/months in? I also distinctly remember a social network having firebase global tokens on the clientside, also a few years ago.


So that's an excuse for AI getting it wrong? It should know better if its so much better.


LLMs are not meant to be infallible it's meant to be faster.

Repeat after me, token prediction is not intelligence.


Not an excuse, no. I agree it should be better. And it will get better. Just pointing out that some mistakes were systematically happening before vibecoding became a thing.

We went from "this thing is a stochastic parrot that gives you poems and famous people styled text, but not much else" to "here's a fullstack app, it may have some security issues but otherwise it mainly works" in 2.5 years. People expect perfection, and move the goalposts. Give it a second. Learn what it can do today, adapt, prepare for what it can do tomorrow.


No one is moving the goalposts. There are a ton of people and companies trying to replace large swathes of workers with AI. So it's very reasonable to point out ways in which the AI's output does not measure up to that of those workers.


I thought the idea was that AI would make us collectively better off, not flood the zone with technical debt as if thousands of newly minted CS/bootcamp graduates were unleashed without any supervision.

LLMs are still stochastic parrots, though highly impressive and occasionally useful ones. LLMs are not going to solve problems like "what is the correct security model for this application given this use case".

AI might get there at some point, but it won't be solely based on LLMs.


> "what is the correct security model for this application given this use case".

Frankly I've seen LLMs answer better than people trained in security theatre so be very careful where you draw the line.

If you're trying to say they struggle with what they've not seen before. Yes, provided that what is new isn't within the phase space they've been trained over. Remember there's no photographs of cats riding dinosaurs but SD models can generate them.


Saying that they aren't worse than an incompetent human isn't a ringing endorsement.


I've heard this multiple times (Tea being an example of problems with vibe coding) but my understanding was that the Tea app issues well predated vibe coding.

I have experimented with vibe coding. With Claude Code I could produce a useful and usable small React/TS application, but it was hard to maintain and extend beyond a fairly low level of complexity. I totally agree that vibe coding (at the moment) is producing a lot of slop code, I just don't think Tea is an example of it from what I understand.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: