Hacker Newsnew | past | comments | ask | show | jobs | submit | bcrosby95's commentslogin

I've been working on a client/server game in Unity the past few years and the LLM constantly forgets to update parts of the UI when I have it make changes. The codebase isn't even particularly large, maybe around 150k LOC in total.

A single complex change (defined as 'touching many parts') can take Claude code a couple hours to do. I could probably do it in a couple hours, but I can have Claude do it (while I steer it) while I also think about other things.

My current guess is that LLMs are really good at web code because its seen a shitload of it. My experience with it in arenas where there's less open source code has been less magical.


I suspect you are not using plan mode?

This is where the old line of "LLMs are just next token predictors" actually factors in. I don't know how you get a next token predictor that user input can't break out of. The answer is for the implementer to try to split what they can, and run pre/post validation. But I highly doubt it will ever be 100%, its fundamental to the technology.

I think this is fundamental to any technology, including human brains.

Humans have a problem distinguishing "John from Microsoft" from somebody just claiming to be John from Microsoft. The reason why scamming humans is (relatively) hard is that each human is different. Discovering the perfect tactic to scam one human doesn't necessarily scale across all humans.

LLMs are the opposite; my Chat GPT is (almost) the same as your Chat GPT. It's the same model with the same system message, it's just the contexts that differ. This makes LLM jailbreaks a lot more scalable, and hence a lot more worthwhile to discover.

LLMs are also a lot more static. With people, we have the phenomenon of "banner blindness", which LLMs don't really experience.


How are you defining "banner blindness"?

The foundation of LLMs is Attention.


"Banner blindness [...] describes people’s tendency to ignore page elements that they perceive (correctly or incorrectly) to be ads." https://www.nngroup.com/articles/banner-blindness-old-and-ne...

So people can focus their attention to parts of content, specifically parts they find irrelevant or adversarial (like ads). LLMs on the other hand pay attention to everything or if they focus on something, it is hard to steer them away from irrelevant or adversarial parts.


Banner blindness is a phenomenon where humans build resistance to previously-effective ad formats, making them much less effective than they previously used to be.

You can find a "hook" to effectively manipulate people with advertising, but that hook gets less and less effective as it is exploited. LLMs don't have this property, except across training generations.


> I don't know how you get a next token predictor that user input can't break out of.

Maybe by adjusting the transformer model to have separate input layers for the control and data paths?


Maybe it's my failing but I can't imagine what that would look like.

Right now, you train an LLM by showing it lots of text, and tell it to come up with the best model for predicting the next word in any of that text, as accurately as possible across the corpus. Then you give it a chat template to make it predict what an AI assistant would say. Do some RLHF on top of that and you have Claude.

What would a model with multiple input layers look like? What is it training on, exactly?


> by showing it lots of text

When you're "showing it lots of text", where does that "show" bit happen? :)


It's hard in general, but for instruct/chat models in particular, which already assume a turn-based approach, could they not use a special token that switches control from LLM output to user input? The LLM architecture could be made so it's literally impossible for the model to even produce this token. In the example above, the LLM could then recognize this is not a legitimate user input, as it lacks the token. I'm probably overlooking something obvious.

Yes, and as you'd expect, this is how LLMs work today, in general, for control codes. But different elems use different control codes for different purposes, such as separating system prompt from user prompt.

But even if you tag inputs however your this is good, you can't force an LLM to it treat input type A as input type B, all you can do is try to weight against it! LLMs have no rules, only weights. Pre and post filters cam try to help, but they can't directly control the LLM text generation, they can only analyze and most inputs/output using their own heuristics.


I wouldn't personally do so, but arguably those tens of thousands rest at our feet considering the current government was political blowback from the US and UK regime changing Iran back in the '50s.

It's even less likely to work because Trump has already claimed, publicly, to arming the protestors. That already makes any regime change illegitimate. They're all foreign backed agitators.

I bring it up because this shit is messy.


Bad code works fine until it doesn't. In my experience, with humans, doing the right thing is worth it over doing the bad thing if your time horizon is a few months. Once you're in years, absolutely do the right thing, you're actually throwing time away if you don't. And I don't mean "big refactor", I mean at-change-time, when you think "this change feels like an icky hack."

For LLMs, I don't really know. I only have a couple years experience at that.


If you make a working and functional bad code, and put it on maintenance mode, it can keep churning for decades with no major issues.

Everything depends on context. Most code written by humans is indeed, garbage.


> Most code written by humans is indeed, garbage.

I think that this is the problem, actually.

It's similar to writing. Most people suck at writing so badly that the LLM/AI writing is almost always better when writing is "output".

Code is similar. Most programmers suck at programming so badly that LLM/AI production IS better than 90+% (possibly 99%+). Remember, a huge number of programmers couldn't pass FizzBuzz. So, if you demand "output", Claude is probably better than most of your (especially enterprise) programming team.

The problem is that the Claude usage flood is simply identifying the fact that things that work do so because there is a competent human somewhere in the review pipeline who has been rejecting the vast majority of "output" from your programming team. And he is now overwhelmed.


  >  Most programmers suck at programming so badly that LLM/AI production IS better than 90+% (possibly 99%+).
How do you know?

Because of just how many programmersI've interviewed who can't pass FizzBuzz?

I also taught upper level CS and my first assignment was always "You have 10 days. Here is a 10 line program on this sheet of paper. Type it in, check it into source control, and make the automated tests go green. Warning: start today."

1/3 of the class couldn't finish that task and would drop.


I define maintenance mode as: given over to different team, so not my problem anymore.

If you are a company founder, what scenario would you rather find yourself in?

a) a pristine, good codebase that follows the best coding practices, but it is built on top of bad specs, wrong data/domain model

b) a bad codebase but it correctly models and nails the domain model for your business case

Real life example, a fintech with:

a) a great codebase but stuck with a single-entry ledger

b) a bad codebase that perfectly implements a double-entry ledger


"Perfectly implements" is doing a lot of work there. Enterprise software is very rarely perfect out of the box, and the issue with bad code is that it can make it extraordinarily hard to solve simple problems. I have personally seen tech-debt induced scenarios where "I want a new API to edit this field in an object" and "Let's do a dependency upgrade" respectively became multi-month projects.

> Perfectly implements" is doing a lot of work there. Enterprise software is very rarely perfect out of the box

Fair, by “perfectly implements” I meant to say that it correctly implemented the core invariant of a double entry ledger (debits = credits), not that it was 100% bug free


Since most won't actually deal with fintech (I don't know the stats on HN, but I'm talking devs as one industry), your first "a" example might actually be better than your first "b" example, depending on the complexity of the software. In lots (probably most) of industries, having a good codebase would mean architecture decisions were solid, but the domain/service layer is bad. Maybe my experiences don't match most of the HN crowd, but usually I get stuck with very detailed domain/service rules, but the architecture is a problem where too much memory or CPU is being used, just to abstract away the actual rules of the application (the purpose). Usually when I've been brought in to rebuild an application, the client is fine with the results, but they are upset over performance and/or cost to run the application. For anything of actual complexity, it's usually the supporting code that is the biggest failure, because complex apps usually have decent requirements. Now, if the requirements were bad, and the architecture was bad, AND the domain/service layer is bad, I don't know if there's anything to fix that.

> Bad code works fine until it doesn't.

Who is to judge the "good" or "bad" anyway?


It is important to question "how to judge," not "who is to judge."

My answer of "how to judge?" question is the question "how easy is it to implement new unforeseen functionality with the code under scrutiny?"


And it’s perfectly okay to fix and improve the code later.

Many super talented developers I know will say “Make it work, then make it good”. I think it’s okay to do this on a bigger scale than just the commit cycle.


https://wiki.c2.com/?MakeItWorkMakeItRightMakeItFast

Make it work, make it work right, make it work fast. In that order.


But why not rewrite the app, change the name, and get shareholder value from a new product announcement? It shouldn't take a long time, the spec for the new product is the old product being rewritten.

See Google Hangouts > Google Chat


The fix time horizon changes too, don't discard that.

But tech debt with vibe coding is fixed by just throwing more magic at it. The cost of tech debt has never been lower.

Imagine thinking people losing their primary income source (usually 100% of it) is remotely comparable to the share price of a single company not going up 2%.


If you can’t lay off people then the economy won’t run and it affects everyone.

Sure you can show easy empathy for the employees but this is how economy runs. A static economy where layoffs are hard or punished will lose to a more dynamic one.


> Sure you can show easy empathy for the employees but this is how economy runs. A static economy where layoffs are hard or punished will lose to a more dynamic one.

Is that why workers are generally happier in Europe even though on paper their economy loses?


I've always been skeptical of happiness statistics. In many cases, self-reporting happiness offers an objective floor for happiness, but the ceiling is entirely relative/subjective.

The floor is universal: starvation, suffering, death.

The ceiling...

For someone who's starving & facing death, would simply be good health, easy access to food, healthy family, house & car.

But the ceiling for someone who already has these things is different. The ceiling for a billionaire is different.

The only way I can imagine not doing this type of subjective self-reporting is... maybe you can draw blood from populations and record cortisol and oxytocin levels?


Unfortunately since PG&E is a regulated utility it's not that simple.


When you compare PG&E's electricity rates to the rest of the nation (and neighbors like SMUD), you can see that the CPUC isn't doing much.


Having multiple kids can make things much harder too. A child that is easy alone may not be easy with siblings.

Parents flex for children because there's a lot of things we don't care about. Little Timmy wants to go first? That's fine. But now introduce Little Jane. Little Timmy can't always go first. Now he has to contend with Little Jane.

Turn taking is the standard here. Doing it at preschool is one thing, but in every day life, at home, from sunrise to sunset, when tired or sick, is another thing. There's also lots of corner cases to navigate. What if Little Jane wants to watch something that scares Little Timmy? What if Little Timmy missed a turn because he had soccer practice? And so on.

You also aren't in control of all turns. For example, birthday parties. Or frequency of seeing friends. What is seen as unfair parts of life by an 8 year old can be completely out of your control.

You can oftentimes rely on the older one being a little more flexible than the younger one, and that is a life saver. As a parent you can lean on it and reason with them. But there is a scenario where this falls apart: twins. Congratulations, have fun reasoning with two 1.5 year olds who want opposing things.

The corner cases from taking turns can get easier as they get older though. I usually offload it to them: you guys figure out whats fair, everyone has to agree, let me know when you figure it out. This can degenerate into the most stubborn winning though, so I still have to monitor the results and stop relying on it if that's what is happening.

I dunno, it feels kinda exhausting for being easy.


There's definitely a balance here. Not giving a shit, which seems to have been the tradition I was raised in, is not great. Recognizing problems is good. Acting on all of them is bad. You need to raise grounded kids who can grow into resilience so by the time they're adults they're able to navigate this strange world.

Especially trying to control who their friends are and what they hear. Questionable behavior from friends at younger ages can actually be a good thing: it lets you plant the seed early when your kids still listen to you.


It's difficult to translate what you're saying. You say "go to visit" which implies you don't see your friends with kids very often. If true you're in a poor position to make these judgements even by your own standards.

Everything you're observing is even more likely to occur if you don't see them that often. Your friends probably want to spend more time focusing on you. The kids are not that familiar with you and are less likely to engage with you. Which also makes the parents more likely to want to distract them with something else.

Whether or not you're bringing children with you matters too. It sounds like you don't because you're focusing on child-adult interactions. If someone has kids the kids run around and be kids with their loudness, and child-adult interactions are going to be much more likely. If you're not bringing kids in tow my kids are much more likely to just go off and do their own thing.

Much of what you're pointing out can also be down to individual child temperment. Which changes as children age. By your standards many parents turn into a bad parent once their kids become a teenager.

That is not to say that your observations are 100% wrong. But just that there's so many variables, most of which I didn't even mention here, makes trying to analyze your statement make my head spin.

Not to lean too much on anecdotes, but I have a friend who has extremely well behaved kids. He has said a couple times that he feels like he has placed too many adult responsibilities on his kids. Is he a good parent? Based upon outside observation, he seems good. But he seems to be questioning some of his own choices. Maybe that alone makes him a good parent? I don't know, who am I to judge?


you read very far into a relatively normal phrase "go to visit" meaning "when i visit them". Some people just say "visit" instead of "hang out" or "bum around with."

eta: it's like some people say they have to do errands.


I smile a bit and give a chuckle when a toddler is giving a parent a hard time. It reminds me of simpler times. The problems and consequences are so much smaller than teenage problems.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: