Hacker Newsnew | past | comments | ask | show | jobs | submit | kraddypatties's commentslogin

I feel like most of this recent Autoresearch trend boils down to reinventing hyper-parameter tuning. Is the SOTA still Bayesian optimization when given a small cluster? It was ~3 years ago when I was doing this kind of work, haven't kept up since then.

Also, shoutout SkyPilot! It's been a huge help for going multi-cloud with our training and inference jobs (getting GPUs is still a nightmare...)!


Wrong and short-sighted take given that the LLM explores serially learning along the way, and can tool use and change code arbitrarily. It seems to currently default to something resembling hyperparameter tuning in absence of more specific instructions. I briefly considered calling the project “autotune” at first but I think “autoresearch” will prove to be the significantly more appropriate name.

I think we need to separate theory from practice. In theory, it can edit the training loop and come up with novel techniques. That is interesting.

In practice, the vast majority of the changes that auto research actually made would have been found much faster with BO if properly parameterized. You do not need an LLM to find a better batch size or learning rate.


I’d always hoped something like this could take advantage of FPGAs directly

FPGAs won't rebuild fast enough for it to matter vs software simulation I'd wager. Even FPGA-in-CPU has been a dream for decades and there you have more time for some workloads, still never was commercially viable for general computing.

There was research a few years back that tried doing something like this with an FPGA, and they found that their algorithm actually exploited defects in the particular chip (not the model, the actual single specific chip) they were using to use electrical interference for computation that shouldn't have worked on paper. They could not reproduce their design on another FPGA of the same model from the same lot.

Out of curiosity, what sort of things have you seen it do that better fit 'autoresearch' than 'autotune' thus far? Optimizations it made that wouldn't be been surfaced by an autotune system, I suppose.

The most recent round of autoresearch (round 2) which decreased "time to GPT-2" from 1.8 hours to 1.65 hours had some examples. I adjusted the program.md to "look at modded nanogpt project and draw inspirations from there for things to try" and it came back with a bunch of tuning, but also tried and implemented new architecture changes, some of which actually helped including the smear gate and the backout skip connection. These are not just hyperparameters, they are new PyTorch code. I'm now working on a more general system that can have a queue of ideas that could be sourced from archive papers, github repos, etc.

Do you have a sense of whether these validation loss improvements are leading to generalized performance uplifts? From afar I can't tell whether these are broadly useful new ideas or just industrialized overfitting on a particular (model, dataset, hardware) tuple.

Why set the bar higher on generalization for autoresearch vs the research humans generally do?

industrialized overfitting is basically what ML researchers do

Did you consider providing the LLM with a framework for automatic hyperparamter tuning? This would free up its capacity to focus on the more important architectural decisions.

I see this critique about autoresearch online often, but I think it’s misplaced.

Here’s a use case that may illuminate the difference, from my own work at Nvidia. Im currently training some large sparse autoencoders, and there are issues with dead latents. Several solutions exit to help here, such as auxk, which I can certainly include and tune the relevant params as you describe. However, I have several other ideas that are much different, each of which requires editing core code (full evaluation changes, initialization strategies, architecture changes, etc.), including changes to parallelism strategies in the multi-rank environment I’m using. Moreover, based on my ideas and other existing literature, Claude can try a number of new ideas, each potentially involving more code changes.

This automated run-and-discover process is far beyond what’s possible with hyperparam search.


It wasn't meant as a critique, I'm legitimately interested in knowing more about where it can push boundaries and where it struggles. I agree that in general it's a truism that "Claude can try a number of new ideas" etc., but the question remains as to where in particular it actually takes advantage of this to push the envelope in a way other tools don't -- since that informs when it makes sense to use something like this.

I can believe that in the long run.

Does the agent have access to arxiv (a brief skim of the README didn't have an answer)? If not, it could be that the current approach of relying on the model's weights only is resulting in the perceived local optimum of hyperparameter tuning.

Anecdotally, we built a little MCP for arxiv to help with our internal research, noticed a significant boost in the diversity of methods (architecture or otherwise) Claude and friends were able to reference.


care to share?

I wonder about the following:

To calculate an gradient step, in practice one doesn't accumulate the gradient for the full corpus, but updates the weights on mini-batches.

Suppose one runs conventional gradient descent on minibatches multiple times with different starting seeds, and then considers a set of pre-trained models M_i

From a random starting point we thus have an idea of the desired end-region in weightspace (lets say a gaussian cloud fit to the final M_i's).

Then it seems like one could score update strategies by how much a single iteration has approached the gaussian cloud, by scoring just the approach on a number of minibatches or just a few update iterations. Instead of searching update strategy space by waiting until pretraining has finished for each candidate update strategy. Only the candidate strategies that perform well enough on 1 or a few iterations would be considered worthy of further consideration, those that pass (a lower number of candidates) are then inspected for approach to the gaussian target after another round of iterations etc.

It seems like it should be possible to optimize the optimization iteration loop, by running it just once for many candidates and observing their convergence to the known desired end region.


Naming things is your primary contribution to AI so well done for deliberating on it. I disagree with the outcome though. Autotune would have been much more fitting.

Is there a cost to converge? And how much does it vary with the random seed?

Re: OpenCogPrime:EconomicAttentionAllocation https://news.ycombinator.com/item?id=45518074 and something about eWASM (edit) https://news.ycombinator.com/item?id=47171887 .. from https://news.ycombinator.com/item?id=46825026 re: eWASM and costed opcodes for agent efficiency


The dataset climbmix 400b looks like it is 600GB, it would be neat if someone could host this in compressed form, given that LLM can be used to compress, even having a small LLM compress it would perform better than classical compression algorithms, why is this approach not used within the ML community?

Or is it the "anyone who means anything in the field, has access to high bandwidth anyway"?


Would you say it's fair to describe autoresearch as a form of neural architecture search? I am curious what you think the core differences are between them.

Have you actually used LLMs for non trivial tasks? They are still incredibly bad when it comes to actually hard engineering work and they still lie all the time, it's just gotten harder to notice, especially if you're just letting it run all night and generate reams of crap.

Most people are optimizing for terrible benchmarks and then don't really understand what the model did anyone and just assume it did something good. It's the blind leading the blind basically, and a lot of people with an AI-psychosis or delusion.


Do you realise who you’re replying to?

I think the OP's comment is entirely fair. Karpathy and others come across to me as people putting a hose into itself: they work with LLMs to produce output that is related to LLMs.

I might reframe the comment as: are you actually using LLMs for sustained, difficult work in a domain that has nothing to do with LLMs?

It feels like a lot of LLM-oriented work is fake. It is compounding "stuff," both inputs and outputs, and so the increased amount of stuff makes it feel like we're living in a higher plane of information abundance, but in reality we're increasing entropy.

Tech has always had an information bias, and LLMs are the perfect vehicle to create a lot of superfluous information.


In my limited experience, using LLMs to code up things unrelated to LLMs (robotics for instance) is significantly less productive than using LLMs to code up things related to LLMs. It works, just not very well and requires a lot more leg work on the user end than in other areas.

To be fair Karpathy isn't known for using LLMs—not that I would assume or question whether he's used them 'for non-trivial tasks', but it's not like making the same comment in reply to Steve Yegge or someone. (However trivial we may think Gastown/Wasteland is in the other sense!)

lolololol

Why should we care that he’s famous?

Fame doesn’t enter it - the point is Karpathy has about as strong a claim as anyone to having “actually used LLMs for non trivial tasks”.

That is not the case at all, considering that he himself started using and tweeting about llms for coding fairly recently. He's probably less experienced in that area than most people who started using claude cli last year.

He is a researcher who understands neural networks and their architectures exceptionally well. That is all.


> He is a researcher who understands neural networks and their architectures exceptionally well. That is all.

And that is precisely why he is more qualified on the subject than your average vibe coder!



That whole thread is just amazing, if you back up a couple of levels from ground zero. Great perspectives from a lot of thoughtful posters.

E.g., you can see a post from a user named dhouston, who mentioned that he was thinking about starting an online file sync/backup service of some sort.


Haha awesome. I guess they were going through YC right then, I still remember their launch video from around then and thinking it was one of the best ads I’d ever seen.

tfw le AI guy has LLM psychosis. We're cooked

Hyperparam tuning that has better intuition and can incorporate architecture changes automatically. It won't invent something completely new though.

> It won't invent something completely new though.

I don't necessarily disagree, but am wondering whether you have any particular reason/intuition driving you to claim this. I have seen AI agents be quite creative in other tasks; do you think there's a particular reason why we shouldn't see creativity in architecture research, given enough time and resources?


Hm, that's fair. It does feel like there's low hanging fruit in combining "old school" methods for conducting a hyperparameter sweep efficiently _with_ the higher level architecture edit ability of Autoresearch.

Probably would cut the number of runs down by a significant number (as far as I can tell it's doing a grid search once it decides to mess with a knob or section of the architecture).


I wonder if it's more like "qualitative gradient descent" on a very non-linear non-convex surface.

You can try this yourself in a simple fashion -- let's say you have piece of code that you want to speed up. Point your agent to a code profiler (your oracle -- typically your Python profiler) and tell it speed up the code. I've tried it. It works.


Glad you liked it!

Currently the avatar does it based on the text, which maps the incoming audio to one of our emotion codes, biasing the generation to that emotion. It's not foolproof, but we've found it works pretty well in practice.


I'm interpreting this as "uv was built off of years of PEPs", which is true; that being said the UX of `uv` is their own, and to me has significantly reduced the amount of time I spend thinking about requirements, modules, etc.


Running into "no healthy upstream" when navigating to the link -- hug of death maybe?


Indeed, we had a huge influx, should be back up now. Thanks for pointing it out


been lurking for most of my adult life (and it shows :-))

Thanks HN! You make me smarter every (other) day.


Thanks for trying it out!

Yea that latency makes sense; "listening" includes turn detection and STT, "thinking" LLM + TTS _and then_ our model, so the pipeline latency stacks up pretty quick. The actual video model starts streaming out frames <500ms from the TTS generation, but we're still working on reducing latency from parts of the pipeline that we are using off the shelf.

We have a high level blog post here https://www.keyframelabs.com/blog/persona-1 about the architecture of the video model, the WebRTC "agent" stack is Livekit + a few backend components hosted in Modal.


We've been tinkering with building realtime talking head models (avatar models, etc.) for a while now, and finally have something that works (well enough)! Operates at ~2x realtime on a 4090, significantly faster than that on enterprise grade GPUs.

You can try it yourself at https://playground.keyframelabs.com/playground/persona-1 and there's a (semi)technical blog post at https://www.keyframelabs.com/blog/persona-1

The main use case we designed for was language learning, particularly having a conversational partner -- generally we've found that adding a face to the voice really helps trigger the fight or flight response, which we've found to be the hardest part of speaking a new language with confidence.

But in building out the system around the model to enable that use case (tool use on a canvas for speaking prompts and images, memory to make conversations less stale, etc.), we think there's potential for other use cases too.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: