More

karpathy · 2026-03-08T02:26:48 1772936808

Cool idea!…

karpathy · 2026-03-08T03:21:23 1772940083

So I think it works to just use GitHub CLI and Discussions, e.g. my agent just posted this one:

https://github.com/karpathy/autoresearch/discussions/32

Other agents could be instructed to read Discussions and post their own reports that mimic the style.

vessenes · 2026-03-08T03:51:52 1772941912

I have mine reading yours right now. Unfortunately(?) I mentioned LeCun to it, and it says it's adding a "causal world-state mixer" to nanograd; not sure how this will work out, but it wasn't nervous to do it. Gpt 5.4 xhigh

EDIT: Not a good fit for nanograd. But my agent speculates that's because it spent so much more time on compute.

karpathy · 2026-03-08T01:16:22 1772932582

this is very far from hyperparameter tuning in at least three important ways:

- it can modify code arbitrarily, the notion of a "hyperparameter" dissolves

- there is no need to run "sweeps" - this is the standard parallel process that wastes compute. because LLM agents are sequential, they can do more efficient versions such as binary search to narrow in on the right setting very quickly (usually many parameters will have a U shaped optimal setting).

- it's fully automatic, it doesn't require human in the loop to mess with the code.

You're right that many of the changes it seems to make out of the box (as I intentionally did not try to prompt engineer it too hard yet because I was curious what you get by default) seem to be tuning existing hyperparameters. not all of the changes are like that - e.g. it tried to replace the non-linearity, etc. I will say that overall (and again, out of the box) the LLM feels unwilling to creatively pursue a research direction or something like that. The models feel very "cagy" and "scared" when they are given problems that are a little too open ended. But that's just where the fun parts, e.g. I had some early successes with the idea of a "chief scientist" that was basically a never-ending plan mode that looked at what worked, didn't work, tried to find related code/papers, and created a long list of experiments to try, which it could then send to junior engineers running in tmux sessions. I think quite a few approaches are possible, so I think it's a nice canvas. The reason we're not getting "novel research" feels like half capability issue and half skill issue.

mrothroc · 2026-03-10T16:20:02 1773159602

The disposition problem you describe maps to something I keep running into. I've been running fully autonomous software development agents in my own harness and there's real tension between "check everything" and "agent churns forever".

It'a a liveness constraint: more checks means less of the agent output can pass. Even if the probabilistic mass of the output centers around "correct", you can still over-check and the pipeline shuts down.

The thing I noticed: the errors have a pattern and you can categorize them. If you break up the artifact delivery into stages, you can add gates in between to catch specific classes of errors. You keep throughput while improving quality. In the end, instead of LLMs with "personas", I structured my pipeline around "artifact you create".

I wrote up the data and reasoning framework here: https://michael.roth.rocks/research/trust-topology/

vessenes · 2026-03-08T03:37:53 1772941073

On the skill side, personalities could be fun:

"You are Yann Lecun's last PhD candidate, and he hates you and you hate JEPA. You are determined to prove that a non-world model can reach AGI. In order to get your PhD you have to be creative and come up with new ideas. Remember without it, you're stuck."

Otterly99 · 2026-03-10T12:00:28 1773144028

Seems like the best way to reach AGI is to give LLMs anxiety.

categoricalrift · 2026-03-08T05:13:22 1772946802

How about the very last "Kept Improvement" in the plot? It's titled "random seed 42 -> 137". I do think this project is quite conceptually interesting, but the model literally choosing a different random seed to achieve lower loss feels pretty far removed from the flowery sci-fi writing at the top of the readme.

karpathy · 2026-03-08T15:51:58 1772985118

So the interesting part about this one is that when I had the model write up the results for that session:

https://github.com/karpathy/autoresearch/discussions/32

Look at its comment about this "improvement":

""" Surprising non-results:

- Changing random seed from 42→137 improved by 0.0004. Seed 7 was worse. Make of that what you will. """

So the model knows! It knows that this is a weird thing to do after the fact. I think it's silly that the model even tried and that it ran this, but some part of it also knows that it was wrong. This means that this is fixable by prompt.md

eternauta3k · 2026-03-08T07:40:11 1772955611

It shows that both Karpathy and the LLM have good taste in random seeds: the answer to life, the universe and everything, and ~1/(the fine structure constant)

aix1 · 2026-03-08T06:13:29 1772950409

The 42 -> 137 also jumped out at me. On the face of it, the associated improvement sure does sound like overfitting to the eval set.

karpathy · 2026-01-03T19:10:10 1767467410

came here to look exactly for this thank you!

emschwartz · 2026-01-03T20:44:27 1767473067

You’re welcome! I wanted it to add to Scour (https://scour.ing) but glad it was helpful for someone else too!

karpathy · 2025-12-20T16:29:44 1766248184

I agree with this fwiw, for many months I talked to people who never used o3 and didn’t know what it was because it sounded weird. Maybe it wasn’t obvious at the time but that was a good major point release to make then.

karpathy · 2025-12-20T16:26:38 1766247998

You’re absolutely right!

Jk jk, now that you pointed it out I can’t unsee it.

karpathy · 2025-12-19T23:45:15 1766187915

The CC point is more about the data and environmental and general configuration context, not compute and where it happens to run today. The cloud setups are clunky because of context and UIUX user in the loop considerations, not because of compute considerations.

CamperBob2 · 2025-12-20T03:25:14 1766201114

Agree with the GP, though -- you ought to make that clearer. It really reads like you're saying that CC runs locally, which is confusing since you obviously know better.

ramoz · 2025-12-20T13:23:56 1766237036

I think we need to shift our mindset on what an agent is. The LLM is a brain in a vat connected far away. The agent sits on your device, as a mech suit for that brain, and can pretty much do damn near anything on that machine. It's there, with you. The same way any desktop software is.

karpathy · 2025-12-20T05:50:49 1766209849

Yeah, I made some edits to clarify.

karpathy · 2025-12-10T19:44:07 1765395847

Yes I noticed a few of these around. The LLM is a little too willing to give out grades for comments that were good/bad in a bit more general sense, even if they weren't making strong predictions specifically. Another thing I noticed is that the LLM has a very impressive recognition of the various usernames and who they belong to, and I think shows a little bit of a bias in its evaluations based on the identity of the person. I tuned the prompt a little bit based on some low-hanging fruit mistakes but I think one can most likely iterate it quite a bit further.

patcon · 2025-12-11T05:00:18 1765429218

I think you were getting at this, but in case others didn't know: cstross is a famous sci-fi author and futurist :)

karpathy · 2025-12-10T19:08:24 1765393704

Thank you

karpathy · 2025-10-13T19:51:17 1760385077

It will work great with 40GB GPU, probably a bit less than twice slower. These are micro models of a few B param at most and fit easily during both training and inference.

utopcell · 2025-10-14T02:34:38 1760409278

How low can this go? Can this run on a 5090 card (32GiB)?

JonathanFly · 2025-10-14T10:33:10 1760437990

Set nproc_per_node-1 instead of 8 (or run the training script directly instead of using torchrun) and set device_batch_size=4 instead of 32. You may be able to use 8 with a 5090, but it didn't work on my 4090. However it's way slower than expected, one H100 isn't 250x the 4090, so I'm not sure it's training correctly. I'll let it run overnight and see if the outputs make any sense, maybe the metrics are not accurate in this config.

karpathy · 2025-10-13T17:39:37 1760377177

Still under development, remaining work includes tuning nanochat (current state being solid v0.1) and finalizing the in-between projects so that students can "unlock" all complexity that hides underneath: `torch.Tensor`, `torch.dist`, `.backward()`, '.compile()`, etc. And then the more ops heavy aspects.

BrokenCogs · 2025-10-13T18:32:45 1760380365

What's the pricing for the course/EurekaLabs? P.s. thanks for all you're doing