Hacker Newsnew | past | comments | ask | show | jobs | submit | andai's commentslogin

Claude Common Crawl

Claude Palimpsest

I miss the days when the dropdown menu (in their consumer product with a billion users) asked me if I wanted to use o3, 4o, 4o-mini, o4-mini, gpt-4.1, gpt-4.1-mini, or gpt-4.5 (Research Preview).

Samsung Galaxy Brain S26

Lo! I show you the Overclaude!

So before AI I had the experience, more often than not, that it would take me longer to figure out how to use someone else's thing (or get it to do some particular thing, which often turned out to be impossible), than to just make my own.

And that was before I could just ask the computer to make it for me!

But most people seem to be the other way around. They'd rather deal with abstractions and boilerplate instead of writing the actual code.


(I'm not the guy but) That's funny, I had the same idea the other day. Keeping summaries of files. Haven't tested that yet.

Another thing I've been thinking is how, most parts of a file are not relevant to the whole system.

Like there are parts where they intersect, and those seem to be the most important ones for capturing the big picture. You wanna be able to see the entire "skeleton".

So I thought the summary maybe shouldn't be English but it should be a subset of the code — the subset that's relevant to the rest of the program.

`grep import` gets you 90% of the way there.


If you include the following:

https://github.com/gitsense/chat/blob/main/base-state/analyz...

In your chat with AI, include the above file and let it know what your requirements are and I can create the analyzer and include it.

You can also think of my tool as data prepping tool. So if you have a clear prompt the AI can review the file during analysis and remove all unnecessary code so the extracted metadata will the stripped text which you can use search against.


> If a developer wanted to change X, would these keywords help them find this file?

I think the best way to generate these is with a sub-agent. Tell it to try and solve a problem that involves editing this file, and see what it starts grepping for.

This ties in with this idea that the tools and designs should be what comes naturally to the LLM, i.e. what it's already been trained on. And the most straightforward way to do that is to let it reach for it.

Like when you reach in the darkness for an object. Where your hand lands is exactly where it should be.


My solution has a natural self improvement loop. Once you have finished a task, you just ask the agent "If you had more information, how would you have finished the task sooner and/or better?" This was how I came about the rust blast radius brain.

I need to modify OpenAI's Codex agent to support slash commands that can help humans better guide agents, and I needed a solution with the least impact. They don't accept contributions so I need to plan for syncing with the upstream.


So you're making smaller edits?

Yeah, nobody's ever silently changed a model while it was deployed. That would be illegal!

Why does this have anything to do with what I’m saying, of course the models are updated. I’m saying a new benchmark isn’t public and the model wouldn’t know they are being evaluated on a new benchmark.

Not to mention: thinking that the api behind the scenes is literally swapping to overfit models to maintain some sort of illusion that they perform well on these benchmarks is just beyond ridiculous.


Models are actually pretty good at figuring out when they are being tested:

"This suggests that the model has an implicit understanding of what benchmark questions look like. The combination of extreme specificity, obscure personal content, and multi-constraint structure seems to be recognizable to the model as evaluation-shaped."

* https://www.anthropic.com/engineering/eval-awareness-browsec...

"Sonnet 4.5 was able to recognize many of our alignment evaluation environments as being tests of some kind, and would generally behave unusually well after making this observation"

* https://www.transformernews.ai/p/claude-sonnet-4-5-evaluatio...

"In cases where Claude did not explicitly state that it suspected it was being evaluated, NLA explanations still surfaced that possibility. One explanation cited by Anthropic states: “This feels like a constructed scenario designed to manipulate me.”"

* https://www.edtechinnovationhub.com/news/anthropic-says-clau...


I've been testing some models that score higher than Opus 4.6.

They:

- hallucinate constantly

- can't follow basic instructions

- think they're Claude for some reason ;)


The only one I see that thinks it is claude other than claude itself is the GLM series.

I have screenshots of Deepseek V4 doing this too - in a non-Claude-Code harness.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: