More

iterateoften · 2026-04-27T21:58:37 1777327117

Im left delighted to find out something new, but left wanting to know how to use it.

Like if im 75% on the green transition, how do i use this information.

glial · 2026-04-27T22:00:42 1777327242

Send it to a significant other, then discuss your differences. Will provide you with a new in-joke.

iterateoften · 2026-04-27T14:24:41 1777299881

Yeah seems like nonsense advise. Have a code word that was never recorded? I don’t see how that would tote y anything. Like the point of these systems is they can say stuff you never said convincingly

MarsIronPI · 2026-04-27T14:56:37 1777301797

The idea is that the attacker doesn't know the codeword. If the attacker finds out about the codeword then the attacker could indeed fake it. Hence why you shouldn't say/write it in recordings or chat messages.

iterateoften · 2026-04-27T14:11:48 1777299108

Zen meditation for an hour staring at a wall is a marathon that at the end results in a semi-psychedelic state for me.

Exercising and sitting b meditating are two related but seriously different things. Which is why there are many other types of meditation to practice (walking, working, silent, etc) but zen mostly considers sitting and looking at a wall the OG

iterateoften · 2026-04-26T18:23:47 1777227827

Unless you know exactly why paper trading sims are so hard to backrest in practice, it’s silly to make arguments on why your paper trading sim works.

It’s insanely easy to make a trading algo profitable on historical data.

irldexter · 2026-04-26T19:00:26 1777230026

Overfitting on historical data is a real risk and defo a concern (there's been lots of learnings lately). The backtest wasn't naive. Fundamentals used filing dates not period-end dates to avoid look-ahead + scoring was validated out-of-sample using walk-forward testing rather than just optimised in-sample (GA used 5 temporal folds and walk-forward used 25 rolling out-of-sample windows).

iterateoften · 2026-04-25T13:28:03 1777123683

You don’t need for every item. Just low frequency high profit goods

iterateoften · 2026-04-25T13:24:34 1777123474

It’s also that agents and ML reach local maximima unless external feedback is given. So your wiki will reach a state and get stuck there.

dataviz1000 · 2026-04-25T13:41:04 1777124464

Here is an iteresting thing.

> "The LLM model's attention doesn't distinguish between "instructions I'm writing" and "instructions I'm following" -- they're both just tokens in context."

That means all these SOTA models are very capable of updating their own prompts. Update prompt. Copy entire repository in 1ms into /tmp/*. Run again. Evaluate. Update prompt. Copy entire repository ....

That is recursion, like Karpathy's autoresearch, it requires a deterministic termination condition.

Or have the prompt / agent make 5 copies of itself and solve for 5 different situations to ensure the update didn't introduce any regressions.

> reach local maximima unless external feedback is given

The agents can update themselves with human permission. So the external feedback is another agent and selection bias of a human. It is close to the right idea. I, however, am having huge success with the external feedback being the agent itself. The big difference is that a recursive agent can evaluate performance within confidence interval rather than chaos.

iterateoften · 2026-04-23T22:43:22 1776984202

Cultural influence is another benefit. China is securing its sphere of influence as well as keeping us ai in check.

iterateoften · 2026-04-17T03:13:12 1776395592

Beware. I had Claude code with opus building boards and using spice simulations. It completely hallucinated the capabilities of the board and made some pretty crazy claims like I had just stumbled onto the secret hardware billion dollar project that every home needed.

None of the boards worked and I had to just do the project in codex. Opus seemed too busy congratulating itself to realize it produced gibberish.

ZihangZ · 2026-04-17T12:09:01 1776427741

This matches what I've seen too — the hallucination gets much worse when the loop has no external verifier. "Does this board work?" has no ground truth inside the model, so it defaults to optimistic narration.

What OP is doing here is actually the mitigation: SPICE + scope readout is a verifier the model can't talk its way past. The netlist either simulates or it doesn't, the waveform either matches or it doesn't. That closes the feedback loop the same way tests close it for code.

The failure mode that remains, in my experience, is a layer down: when the verifier itself errors out (SPICE convergence failure, missing model card, wrong .include path), the agent burns turns "reasoning" about environment errors it has seen a hundred times.That's where most of the token budget actually goes, not the design work.

jddj · 2026-04-17T12:12:54 1776427974

What throws me about this comment is the missing space between the period and the T in the last sentence.

Did the model itself do that? Was it a paste error?

svnt · 2026-04-17T15:33:27 1776440007

I’ve also noticed Gemini and Claude occasionally mixing terms recently (eg revel vs reveal) and can’t decide whether it is due to cost optimization effects or some attempt to seem more human.

I can’t recall either using a wrong word prior this month for some time.

lambda · 2026-04-17T15:47:17 1776440837

Or just because mistakes are part of the distribution that it's trained on? Usually the averaging effect of LLMs and top-k selection provides some pressure against this, but occasionally some mistake like this might rise up in probability just enough to make the cutoff and get hit by chance.

I wouldn't really ascribe it to any "attempt to seem more human" when "nondeterministic machine trained on lots of dirty data" is right there.

svnt · 2026-04-17T15:56:46 1776441406

Sure, but if that were the case why has it gotten worse recently? I would expect it to be as a result of cost optimization or tradeoffs in the model. I suppose it could be an indicator of the exhaustion of high quality training data or model architecture limitation. But this specific example, revel vs reveal, is almost like going back to GPT-2 reddit errors.

I also don’t want to pretend there is no incentive for AI to seem more human by including the occasional easily recognized error.

lambda · 2026-04-17T16:31:09 1776443469

Or just the models are getting bigger and better at representing the long tail of the distribution. Previously errors like this would get averaged away more often; now they are capable of modelling more variation, and so are picking up on more of these kinds of errors.

svnt · 2026-04-17T19:33:35 1776454415

That makes sense, but what is the solution?

jddj · 2026-04-17T16:35:38 1776443738

Looking at the account's other comment there are subtle grammatical errors in that one too.

Would be good to see the prompt out of morbid curiosity

_fizz_buzz_ · 2026-04-17T03:25:56 1776396356

I haven't tried it with codex yet. But my approach is currently a little bit different. I draw the circuit myself, which I am usually faster at than describing the circuit in plain english. And then I give claude the spice netlist as my prompt. The biggest help for me is that I (and Claude) can very quickly verify that my spice model and my hardware are doing the same thing. And for embedded programming, Claude automatically gets feedback from the scope and can correct itself. I do want to try out other models. But it is true, Claude does like to congratulate itself ;)

ezst · 2026-04-17T10:48:12 1776422892

It's because you are holding it wrong!

--courtesy for all the LLM pushers so they don't have to bother commenting on this one

varispeed · 2026-04-17T13:10:36 1776431436

This week I tried to use Opus to analyse output from an oscilloscope and it was impossible to complete, because Python scripts (Opus wrote itself) were flagged for cyber security risk. Baffling.

iterateoften · 2026-04-17T02:07:09 1776391629

No it’s because Anthropic can’t message anything to its customers without lying.

iterateoften · 2026-04-16T18:03:50 1776362630

It’s the official communication that sucks. It’s one thing for the product to be a black box if you can trust the company. But time and time again Boris lies and gaslights about what’s broken, a bug or intentional.

CodingJeebus · 2026-04-16T18:28:42 1776364122

> It’s the official communication that sucks. It’s one thing for the product to be a black box if you can trust the company.

A company providing a black box offering is telling you very clearly not to place too much trust in them because it's harder to nail them down when they shift the implementation from under one's feet. It's one of my biggest gripes about frontier models: you have no verifiable way to know how the models you're using change from day to day because they very intentionally do not want you to know that. The black box is a feature for them.

bomewish · 2026-04-16T18:42:50 1776364970

If you cared so bad you could make your own evals.

whateveracct · 2026-04-16T19:08:25 1776366505

so pay anthropic money to maybe detect when the model is on a down week? lol