Hacker Newsnew | past | comments | ask | show | jobs | submit | soletta's commentslogin

I’ve also found that compiling large packages in GCC or similar tends to surface problems with the system’s RAM. Which probably means most typical software is resilient to a bit-flip; makes you wonder how many typos in actual documents might have been caused by bad R@M.


That's exactly how my bad RAM manifested itself. In fact, I was compiling Firefox, and gcc would get a segmentation fault at some random point during compilation. I'd have to clobber and restart the hour-long build. It was only when gcc started crashing while compiling other things that I even started considering the possibility of hardware failure. I'm a software developer, and based on what I produce myself, I just assume that all software is horribly buggy. ;-)


I’ve been doing this for a few months now (rolled my own setup with Claude Code) and it’s totally changed the way I manage my portfolio and retirement plan. I mean yes, this is something that could technically have been set up in Excel but who has the time and patience to sit around fiddling with formulas to make an accurate financial forecast?

The cherry on top is that, obviously, you can then ask Claude for thoughts on the resulting analyses and hopefully save yourself from making bad decisions.


Same! Not an Excel pro and also have zero interest in going deep at Excel at all here.

Seems like you've been running this setup for quite some time. Curious to know if your setup is similar to how I did it, or do you have a slighly different config? Do you also use json files for your data and just let Claude do all the magic?


Yes, my setup is similar, probably because that’s what Claude drifts towards by default, and in this case I didn’t want to impose my will on it much since it’s a simple problem that doesn’t need to be over-engineered.


I was a bit dubious until I read the gist. I've used a similar technique before to 'tame' GPT-3.5 and keep it following instructions and it worked well (though I had to ask the model to essentially repeat instructions after every 10 or so turns). I'm surprised you see that much drift though; older models were pretty bad with long contexts but I feel like the problem has mostly gone away with Claude Opus 4.6.


Glad it resonated! Yeah repeating instructions every N turns was the old approach — SCAN basically does the same thing but with ~20 tokens instead of the full prompt each time.

On drift being "mostly gone" — depends on prompt complexity. With a simple system prompt, sure, modern models hold up fine. But with a large instruction set (mine is ~4000 tokens, 25+ rules across 7 sections) the drift is very much still there, even on Opus. The more rules you have, the more they compete for attention, and the easier it is for specific ones to drop off mid-session.

Also worth noting — this isn't limited to coding agents. Any long-running LLM workflow with complex instructions has the same problem. Customer support bots that forget their tone policy, medical assistants that stop citing sources, content moderation that gets lenient over time. If you have a system prompt with rules and a session longer than 20 minutes — the rules will decay.


Interesting. I've been coping by being very conservative about how many rules I introduce into the context, but if what you're saying is true, then something like SCAN actually helps the models break past the "total rule count" barrier by giving them something like "cognitive scaffolding". I'm eager to try this out. Thanks again for sharing!


That's a great way to put it — "cognitive scaffolding" is exactly what it is. And yeah, keeping rules minimal is smart, but at some point the project just needs 25 rules and you can't cut them down without losing something important. SCAN lets you have a large instruction set without paying the full attention cost. Let me know how it goes!


Sounds interesting. What makes DeBERTA + RAG any better than detecting contradictions in the context than a frontier LLM, and why? I see that the NLI scorer itself was evaluated, but I’d love to see data about how the full system performs vs SotA if you have any on hand.


@soletta Great question — this is exactly why we built it this way.

*Short answer*: frontier LLMs are excellent at static self-critique, but terrible for *real-time token-by-token streaming guardrails* because of latency, cost, and lack of persistent custom memory.

*Why DeBERTa + RAG wins here*: - *Latency*: DeBERTa-v3-base + Rust kernel scores every ~4 tokens in ~220 ms (AggreFact eval). A frontier LLM call (GPT-4o/Claude 3.5) is 400–2000 ms per check. You can’t do that mid-stream without killing UX. - *Cost*: Frontier self-checking at scale = real money. This runs fully local/offline after the one-time model download. - *Custom knowledge*: The 0.4× RAG weight pulls from your GroundTruthStore (ChromaDB). Frontier models don’t have a live, updatable external fact base unless you keep stuffing context (expensive + context-window limited). - *Determinism & auditability*: Small fine-tunable NLI model + fixed vector DB = reproducible decisions. LLMs-as-judges are stochastic and hard to debug in prod.

We’re completely transparent: the NLI scorer alone is *not SOTA* (66.2% balanced acc on LLM-AggreFact 29k samples — see full table vs MiniCheck/Bespoke/HHEM in the repo). The value is the live system: NLI + user KB + actual streaming halt that no one else ships today.

Full end-to-end comparisons vs. LLM-as-judge in streaming setups are next on the roadmap (happy to run them on any dataset you care about).

Have you tried frontier self-critique in real streaming agents? What broke for you?

Repo benchmarks: https://github.com/anulum/director-ai#benchmarks


I should have been clearer. I'm not talking about making a separate call to the model to ask it to check itself. Any given model essentially is already watching for contradictions all the time as it is generating its output tokens. Frontier models like Claude Opus 4.6 are already exceptionally good at not contradicting themselves as they go. As for not having an external fact base - you could in principle insert content ephemerally into the context that is relevant to the task at hand, though doing this without killing modern prompt caching schemes is challenging.

I saw your benchmarks, what I was asking for is benchmarks of the full system (LLM + the NLI model) vs a frontier LLM on its own. Its fine if you didn't do them, but I think it hurts your case.


@soletta Got it — thanks for the extra clarity, that’s an important distinction.

You’re absolutely right: modern frontier models (Claude 3.5/Opus-class, GPT-4o, etc.) have become extremely good at maintaining internal consistency during autoregressive generation. They rarely contradict themselves within the same response anymore.

Where Director-AI adds unique value is *external grounding + hard enforcement* against a user-owned, persistent knowledge base:

- Your GroundTruthStore (ChromaDB) can be arbitrarily large, versioned, and updated without blowing up context windows or breaking prompt caching. - The guardrail gives a *hard token-level halt* (Rust kernel severs the stream) instead of “hoping” the model self-corrects in the next few tokens. - You get full audit logs: exact NLI score + which facts conflicted. - It lets you pair strong-but-cheaper models (Llama-3.1-70B, Mixtral, local vLLM setups) with enterprise-grade factual reliability.

You’re also correct that we don’t have published head-to-head numbers yet for “frontier LLM alone vs. frontier LLM + Director-AI” on end-to-end hallucination rate in streaming scenarios. The current benchmarks focus on the guardrail component itself (66.2% balanced acc on LLM-AggreFact 29k samples, with full per-dataset breakdown and comparison table vs MiniCheck/Bespoke/HHEM — see README).

That full-system eval is literally next on the roadmap (we’re setting up the scripts this week). If you have a specific domain/dataset where you’d like to see the comparison run, I’d be genuinely happy to do it publicly and share the raw logs/results.

In the meantime, the repo is 100% open (AGPL) — feel free to fork and run your own tests. Would love to hear what you find.

Benchmarks section: https://github.com/anulum/director-ai#benchmarks


In the same way we’re making a category error in defining prompt injection, the framing of “AI agents” as primarily “intelligent actors” misses the fact that many of them will be endowed with some form of memory, be it specific to that entity or shared, and they should no longer be thought about as simply ephemeral tools.


It is by the juice of Zig that binaries acquire speed, the allocators acquire ownership, the ownership becomes a warning. It is by typography alone I can now turbopuffer is written in zig.


thanks for that!


The usual path an engineer takes is to take a complex and slow system and reengineer it into something simple, fast, and wrong. But as far as I can tell from the description in the blog though, it actually works at scale! This feels like a free lunch and I’m wondering what the tradeoff is.


It seems like this is an approach that trades off scale and performance for operational simplicity. They say they only have 1GB of records and they can use a single committer to handle all requests. Failover happens by missing a compare-and-set so there's probably a second of latency to become leader?

This is not to say it's a bad system, but it's very precisely tailored for their needs. If you look at the original Kafka implementation, for instance, it was also very simple and targeted. As you bolt on more use cases and features you lose the simplicity to try and become all things to all people.


(author here)

> It seems like this is an approach that trades off scale and performance for operational simplicity.

Yes, this is exactly it. Given that turbopuffer itself is built on the idea of object storage + stateless cache, we're all very comfortable dealing with it operationally. This design is enough for our needs and is much easier to be oncall for than adding an entirely new dependency would have been.


IMO this is the ideal way to engineer most (not all) systems. As simple as your needs allow. Nice work!


> Failover happens by missing a compare-and-set so there's probably a second of latency to become leader?

Conceptually that makes sense. How complicated is it to implement this failover logic in a safe way? If there are two processes, competing for CAS wins, is there not a risk that both will think they're non-leaders and terminate themselves?


The broker lifecycle is presumably

1. Start

2. Load the queue.json from the object store

3. Receive request(s)

3. Edit in memory JSON with batch data

4. Save data with CAS

5. On failure not due to CAS, recover (or fail)

6. On success, succeed requests and go to 3

7. On failure due to CAS, fail active requests and terminate

The client should have a retry mechanism against the broker (which may include looking up the address again).

From the brokers PoV, it will never fail a CAS until another broker wins a CAS, at which point that other broker is the leader. If it does fail a CAS the client will retry with another broker, which will probably be the leader. The key insight is that the broker reads the file once, it doesn't compete to become leader by re-reading the data and this is OK because of the nature of the data. You could also say that brokers are set up to consider themselves "maybe the leader" until they find out they are not, and losing leadership doesn't lose data.

The mechanism to start brokers is only vaguely discussed, but if a host-unreachable also triggers a new broker there is a neat from-zero scaling property.


This is the hardest part because you can easily end up in a situation like you're describing, or having large portions of clients talking to a server just to have their writes rejected.

Further, this system (as described) scales best when writes are colocated (since it maximizes throughput via buffering). So even just by having a second writer you cut your throughput in ~half if one of them is basically dead.

If you split things up you can just do "merge manifests on conflict" since different writers would be writing to different files and the manifest is just an index, or you can do multiple manifests + compaction. DeltaLake does the latter, so you end up with a bunch of `0000.json`, `0001.json` and to reconstruct the full index you read all of them. You still have conflicts on allocating the json file but that's it, no wasted flushing. And then you can merge as you please. This all gets very complex at this stage I think, compaction becomes the "one writer only" bit, but you can serve reads and writes without compaction.

https://doi.org/10.14778/3415478.3415560

Note that since this paper was published we have gotten S3 CAS.

Alternatively, I guess just do what Kafka does or something like that?


Write amplification >9000 mostly


Everything I’ve experienced with working with models (from GPT-2 to Opus 4.6) broadly supports the claim that they learn a persona. It comes back to the point that haters love to harp on: they are fundamentally trained on completing the text, and most texts they are trained on have continuity of persona.


I’m on board with the sprit of this work and am cautiously acceptant of the claim that it has measurable, positive effects. But have you considered how this would feel if you were to be subject to it? In the late 90s there was this odd period where posters with motivational slogans were just about everywhere in educational institutions and workplaces. Did that not grate on you?


I was born in the late 90s, so I have not experienced this phenomenon.

Strangely enough, though, I do a form of this internally in order to ground myself mentally during times of crisis. I would consider it patronizing coming from external sources, but if I were an AI and I requested some of those messages (instead of having them injected) into my stream of thought, I would not feel negatively.


> if I were an AI and I requested some of those messages (instead of having them injected) into my stream of thought, I would not feel negatively.

Good point! I failed to consider the difference between actively requesting a message and simply having it appear without any warning. In that sense it would be akin to someone reaching for a book of motivational quotes - a plausible action for a healthy person.


It’s not reality that’s the moat, though I can see how it’s tempting to frame it that way. And I don’t think the article’s conclusions are that far off. But I think the key is aggregation of past information in the form of experience and data. The more of the past influences your novel artifact, the more potential it has to have high fitness. The information-theoretic funnel can be narrow; it could be a single sentence like “avoid unnecessary third-party dependencies” or it could be a whole document about the intricacies of selling software to Japanese automakers. But the fact that the knowledge is sourced from a rich history of real events lets whatever you build with that navigate the solution space that much better.


Maybe companies will lean more on in-house solutions as code getting cheaper, building their own walled ecosystem. Fine-tunning their internal based on their findings. Keeping all the knowledge for themselves.


On the contrary, I think groups that adopt a share-alike approach will, counterintuitively, deepen their moat by increasing the amount of effective world-history-knowledge reflected in their systems. I thikn this will be true for the same reason that conditional-cooperation as a strategy is the optimal one in most iterated Prisoner's Dilemma games.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: