More

zknill · 2026-04-24T08:23:11 1777018991

AI generated code, where the author doesn't understand the code, shifts the burden of checking quality and function onto the reviewer.

This post says little about that, and suggests some improvements the _reviewer_ can make.

I think that's completely the wrong end of the stick to be tackling. As the burden is still on the reviewer and not the author.

kristopherleads · 2026-04-26T09:55:58 1777197358

I 100% agree with you that in a perfect world the submitter should be doing the review work. But the reality is that we don't live in a perfect world - and just sticking our fingers in our ears and shouting it's not our job isn't going to make data breaches less common or code better quality. Accordingly, if we accept that we can't make vibe coders better stewards (although I absolutely do think we can help in that regard, and I suggest ways to do that in the post), then we have to do our part of improve it somehow.

zknill · 2026-04-23T11:46:37 1776944797

I wrote about "All your agents are going async"[1], and everyone said "Can't you just do this with SSE". So I figured I'd dig into that claim.

[1]: https://news.ycombinator.com/item?id=47832720

zknill · 2026-04-23T10:18:23 1776939503

This is great, I built a manual integration based on JMAP and CalDav cli tooling, but this is neat. Especially:

> The OAuth consent screen will give you a choice of three levels of access: read-only (see emails, contacts, calendars), write (update emails, save drafts, edit contacts and events), and send (send emails).

zknill · 2026-04-22T11:13:02 1776856382

I suspect the answer is that the AI chat-app is built so that the LLM response tokens are sent straight into the HTTP response as a SSE stream, without being stored (in their intermediate state) in a database. BUT the 'full' response _is_ stored in the database once the LLM stream is complete, just not the intermediate tokens.

If you look at the gifs of the Claude UI in this post[1], you can see how the HTTP response is broken on page refresh, but some time later the full response is available again because it's now being served 'in full' from the database.

[1]: https://zknill.io/posts/chatbots-worst-enemy-is-page-refresh...

zknill · 2026-04-22T10:46:04 1776854764

> "and which ones are no longer relevant."

This is absolutely the hardest bit.

I guess the short-cut is to include all the chat conversation history, and then if the history contains "do X" followed by "no actually do Y instead", then the LLM can figure that out. But isn't it fairly tricky for the agent harness to figure that out, to work out relevancy, and to work out what context to keep? Perhaps this is why the industry defaults to concatenating messages into a conversation stream?

edg5000 · 2026-04-23T06:16:51 1776925011

My guess (I will test this eventually) is that you set a window size (which may be the model limit, or lower to reduce input token costs), the harness then refuses to show items that don't fit. If the model emits a command to read a file, the harness then says "File hidden due to lack of context space". In the system prompt, the model is informed about the context space usage, and that it can hide files. It needs to be instructed that if files contain something noteworthy, that the agent notes this down in their notes, which should always be rendered into the context. If this fails, the agent will hide a file with relevant information and then get lost in circles. If it succeeds, the agent can work on larger tasks autonomously. So it's worth trying.

asixicle · 2026-04-22T11:24:14 1776857054

That's what the embedding model is for. It's like a tack-on LLM that works out the relevancy and context to grab.

nprateem · 2026-04-22T11:48:27 1776858507

God knows why you think this is possible. If I don't even know what might be relevant to the conversation in several turns, there's no way an agent could either.

asixicle · 2026-04-22T11:57:04 1776859024

One of us is confusing prediction with retrieval. The embedding model doesn't predict what is going to be relevant in several turns, just on the turn at hand. Each turn gets a fresh semantic search against the full body of memory/agent comms. If the conversation or prompt changes the next query surfaces different context automatically.

As you build up a "body of work" it gets better at handling massive, disparate tasks in my admittedly short experience. Been running this for two weeks. Trying to improve it.

edg5000 · 2026-04-23T06:19:59 1776925199

So the embedding model is a fixed-size view on a arbitrarily sized work history (tool calls, natural language messages)? The model is like a summarizer, but in latent space? And not aimed to summarize, but trained to hold whatever is needed for the agent to be autonomous for longer runs?

asixicle · 2026-04-23T21:34:15 1776980055

Pretty much. It's a fixed-size vector per chunk-- 1024 dims in the case of Voyager Nano. The autonomy part is entirely in how you build the vectorDB and query it, not in the model's training. That's the part I've been focusing on lately. Trying different methods and seeing what gives the best results.

At the moment I wouldn't emphasize "autonomous-ness", there's still a fair bit of human hand holding. But once I get a model on the right path it can switch back to to an old project, autonomously locate and debug 2-week old commits and the context around their development, and apply that knowledge to the task at hand.

It's only been a day but I seeing an improvement from nomite (768dims) to Voayager.

zknill · 2026-04-22T10:07:09 1776852429

Assuming LROs are "Long running operations", then you kick off some work with an API request, and get some ID back. Then you poll some endpoint for that ID until the operation is "done". This can work, but when you try and build in token-streaming to this model, you end up having to thread every token through a database (which can work), and increasing the latency experienced by the user as you poll for more tokens/completion status.

Obviously polling works, it's used in lots of systems. But I guess I am arguing that we can do better than polling, both in terms of user experience, and the complexity of what you have to build to make it work.

If your long running operations just have a single simple output, then polling for them might be a great solution. But streaming LLM responses (by nature of being made up of lots of individual tokens) makes the polling design a bit more gross than it really needs to be. Which is where the idea of 'sessions' comes in.

sudb · 2026-04-22T15:46:21 1776872781

Did you consider websockets? Curious to know if I'm missing something!

zknill · 2026-04-22T09:54:46 1776851686

I don't know Kitaru too well, but I do know Temporal a bit.

The pattern I describe in the article of 'channels' works really well for one of the hardest bits of using a durable execution tool like Temporal. If your workflow step is long running, or async, it's often hard to 'signal' the result of the step out to some frontend client. But using channels or sessions like in the article it becomes super easy because you can write the result to the channel and it's sent in realtime to the subscribed client. No HTTP polling for results, or anything like that.

htahir111 · 2026-04-22T11:30:27 1776857427

so to be clear, this should be used "instead of" rather then "on top of" durable execution engines?

zknill · 2026-04-22T09:39:36 1776850776

With the approach based on pub/sub channels, this is possible to do if you know the name of the session (i.e. know the name of the channel).

Of course the hard bit then is; how does the client know there's new information from the agent, or a new session?

Generally we'd recommend having a separate kind of 'notification' or 'control' pub/sub channel that clients always subscribe to to be notified of new 'sessions'. Then they can subscribe to the new session based purely on knowing the session name.

zknill · 2026-04-22T09:25:55 1776849955

I don't think this is quite right. I do work for a pub/sub company that's involved in this space, but this article isn't a commercial sales pitch and we do have a product that exists.

The article is about how agents are getting more and more async features, because that's what makes them useful and interesting. And how the standard HTTP based SSE streaming of response tokens is hard to make work when agents are async.

philipwhiuk · 2026-04-22T12:19:30 1776860370

> but this article isn't a commercial sales pitch

Yes it is. But it's nice you've convinced yourself I guess.

What is this, if not a product pitch:

> Because we’re building on our existing realtime messaging platform, we’re approaching the same problem that Cloudflare and Anthropic are approaching, but we’ve already got a bi-directional, durable, realtime messaging transport, which already supports multi-device and multi-user. We’re building session state and conversation history onto that existing platform to solve both halves of the problem; durable transport and durable state.

sudb · 2026-04-22T15:43:34 1776872614

If agents are async, is streaming still important? I think the useful set of interactions with an async agent are pretty limited - you'd want to stop, interrupt with a user message, maybe pause, resume, or steer with a user message?

All of those can be done without needing streams or a session abstraction I think, unless I'm misunderstanding.

zknill · 2026-03-20T11:47:43 1774007263

This is actually great for *claws. When Anthropic changed their T&Cs to disallow using claude code oauth tokens in the Anthropic Agent SDK, you had a choice between violate the terms or pay a lot more for the model inference using an API key from platform.claude.com instead of claude.ai.

With this change, it looks like an officially sanctioned version of *claws. Connecting to whatever "channels" you want via MCP.

Architecturally it's a little different, most *claws would call the Agent SDK from some orchestrator, but with claude channels the claude code binary starts the MCP server used to communicate with the channel. So it's a full inversion of control where Claude code is the driver, instead of your orchestrator code.

I updated my nanoclaw fork to start the claude code binary in a docker container on PID 1, and you can read the docker logs straight from claude code stdout, but with comms directly to/from your channel of choice. It's pretty neat.