More

mzl · 2026-03-12T10:02:22 1773309742

No, but a lot of AI-adjsuted wordings have the very idiosyncratic AI-style that is prevalent in the AI-slop that is everywhere, and that style has quickly become associated with writing that is generally void of content and insight. So it is natural to get gut-reactions to the typical phrasings that have become associated with AI.

mzl · 2026-03-10T16:14:49 1773159289

As others have said, this is more of a constraint programming system than Wave Function Collapse. Whatever one wants to call it, I liked it.

For guiding the search, you might want to consider search steps that select only one feature, for example that a pair of adjacent tiles should be connected by a road, and just propagate that information. That could be used as a way to guide the search on high-level features first, and then later realize the plans by doing the normal search.

mzl · 2026-03-10T07:07:58 1773126478

I have a (very slight) beef with the name Algorithm X, as it is more of a data-structure to manage undo-information for the backtracking than an algorithm. It is a very fun, useful, and interesting data-structure, but it doesn't really change what steps are performed in the backtracking search.

mzl · 2026-03-04T07:45:52 1772610352

In my view, Scrum is a way to force dysfunctional teams to have some process, it is not useful for a team that is already delivering and working in a samll-a agile manner.

ap99 · 2026-03-04T09:24:50 1772616290

If you were to write down a guide on how to avoid team dysfunction, it would get a name or maybe an acronym.

If it worked someone would say, hey let's use this in more places.

If it worked really well others would say these aren't guidelines they're dogma.

Now we have scrum 2.0.

mzl · 2026-02-26T09:10:31 1772097031

Are you using the Model GPU memory snapshotting for this?

mzl · 2026-02-19T12:05:12 1771502712

I like the intelligence per watt and intelligence per joule framing in https://arxiv.org/abs/2511.07885 It feels like a very useful measure for thinking about long-term sustainable variants of AI build-outs.

mzl · 2026-02-16T12:23:54 1771244634

Depends on which cache you mean. The KV Cache gets read on every token generated, but the prompt cache (which is what incurs the cache read cost) is read on conversation starts.

0-_-0 · 2026-02-16T12:24:45 1771244685

What's in the prompt cache?

mzl · 2026-02-16T13:59:04 1771250344

The prompt cache caches KV Cache states based on prefixes of previous prompts and conversations. Now, for a particular coding agent conversation, it might be more involved in how caching works (with cache handles and so on), I'm talking about the general case here. This is a way to avoid repeating the same quadratic cost computing over the prompt. Typically, LLM providers have much lower pricing for reading from this cache than computing again.

Since the prompt cache is (by necessity, this is how LLMs work) prefix of a prompt, if you have repeated API calls in some service, there is a lot of savings possible by organizing queries to have less commonly varying things first, and more varying things later. For example, if you included the current date and time as the first data point in your call, then that would force a recomputation every time.

lostmsu · 2026-02-16T18:09:17 1771265357

> The prompt cache caches KV Cache states

Yes. The cache that caches KV cache states is called the KV cache. "Prompt cache" is just index from string prefixes into KV cache. It's tiny and has no computational impact. The parent was correct to question you.

The cost of using it comes from the blend of the fact that you need more compute to calculate later tokens and the fact that you have to keep KV cache entries between requests of the same user somewhere while the system processes requests of other users.

mzl · 2026-02-16T20:28:55 1771273735

Saying that it is just in index from string prefixes into KV Cache misses all the fun, interesting, and complicated parts of it. While technically the size of the prompt-pointers is tiny compared with the data it points into, the massive scale of managing this over all users and requests and routing inside the compute cluster makes it an expensive thing to implement and tune. Also, keeping the prompt cache sufficiently responsive and storing the large KV Caches somewhere costs a lot as well in resources.

I think that the OpenAI docs are pretty useful for the API level understanding of how it can work (https://developers.openai.com/api/docs/guides/prompt-caching...). The vLLM docs (https://docs.vllm.ai/en/stable/design/prefix_caching/) and SGLang radix hashing (https://lmsys.org/blog/2024-01-17-sglang/) are useful for insights into how to implement it locally for one computer ode.

lostmsu · 2026-02-16T22:43:57 1771281837

The implementation details are irrelevant to the discussion of the true cost of running the models.

mzl · 2026-02-17T06:04:43 1771308283

The cost of running things like prompt caching is defined by the implementation as that gives the infrastructure costs.

bsenftner · 2026-02-16T12:26:47 1771244807

Way too much. This has got to be the most expensive and most lacking in common sense way to make software ever devised.

mzl · 2026-02-13T07:38:38 1770968318

Technically, Cerebras solution is really cool. However, I am skeptical that it will be economically useful for models that are larger in size, as the requirements on the number of racks scales with the the size of the model to fit the weights in SRAM.

mzl · 2026-02-13T05:03:54 1770959034

I find it interesting that the spark version seems worse than the gpt-oss version (https://simonwillison.net/2025/Aug/5/gpt-oss/)

mzl · 2026-01-27T12:42:55 1769517775

An LLM model only outputs tokens, so this could be seen as an extension of tool calling where it has trained on the knowledge and use-cases for "tool-calling" itself as a sub-agent.

XCSme · 2026-01-27T13:01:52 1769518912

Ok, so agent swarm = tool calling where the tool is a LLM call and the argument is the prompt

IanCal · 2026-01-27T15:16:29 1769526989

Yes largely, although they’ve trained a model specifically for this task rather than using the base model and a bit of prompting.

dcre · 2026-01-27T14:12:57 1769523177

Sort of. It’s not necessarily a single call. In the general case it would be spinning up a long-running agent with various kinds of configuration — prompts, but also coding environment and which tools are available to it — like subagents in Claude Code.