More

enduku · 2026-03-30T22:53:08 1774911188

I feel like LLMs are just forcing me to realize what writing actually is. For me, writing is basically a mental cache clear. I write things down so I can process them fully and then safely forget them.

If I let an LLM generate the text, that cognitive resolution never happens. I can't offload a thought i haven't actually formed - hence am troubld to safely forget about it.

Using AI for that is like hiring someone to lift weights for you and expecting to get stronger (I remember Slavoj Žižek equating it to a mechanical lovemaking in his recent talk somewhere).

The real trap isn't that we/writers willbe replaced; it's that we'll read the eloquent output of a model and quietly trick ourselves into believing we possess the deep comprehension it just spit out.

It reminds me of the shift from painting to photography. We thought the point of painting was to perfectly replicate reality, right up until the camera automated it. That stripped away the illusion and revealed what the art was actually for.

If the goal is just to pump out boilerplate, sure, let AIdo it. But if the goal is to figure out what I actually think, I still have to do the tedious, frustrating work of writing it out myself .

enduku · 2026-03-28T20:18:55 1774729135

No AI was used. I see no problems with using AI to write code whatsoever, but this isn't that. The formatting is my screw-up. I ran clang-format with a bad config, then tried to hand-fix the result and made it worse. The parenthesization is from defensive macro expansion that I inlined for the build and never cleaned up . The inline (smoke) test in the Makefile was a lazy hack from my local workflow that I forgot to replace before pushing and a proper test suite exists but the names/sections are in Telugu, my native language . I'll fix both and add.

enduku · 2026-03-28T20:08:38 1774728518

Fair points on both - the 5ns is the L2 hit case. I should have stated the range (30-60ns?) instead of the best case. And yes, fixing the tcmalloc case is on my list - thanks for pointing that out. And also to be clear, the goal was never to beat jemalloc or tcmalloc on raw throughput. I wanted t oshow that one doesn't have t ogive up competitive performnce to get explicit heaps, hard caps and teardown semantics.

jeffbee · 2026-03-28T20:14:46 1774728886

That makes sense. I have a long-standing beef with the mimalloc-bench people because they made a bunch of claims in their paper but as recently as 2022 they were apparently not aware of the distinction, and the way they tried to shoehorn tcmalloc into their harness is plain broken. That is not a problem caused by your fine project.

enduku · 2026-03-28T20:01:46 1774728106

I am aware of dlmallc/mspaces and GNU Obstacks. Both were in a way, original inspirations for spaces. Though I hadn't looked at mspaces source in years, I remember its inline boundary tags enabling zero overhad per allocation and there were no alignment constraints on the allocator itself (and is hardened across countless archs, not just x64 :) Spaces uses 64kb aligned slabs and a metadata find is a bitop. so potentially, a buffer overflow can corrupt the heap metadata in mspaces while spaces eats a cache-line on free.

mspaces was one mutex per heap for entire task (no tlc or lockfree paths). Spaces has per thread-heaps, local caches (no atomic ops on same thrad alloc/free), and a lock-free Treiber stack (ABA tagging) for cross-thread frees. mspaces doesnt track large allocs (>= 256 or 512kb) that hit mmap, so unless one knows to explicitly call mspace_track_large_chunks(...), destroy_mspace silently leaks them all (I think obstacks is good this way but is not a general fit imo). In Spaces, a chunk_destroy walks and frees all the page types unconditionally.

Another small thing may matter is error callbacks: Spaces triggers a cb allowing the application to shed load/degrade gracefully. Effectively, the heap walking (inspection?) in msapces is a compile-time switch that holds the lock whole time and doesnt track mmap (direct) allocs, and shares the thresholds like mmap_threashold, etc. globally, whereas Spaces lets you tune everything per-heap. So I'd say Spaces is a better candidate for use cases mspaces bolts on: concurrent access, hard budgets, complete heap walking and per-heap tuning.

enduku · 2026-03-26T15:59:53 1774540793

Thanks for taking a look, really appreciate the thoughtful feedback! You're absolutely right about Fibonacci. It's a terrible performance example since the work per fork is basically zero :) I included it as an 8-line API showcase, and because it's almost pure overhead it doubles as a brutal stress test for the runtime. The nqueens, matmul, and quicksort benchmarks in the repo are the real performance indicators, imo. And yeah, batching and granularity control matter a lot, there's a section in the README on tuning the serial cutoff.

On the utilization and context switch concerns, i feel they typically stem from centralized queues or allocator contention in child-stealing systems (like TBB). Cactus uses continuation-stealing instead: on FORK, the child runs immediately as a normal function call, and the parent's continuation (just 24 bytes: RBP/RSP/return address) goes onto a per-worker deque.

If nobody steals it, the parent reclaims it at JOIN with an atomic decrement. No allocation, no stack switch, no context switch on the fast path. A stack switch only happens when a thief actually steals work and grabs a recycled slab from the pool.

The other scalability killer is usually lock-based synchronization at join points. I tried to avoid this by using atomic counters for the worker/thief handoff instead of mutexes. You can test scaling yourself: CACTUS_NPROCS=1 build/cc/nqueens 14 vs CACTUS_NPROCS=N build/cc/nqueens 14.

That said, I totally agree an Executor model is the right tool for flat workloads. I built this specifically for recursive divide-and-conquer (game trees, mergesort, mesh refinement) where you'd otherwise have to manually flatten the recursion or risk deadlocking a fixed-size pool.

PaulHoule · 2026-03-26T18:35:07 1774550107

Great answer!

enduku · 2026-03-25T17:49:33 1774460973

I wrote this because I wanted more explicit control over heaps when building different subsystems in C. Standard options like jemalloc and mimalloc are incredibly fast, but they act as black boxes. You can't easily cap a parser's memory at 256MB or wipe it all out in one go without writing a custom pool allocator.

Spaces takes a different approach. It uses 64KB-aligned slabs, and the metadata lookup is just a pointer mask (ptr & ~0xFFFF).

The trade-off is that every free() incurs an L1 cache miss to read the slab header, and there is a 64KB virtual memory floor per slab. But in exchange, you get zero-external-metadata regions, instant teardown of massive structures like ASTs, and performance that surprisingly keeps up with jemalloc on cross-thread workloads (I included the mimalloc-bench scripts in the repo).

It's Linux x86-64 only right now. I'm curious if systems folks think this chunk API is a pragmatic middle ground for memory management, or if the cache-miss penalty on free() makes the pointer-masking approach a dead end for general use.

throwaway2027 · 2026-03-28T12:41:43 1774701703

When dealing with memory in C defaulting to malloc or some opaque structure behind it is unless you just want to allocate and forget it for some one off program that frees memory on proc exit seems bad to me now. For any kind of sophisticated system or module you almost always want to write your own variety of slab, arena, pool, bump whatever it may be allocator.

enduku · on July 29, 2024

https://en.algorithmica.org/

enduku · on Dec 29, 2023

Mirdin[0] can be helpful too.

[0] https://self-service.mirdin.com/products/advanced-software-d...

enduku · on Dec 8, 2023

TinyChat is an efficient, lightweight, Python-native serving framework for 4-bit LLMs by AWQ. It delivers 2.3x generation speed up on RTX4090.

Code: https://github.com/mit-han-lab/llm-awq/tree/main/tinychat

enduku · on Nov 10, 2023

I find myself using J [0] for such type of ideation/programming exercises. Klong/K/APL also lend to similar productivity.

[0] https://www.hillelwayne.com/handwriting-j/