More

curiouscoding · 2025-10-08T23:20:27 1759965627

Try measuring it! You'll quickly see that the latency of addressing n bytes of memory is definitely not O(1).

See eg "The Myth of RAM": https://www.ilikebigbits.com/2014_04_21_myth_of_ram_1.html

dmitrygr · 2025-10-08T23:44:03 1759967043

100% wrong

8^(1/3) is 2

So according to OP (and you by agreement), upgrading my machine from 4GB of RAM to 32GB should double my RAM access time. Obviously, and demonstrably, nothing of the sort occurs!

tabender · 2025-10-09T04:30:35 1759984235

But for another example, we often say array access is O(1) - but if you increased the size of your array by 8x (assuming you were already operating at the speed of light) you would see that increase!

dmitrygr · 2025-10-09T23:32:03 1760052723

What nonsense are you going on about? My Dram controller has no idea as to which bytes are part of my array and which bytes aren’t, so even if it tried to play along with your claims, it wouldn’t know how big my array was declared an thus which accesses and by how much to slow down.

curiouscoding · 2025-07-28T11:43:24 1753703004

Some questions/remarks:

- in the initial `Contains` code snippet (and all early-break variants), most performance is probably lost by branch misses on which element returns true. Probably much better is something like `table[0] == fp | table[1] == fp | table[2] == fp | table[3] == fp` (which might actually get optimized to int/SIMD anyway; who knows).

- What is shown in the table with benchmarks? I assume 'operations' is the number of lookups, but how large is the array? Memory latency is probably one of the big effects on overall runtime? - It's probably more useful to show time per lookup than mean 'total time'.

- Also, how do you run the benchmarks? If you make a fixed array of queries and query those many times, the same cachelines will be hit and you won't measure memory latency. Also, in this case the branchpredictor might learn all branches, so best is to only do each query once in your benchmark.

- You don't really comment on the differences between benches with different number of 'operations'. Are there any takeaways from this? (otherwise just don't show them?)

- Maybe you could try with some `u8x8` SIMD instructions as well? Although the bitmasking is also cute as-is :)

axeluser · 2025-07-28T20:47:47 1753735667

Hi, thank you for your questions: 1. The "one-line" boolean condition still has jumps in the produced JIT Asm, at least to make a short circuit to exit as soon as possible. I guess the C# compiler is far too conservative for such optimisations.

2 and 3. The "operations" are both the array's size (actually the nearest power of 2) and lookup operations. I first thought that latency should be equal for both cases, because my benchmarks may be too naive - they test lookup by sequential memory access, i.e., iterating buckets from 0 to n, and iterating from first to last element in each bucket. I'll try to experiment with a more shuffled workload in my free time. Thank you for the question. I've not thought about it deeply. Also, counting mispredictions is a good idea; I'll add this metric to my benchmarks to get a complete picture.

4. I didn't pay much attention to that because the trend was still the same, and even the ratio was similar for different counts of lookups per underlying memory layout. There was just some minor deviation, but I ignored it. Maybe I'll run more granular tests with hardware counters to catch some insights.

5. SIMD, in my opinion, would be an overhead there because for a single lookup in my configuration, I'm using a single 32-bit value, and Cuckoo Filter will touch at max 64 bits (2 lookups). However, I was planning to experiment with batching of lookups; I just haven't reached it yet.

curiouscoding · 2025-07-28T23:46:28 1753746388

> they test lookup by sequential memory access, i.e., iterating buckets from 0 to n, and iterating from first to last element in each bucket

Yeah; that completely eliminates the cache misses / memory latency you'd have in practice. Of course eliminating that bottleneck is useful if you want to purely optimize CPU speed, but probably not quite as representative of real workloads. Also makes sense then that different array sizes give similar results: streaming tends to be fast anyway, regardless of the array size.

maxlapdev · 2025-07-29T12:39:53 1753792793

The one liner shouldn't have shortcircuiting since it uses the binary or |, not the boolean/logical or||.

axeluser · 2025-07-29T15:29:34 1753802974

Aha, you're right, didn't look carefully

curiouscoding · 2025-07-28T11:50:03 1753703403

Another remark:

If you benchmark it as something like `for q in queries { contains(q); }`, especially the branchless variants are probably executed in parallel by the CPU, and you are measuring throughput instead of latency. That may or may not be relevant depending on the application.

axeluser · 2025-07-28T20:48:35 1753735715

Hm, that's actually may be true, thank you for the heads-up

curiouscoding · 2025-07-23T05:49:54 1753249794

I don't think talks are being recorded, unfortunately.

curiouscoding · 2025-05-17T07:48:21 1747468101

I'd love to see some benchmarks/comparison on variable length strings. For strings with random length between 10 and 30, gxhash was significantly faster than xxhash for me. I would assume because it processes chunks of up to 32 chars at a time, avoiding branch misses.

Generally my feeling is that at these speeds, designing for branch-predictability for short strings might be more important than absolute throughput.

curiouscoding · on Jan 3, 2025

Nice overview of sorting methods, thanks for sharing!

I also looked a bit into radix and distribution sort at some point over the past year, but in the end high performance sorting is actually too big of a thing to just do quickly on the side, as your post well shows :") In fact I wasn't aware of the associativity issues for radix sort. That's definitely something to keep in mind and investigate.

Will definitely refer back to it once I'm looking at sorting again in more detail at some point!

mlochbaum · on Jan 6, 2025

I got stuck on sorting too, was working on SingeliSort (https://github.com/mlochbaum/SingeliSort) for a while. The basic performance is there but I need to get serious about testing before using it. But the radix sort and counting sort should be very solid. The approach is about the same as the C code currently used in CBQN, linked below. The main complication is to reduce constant overhead for shorter arrays with a small count type and better prefix sums, interleaved SWAR for SingeliSort since it targets generic architecture and shared SIMD utilities in CBQN.

Email in my Github profile, feel free to contact me any time if you'd like to talk about algorithms!

32-bit radix sort: https://github.com/dzaima/CBQN/blob/v0.8.0/src/builtins/grad... plus https://github.com/dzaima/CBQN/blob/v0.8.0/src/builtins/radi...

curiouscoding · on Jan 1, 2025

I've added a little motivation section on this :)

The final goal is to index DNA, say a human genome, or a bunch of them, and this is static data.

Then as new DNA comes in (eg is read by a DNA sequencer) we can efficiently query against the index built on the reference genome, and this reference is often fixed.

curiouscoding · on Jan 1, 2025

Jup. I've added to the future work section now that the plan is to use this to speed up suffix array searching. Suffix arrays are pretty regularly used to build indices over say a human genome, that can then be queried.

And since DNA sequencers are still increasing there throughput and accuracy, it's always good to have faster algorithms as well.

curiouscoding · on Jan 1, 2025

Thanks! I did spend some time fiddling with CSS to make the yellow highlights so that the titles stand out a bit more, and to improve the code blocks, but otherwise it's a relatively common theme for Hugo. (I think it's linked in the footer.)

curiouscoding · on Jan 1, 2025

You're absolutely right. At some point I spent so long working on the plotting code that I really couldn't be bothered anymore to make it prettier. Hopefully I eventually find some motivation to fix this.

curiouscoding · on Jan 1, 2025

At some point I was playing around with interpolation search on the human genome dataset, and it was really really terrible. It had some very bad worst case behaviour when the input data has big plateaus and you're searching for something around the edges of the plateau. This can be fixed by only using interpolation for the first few iterations and doing binary search for as long as needed afterwards, but that kills a lot of the predictability of the current approach. So while I really like it in theory, in practice I'm not quite convinced it can be made to work efficiently.

estebarb · on Jan 7, 2025

There are tricks: if you need to find uniformly distributed data it is unbeatable: guaranteed to work in 5 hops or less. After 5 hops, is better to use binary search. If data is not uniformly distributed one trick is to sort by hash of the element (store the hash with the element or calculate on the fly). But yeah, without those adjustments it has a worst case of O(n).