More

curiouscoding · on Jan 1, 2025

You're right about the mistake in the figure. Will fix soon, thanks :)

And yeah, maybe I should just label all of then with throughout for consistency.

curiouscoding · on Jan 1, 2025

Assuming this is going to save someone 10ns per query in the end and I spent 150h on coding and writing this up, we only(?) need around 10^14 queries to make it worth it!

But yes, as new technologies stack up, we achieve exponential growth in throughput in the end, which usually enables new science :)

curiouscoding · on Jan 1, 2025

You're right. While I wasn't very explicit about this (because algorithmica and the linked paper spend plenty of words on it already), the code snippet for Eytzinger does indeed do both the prefetching and the formula.

In fact, I'm pretty sure a similar formula could be made to work for higher branching factors, although it would surely be slower. (Probably it depends on the number of times 17 divides the final index it so, which is not great, but with B=15 it would be the number of factors of 16 which is easy again.) The more annoying issue with not storing everything in the final layer is that we have to keep a running-answer for each query as we go down the tree, which I suspect will add measurable runtime overhead. But maybe that's a tradeoff one is willing to make to avoid the 6.25% overhead.

curiouscoding · on Jan 1, 2025

Just fyi: the throughput numbers with batching are per _query_, not per _batch_, so I think the *8 is too optimistic ")

I suspect that at higher core counts, we can still saturate the full RAM bandwidth with only 4-5 cores, so that the marginal gains with additional cores will be very small. That's good though, because that gives CPU time to work on the bigger problem to determine the right queries, and to deal with the outputs (as long as that is not too memory bound in itself, although it probably is).

curiouscoding · on Jan 1, 2025

Hmm, bitmaps is an interesting idea! If the data is dense enough, then yeah I guess a quick linear scan would work.

There's also the extreme of simply storing the answer for each possible query as a u32 and just index the array, but there the overhead is much larger.

Rank-select is also interesting, but I doubt it comes anywhere close in performance.

I happen to also be working on a minimal perfect hashing project that has way higher throughput than other methods (mostly because batching), see https://curiouscoding.nl/posts/ptrhash-paper/ and the first post linked from there.

gorset · on Jan 1, 2025

Nice work! I love the care and discussion around low level optimizations, as such things are often ignored.

There are a lot of interesting variants of rank/select and other succinct data structures which has interesting tradeoffs. Maybe most useful when compression is critical. At least my own data structures can’t compete with the query performance you are showing, but they are great when the memory footprint matters.

curiouscoding · on Jan 1, 2025

Ah yes, I did consider sorting the queries at some point but I guess I then forgot about it again. If the queries are random and much less than the input size, probably most queries will hit different cache lines in the final layer, and I suspect benefits will be limited at maybe 2x best case since the last layer is where the bottleneck is.

Also (radix) sorting is very memory bound usually, and we probably need to sort in at 16^2=256 buckets or so to get sufficient reusing of the higher layers, but I don't have the numbers of what a round of radix sort takes. (My guess is order 1ns per query? Maybe I'll find time to investigate and add it to the post.)

Bulat_Ziganshin · on Jan 3, 2025

high-performance sorting algos do either merging or partitioning. I.e., you merge R input streams into one, or split one input stream into R (for quick, radix and sample sort).

1. For merge sort of N elements, you have to perform log(N)/log(R) passes

2. For sample sort - C*log(N)/log(R), where С depends on the distribution of your data, there are no strict guarantees

3. For radix sort of K-byte elements exactly K passes (indeed, 256 ways is optimal according to my tests), which is usually larger than the previous value

While Mergesort looks preferable since it uses the least and fixed amount of passes, the merging code itself is less efficient - there are not much independent CPU operations, making it slower than samplesort and especially radix sort for small inputs.

So, it seems that the best strategy combines two different levels - for larger blocks you absolutely need to minimize the amount of passes [over memory] and employ multithreading, while for smaller blocks we need to minimize the amount of CPU operations and increase ILP.

Today two well-recognized algos are IPS4O for larger blocks and Google's SIMD QuickSort for smaller ones.

curiouscoding · on Jan 1, 2025

Anything in particular that threw you off? I'd be happy to add a few words to briefly explain some of the less intuitive rust syntax.

skrebbel · on Jan 1, 2025

Nah just the final bits, stuff like `Simd::<u32, 8>::splat(q)`, I'm not sure what splatting is or how Rust's Simd library works or, admittedly, how SIMD works in detail at all. So eg I'm not sure what that 8 is doing there in the type parameters, details like that. Maybe this isn't a Rust thing but a SIMD thing, btw, I don't know much Rust but I also never wrote any SIMD code ever. I don't know how the article could be clearer, IMO once you go into the deep nitty gritty optimization stuff you just gotta assume the reader knows the tools you're using. I'm OK with dropping out halfway cause the core idea is there even before you squeeze out all the perf.

burntsushi · on Jan 1, 2025

`Simd::<u32, 8>` is describing a vector with 8 lanes, where the width of each lane is 32-bits. For targets that support it, that should get you a 256-bit register.

The `splat(q)` constructor is just saying, "populate each lane with `q`." That is, the value is "splatted" across every lane in the vector.

The main idea of SIMD is simple: you represent multiple points of data in the same register and use specialized CPU instructions to operate on those points simultaneously. The difficulty of SIMD is coming up with clever algorithms that use the available instructions in an efficient way.

skrebbel · on Jan 1, 2025

Ahh right, I've been doing too much TypeScript lately. I thought a type parameter couldn't impact behavior but only typechecking but clearly in Rust (like in C++) it can. Thanks for the explanation!

Ah so splat is like a memset for the entire vector to the same value. OK makes sense, I bet that wasn't actually a Rust-ism at all, just basic SIMD lingo I didn't know. Thanks again!

shawn_w · on Jan 2, 2025

It's the Intel intrinsic `_mm256_set1_epi32()` except with an amusing name. Got to give Rust the win here.

skrebbel · on Jan 2, 2025

wow yeah, i way prefer Rust's notation there! but i assume you could make a similar zero-cost abstraction with some mildly crazy C++ template programming.

bennythomsson · on Jan 1, 2025

Nice to see other real experts (and a bit of celebrities) in here.

(For the uninitiated: Burntsushi of ripgrep fame.)

skrebbel · on Jan 1, 2025

yeah tbh i was a bit starstruck that burntsushi would reply to my comment explaining what i assume is utter SIMD basics. that must be how swifties feel when she (or, likelier, her PR drones) click "like" on their insta comments.

freeopinion · on Jan 1, 2025

Does burntsushi have PR drones (who can explain SIMD)? :)

curiouscoding · on Jan 1, 2025

The problem with pseudocode is that it's completely underspecified. And how would I ever write intrinsics in pseudocode? Much easier to do a proper language directly so people can actually Google things and read their official documentation.

sigbottle · on Jan 1, 2025

My former roommate works in a similar domain; https://dl.acm.org/doi/pdf/10.1145/3448016.3452841 is an example of a paper implementing intrinsics in pseudocode. They "unroll" the intrinsics with a comment saying that this is implemented with an intrinsic. Of course though, your blog isn't a paper, don't know why people are getting up in arms about it.

Also the other comment saying that "psuedocode is not concerned with intrinsics" is false. You can get "great" theoretical speedups (the shaveoffs are tiny but hey, they're better) with "simple" intrinsic operations - that's my roommate's entire research lol. The external memory model, for example, formalizes caching and allows all these low level optimizations to flourish. I'm not sure how intrinsics tied into it, but he's published so I'm not gonna question it :)

---

Speaking of which, I noticed that you did competitive programming. How does CP compare to research? I loved data structure manipulation problems in CP when they were clever - often because they involved noticing that you can take a "common" model, but then optimize it significantly because you only needed to support a subset of the operations through a clever mathematical proof based on the structure of the problem - but as I got to the higher levels it felt more and more that a lot of them became really obscure "support 413 operations on a tree, and yes, you really need to support all 413 operations on a tree" and that's kind of my opinion of data structure research unfortunately as well :( I guess because solving broad general instances is more important. I'd love to hear your perspective though.

ryao · on Jan 1, 2025

Pseudo code can be whatever you want it to be. You can do SIMD pseudo code, but most generally don’t as that is often an implementation detail.

curiouscoding · on Jan 2, 2025

Yeah exactly, pseudocode just doesn't cut it if what matters is exactly the implementation itself.

Indeed I did a bunch of competitive programming! But actually there my favourite topics are combinatorics, graph theory, and number theory. I'd usually leave the datastructure (read segtree) problems to my teammates.

I super enjoyed that, and indeed was looking for a PhD where I could do similar things (because my time at Google was boring in comparison -- mostly just software engineering), on the intersection of new theory and practical fast code. I decided on bioinformatics, because this is exactly a field that has a lot of data, and the amount of data is growing fast, so that fast algorithms&code are needed, both in theory (big-O) and practice. Generally I've been super excited working on various problems in this domain, and I'd say it quite closely matches my compprog experience :)

curiouscoding · on Jan 1, 2025

Rust is great :")

But no, this is actually part of my PhD research. The next step will be to use this in a fast suffix-array search algorithm.

jackschultz · on Jan 1, 2025

Great stuff. We get told O notation is what matters for data structures / algorithms, but improvements low level with memory and storage with things like rust is much more where improves can be made. These types of tricks for anctual run times are so valuable, and interesting to follow.

curiouscoding · on Jan 1, 2025

Ohh yeah, don't get me started in big-O...

It was great while computers were not really a thing yet, but these days it's often so meaningless. We see papers with 2x speedup with a lot of novel algorithmic stuff that sell better than 10x speedup just by exploiting CPUs to the fullest.

Even myself I kinda think theoretical contributions are cooler, and we really need to get rid of that (slightly exaggerating).

ryao · on Jan 1, 2025

Your article led me to wonder if our b-trees would be faster if I switched the intra node operations to use Eytzinger ordered arrays:

https://github.com/openzfs/zfs/commit/677c6f8457943fe5b56d7a...

There are two ways to look at this Big O wise. One is that insertions and deletions would be asymptomatically faster since memmove() is a linear operation while bubble up/down are logarithmic operations. Look ups would not be any different asymptotically, but the constant factor might improve from being able to do prefetch. The other way is that the N is bounded, such that it is all O(1) and the difference is how big the constant factor is.

I imagine I could implement it and benchmark it. However, my intuition is that the end result have lookups be marginally faster to the point of splitting hairs while insertions and deletions would be slower. While memmove() is a technically a linear time operation, it is a sequential operation that has a very low constant factor. The bubble up and bubble down operations needed to do insertions and deletions in a Eytzinger ordered array are technically random access, which has a higher constant factor. At some point, the Eytzinger ordered array operations should win, but that point is likely well beyond the size of a b-tree node.

My reason for saying this is to say that Big O notation still matters, but understanding when the constant factor is significant is important.

jackschultz · on Jan 1, 2025

Same with async and throwing threads at a problem. People love do those and think it's the right answer, but you can do a ton with smarter memory management and actually looking at what the code is doing lower level rather than abstractions.

Video about this that was very interesting to follow and somewhat related to what you're doing: https://www.youtube.com/watch?v=5rb0vvJ7NCY

Andys · on Jan 1, 2025

You're really just exploiting hidden O(?) implementations inside the CPU, so it still applies, just less visibly.

ryao · on Jan 1, 2025

He is improving the constant factor in big O notation. University algorithm classes tend to ignore cases where the constant factor difference is significant enough to favor a asymptomatically slower algorithm. Matrix multiplication is the quintessential example of this, since a good implementation of the O(n^3) algorithm will outperform asymptotically faster algorithms, such as the famous O(n^2 * log(n) * log(log(n))) one that uses the FFT. At least, it outperforms it on matrix multiplications people actually do in practice.

quotemstr · on Jan 1, 2025

Rust is... okay. I still prefer C++'s template system to Rust generics, in part because Rust doesn't have specialization and won't for a while. Memory safety by default is a big enough win to make me overlook these nits. Really, though, most people most of the time should be writing managed code.

curiouscoding · on Jan 1, 2025

Yeah, I don't think python is the right tool here. C++ definitely would be an option though.

Anyway very happy that this is also showing off what rust can do

ryao · on Jan 1, 2025

As far as I can tell, the community is largely divided into three major groups. Those that can read C, those that can read Python and those that can read both. Using either of them would have dodged criticism that your code examples are not accessible to much of the community.

That said, you are right that Python is not the best language to use when you need to use intrinsics, as you would be writing it in another language and using the Python FFI to access it.

oguz-ismail · on Jan 1, 2025

> this is also showing off what rust can do

To people who already know Rust, yes. To others, not so much