More

willothy · on Nov 5, 2024

This is an interesting idea. I am gonna try this out - especially with dashmap, I think that could perform very well.

zamalek · on Nov 5, 2024

You could also look into shamelessly "taking inspiration" from async-lock.

willothy · on Nov 5, 2024

I don't believe that waking the waker in `poll` synchronously waits / runs poll again immediately. I think it is more likely just adding the future back to the global queue to be polled. I could be wrong though, I'll look into this more. Thanks for the info!

conradludgate · on Nov 5, 2024

It does immediately put itself into the queue to be polled again. But that's no different in effect to a spin-lock. If you have other tasks in your runtime, this will be putting excess pressure on your scheduler

conradludgate · on Nov 5, 2024

Expanding on this. If you have a lot of concurrent tasks, you will overflow[0] the task local queue and be bottlenecked by the global queue mutex[1]

[0]: https://github.com/tokio-rs/tokio/blob/8897885425bf3d89053f8... [1]: https://github.com/tokio-rs/tokio/blob/8897885425bf3d89053f8...

willothy · on Nov 5, 2024

Oh this is really good to know, thank you!

jkelleyrtp · on Nov 5, 2024

Tokio has a task budget that will cap at 32 or 256 polls of the same task before switching to another task. So, yes a spinlock, but not likely to deadlock.

willothy · on Nov 5, 2024

Yeah, after looking into this more I think this was a big oversight on my part. Working on a (hopefully) better way of doing this right now - I'm thinking per-shard waker queues and only falling back to spinlocking like this if the queues are full.

binarycoffee · on Nov 5, 2024

Another offender is AtomicWaker [1] which does the same on contention.

[1] https://docs.rs/atomic-waker/latest/atomic_waker/

SabrinaJewson · on Nov 5, 2024

AtomicWaker is much less bad, because it will only spin in the case that another thread has spuriously called `wake` – i.e. called `wake` despite that fact that the future it’s waking is in fact unable to progress.

In the common case, the waker side will operate some logic like:

    set_flag();
    atomic_waker.wake();

and the future will run:

    atomic_waker.register(cx.waker());
    if flag_is_set() {
        Poll::Ready(())
    } else {
        Poll::Pending
    }

Thus, even if `register` decides to “spin”, the flag will already be set, and so the future will not spin in reality (it may be polled unnecessarily, but this will only happen once).

I can’t immediately think of examples where support for spurious waking is necessary.

binarycoffee · on Nov 6, 2024

The producers of an MPSC queue may use an AtomicWaker::wake to signal to the consumer that a new item is ready. In this case all wake-ups are necessary (not spurious).

willothy · on Nov 5, 2024

Hmm, imo it's definitely better than directly spinlocking to have many spinlocks running cooperatively, but you're right that it may not be ideal. Thanks for pointing this out. I'll see if I can find a better way to coordinate the polling/waking of lock acquisition futures.

willothy · on Nov 5, 2024

I agree with this take a lot. I think having lots of custom implementations is inevitable for systems languages - the only reason why Rust is different is because Cargo/crates.io makes things available for everyone, where in C/C++ land you will often just DIY everything to avoid managing a bunch of dependencies by hand or with cmake/similar.

willothy · on Nov 5, 2024

There are a few reasons - For one, I'm not sure BTreeMap is always faster in Rust... it may be sometimes but lookups are still O(log(n)) due to the searching where with a HashMap it's (mostly) O(1). They both have their uses - I usually go for BTreeMap when I explicitly need the collection to be ordered.

A second reason is sharding - sharding based on a hash is quite simple to do, but sharding an ordered collection would be quite difficult since some reads would need to search across multiple shards and thus take multiple locks.

If you mean internally (like for each shard), we're using hashbrown's raw HashTable API because it allows us to manage hashing entirely ourselves, and avoid recomputing the hash when determining the shard and looking up a key within a shard.

willothy · on Nov 5, 2024

I think there's a pretty big difference between committing to semantic versioning and saying "do not use this until some unspecified point in the future." Maybe I'm just not clear enough in the note - I just mean that the API could change. But as long as a consumer doesn't use `version = "*"` in their Cargo.toml, breaking changes will always be opt-in and builds won't start failing if I release something with a big API change.

James_K · on Nov 5, 2024

Maybe I'm a bit weird, but I would never commit to using something if the person making it wasn't providing a consistent interface. It could well be different in your case, but as a general rule, a sub 1.0 version is software that isn't yet ready to be used. The vast vast majority of projects that say "you can use this, but we won't provide a consistent interface yet" end up either dying before they get to v1 or causing so much pain they weren't worth using. I can see this issue being especially bad in Rust, where small API changes can create big issues with lifetimes and such.

willothy · on Nov 5, 2024

Tokio is just used for async tests and the examples, the crate doesn’t depend on any specific async runtime :)

willothy · on Nov 5, 2024

Hey, I think the name is cool!

Fair point though, there would definitely be some benefit to having some of these things in the stdlib.

willothy · on Nov 5, 2024

I use a multiple of `std::thread::available_paralellism()`. Tbh I borrowed the strategy from dashmap, but I tested others and this seemed to work quite well. Considering making that configurable in the future so that it can be specialized for different use-cases or used in single-threaded but cooperatively scheduled contexts.

willothy · on Nov 5, 2024

Definitely a good point. I used dashmap's benchmark suite because it was already setup to bench many popular libraries, but I definitely want to get this tested in more varied scenarios. I'll try to add a benchmark for a single key only this week.

Regarding your edit: damn I hadn't thought of that. I'll rerun the benchmarks on my Linux desktop with a Ryzen chip and update the readme.

judofyr · on Nov 5, 2024

Very fascinating results though! The sharding approach is quite a simple way of doing more fine-grained locking, making sure that writes don't block all the reads. Pretty cool to see that it actually pays off even with the overhead of Tokie scheduling everything!

There might be some fairness concerns here? Since we're polling the lock (instead of adding ourselves to a queue) it could be the case that some requests are constantly "too late" to acquire it? Could be interesting to see the min/max/median/P99 of the requests themselves. It seems that Bustle only reports the average latency[1] which honestly doesn't tell us much more than the throughputs.

[1]: https://docs.rs/bustle/latest/bustle/struct.Measurement.html

willothy · on Nov 5, 2024

This is a great point too - I definitely want to run some more varied benchmarks to get a better idea of how this performs in different settings. We'll also be using it in prod soon, so we'll see how it does in a real use setting too :)

willothy · on Nov 5, 2024

Yep, someone suggested loom on our Reddit r/rust post as well - I'm actively working on that. Somehow I'd just never heard of loom before this.