What do you mean by "the worst case for LSM trees"? I guess you could merge two ...

saurik · on Feb 10, 2022

> What do you mean by "the worst case for LSM trees"?

For write amplification.

> If you use a log-structured B-tree approach so MVCC is convenient...

You just hamstrung your B-tree for conscience to implement a single model, but regardless: I never claimed that B-trees were faster at writes than LSM-trees. I instead claimed that I don't understand where all of these write-dominated use cases are coming from as the world seems to be full of read-dominated services.

> And that's why, on one machine I tested on, LevelDB could insert 300'000 key-value pairs per second while Postgres manages about 3000 and SQLite about 17000.

I mean, this is clearly an unfair comparison by your own math: you need to keep running this for a long enough time that you can amortize into your calculations the cost of compactions (for LevelDB) and vacuums (PostgreSQL).

kragen · on Feb 10, 2022

Aha, thanks. I suppose that in theory you could avoid actual merging if you had disjoint key ranges, in which case random writes could conceivably have less write amplification than mostly sequential writes, but does anyone do that in practice? Does it make a difference in practice?

What do you mean by "You just hamstrung your B-tree for conscience to implement a single model"? I'm not sure what you mean by either "for conscience" or "to implement a single model".

It's probably not an extremely fair comparison, but I'm not sure which databases it's slanted in favor of, and it wouldn't be very surprising if I'd overlooked some easy speedup. Like, maybe by deferring index creation until after ETL you could get a significant boost in Postgres speed, or by changing the Postgres tuning numbers. I do think it was running long enough to incur a significant number of compactions and vacuums, though. Feel free to critique in more detail. The Postgres and SQLite numbers come from http://canonical.org/~kragen/sw/dev3/exampledb.py, but for LevelDB I inferred that the Python bindings were the bottleneck, so the 300'000 figure is from http://canonical.org/~kragen/sw/dev3/leveldb-62m.cc.

jeffffff · on Feb 10, 2022

i don't think that's an unfair comparison, that is in line with similar tests i have performed for random write workloads on lsm trees and b trees that don't fit in ram, including cost of compaction. https://youtu.be/jwq_0mPNnN8?t=3730 links to a section of a talk i did which explains my benchmarks. it also includes some pretty neat visualizations of the write patterns of the two data structures that may help people understand why there is such a big difference in write throughput. my benchmarks were done on spinning disk, on ssds i'd expect lsm trees to only have around 10x higher write throughput for the same benchmark.

hderms · on Feb 10, 2022

Presumably people care about write speed because it's typically harder to scale up than read speed as there are just less tricks overall that can be applied.

Like we can assume the overwhelming majority of applications writing to a database are interested in durability. A good portion of those could tolerate eventual consistency on reads, for instance, which means there are considerably more ways outside the database to improve read performance.