What do you mean by "the worst case for LSM trees"? I guess you could merge two index segments (SSTables, in Google-derived implementations) faster if their key ranges don't overlap, because you could just concatenate them without encoding and decoding their entire contents, but does anybody actually do that?
LSM trees are a lot faster than B+trees with lots of random writes.
In a by-the-book B+tree implementation that doesn't allow MVCC, if your branching factor is 1024, your table is 256 gigabytes, and your records are 128 bytes, most random record writes require updating four tree nodes, the transaction log, and the actual record page, which would be six random seeks without a writeback cache, and probably three random seeks with a writeback cache. I think in practice on modern SSDs this means three 4-KiB blocks added to the FTL's transaction log, but without further write amplification I think that fragments the FTL's storage so that reading a tree node requires multiple random page reads instead of one. If you use a log-structured B-tree approach so MVCC is convenient, you probably want to use a lower branching factor; say you use 128. Now your tree is 5 levels of internal nodes deep, and each node is, say, 2 KiB of data, so you're appending 10 KiB of tree data to the write log for each random write, plus the 128-byte record.
By contrast, with an LSM-tree, at the moment of commit, you just append the 128-byte updated record, perhaps after an in-RAM sort: 80 times less data. But if the data survives long enough, it eventually has to get merged up to the root through a series of merges, which involves reading it and writing it an additional logarithmic number of times. Say your merges are 8-way on average; in that case the data has to get read and written 11 more times before the table is a single 256-megabyte segment. That's still 7 times faster than the log-structured B-tree approach.
The difference gets bigger with spinning rust.
And that's why, on one machine I tested on, LevelDB could insert 300'000 key-value pairs per second while Postgres manages about 3000 and SQLite about 17000.
> What do you mean by "the worst case for LSM trees"?
For write amplification.
> If you use a log-structured B-tree approach so MVCC is convenient...
You just hamstrung your B-tree for conscience to implement a single model, but regardless: I never claimed that B-trees were faster at writes than LSM-trees. I instead claimed that I don't understand where all of these write-dominated use cases are coming from as the world seems to be full of read-dominated services.
> And that's why, on one machine I tested on, LevelDB could insert 300'000 key-value pairs per second while Postgres manages about 3000 and SQLite about 17000.
I mean, this is clearly an unfair comparison by your own math: you need to keep running this for a long enough time that you can amortize into your calculations the cost of compactions (for LevelDB) and vacuums (PostgreSQL).
Aha, thanks. I suppose that in theory you could avoid actual merging if you had disjoint key ranges, in which case random writes could conceivably have less write amplification than mostly sequential writes, but does anyone do that in practice? Does it make a difference in practice?
What do you mean by "You just hamstrung your B-tree for conscience to implement a single model"? I'm not sure what you mean by either "for conscience" or "to implement a single model".
It's probably not an extremely fair comparison, but I'm not sure which databases it's slanted in favor of, and it wouldn't be very surprising if I'd overlooked some easy speedup. Like, maybe by deferring index creation until after ETL you could get a significant boost in Postgres speed, or by changing the Postgres tuning numbers. I do think it was running long enough to incur a significant number of compactions and vacuums, though. Feel free to critique in more detail. The Postgres and SQLite numbers come from http://canonical.org/~kragen/sw/dev3/exampledb.py, but for LevelDB I inferred that the Python bindings were the bottleneck, so the 300'000 figure is from http://canonical.org/~kragen/sw/dev3/leveldb-62m.cc.
i don't think that's an unfair comparison, that is in line with similar tests i have performed for random write workloads on lsm trees and b trees that don't fit in ram, including cost of compaction. https://youtu.be/jwq_0mPNnN8?t=3730 links to a section of a talk i did which explains my benchmarks. it also includes some pretty neat visualizations of the write patterns of the two data structures that may help people understand why there is such a big difference in write throughput. my benchmarks were done on spinning disk, on ssds i'd expect lsm trees to only have around 10x higher write throughput for the same benchmark.
Presumably people care about write speed because it's typically harder to scale up than read speed as there are just less tricks overall that can be applied.
Like we can assume the overwhelming majority of applications writing to a database are interested in durability. A good portion of those could tolerate eventual consistency on reads, for instance, which means there are considerably more ways outside the database to improve read performance.
LSM trees are a lot faster than B+trees with lots of random writes.
In a by-the-book B+tree implementation that doesn't allow MVCC, if your branching factor is 1024, your table is 256 gigabytes, and your records are 128 bytes, most random record writes require updating four tree nodes, the transaction log, and the actual record page, which would be six random seeks without a writeback cache, and probably three random seeks with a writeback cache. I think in practice on modern SSDs this means three 4-KiB blocks added to the FTL's transaction log, but without further write amplification I think that fragments the FTL's storage so that reading a tree node requires multiple random page reads instead of one. If you use a log-structured B-tree approach so MVCC is convenient, you probably want to use a lower branching factor; say you use 128. Now your tree is 5 levels of internal nodes deep, and each node is, say, 2 KiB of data, so you're appending 10 KiB of tree data to the write log for each random write, plus the 128-byte record.
By contrast, with an LSM-tree, at the moment of commit, you just append the 128-byte updated record, perhaps after an in-RAM sort: 80 times less data. But if the data survives long enough, it eventually has to get merged up to the root through a series of merges, which involves reading it and writing it an additional logarithmic number of times. Say your merges are 8-way on average; in that case the data has to get read and written 11 more times before the table is a single 256-megabyte segment. That's still 7 times faster than the log-structured B-tree approach.
The difference gets bigger with spinning rust.
And that's why, on one machine I tested on, LevelDB could insert 300'000 key-value pairs per second while Postgres manages about 3000 and SQLite about 17000.