May I also mention that N log N is the total cost of compaction for a DB of size N. You don't perform a compaction on every single write. Amortised per write the cost is more like N log(N)/N == log(N).
Also, N log(N) is nowhere near exponential. O(2^N) would be exponential, and that's not what you have here.
"Amortised per write" - now you're getting down into the constant factors, which Big-O disregards. But you can't ignore them in real implementations. First the actual writes have a 2x constant factor, since you're writing to a WAL in addition to the DB itself.
The original LSM paper claims that writes to Level 0 are free because that's all in-memory. But that's not really true; if you have a stream of incoming writes then everything that goes into Level 0 must eventually be pushed out to Level 1. Buffering doesn't make writes free, it only displaces their occurrence in time.
So you have a rolling merge every M writes. As far as Big-O goes, that's N log(N) / M == N log(N) because Big-O disregards constant factors!
In the context of an implementation like LevelDB, theory and reality diverge even further. Since it's chunking data into 2MB files and deleting them during each merge operation, and also writing a bunch of bookkeeping into Manifest files and other stuff, the number of actual I/Os is much higher. A lot of wasted activity in allocating and deallocating space - filesystem metadata overhead that's also not transactionally safe.
In LevelDB a single merge reads 26MB and writes 26MB at a time to push 2MB of data from a level L to level L+1. So now instead of a single merge op costing only N, it actually costs 13*N. Again, if you're only talking about Big-O complexity you sweep this under the rug. But in reality, this is a huge cost.
Stated another way - assume you want to sustain a user workload writing 20MB/sec, and you don't do any throttling. Level 0 consists of 4 1MB files - it will fill in 1/5th of a second, and then compaction will reduce it by 1MB. After that it will be compacting continuously every 1/20th of a second. To sustain this workload for the 1st second will thus require 17 compactions to Level 1. Assuming an already populated Level 1 and worst-case key distribution that means in 1 second it will trigger compactions that read 238MB and write 238MB to store the incoming 20MB.
Level 1 is only 10MB, so if it was empty it would fill in the first 1/2 second. For the remaining 1/2 second it would trigger 5 more compactions to Level 2, reading 130MB and writing 130MB. If it started out full then this would be 260MB/260MB respectively.
So for a 20MB/sec input workload you would need a disk subsystem capable of sustaining 498MB/sec of reads concurrent with 498MB/sec of writes. And that's only for a small DB, only Level 0-2 present (smaller than 110MB), and excluding the actual cost of filesystem operations (create/delete/etc).
That's only for the 1st second of load. For every second after that, you're dumping from Level 0 to Level 1 at 280MB read and 280MB write/sec. And dumping from Level 1 to Level 2 at 260/260 as before. 540/540 - so a disk capable of 1080MB/sec I/O is needed to sustain a 20MB/sec workload. And this is supposed to be HDD-optimized? Write-optimized? O(N logN) - what a laugh.
Maybe LSMs in general can be more efficient than this. LevelDB is pretty horrible though.
It would only trigger compaction if sst tables have overlapping keys. And if you only write new items, goleveldb implementation would just create 3.7Mb sst tables by default without trying to merge them into bigger chunks (what's the point? they are all sorted and non-overlapping). When you have queue consumption workload it would start merging tombstones with sst tables and since tombstones are also in sorted order it would not pick up multiple sst tables at a time, and just either completely or partially remove stale sst files. I added some more benchmarks including queue packing with 200M messages of 64 byte size, and benchmarks of consumption of 200M messages. The speed is sustainable. https://github.com/bogdanovich/siberite/blob/master/docs/ben...
The TokuDB implementation of Fractal Trees uses a single-process, multi-threaded model, which is incompatible with PostgreSQL's multi-process model. In theory, one could patch TokuDB to be suitable, but this is a ton of work.
That's not really online, it still rebuilds the entire table, it just does it in the background and shows the results once it's done. This is pretty similar to how TokuDB hot indexing works, but TokuDB's hot column add/rename/delete is truly online, the results are visible immediately and the actual disk work is done in a lazy fashion.
Cracking is kind of similar in that it delays the "sorting work" done in the indexing structure. However, cracking is fairly heuristic and therefore hard to analyze without an intimate understanding of the workload. Fractal Trees do pretty much the same thing under all workloads (modulo optimizations for things like sequential and Zipfian workloads), so they're easier to analyze and prove bounds for.
An interesting new development is http://getkudu.io/ which applies some ideas common with Fractal Trees and other write-optimized data structures (like LSM-trees) to column stores.
At Tokutek, we had some designs for how to implement column store-like data models on Fractal Trees (we called them "matrix stores" because they had some of the benefits of both row stores and column stores), but we didn't get around to implementing them.
This is the critical insight. TokuDB uses a write-ahead log which is synced according to the configuration, and can be made as immediate as full fsync on commit. This provides the strongest durability available on a single machine.
Where TokuDB gets its speed boost is by delaying the random reads associated with updating the indexing structure (the Fractal Tree). The buffers are written to disk on checkpoint, but because they're buffers, the potentially random writes are localized to a smaller number of nodes high in the tree, which minimizes the number of disk seeks required. Since sequential I/O is cheaper than random, the sequential writes to the write-ahead log are very fast, so even in very strict durability configurations, TokuDB can easily outperform databases which use random writes to update the indexing structures, such as the B-trees used by InnoDB and most other RDBMSes.
TokuDB is built for a single-process model with threads for connections. This is incompatible with PostgreSQL's multi-process model, and patching TokuDB to support such a model would be a large effort. Not impossible, but a lot of work.
A bottle is usually about 4 glasses (which is consistent with my experience, not sure how you made your calculation), so when shared, that's one to drink with the meal and one to share over conversation after. I'm far from a teetotaler, but it doesn't seem terribly excessive when phrased that way.
It may depend on your upbringing. In the modern US, particularly the more puritan regions, any alcohol at all is usually considered a luxury reserved for celebration, but in some parts of the world, wine or beer is just the drink you drink with a meal, and there's less of a stigma about it. If it's not your thing, it's not your thing, but applying the label "excess" is a pretty specifically cultural decision.
A single serving of wine is 4 to 5 ounces. Don't base it off of restaurant pours, they often over-pour to compensate for the high markup.
A bottle is at least 25 ounces. Split between two people, that's 2.5 to 3.125 glasses of wine per person per bottle. 5 times a week, that's 12.5 to over 15 ounces.
I'm in the US, but from an Italian family. Wine was on the table often, especially around my extended family. I enjoy whiskey, and I've been into savory hard ciders since I found out beer is out of the question. Like I said, hardly a teetotaler.
Yet I'm under no illusions that drinking 15 servings of alcohol a week regularly is recommended. If someone were drinking that much, that consistently, their physician would definitely recommend that they cut their consumption considerably.
You could try talking to them about this. Try "hey, I understand being in this industry can be difficult as a woman, I am not very familiar with the problem but I don't want to be part of it, and I'm concerned about how much eye contact I'm giving you during meetings. I don't want to make you uncomfortable with my eye contact or this question, but is it a problem for you?"
Most likely they'll tell you they have a lot of worse things to deal with than how much eye contact you're giving them, but it'll be a learning experience and at least you'll open up a dialogue.
3. Don't be discouraged, accept that there are things you can do to help! This is good, you're about to become a more valuable member of society.
4. Accept that you need to put in some effort. Not much, don't worry. Most of it is shutting up.
5. Read what women write about this problem. They're all experts because they've been studying it literally their entire lives.
6. Learn (from step 5 and some self-reflection) how to recognize in real time when you're being a jerk. Then stop.
Now it gets a little harder. But remember, not nearly as hard as being a woman in tech, so buck up, kid!
7. Learn (from step 5) how to recognize in real time when other people are being jerks, and good techniques for how to advocate on behalf of women. This takes practice and courage, keep at it.
8. You'll be tempted to brag about how helpful you are to women (hi @wadhwa). Resist this temptation. Whenever you feel yourself about to tell someone how great and helpful you are, instead, show them something an actual woman has said about their problems, and try to point them in the right direction.
Note that you can replace "woman/women" in the above with any minority/disenfranchised/disadvantaged group you want to help, and the same formula pretty much works verbatim.
Not that you can replace "Note that you can replace "woman/women" in the above with any minority/disenfranchised/disadvantaged group"" with "Note that you can replace woman/women with any person of any group who faces hurdles or has disadvantages"
When you can do that, you have successfully transcended the "gotta get mine!" and "shut up and give me your stuff!" subtext of most of today's egalitarian rhetoric.
Thank you very much for the compliment, although it's important to remember that just because someone has a contrarian opinion, doesn't necessarily mean that they have critical thinking skills.
(Unless you were being sarcastic. Sarcasm doesn't translate well over text.)
You seem certain that that you know exactly what the problem is, and exactly how to solve it, down to a list of steps. You even seem certain that a stranger on the Internet is part of the problem. Well, give me a break -- that's really presumptuous and condescending. Listening to the aggrieved seems like a good idea, but there's no guarantee that they have the answers.
>You even seem certain that a stranger on the Internet is part of the problem
The stranger on the internet said they felt like part of the problem, so I don't really know what to tell you.
If you read between the lines a bit, I never said I had any of the answers. All I said was listen to the people that are affected, try to learn about their problems, and be empathetic, and maybe you'll find the answers.
I'm not trying to be revolutionary here, I just think one of the nice features of these types of issues is that the people being affected are actual people that you can listen to and have conversations with, and I think that's a good place to start.
> All I said was listen to the people that are affected, try to learn about their problems, and be empathetic, and maybe you'll find the answers.
Are you aware that one female will have entirely different experiences and suggestions on how to solve tech's sexism problem than the next? E.g., you implicate Vivek Wadwa as being a problem (probably due to Amelia Greenhall's accusations), and yet there are an overwhelming amount of women who take Vivek's side, not Amelia's. Amelia Greenhall herself, for example, is highly critical of Sheryl Sandberg. But is highly supportive of females whose blogs are banned by hacker news (Nitasha Tiku, for example). So, with all due respect, your suggestions are pretty impractical and will result in a lot of confusion for the person following the advice, not much good results.
More people accepting the problem, reading up on the experiences of those facing discrimination, and thinking about what they can do to change themselves and those around them.
That seems like a pretty good starting point, no?
Whatever you read about, you're going to come up with conflicting opinions and suggestions, so that is hardly new, and is certainly not an excuse for doing nothing.
> That seems like a pretty good starting point, no?
Honestly, not at all. I really think the exact opposite is true. The debate in this arena is hellishly toxic. You read into it a little and you quickly find out how disturbing ideas flying around really are -- especially from the prominent voices. Staying away from these debates is probably the best option for now. I really hope in time something happens and the toxicity goes away.
I agree, the mask/not/and is standard within systems contexts. It need not be cognitive overhead, not to be elitist but if you don't have a copy of hacker's delight or the ability to understand constructs like this, you have no business writing systems-level code.
https://www.cs.berkeley.edu/~brewer/cs262/Aries.pdf
> Remember that merge operations are O(N). Then remember that there are N of them to do. O(N^2) is a horrible algorithmic complexity.
No. Mountains of actual math refute this. LSM-tree merges are O(N log N). This is an Actual Fact.
Read more, kids.