It is not always read-write-modify. There is no evidence of this pattern in Ubuntu when there is no memory pressure. Merge occurs when block is partially updated after kenel had lost state of the block, which can happen under memory pressure.
On x86, and I think every architecture, when you write to a memory mapping that is not already backed by a writable page, the kernel is notified that user code is trying to write. And the kernel needs to fill in the contents of the page, which requires a read if the page isn’t already loaded.
It has to be this way! The write could be a read-modify-write instruction. Or it could be a plain store, but I’ve never heard of hardware with write-only memory with fine enough granularity to make this work.
The sole exception is if the page in question is all zeros and the kernel can know this without needing to read the file. This might sometimes be the case for an append-only database. I don’t know exactly what QuestDB does.
Also:
> As soon as you mmap a file, the kernel allocates page table entries (PTEs) for the virtual memory to reserve an address range for your file,
Nope. It just makes a record of the existence of the mapping. This is called a VMA in Linux. No PTEs are created unless you set MAP_POPULATE.
> but it doesn't read the file contents at this point. The actual data is read into the page when you access the allocated memory, i.e. start reading (LOAD instruction in x86) or writing (STORE instruction in x86) the memory.
What are these LOAD and STORE instructions in x86? There are architectures reasonably described as load-store architectures, and x86 isn’t one of them.
> On x86, and I think every architecture, when you write to a memory mapping that is not already backed by a writable page, the kernel is notified that user code is trying to write. And the kernel needs to fill in the contents of the page, which requires a read if the page isn’t already loaded.
This is very true. Perhaps there wasn't enough context to what the article is describing. The read problem started to occur on database that is subject to constant write workload. Data is flowing in all the time at variable rate. Typically blocks are "hot" and being filled in fully within seconds if not millis.
Zeroing the file is an option to try. QuestDB allocates disk with `posix_fallocate()`, which doesn't have the required flags. We would need to explore `fallocate()`. Thanks.
If you are allocating zeroed space and then writing a whole page quickly, then you may well avoid a read. And doing this under extreme memory pressure will indeed cause the page to be written to disk while only partly written by user code, and it will need to be read back in. Reducing memory pressure is always good.
I would expect quite a bit better performance if you actually write a entire pages using pwrite(2) or the io_uring equivalent, though.
Messing with fallocate on a per-page basis is asking for trouble. It changes file metadata, and I expect that to hurt performance.
great, we are on the same page! `fallocate()` (or posix one) is called on large swades of file. 16MB default. Not too often to hurt performance. I wonder if zeroing the file with `fallocate()` will result in actual disk writes or is it ephemeral?