I understand the standard explanation for why B-trees are used in databases: they minimize disk seeks by packing many keys into each node, keeping the tree shallow (3-4 levels), and enabling efficient sequential scans.

I'm confused because the db storage engine (e.g., InnoDB) interacts with data through the OS file system abstraction. The OS makes no guarantees about where data is physically written on disk. (it has its own algorithms for minimizing disk head movement etc.)

My doubts:

  1. The OS doesn't write sequentially to disk in the order the storage engine wants it to. For example, Windows doesn't allow easy volume resizing because data placement is unpredictable. How can the storage engine's "page optimization" actually translate to fewer physical disk seeks if it has no control over where pages end up on the physical medium?

  2. The file system API abstracts away block storage details. If B-trees are designed for disk seek optimization specifically, but the storage engine can't control or even know about physical disk layout, isn't this optimization working at the wrong level of abstraction?

  3. Despite, B-trees clearly work well in practice. Is it that the OS does try to keep sequentially-written data physically close together (maybe not perfectly)? Or its more about reducing the number of I/O requests rather than their physical locality?

Related:

1 Reply 1

As far as I know, as you already assume, database engines usually do not directly interact with the OS filesystem.

First of all, this technology was developed for spinning disks, I think with today's SSDs, that have good random access, the sequential access property is much less useful.

What happens in practice is that (at least for the DBMS that I used) the database administrator (DBA) allocates a large file for the database, the database may also be split into "volumes", all depending on the DBMS. If the file seems fragmented on disk, the DBA can defragment the file (but usually you would run it before creating the database file). That way, the file is unlikely to be fragmented.

The DBMS engine may also assume a certain block size without necessarily know it. E.g. if it assumes 4096 bytes, the file system may actually have 16KB. That is not a problem because the B-Tree would just put four blocks into a single disk block, so access is still optimized.

Besides that, regardless of fragmentation and disk block size, B-Trees (B+trees) work well because they are effective at indexing large amounts of entries, so you can even use them in memory.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.