Timeline for Indexes for speeding up find by rank / counting

Current License: CC BY-SA 4.0

13 events

when toggle format	what		by	license	comment
Jun 25, 2023 at 15:45	comment	added	Rick James		@Luatic - Be sure to study MySQL's "Change Buffer" -- where it holds non-unique updates in hopes of batch-updating the indexing.
Jun 24, 2023 at 17:57	comment	added	Rick James		Columnstore is a storage engine in MariaDB: mariadb.com/kb/en/mariadb-columnstore . I suspect that insert/delete/update is very costly for columnstore -- because it must recompress columns in the affected 64K-row chunk.
Jun 24, 2023 at 16:43	vote	accept	Luatic
Jun 24, 2023 at 16:39	comment	added	Bill Karwin		In any case, I think I've answered your original question — the feature you are looking for is not implemented in the brands of RDBMS you named, and adding it would be a significant engineering effort. Some support may exist in other architectures, but to do so, they have probably sacrificed other optimizations that you would miss.
Jun 24, 2023 at 16:27	comment	added	Luatic		I'm actually considering (trying to) implement it if no one has done it before, though I realize that's a lot of work so I was hoping I wouldn't "have to" :P
Jun 24, 2023 at 15:53	comment	added	Bill Karwin		If you think it can be done in an RDBMS that supports MVCC, I encourage you to implement it. I'm sure you will learn a lot.
Jun 24, 2023 at 13:31	comment	added	Bill Karwin		But any of the rows between `x` and `y` might not be visible in a given client's transaction (e.g. were inserted or deleted after the client's transaction started). So to get the count, each row has to be checked individually. Or else you cache every possible count, which means any update to the table has to update all the cached counts.
Jun 24, 2023 at 13:24	comment	added	Luatic		In the same way; I don't see why it wouldn't work. To count the entries with `id`s between `x` and `y`, just find the rank of `x` and the rank of `y` and subtract the two. Finding the two ranks involves two logarithmic time lookups in the B-Tree.
Jun 24, 2023 at 13:18	comment	added	Bill Karwin		That's roughly how each index value supports multi-versioning. How would you do it for counts? Especially to handle the `BETWEEN x AND y` scenario you brought up.
Jun 24, 2023 at 13:16	comment	added	Luatic		One solution would be keeping the old pages around and only adding new pages for all modified nodes. This is how persistent tree data structures are usually implemented. For B-Trees, you might want to "compress" the difference between the "new node" and the "old node" since the nodes will usually be very similar.
Jun 24, 2023 at 12:46	comment	added	Bill Karwin		How do you propose counts would be stored in the index, given that many RDBMS implementations support multiple concurrent views against the data? That is, you have many clients, each of which can "see" a different count of rows because of transaction isolation.
Jun 24, 2023 at 10:35	comment	added	Luatic		Thanks for the clear answer :) I'm surprised that no RDBMS seems to have an option for augmenting B-Tree indexes with counts; it would enable many rank-based operations to run in logarithmic rather than linear time at the cost of requiring a few more (but still only O(log n) total) disk writes for each update (to update the counts of all the nodes in the path).
Jun 23, 2023 at 17:13	history	answered	Bill Karwin	CC BY-SA 4.0