Timeline for Indexes for speeding up find by rank / counting
Current License: CC BY-SA 4.0
13 events
| when toggle format | what | by | license | comment | |
|---|---|---|---|---|---|
| Jun 25, 2023 at 15:45 | comment | added | Rick James | @Luatic - Be sure to study MySQL's "Change Buffer" -- where it holds non-unique updates in hopes of batch-updating the indexing. | |
| Jun 24, 2023 at 17:57 | comment | added | Rick James | Columnstore is a storage engine in MariaDB: mariadb.com/kb/en/mariadb-columnstore . I suspect that insert/delete/update is very costly for columnstore -- because it must recompress columns in the affected 64K-row chunk. | |
| Jun 24, 2023 at 16:43 | vote | accept | Luatic | ||
| Jun 24, 2023 at 16:39 | comment | added | Bill Karwin | In any case, I think I've answered your original question — the feature you are looking for is not implemented in the brands of RDBMS you named, and adding it would be a significant engineering effort. Some support may exist in other architectures, but to do so, they have probably sacrificed other optimizations that you would miss. | |
| Jun 24, 2023 at 16:27 | comment | added | Luatic | I'm actually considering (trying to) implement it if no one has done it before, though I realize that's a lot of work so I was hoping I wouldn't "have to" :P | |
| Jun 24, 2023 at 15:53 | comment | added | Bill Karwin | If you think it can be done in an RDBMS that supports MVCC, I encourage you to implement it. I'm sure you will learn a lot. | |
| Jun 24, 2023 at 13:31 | comment | added | Bill Karwin | But any of the rows between x and y might not be visible in a given client's transaction (e.g. were inserted or deleted after the client's transaction started). So to get the count, each row has to be checked individually. Or else you cache every possible count, which means any update to the table has to update all the cached counts. | |
| Jun 24, 2023 at 13:24 | comment | added | Luatic | In the same way; I don't see why it wouldn't work. To count the entries with ids between x and y, just find the rank of x and the rank of y and subtract the two. Finding the two ranks involves two logarithmic time lookups in the B-Tree. | |
| Jun 24, 2023 at 13:18 | comment | added | Bill Karwin | That's roughly how each index value supports multi-versioning. How would you do it for counts? Especially to handle the BETWEEN x AND y scenario you brought up. | |
| Jun 24, 2023 at 13:16 | comment | added | Luatic | One solution would be keeping the old pages around and only adding new pages for all modified nodes. This is how persistent tree data structures are usually implemented. For B-Trees, you might want to "compress" the difference between the "new node" and the "old node" since the nodes will usually be very similar. | |
| Jun 24, 2023 at 12:46 | comment | added | Bill Karwin | How do you propose counts would be stored in the index, given that many RDBMS implementations support multiple concurrent views against the data? That is, you have many clients, each of which can "see" a different count of rows because of transaction isolation. | |
| Jun 24, 2023 at 10:35 | comment | added | Luatic | Thanks for the clear answer :) I'm surprised that no RDBMS seems to have an option for augmenting B-Tree indexes with counts; it would enable many rank-based operations to run in logarithmic rather than linear time at the cost of requiring a few more (but still only O(log n) total) disk writes for each update (to update the counts of all the nodes in the path). | |
| Jun 23, 2023 at 17:13 | history | answered | Bill Karwin | CC BY-SA 4.0 |