CNDB-17012: Improve TrieMemoryIndex row count estimation#2262
CNDB-17012: Improve TrieMemoryIndex row count estimation#2262
Conversation
Checklist before you submit for review
|
58dc2a8 to 3ea235f Compare test/unit/org/apache/cassandra/index/sai/disk/v6/TermsDistributionTest.java Show resolved Hide resolved
test/unit/org/apache/cassandra/index/sai/disk/v6/TermsDistributionTest.java Show resolved Hide resolved
src/java/org/apache/cassandra/index/sai/memory/TrieMemoryIndex.java Outdated Show resolved Hide resolved
| } | ||
| } | ||
| | ||
| private static final char[] CHARS = "abcdefghijklmnopqrstuvwxyz".toCharArray(); |
There was a problem hiding this comment.
Why a limited alphabet, if we are testing all of UTF-8?
There was a problem hiding this comment.
The problem is this simplified estimation isn't very accurate if the data is not distributed uniformly. This ticket really switched the estimation from a horrible algorithm to a slightly less horrible ;). And this test is just a very basic sanity check that the algorithm works fine if the main assumption (= data distributed uniformly) kinda holds.
If we use a wider character set, there will be likely huge holes between the code points, and modeling real distribution of characters used by some real text (as in some existing natural language) becomes hairy very quickly. I got already much worse results when using a set of {a-z, 0-9}, because of a hole between encodings of letters and numbers.
Btw - this test wasn't passing earlier even for the simplified character set like that.
Im afraid we can't make this test better without further improvement of the estimation algorithm itself. We'd probably need to introduce histograms similar way like we have for sstables, but that is far out of scope of this ticket.
773b3d7 to a238f83 Compare Fixes a bug in TermsDistribution#toBigDecimal which didn't preserve order for some types and broke the assertion in TrieMemoryIndex#estimateNumRowsMatchingRange. Additionally, replaces the row count estimation algorithm with a better one: - simpler - more efficient (no need to search and iterate the trie) - more accurate (especially for wider ranges)
a238f83 to d1532b7 Compare ❌ Build ds-cassandra-pr-gate/PR-2262 rejected by Butler7 regressions found Found 7 new test failures
Found 7 known test failures |
|



Fixes a bug in TermsDistribution#toBigDecimal which didn't preserve
order for some types and broke the assertion
in TrieMemoryIndex#estimateNumRowsMatchingRange.
Additionally, replaces the row count estimation algorithm with a better
one: