CNDB-17012: Improve TrieMemoryIndex row count estimation by pkolaczk · Pull Request #2262 · datastax/cassandra

pkolaczk · 2026-03-09T13:53:20Z

Fixes a bug in TermsDistribution#toBigDecimal which didn't preserve
order for some types and broke the assertion
in TrieMemoryIndex#estimateNumRowsMatchingRange.

Additionally, replaces the row count estimation algorithm with a better
one:

simpler
more efficient (no need to search and iterate the trie)
more accurate (especially for wider ranges)

github-actions · 2026-03-09T13:53:39Z

Checklist before you submit for review

This PR adheres to the Definition of Done
Make sure there is a PR in the CNDB project updating the Converged Cassandra version
Use NoSpamLogger for log lines that may appear frequently in the logs
Verify test results on Butler
Test coverage for new/modified code is > 80%
Proper code formatting
Proper title for each commit staring with the project-issue number, like CNDB-1234
Each commit has a meaningful description
Each commit is not very long and contains related changes
Renames, moves and reformatting are in distinct commits
All new files should contain the DataStax copyright header instead of the Apache License one

test/unit/org/apache/cassandra/index/sai/disk/v6/TermsDistributionTest.java

src/java/org/apache/cassandra/index/sai/memory/TrieMemoryIndex.java

adelapena · 2026-03-09T17:12:32Z

test/unit/org/apache/cassandra/index/sai/memory/TrieMemoryIndexTest.java

+ }
+ }
+
+ private static final char[] CHARS = "abcdefghijklmnopqrstuvwxyz".toCharArray();


Why a limited alphabet, if we are testing all of UTF-8?

The problem is this simplified estimation isn't very accurate if the data is not distributed uniformly. This ticket really switched the estimation from a horrible algorithm to a slightly less horrible ;). And this test is just a very basic sanity check that the algorithm works fine if the main assumption (= data distributed uniformly) kinda holds.

If we use a wider character set, there will be likely huge holes between the code points, and modeling real distribution of characters used by some real text (as in some existing natural language) becomes hairy very quickly. I got already much worse results when using a set of {a-z, 0-9}, because of a hole between encodings of letters and numbers.

Btw - this test wasn't passing earlier even for the simplified character set like that.
Im afraid we can't make this test better without further improvement of the estimation algorithm itself. We'd probably need to introduce histograms similar way like we have for sstables, but that is far out of scope of this ticket.

Fixes a bug in TermsDistribution#toBigDecimal which didn't preserve order for some types and broke the assertion in TrieMemoryIndex#estimateNumRowsMatchingRange. Additionally, replaces the row count estimation algorithm with a better one: - simpler - more efficient (no need to search and iterate the trie) - more accurate (especially for wider ranges)

cassci-bot · 2026-03-10T15:37:17Z

❌ Build ds-cassandra-pr-gate/PR-2262 rejected by Butler

7 regressions found
See build details here

Found 7 new test failures

Test	Explanation	Runs	Upstream
o.a.c.index.sai.QueryContextTest.testWideTableScoreOrdered[ca] (compression)	REGRESSION	🔴🔵	0 / 22
o.a.c.index.sai.cql.NonNumericTermsDistributionTest.testAsciiIndexEstimates (compression)	REGRESSION	🔵🔴	0 / 22
o.a.c.index.sai.cql.NonNumericTermsDistributionTest.testTimestampIndexEstimates (compression)	REGRESSION	🔵🔴	0 / 22
o.a.c.index.sai.cql.NonNumericTermsDistributionTest.testUtf8IndexEstimates (compression)	REGRESSION	🔵🔴	0 / 22
o.a.c.index.sai.cql.VectorCompaction100dTest.testOneToManyCompaction[ec false]	REGRESSION	🔴⚪	0 / 22
o.a.c.index.sai.cql.VectorCompaction2dTest.testZeroOrOneToManyCompaction[ca false] (compression)	REGRESSION	🔴🔵	0 / 22
o.a.c.index.sai.cql.VectorSiftSmallTest.testRerankKZeroOrderMatchesFullPrecisionSimilarity[ec true false]	REGRESSION	🔴⚪	0 / 22

Found 7 known test failures

sonarqubecloud · 2026-03-11T19:34:18Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
93.5% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

pkolaczk requested a review from adelapena March 9, 2026 13:54

pkolaczk force-pushed the c17012-fix-estimate-num-rows branch from 58dc2a8 to 3ea235f Compare March 9, 2026 16:18

adelapena reviewed Mar 9, 2026

View reviewed changes

pkolaczk force-pushed the c17012-fix-estimate-num-rows branch from 773b3d7 to a238f83 Compare March 10, 2026 12:02

pkolaczk force-pushed the c17012-fix-estimate-num-rows branch from a238f83 to d1532b7 Compare March 10, 2026 14:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CNDB-17012: Improve TrieMemoryIndex row count estimation#2262

CNDB-17012: Improve TrieMemoryIndex row count estimation#2262
pkolaczk wants to merge 1 commit intomainfrom
c17012-fix-estimate-num-rows

pkolaczk commented Mar 9, 2026

github-actions bot commented Mar 9, 2026 •

edited by pkolaczk

Loading

Uh oh!

Uh oh!

Uh oh!

adelapena Mar 9, 2026

pkolaczk Mar 10, 2026

cassci-bot commented Mar 10, 2026

sonarqubecloud bot commented Mar 11, 2026

Labels

3 participants

Conversation

pkolaczk commented Mar 9, 2026

github-actions bot commented Mar 9, 2026 • edited by pkolaczk Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist before you submit for review

Uh oh!

Uh oh!

Uh oh!

adelapena Mar 9, 2026

Choose a reason for hiding this comment

pkolaczk Mar 10, 2026

Choose a reason for hiding this comment

cassci-bot commented Mar 10, 2026

❌ Build ds-cassandra-pr-gate/PR-2262 rejected by Butler

Found 7 new test failures

Found 7 known test failures

sonarqubecloud bot commented Mar 11, 2026

Quality Gate passed

Labels

3 participants

github-actions bot commented Mar 9, 2026 •

edited by pkolaczk

Loading