Skip to content

CNDB-17012: Improve TrieMemoryIndex row count estimation#2262

Open
pkolaczk wants to merge 1 commit intomainfrom
c17012-fix-estimate-num-rows
Open

CNDB-17012: Improve TrieMemoryIndex row count estimation#2262
pkolaczk wants to merge 1 commit intomainfrom
c17012-fix-estimate-num-rows

Conversation

@pkolaczk
Copy link

@pkolaczk pkolaczk commented Mar 9, 2026

Fixes a bug in TermsDistribution#toBigDecimal which didn't preserve
order for some types and broke the assertion
in TrieMemoryIndex#estimateNumRowsMatchingRange.

Additionally, replaces the row count estimation algorithm with a better
one:

  • simpler
  • more efficient (no need to search and iterate the trie)
  • more accurate (especially for wider ranges)
@github-actions
Copy link

github-actions bot commented Mar 9, 2026

Checklist before you submit for review

  • This PR adheres to the Definition of Done
  • Make sure there is a PR in the CNDB project updating the Converged Cassandra version
  • Use NoSpamLogger for log lines that may appear frequently in the logs
  • Verify test results on Butler
  • Test coverage for new/modified code is > 80%
  • Proper code formatting
  • Proper title for each commit staring with the project-issue number, like CNDB-1234
  • Each commit has a meaningful description
  • Each commit is not very long and contains related changes
  • Renames, moves and reformatting are in distinct commits
  • All new files should contain the DataStax copyright header instead of the Apache License one
@pkolaczk pkolaczk requested a review from adelapena March 9, 2026 13:54
@pkolaczk pkolaczk force-pushed the c17012-fix-estimate-num-rows branch from 58dc2a8 to 3ea235f Compare March 9, 2026 16:18
}
}

private static final char[] CHARS = "abcdefghijklmnopqrstuvwxyz".toCharArray();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why a limited alphabet, if we are testing all of UTF-8?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is this simplified estimation isn't very accurate if the data is not distributed uniformly. This ticket really switched the estimation from a horrible algorithm to a slightly less horrible ;). And this test is just a very basic sanity check that the algorithm works fine if the main assumption (= data distributed uniformly) kinda holds.

If we use a wider character set, there will be likely huge holes between the code points, and modeling real distribution of characters used by some real text (as in some existing natural language) becomes hairy very quickly. I got already much worse results when using a set of {a-z, 0-9}, because of a hole between encodings of letters and numbers.

Btw - this test wasn't passing earlier even for the simplified character set like that.
Im afraid we can't make this test better without further improvement of the estimation algorithm itself. We'd probably need to introduce histograms similar way like we have for sstables, but that is far out of scope of this ticket.

@pkolaczk pkolaczk force-pushed the c17012-fix-estimate-num-rows branch from 773b3d7 to a238f83 Compare March 10, 2026 12:02
Fixes a bug in TermsDistribution#toBigDecimal which didn't preserve order for some types and broke the assertion in TrieMemoryIndex#estimateNumRowsMatchingRange. Additionally, replaces the row count estimation algorithm with a better one: - simpler - more efficient (no need to search and iterate the trie) - more accurate (especially for wider ranges)
@pkolaczk pkolaczk force-pushed the c17012-fix-estimate-num-rows branch from a238f83 to d1532b7 Compare March 10, 2026 14:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

3 participants