Fix Broken Lucene Query Tracking and Cancellation for OOM Protection#17884
Fix Broken Lucene Query Tracking and Cancellation for OOM Protection#17884anuragrai16 wants to merge 3 commits intoapache:masterfrom
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@ ## master #17884 +/- ## ============================================ + Coverage 63.25% 63.26% +0.01% Complexity 1481 1481 ============================================ Files 3190 3190 Lines 192285 192347 +62 Branches 29470 29470 ============================================ + Hits 121630 121689 +59 - Misses 61118 61125 +7 + Partials 9537 9533 -4
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Jackie-Jiang left a comment
There was a problem hiding this comment.
Thanks for the fix!
| "Failed while releasing the searcher manager for realtime text index for columns {}, exception {}", | ||
| _columns, e.getMessage()); | ||
| // Propagate context to register searcher thread for CPU/memory tracking | ||
| if (parentContext != null) { |
There was a problem hiding this comment.
QueryThreadContext.contextAwareExecutorService() provides a wrapper over the executor to maintain the context. You may modify RealtimeLuceneTextIndexSearcherPool to initialize a wrapped executor service instead
There was a problem hiding this comment.
The problem with contextAwareExecutorService() is that it also registers the future with executionContext.addTask(future) so that it can be terminated with task.cancel(true);. For Lucene index searchers on consuming segments (that have an Index writer too), there is a problem in Lucene (see this), where it doesn't react well to Thread interruption and can corrupt the underlying FSDirectory used to do Index creation as well.
So, instead we only propagate the QueryThreadContext explicitly to make sure tracking of this works and not register the future to be cancelled. The cancellation is done manually in a cooperative way with _shouldCancel. This is only relevant for realtime segment lucene searches (that shares the same FSDirectory object b/w the IndexWriter and IndexSearcher).
There was a problem hiding this comment.
Good call. In that case, I'd suggest adding a boolean flag to control whether to skip registering task for auto termination into QueryThreadContext.contextAwareExecutorService(). Wrapping the logic in the same place is easier to track and manage.
| import static org.testng.Assert.assertNotNull; | ||
| import static org.testng.Assert.assertTrue; | ||
| | ||
| public class TextMatchOomKillingIntegrationTest extends BaseClusterIntegrationTest { |
There was a problem hiding this comment.
How long does this test run and how robust is it?
There was a problem hiding this comment.
It runs for 2-4 mins on my machine. OOM Killing in itself is indeterministic and will depend on the docker jvm that is used to run this test. So, I've made this test accept either outcome for now with,
if (exceptionsNode.isEmpty()) { // Query completed successfully - verify resource tracking worked assertTrue(memAllocated > 0 || cpuTime > 0, ...); } else { // Query was killed - verify it was OOM killed verifyOomKill(query, response); } from function verifyOomKillOrResourceTracking()
so, the test in itself would not fail, and would demonstrate either the tracking worked and the query was not killed. But in most cases, would show the query being killed since the threshold for the heap is set to very low (15%) for query killing and 0% for ALARMING tracking.
There was a problem hiding this comment.
Adding a 2-4 minutes test for this bug fix would be too much overhead. Do you see a way to make it light weight? E.g. a unit test to ensure the memory usage is tracked
| _numDocsCollected++, "RealtimeLuceneDocIdCollector"); | ||
| } catch (RuntimeException e) { | ||
| // Convert to CollectionTerminatedException for clean Lucene handling | ||
| throw new CollectionTerminatedException(); |
There was a problem hiding this comment.
Why do we need a separate exception? We have special handling on TerminationException so it would be good to preserve it for correct error message in query response
There was a problem hiding this comment.
This exception is needed for Lucene IndexSearcher to correctly close all open objects. See documentation of this exception here.
Throw this exception in LeafCollector.collect(int) to prematurely terminate collection of the current leaf.
Note: IndexSearcher swallows this exception and never re-throws it.
Note that, the Lucene collector is run via the IndexSearcher, and the original interrupted TerminationException is preserved via RealtimeLuceneTextIndex/MultiColumnRealtimeLuceneTextIndex that catches the interruption and retains the correct error message that the TEXT_MATCH was interrupted.
There was a problem hiding this comment.
I see. Could you please add some more comments explaining this?
Prospective fix for #17877
Summary
Fix missing CPU/memory tracking for realtime Lucene and HNSW index searcher threads
Enable OOM killer to properly terminate TEXT_MATCH and VECTOR_SIMILARITY queries
Changes
RealtimeLuceneTextIndex: Propagate QueryThreadContext to async searcher threads for resource trackingMultiColumnRealtimeLuceneTextIndex: Same fix for multi-column text index variantMutableVectorIndex: Convert synchronous Lucene search to async pattern using RealtimeLuceneTextIndexSearcherPool, preventing FSDirectory corruption on thread interrupt while enabling proper resource trackingLuceneDocIdCollector/HnswDocIdCollector: Add periodic checkTerminationAndSampleUsagePeriodically() calls in collectors to detect OOM termination during document collectionTests
RealtimeLuceneTextIndexResourceTrackingTest: Unit tests verifying context propagation, thread registration, and termination handling
TextMatchOomKillingIntegrationTest: Integration test validating OOM killing for TEXT_MATCH queries