- Notifications
You must be signed in to change notification settings - Fork 125
Description
🧠 Summary
celery-worker pods are being OOMKilled repeatedly due to excessive memory usage when loading data from very large PostgreSQL tables (lightrag_vdb_relation and lightrag_vdb_entity).
🐛 Problem Description
The celeryworker pod in the aperag deployment frequently restarts because of Out of Memory (OOMKilled) events.
After investigation, it appears that the worker is loading entire tables from PostgreSQL into memory during the execution of the query_lightrag_vdb_relation_all function.
These two tables have grown significantly and now exceed the pod’s memory limit (4 GiB):
postgres=# SELECT pg_size_pretty(pg_total_relation_size('lightrag_vdb_relation')); pg_size_pretty ---------------- 4228 MB (1 row) postgres=# SELECT pg_size_pretty(pg_total_relation_size('lightrag_vdb_entity')); pg_size_pretty ---------------- 3533 MB (1 row) postgres=# 📋 Environment Details
Pod:
Name: celeryworker-799fc9f787-65jmf Namespace: default Node: cn-hongkong.10.231.119.38 Status: Running (frequently OOMKilled) RestartCount: 412 Memory Limit: 4Gi Image:
apecloud-registry.cn-zhangjiakou.cr.aliyuncs.com/apecloud/aperag:v0.7.0-alpha.5 Command:
/opt/venv/bin/celery -A config.celery worker -l INFO --concurrency=16 -Q 10.231.119.38,celery --pool=threads 🔍 Investigation
Using py-spy to inspect the running worker process:
root@:~# py-spy dump -p 1956455 Process 1956455: /opt/venv/bin/python3 /opt/venv/bin/celery -A config.celery worker -l INFO --concurrency=16 -Q 10.231.119.38,celery --pool=threads Python v3.11.13 (/usr/local/bin/python3.11) Thread 11 (active): "MainThread" poll (kombu/utils/eventio.py:83) create_loop (kombu/asynchronous/hub.py:317) asynloop (celery/worker/loops.py:97) start (celery/worker/consumer/consumer.py:772) start (celery/bootsteps.py:116) start (celery/worker/consumer/consumer.py:341) start (celery/bootsteps.py:365) start (celery/bootsteps.py:116) start (celery/worker/worker.py:203) worker (celery/bin/worker.py:356) caller (celery/bin/base.py:135) new_func (click/decorators.py:33) invoke (click/core.py:788) invoke (click/core.py:1443) invoke (click/core.py:1697) main (click/core.py:1082) __call__ (click/core.py:1161) main (celery/bin/celery.py:231) main (celery/__main__.py:15) <module> (celery:10) Thread 30 (active): "ThreadPoolExecutor-3_0" _read_ready__get_buffer (asyncio/selector_events.py:974) _read_ready (asyncio/selector_events.py:956) _run (asyncio/events.py:84) _run_once (asyncio/base_events.py:1936) run_forever (asyncio/base_events.py:608) run_until_complete (asyncio/base_events.py:641) _run_in_new_loop (lightrag_manager.py:216) process_document_for_celery (lightrag_manager.py:118) create_index (aperag/tasks/document.py:114) create_index_task (config/celery_tasks.py:267) run (celery/app/autoretry.py:38) __protected_call__ (celery/app/trace.py:736) trace_task (celery/app/trace.py:453) fast_trace_task (celery/app/trace.py:651) apply_target (celery/concurrency/base.py:30) run (concurrent/futures/thread.py:58) _worker (concurrent/futures/thread.py:83) run (threading.py:982) _bootstrap_inner (threading.py:1045) _bootstrap (threading.py:1002) Thread 31 (active): "ThreadPoolExecutor-3_1" read (ssl.py:1168) recv (ssl.py:1295) read (httpcore/_backends/sync.py:128) _receive_event (httpcore/_sync/http11.py:217) _receive_response_headers (httpcore/_sync/http11.py:177) handle_request (httpcore/_sync/http11.py:106) handle_request (httpcore/_sync/connection.py:103) handle_request (httpcore/_sync/connection_pool.py:236) handle_request (httpx/_transports/default.py:250) _send_single_request (httpx/_client.py:1014) _send_handling_redirects (httpx/_client.py:979) _send_handling_auth (httpx/_client.py:942) send (httpx/_client.py:914) post (http_handler.py:761) _make_common_sync_call (llm_http_handler.py:175) completion (llm_http_handler.py:471) completion (litellm/main.py:2626) wrapper (litellm/utils.py:1219) _completion_core (aperag/llm/completion/completion_service.py:173) generate (aperag/llm/completion/completion_service.py:211) _summarize_text (aperag/index/summary_index.py:368) _generate_document_summary (aperag/index/summary_index.py:305) create_index (aperag/index/summary_index.py:75) update_index (aperag/index/summary_index.py:196) update_index (aperag/tasks/document.py:325) update_index_task (config/celery_tasks.py:393) run (celery/app/autoretry.py:38) __protected_call__ (celery/app/trace.py:736) trace_task (celery/app/trace.py:453) fast_trace_task (celery/app/trace.py:651) apply_target (celery/concurrency/base.py:30) run (concurrent/futures/thread.py:58) _worker (concurrent/futures/thread.py:83) run (threading.py:982) _bootstrap_inner (threading.py:1045) _bootstrap (threading.py:1002) Thread 32 (idle): "ThreadPoolExecutor-3_2" select (selectors.py:468) _run_once (asyncio/base_events.py:1898) run_forever (asyncio/base_events.py:608) run_until_complete (asyncio/base_events.py:641) _run_in_new_loop (lightrag_manager.py:216) delete_document_for_celery (lightrag_manager.py:126) delete_index (aperag/tasks/document.py:216) delete_index_task (config/celery_tasks.py:334) run (celery/app/autoretry.py:38) __protected_call__ (celery/app/trace.py:736) trace_task (celery/app/trace.py:453) fast_trace_task (celery/app/trace.py:651) apply_target (celery/concurrency/base.py:30) run (concurrent/futures/thread.py:58) _worker (concurrent/futures/thread.py:83) run (threading.py:982) _bootstrap_inner (threading.py:1045) _bootstrap (threading.py:1002) Thread 33 (active): "ThreadPoolExecutor-3_3" acquire (logging/__init__.py:927) handle (logging/__init__.py:976) callHandlers (logging/__init__.py:1706) handle (logging/__init__.py:1644) _log (logging/__init__.py:1634) info (logging/__init__.py:1489) cleanup_expired_documents (aperag/tasks/collection.py:211) reconcile_all (aperag/tasks/reconciler.py:649) cleanup_expired_documents_task (config/celery_tasks.py:841) __protected_call__ (celery/app/trace.py:736) trace_task (celery/app/trace.py:453) fast_trace_task (celery/app/trace.py:651) apply_target (celery/concurrency/base.py:30) run (concurrent/futures/thread.py:58) _worker (concurrent/futures/thread.py:83) run (threading.py:982) _bootstrap_inner (threading.py:1045) _bootstrap (threading.py:1002) Thread 34 (idle): "ThreadPoolExecutor-3_4" select (selectors.py:468) _run_once (asyncio/base_events.py:1898) run_forever (asyncio/base_events.py:608) run_until_complete (asyncio/base_events.py:641) _run_in_new_loop (lightrag_manager.py:216) delete_document_for_celery (lightrag_manager.py:126) delete_index (aperag/tasks/document.py:216) delete_index_task (config/celery_tasks.py:334) run (celery/app/autoretry.py:38) __protected_call__ (celery/app/trace.py:736) trace_task (celery/app/trace.py:453) fast_trace_task (celery/app/trace.py:651) apply_target (celery/concurrency/base.py:30) run (concurrent/futures/thread.py:58) _worker (concurrent/futures/thread.py:83) run (threading.py:982) _bootstrap_inner (threading.py:1045) _bootstrap (threading.py:1002) Thread 35 (active): "ThreadPoolExecutor-3_5" _read_ready__get_buffer (asyncio/selector_events.py:974) _read_ready (asyncio/selector_events.py:956) _run (asyncio/events.py:84) _run_once (asyncio/base_events.py:1936) run_forever (asyncio/base_events.py:608) run_until_complete (asyncio/base_events.py:641) _run_in_new_loop (lightrag_manager.py:216) process_document_for_celery (lightrag_manager.py:118) create_index (aperag/tasks/document.py:114) create_index_task (config/celery_tasks.py:267) run (celery/app/autoretry.py:38) __protected_call__ (celery/app/trace.py:736) trace_task (celery/app/trace.py:453) fast_trace_task (celery/app/trace.py:651) apply_target (celery/concurrency/base.py:30) run (concurrent/futures/thread.py:58) _worker (concurrent/futures/thread.py:83) run (threading.py:982) _bootstrap_inner (threading.py:1045) _bootstrap (threading.py:1002) Thread 36 (active): "ThreadPoolExecutor-3_6" read (ssl.py:1168) recv (ssl.py:1295) read (httpcore/_backends/sync.py:128) _receive_event (httpcore/_sync/http11.py:217) _receive_response_headers (httpcore/_sync/http11.py:177) handle_request (httpcore/_sync/http11.py:106) handle_request (httpcore/_sync/connection.py:103) handle_request (httpcore/_sync/connection_pool.py:236) handle_request (httpx/_transports/default.py:250) _send_single_request (httpx/_client.py:1014) _send_handling_redirects (httpx/_client.py:979) _send_handling_auth (httpx/_client.py:942) send (httpx/_client.py:914) post (http_handler.py:761) _make_common_sync_call (llm_http_handler.py:175) completion (llm_http_handler.py:471) completion (litellm/main.py:2626) wrapper (litellm/utils.py:1219) _completion_core (aperag/llm/completion/completion_service.py:173) generate (aperag/llm/completion/completion_service.py:211) _summarize_text (aperag/index/summary_index.py:368) _generate_document_summary (aperag/index/summary_index.py:305) create_index (aperag/index/summary_index.py:75) create_index (aperag/tasks/document.py:135) create_index_task (config/celery_tasks.py:267) run (celery/app/autoretry.py:38) __protected_call__ (celery/app/trace.py:736) trace_task (celery/app/trace.py:453) fast_trace_task (celery/app/trace.py:651) apply_target (celery/concurrency/base.py:30) run (concurrent/futures/thread.py:58) _worker (concurrent/futures/thread.py:83) run (threading.py:982) _bootstrap_inner (threading.py:1045) _bootstrap (threading.py:1002) Thread 37 (idle): "ThreadPoolExecutor-3_7" select (selectors.py:468) _run_once (asyncio/base_events.py:1898) run_forever (asyncio/base_events.py:608) run_until_complete (asyncio/base_events.py:641) _run_in_new_loop (lightrag_manager.py:216) process_document_for_celery (lightrag_manager.py:118) create_index (aperag/tasks/document.py:114) create_index_task (config/celery_tasks.py:267) run (celery/app/autoretry.py:38) __protected_call__ (celery/app/trace.py:736) trace_task (celery/app/trace.py:453) fast_trace_task (celery/app/trace.py:651) apply_target (celery/concurrency/base.py:30) run (concurrent/futures/thread.py:58) _worker (concurrent/futures/thread.py:83) run (threading.py:982) _bootstrap_inner (threading.py:1045) _bootstrap (threading.py:1002) Thread 38 (idle): "ThreadPoolExecutor-3_8" select (selectors.py:468) _run_once (asyncio/base_events.py:1898) run_forever (asyncio/base_events.py:608) run_until_complete (asyncio/base_events.py:641) _run_in_new_loop (lightrag_manager.py:216) delete_document_for_celery (lightrag_manager.py:126) delete_index (aperag/tasks/document.py:216) delete_index_task (config/celery_tasks.py:334) run (celery/app/autoretry.py:38) __protected_call__ (celery/app/trace.py:736) trace_task (celery/app/trace.py:453) fast_trace_task (celery/app/trace.py:651) apply_target (celery/concurrency/base.py:30) run (concurrent/futures/thread.py:58) _worker (concurrent/futures/thread.py:83) run (threading.py:982) _bootstrap_inner (threading.py:1045) _bootstrap (threading.py:1002) Thread 39 (active): "ThreadPoolExecutor-3_9" read (ssl.py:1168) recv (ssl.py:1295) read (httpcore/_backends/sync.py:128) _receive_event (httpcore/_sync/http11.py:217) _receive_response_headers (httpcore/_sync/http11.py:177) handle_request (httpcore/_sync/http11.py:106) handle_request (httpcore/_sync/connection.py:103) handle_request (httpcore/_sync/connection_pool.py:236) handle_request (httpx/_transports/default.py:250) _send_single_request (httpx/_client.py:1014) _send_handling_redirects (httpx/_client.py:979) _send_handling_auth (httpx/_client.py:942) send (httpx/_client.py:914) post (http_handler.py:761) _make_common_sync_call (llm_http_handler.py:175) completion (llm_http_handler.py:471) completion (litellm/main.py:2626) wrapper (litellm/utils.py:1219) _completion_core (aperag/llm/completion/completion_service.py:173) generate (aperag/llm/completion/completion_service.py:211) _summarize_text (aperag/index/summary_index.py:368) _generate_document_summary (aperag/index/summary_index.py:305) create_index (aperag/index/summary_index.py:75) create_index (aperag/tasks/document.py:135) create_index_task (config/celery_tasks.py:267) run (celery/app/autoretry.py:38) __protected_call__ (celery/app/trace.py:736) trace_task (celery/app/trace.py:453) fast_trace_task (celery/app/trace.py:651) apply_target (celery/concurrency/base.py:30) run (concurrent/futures/thread.py:58) _worker (concurrent/futures/thread.py:83) run (threading.py:982) _bootstrap_inner (threading.py:1045) _bootstrap (threading.py:1002) Thread 40 (idle): "ThreadPoolExecutor-3_10" select (selectors.py:468) _run_once (asyncio/base_events.py:1898) run_forever (asyncio/base_events.py:608) run_until_complete (asyncio/base_events.py:641) _run_in_new_loop (lightrag_manager.py:216) delete_document_for_celery (lightrag_manager.py:126) delete_index (aperag/tasks/document.py:216) delete_index_task (config/celery_tasks.py:334) run (celery/app/autoretry.py:38) __protected_call__ (celery/app/trace.py:736) trace_task (celery/app/trace.py:453) fast_trace_task (celery/app/trace.py:651) apply_target (celery/concurrency/base.py:30) run (concurrent/futures/thread.py:58) _worker (concurrent/futures/thread.py:83) run (threading.py:982) _bootstrap_inner (threading.py:1045) _bootstrap (threading.py:1002) Thread 41 (active): "ThreadPoolExecutor-3_11" read (ssl.py:1168) recv (ssl.py:1295) read (httpcore/_backends/sync.py:128) _receive_event (httpcore/_sync/http11.py:217) _receive_response_headers (httpcore/_sync/http11.py:177) handle_request (httpcore/_sync/http11.py:106) handle_request (httpcore/_sync/connection.py:103) handle_request (httpcore/_sync/connection_pool.py:236) handle_request (httpx/_transports/default.py:250) _send_single_request (httpx/_client.py:1014) _send_handling_redirects (httpx/_client.py:979) _send_handling_auth (httpx/_client.py:942) send (httpx/_client.py:914) post (http_handler.py:761) _make_common_sync_call (llm_http_handler.py:175) completion (llm_http_handler.py:471) completion (litellm/main.py:2626) wrapper (litellm/utils.py:1219) _completion_core (aperag/llm/completion/completion_service.py:173) generate (aperag/llm/completion/completion_service.py:211) _summarize_text (aperag/index/summary_index.py:368) _generate_document_summary (aperag/index/summary_index.py:305) create_index (aperag/index/summary_index.py:75) create_index (aperag/tasks/document.py:135) create_index_task (config/celery_tasks.py:267) run (celery/app/autoretry.py:38) __protected_call__ (celery/app/trace.py:736) trace_task (celery/app/trace.py:453) fast_trace_task (celery/app/trace.py:651) apply_target (celery/concurrency/base.py:30) run (concurrent/futures/thread.py:58) _worker (concurrent/futures/thread.py:83) run (threading.py:982) _bootstrap_inner (threading.py:1045) _bootstrap (threading.py:1002) Thread 42 (idle): "ThreadPoolExecutor-3_12" select (selectors.py:468) _run_once (asyncio/base_events.py:1898) run_forever (asyncio/base_events.py:608) run_until_complete (asyncio/base_events.py:641) _run_in_new_loop (lightrag_manager.py:216) delete_document_for_celery (lightrag_manager.py:126) delete_index (aperag/tasks/document.py:216) delete_index_task (config/celery_tasks.py:334) run (celery/app/autoretry.py:38) __protected_call__ (celery/app/trace.py:736) trace_task (celery/app/trace.py:453) fast_trace_task (celery/app/trace.py:651) apply_target (celery/concurrency/base.py:30) run (concurrent/futures/thread.py:58) _worker (concurrent/futures/thread.py:83) run (threading.py:982) _bootstrap_inner (threading.py:1045) _bootstrap (threading.py:1002) Thread 43 (active): "ThreadPoolExecutor-3_13" emit (logging/__init__.py:1113) handle (logging/__init__.py:978) callHandlers (logging/__init__.py:1706) handle (logging/__init__.py:1644) _log (logging/__init__.py:1634) info (logging/__init__.py:1489) cleanup_expired_documents (aperag/tasks/collection.py:211) reconcile_all (aperag/tasks/reconciler.py:649) cleanup_expired_documents_task (config/celery_tasks.py:841) __protected_call__ (celery/app/trace.py:736) trace_task (celery/app/trace.py:453) fast_trace_task (celery/app/trace.py:651) apply_target (celery/concurrency/base.py:30) run (concurrent/futures/thread.py:58) _worker (concurrent/futures/thread.py:83) run (threading.py:982) _bootstrap_inner (threading.py:1045) _bootstrap (threading.py:1002) Thread 44 (idle): "ThreadPoolExecutor-3_14" select (selectors.py:468) _run_once (asyncio/base_events.py:1898) run_forever (asyncio/base_events.py:608) run_until_complete (asyncio/base_events.py:641) _run_in_new_loop (lightrag_manager.py:216) process_document_for_celery (lightrag_manager.py:118) create_index (aperag/tasks/document.py:114) create_index_task (config/celery_tasks.py:267) run (celery/app/autoretry.py:38) __protected_call__ (celery/app/trace.py:736) trace_task (celery/app/trace.py:453) fast_trace_task (celery/app/trace.py:651) apply_target (celery/concurrency/base.py:30) run (concurrent/futures/thread.py:58) _worker (concurrent/futures/thread.py:83) run (threading.py:982) _bootstrap_inner (threading.py:1045) _bootstrap (threading.py:1002) Thread 45 (idle): "ThreadPoolExecutor-3_15" select (selectors.py:468) _run_once (asyncio/base_events.py:1898) run_forever (asyncio/base_events.py:608) run_until_complete (asyncio/base_events.py:641) _run_in_new_loop (lightrag_manager.py:216) delete_document_for_celery (lightrag_manager.py:126) delete_index (aperag/tasks/document.py:216) delete_index_task (config/celery_tasks.py:334) run (celery/app/autoretry.py:38) __protected_call__ (celery/app/trace.py:736) trace_task (celery/app/trace.py:453) fast_trace_task (celery/app/trace.py:651) apply_target (celery/concurrency/base.py:30) run (concurrent/futures/thread.py:58) _worker (concurrent/futures/thread.py:83) run (threading.py:982) _bootstrap_inner (threading.py:1045) _bootstrap (threading.py:1002) Thread 46 (active): "ThreadPoolExecutor-1_0" _worker (concurrent/futures/thread.py:81) run (threading.py:982) _bootstrap_inner (threading.py:1045) _bootstrap (threading.py:1002) Thread 3476 (active): "asyncio_0" _worker (concurrent/futures/thread.py:81) run (threading.py:982) _bootstrap_inner (threading.py:1045) _bootstrap (threading.py:1002) Thread 3478 (active): "asyncio_0" _worker (concurrent/futures/thread.py:81) run (threading.py:982) _bootstrap_inner (threading.py:1045) _bootstrap (threading.py:1002) Thread 3480 (active): "asyncio_0" do_execute (sqlalchemy/engine/default.py:945) _exec_single_context (sqlalchemy/engine/base.py:1964) _execute_context (sqlalchemy/engine/base.py:1843) _execute_clauseelement (sqlalchemy/engine/base.py:1638) _execute_on_connection (sqlalchemy/sql/elements.py:523) execute (sqlalchemy/engine/base.py:1416) orm_execute_statement (sqlalchemy/orm/bulk_persistence.py:1294) _execute_internal (sqlalchemy/orm/session.py:2251) execute (sqlalchemy/orm/session.py:2365) _upsert_node (aperag/db/repositories/graph.py:65) _execute_transaction (aperag/db/repositories/base.py:65) upsert_graph_node (aperag/db/repositories/graph.py:69) _sync_upsert_node (lightrag/kg/pg_ops_sync_graph_storage.py:68) run (concurrent/futures/thread.py:58) _worker (concurrent/futures/thread.py:83) run (threading.py:982) _bootstrap_inner (threading.py:1045) _bootstrap (threading.py:1002) Thread 3482 (active): "asyncio_0" _worker (concurrent/futures/thread.py:81) run (threading.py:982) _bootstrap_inner (threading.py:1045) _bootstrap (threading.py:1002) Thread 3484 (active): "asyncio_0" do_execute (sqlalchemy/engine/default.py:945) _exec_single_context (sqlalchemy/engine/base.py:1964) _execute_context (sqlalchemy/engine/base.py:1843) _execute_clauseelement (sqlalchemy/engine/base.py:1638) _execute_on_connection (sqlalchemy/sql/elements.py:523) execute (sqlalchemy/engine/base.py:1416) orm_execute_statement (sqlalchemy/orm/context.py:306) _execute_internal (sqlalchemy/orm/session.py:2251) execute (sqlalchemy/orm/session.py:2365) _query (aperag/db/repositories/lightrag.py:576) _execute_query (aperag/db/repositories/base.py:55) query_lightrag_vdb_relation_all (aperag/db/repositories/lightrag.py:579) _sync_get_all (lightrag/kg/pg_ops_sync_vector_storage.py:100) run (concurrent/futures/thread.py:58) _worker (concurrent/futures/thread.py:83) run (threading.py:982) _bootstrap_inner (threading.py:1045) _bootstrap (threading.py:1002) Thread 3485 (active): "asyncio_0" do_execute (sqlalchemy/engine/default.py:945) _exec_single_context (sqlalchemy/engine/base.py:1964) _execute_context (sqlalchemy/engine/base.py:1843) _execute_clauseelement (sqlalchemy/engine/base.py:1638) _execute_on_connection (sqlalchemy/sql/elements.py:523) execute (sqlalchemy/engine/base.py:1416) orm_execute_statement (sqlalchemy/orm/context.py:306) _execute_internal (sqlalchemy/orm/session.py:2251) execute (sqlalchemy/orm/session.py:2365) _query (aperag/db/repositories/lightrag.py:576) _execute_query (aperag/db/repositories/base.py:55) query_lightrag_vdb_relation_all (aperag/db/repositories/lightrag.py:579) _sync_get_all (lightrag/kg/pg_ops_sync_vector_storage.py:100) run (concurrent/futures/thread.py:58) _worker (concurrent/futures/thread.py:83) run (threading.py:982) _bootstrap_inner (threading.py:1045) _bootstrap (threading.py:1002) Thread 3486 (active): "asyncio_0" <listcomp> (sqlalchemy/orm/loading.py:223) chunks (sqlalchemy/orm/loading.py:223) _fetchall_impl (sqlalchemy/engine/result.py:2268) _fetchall_impl (sqlalchemy/engine/result.py:1674) _allrows (sqlalchemy/engine/result.py:548) all (sqlalchemy/engine/result.py:1767) _query (aperag/db/repositories/lightrag.py:577) _execute_query (aperag/db/repositories/base.py:55) query_lightrag_vdb_relation_all (aperag/db/repositories/lightrag.py:579) _sync_get_all (lightrag/kg/pg_ops_sync_vector_storage.py:100) run (concurrent/futures/thread.py:58) _worker (concurrent/futures/thread.py:83) run (threading.py:982) _bootstrap_inner (threading.py:1045) _bootstrap (threading.py:1002) Thread 3487 (active+gil): "asyncio_0" fetchall (sqlalchemy/engine/cursor.py:1137) _fetchall_impl (sqlalchemy/engine/cursor.py:2135) _raw_all_rows (sqlalchemy/engine/result.py:540) chunks (sqlalchemy/orm/loading.py:219) _fetchall_impl (sqlalchemy/engine/result.py:2268) _fetchall_impl (sqlalchemy/engine/result.py:1674) _allrows (sqlalchemy/engine/result.py:548) all (sqlalchemy/engine/result.py:1767) _query (aperag/db/repositories/lightrag.py:577) _execute_query (aperag/db/repositories/base.py:55) query_lightrag_vdb_relation_all (aperag/db/repositories/lightrag.py:579) _sync_get_all (lightrag/kg/pg_ops_sync_vector_storage.py:100) run (concurrent/futures/thread.py:58) _worker (concurrent/futures/thread.py:83) run (threading.py:982) _bootstrap_inner (threading.py:1045) _bootstrap (threading.py:1002) Thread 3488 (active): "asyncio_0" do_execute (sqlalchemy/engine/default.py:945) _exec_single_context (sqlalchemy/engine/base.py:1964) _execute_context (sqlalchemy/engine/base.py:1843) _execute_clauseelement (sqlalchemy/engine/base.py:1638) _execute_on_connection (sqlalchemy/sql/elements.py:523) execute (sqlalchemy/engine/base.py:1416) orm_execute_statement (sqlalchemy/orm/context.py:306) _execute_internal (sqlalchemy/orm/session.py:2251) execute (sqlalchemy/orm/session.py:2365) _query (aperag/db/repositories/lightrag.py:576) _execute_query (aperag/db/repositories/base.py:55) query_lightrag_vdb_relation_all (aperag/db/repositories/lightrag.py:579) _sync_get_all (lightrag/kg/pg_ops_sync_vector_storage.py:100) run (concurrent/futures/thread.py:58) _worker (concurrent/futures/thread.py:83) run (threading.py:982) _bootstrap_inner (threading.py:1045) _bootstrap (threading.py:1002) Thread 3489 (active): "asyncio_0" do_execute (sqlalchemy/engine/default.py:945) _exec_single_context (sqlalchemy/engine/base.py:1964) _execute_context (sqlalchemy/engine/base.py:1843) _execute_clauseelement (sqlalchemy/engine/base.py:1638) _execute_on_connection (sqlalchemy/sql/elements.py:523) execute (sqlalchemy/engine/base.py:1416) orm_execute_statement (sqlalchemy/orm/context.py:306) _execute_internal (sqlalchemy/orm/session.py:2251) execute (sqlalchemy/orm/session.py:2365) _query (aperag/db/repositories/lightrag.py:576) _execute_query (aperag/db/repositories/base.py:55) query_lightrag_vdb_relation_all (aperag/db/repositories/lightrag.py:579) _sync_get_all (lightrag/kg/pg_ops_sync_vector_storage.py:100) run (concurrent/futures/thread.py:58) _worker (concurrent/futures/thread.py:83) run (threading.py:982) _bootstrap_inner (threading.py:1045) _bootstrap (threading.py:1002) Thread 3490 (active): "asyncio_1" _worker (concurrent/futures/thread.py:81) run (threading.py:982) _bootstrap_inner (threading.py:1045) _bootstrap (threading.py:1002) Thread 3491 (active): "asyncio_1" _worker (concurrent/futures/thread.py:81) run (threading.py:982) _bootstrap_inner (threading.py:1045) _bootstrap (threading.py:1002) Thread 3492 (active): "asyncio_1" _worker (concurrent/futures/thread.py:81) run (threading.py:982) _bootstrap_inner (threading.py:1045) _bootstrap (threading.py:1002) Thread 3493 (active): "asyncio_1" _worker (concurrent/futures/thread.py:81) run (threading.py:982) _bootstrap_inner (threading.py:1045) _bootstrap (threading.py:1002) Thread 3497 (active): "ThreadPoolExecutor-1_1" _worker (concurrent/futures/thread.py:81) run (threading.py:982) _bootstrap_inner (threading.py:1045) _bootstrap (threading.py:1002) Thread 3498 (active): "ThreadPoolExecutor-1_2" _worker (concurrent/futures/thread.py:81) run (threading.py:982) _bootstrap_inner (threading.py:1045) _bootstrap (threading.py:1002) Thread 3499 (active): "ThreadPoolExecutor-1_3" _worker (concurrent/futures/thread.py:81) run (threading.py:982) _bootstrap_inner (threading.py:1045) _bootstrap (threading.py:1002)Stack traces show multiple active calls to:
query_lightrag_vdb_relation_all This function attempts to load all records from lightrag_vdb_relation, causing high memory pressure.
💥 Root Cause
- The tables
lightrag_vdb_relationandlightrag_vdb_entityare very large (4 GB and 3.5 GB respectively). - The
query_lightrag_vdb_relation_allfunction does not paginate or stream results. - As a result, the entire dataset is loaded into memory, exceeding the Celery worker’s 4 GiB memory limit and triggering OOMKilled events.
✅ Suggested Fixes / Improvements
- Implement pagination or streaming queries when fetching relation/entity data.
Example: useLIMIT/OFFSETor server-side cursors (psycopg2.extras.DictCursor(name='cursor_name')). - Avoid loading entire tables into memory for processing — process items in batches.
- Consider offloading large queries to background jobs with higher memory limits or move data aggregation logic into SQL.
- Optionally, increase the worker memory limit temporarily as a mitigation (from
4Gi→8Gi), but the core issue is unbounded memory usage.
📈 Impact
- Frequent Celery worker restarts.
- Tasks interrupted or retried repeatedly.
- Reduced system reliability and performance.