How to clear task resources and zombie processes

Question

I am building celery + django + selenium application. I am running selenium-based browsers in separate processes with help celery. Versions:

celery==5.2.6 redis==3.4.1 selenium-wire==5.1.0 Django==4.0.4 djangorestframework==3.13.1

I found out that after several hours application generates thousands of zombie processes. Also found out that problem deals with celery docker container, because after sudo /usr/local/bin/docker-compose -f /data/new_app/docker-compose.yml restart celery I have 0 zombie processes.

My code

from rest_framework.decorators import api_view @api_view(['POST']) def periodic_check_all_urls(request): # web-service endpoint ... check_urls.delay(parsing_results_ids) # call celery task

Celery task code

from celery import shared_task @shared_task() def check_urls(parsing_result_ids: List[int]): """ Run Selenium-based parser the parser exctracts data and saves in database """ try: logger.info(f"{datetime.now()} Start check_urls") parser = Parser() # open selenium browser parsing_results = ParsingResult.objects.filter(pk__in=parsing_result_ids).exclude(status__in=["DONE", "FAILED"]) parser.check_parsing_result(parsing_results) except Exception as e: full_trace = traceback.format_exc() finally: if 'parser' in locals(): parser.stop()

Selenium browser stop function and destructor

class Parser(): def __init__(self): """ Prepare parser """ if not USE_GUI: self.display = Display(visible=0, size=(800, 600)) self.display.start() """ Replaced with FireFox self.driver = get_chromedriver(proxy_data) """ proxy_data = { ... } self.driver = get_firefox_driver(proxy_data=proxy_data) def __del__(self): self.stop() def stop(self): try: self.driver.quit() logger.info("Selenium driver closed") except: pass try: self.display.stop() logger.info("Display stopped") except: pass

Also I was trying several settings to limit celery task resources and time of work (it didn't help with Zombie processes)

My celery settings in dgango settings.py

# celery setting (documents generation) CELERY_BROKER_URL = os.environ.get("CELERY_BROKER", "redis://redis:6379/0") CELERY_RESULT_BACKEND = os.environ.get("CELERY_BROKER", "redis://redis:6379/0") CELERY_IMPORTS = ("core_app.celery",) CELERY_TASK_TIME_LIMIT = 10 * 60

My celery settings in dockers

celery: build: ./project command: celery -A core_app worker --loglevel=info --concurrency=15 --max-memory-per-child=1000000 volumes: - ./project:/usr/src/app - ./project/media:/project/media - ./project/logs:/project/logs env_file: - .env environment: # environment variables declared in the environment section override env_file - DJANGO_ALLOWED_HOSTS=localhost 127.0.0.1 [::1] - CELERY_BROKER=redis://redis:6379/0 - CELERY_BACKEND=redis://redis:6379/0 depends_on: - django - redis

I read Django/Celery - How to kill a celery task? but it didn't help

Also read Celery revoke leaving zombie ffmpeg process but my task already contains try/except

Example of zombie processes

ps aux | grep 'Z' root 32448 0.0 0.0 0 0 ? Z 13:45 0:00 [Utility Process] <defunct> root 32449 0.0 0.0 0 0 ? Z 13:09 0:00 [Utility Process] <defunct> root 32450 0.0 0.0 0 0 ? Z 11:13 0:00 [sh] <defunct> root 32451 0.0 0.0 0 0 ? Z 13:44 0:00 [Utility Process] <defunct> root 32452 0.0 0.0 0 0 ? Z 10:12 0:00 [Utility Process] <defunct> root 32453 0.0 0.0 0 0 ? Z 09:52 0:00 [sh] <defunct> root 32454 0.0 0.0 0 0 ? Z 10:40 0:00 [Utility Process] <defunct> root 32455 0.0 0.0 0 0 ? Z 09:52 0:00 [Utility Process] <defunct> root 32456 0.0 0.0 0 0 ? Z 10:13 0:00 [sh] <defunct> root 32457 0.0 0.0 0 0 ? Z 10:51 0:00 [Utility Process] <defunct> root 32459 0.0 0.0 0 0 ? Z 14:01 0:00 [Utility Process] <defunct> root 32460 0.0 0.0 0 0 ? Z 13:16 0:00 [Utility Process] <defunct> root 32461 0.0 0.0 0 0 ? Z 10:40 0:00 [Utility Process] <defunct> root 32462 0.0 0.0 0 0 ? Z 10:12 0:00 [Utility Process] <defunct>

Did you try the solution of calling return after parser.stop() in your Celery task code from the 2nd last link? Also I don't think limiting task resources would prevent zombie processes? — keventhen4
– keventhen4, Commented Dec 16, 2024 at 4:08
@keventhen4 will try to call return, why do you think that is can help? I think that end of function is similar to retur None — mascai
– mascai, Commented Dec 16, 2024 at 11:26
Sorry, I read it wrong, return is just for making sure the process has ended, and default end of function does indeed seem to return None. I believe Celery has a wait() equivalent to get rid of zombie processes? Solution is probably similar to this: stackoverflow.com/q/2760652/16169432 — keventhen4
– keventhen4, Commented Dec 16, 2024 at 17:13
Please modify your stop method to @Apex862-2's answer, and show us the log data of the traceback — Lord Elrond
– Lord Elrond, Commented Dec 19, 2024 at 23:33

mascai · Accepted Answer · 2024-12-17 22:22:40Z

Use timeout and soft_time_limit

You have already set CELERY_TASK_TIME_LIMIT, but it can be beneficial to also use soft_time_limit. The soft_time_limit sends a TimeoutError signal to the task, which you can catch to clean up resources before the task is forcefully terminated after the time_limit.

Here’s how you can set both:

from celery.exceptions import SoftTimeLimitExceeded @shared_task(soft_time_limit=600, time_limit=650) def check_urls(parsing_result_ids: List[int]): try: logger.info(f"{datetime.now()} Start check_urls") parser = Parser() # Open selenium browser parsing_results = ParsingResult.objects.filter(pk__in=parsing_result_ids).exclude(status__in=["DONE", "FAILED"]) parser.check_parsing_result(parsing_results) except SoftTimeLimitExceeded: logger.warning(f"Task exceeded soft time limit, cleaning up resources.") except Exception as e: full_trace = traceback.format_exc() logger.error(f"Error occurred: {full_trace}") finally: if 'parser' in locals(): parser.stop()

Ensure All Selenium Processes are Cleaned

Make sure all subprocesses, including the Selenium driver and X server (in headless mode), are correctly stopped. This could involve adding explicit process killing if necessary. For instance:

import psutil import os class Parser(): def __init__(self): if not USE_GUI: self.display = Display(visible=0, size=(800, 600)) self.display.start() self.driver = get_firefox_driver(proxy_data=proxy_data) def stop(self): try: self.driver.quit() logger.info("Selenium driver closed") except Exception as e: logger.error(f"Error closing driver: {e}") try: self.display.stop() logger.info("Display stopped") except Exception as e: logger.error(f"Error stopping display: {e}") # Clean up any remaining subprocesses (especially related to Selenium) self.cleanup_selenium_processes() def cleanup_selenium_processes(self): # Check for any lingering Selenium processes for proc in psutil.process_iter(attrs=['pid', 'name']): try: if 'selenium' in proc.info['name'].lower(): logger.info(f"Killing zombie process: {proc.info['pid']}") proc.terminate() except (psutil.NoSuchProcess, psutil.AccessDenied, psutil.ZombieProcess): pass

Implement soft_time_limit and time_limit for task termination. Ensure that all Selenium resources are released (including driver and display). Use psutil to clean lingering processes. Configure Docker memory limits and restart policies. Use max-tasks-per-child to automatically restart workers.

Added time limits and cleanup_selenium_processes but it didn't help. I see 27000 zombie processes
The child process (Selenium WebDriver, Xvfb, etc.) terminates but the parent (Celery worker) does not wait() for it. Celery workers are not properly handling task failures, crashes, or force-killed processes. The Celery worker itself is killed or restarted without properly cleaning up child processes.
Try terminating Selenium and Xvfb using subprocess.Popen tracking

Lord Elrond · Accepted Answer · 2024-12-19 19:15:25Z

I'd start by turning the Parser class into a context manager:

class Parser(): def __init__(self): self.display = Display(visible=0, size=(800, 600)) self.display.start() self.driver = get_firefox_driver(proxy_data={}) def __enter__(self): return self.driver def __exit__(self, exc_type, exc_val, exc_tb): self.kill_driver() self.display.stop() # handle exceptions here # if this returns true, any exceptions will be supressed def kill_driver(self): self.driver.close() self.driver.quit()

If there is an error thrown within the with block, Parser.__exit__ will be called before the exception is raised, which gives you the chance to kill the driver and the display before the process closes.

Note that I removed your empty try: except: blocks in the stop method. This is bad practice, because you won't see the traceback, which would be quite useful for debugging your question...

Now in your task:

@shared_task() def check_urls(parsing_result_ids): with Parser() as parser: parsing_results = ParsingResult.objects.filter(pk__in=parsing_result_ids).exclude(status__in=["DONE", "FAILED"]) parser.check_parsing_result(parsing_results)

It's unlikely Celery is the problem. Using Selenium within a Docker container seems to be the root cause of the zombie processes. See Jimmy Engelbrecht's answer for further details.

Jimmy's solution to the zombie problem:

def quit_driver_and_reap_children(driver): log.debug('Quitting session: %s' % driver.session_id) driver.quit() try: pid = True while pid: pid = os.waitpid(-1, os.WNOHANG) log.debug("Reaped child: %s" % str(pid)) #Wonka's Solution to avoid infinite loop cause pid value -> (0, 0) try: if pid[0] == 0: pid = False except: pass #---- ---- except ChildProcessError: pass

If this solution doesn't work, please show us the traceback you suppressed in your Parser.stop method.

Collectives™ on Stack Overflow

How to clear task resources and zombie processes

2 Answers 2

4 Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

Comments

Linked

Related