Google App Engine deployment fails because of failing readiness check

Question

A custom app engine environment fails to start up and it seems to be due to failing health checks. The app has a few custom dependencies (e.g. PostGIS, GDAL) so a few layers on top of the app engine image. It builds successfully and it runs locally in a Docker container.

ERROR: (gcloud.app.deploy) Error Response: [4] Your deployment has failed to become healthy in the allotted time and therefore was rolled back. If you believe this was an error, try adjusting the 'app_start_timeout_sec' setting in the 'readiness_check' section.

The Dockerfile looks as follows (Note: no CMD as entrypoint is defined in docker-compose.yml and app.yaml):

FROM gcr.io/google-appengine/python ENV PYTHONUNBUFFERED 1 ENV DEBIAN_FRONTEND noninteractive RUN apt -y update && apt -y upgrade\ && apt-get install -y software-properties-common \ && add-apt-repository -y ppa:ubuntugis/ppa \ && apt -y update \ && apt-get -y install gdal-bin libgdal-dev python3-gdal \ && apt-get autoremove -y \ && apt-get autoclean -y \ && apt-get clean \ && rm -rf /var/lib/apt/lists/* ADD requirements.txt /app/requirements.txt RUN python3 -m pip install -r /app/requirements.txt ADD . /app/ WORKDIR /app

This unfortunately creates an image of a whopping 1.58GB, but the original gcr.io python image starts at 1.05GB, so I don't think the size of the image would or should be a problem.

Running this locally with the following docker-compose.yml config beautifully spins up a container in no time:

version: "3" services: web: build: . command: gunicorn gisapplication.wsgi --bind 0.0.0.0:8080

So, I would have expected the following app.yaml would do the trick:

runtime: custom env: flex entrypoint: gunicorn -b :$PORT gisapplication.wsgi beta_settings: cloud_sql_instances: <sql-db-connection> runtime_config: python_version: 3

No luck. So, as per error above, it seemed to have something to do with the readiness check. Tried increasing the timeout for the app to start (15 mins!) There seemed to have been some issues with health checks previously and rolling back to legacy health checks is not a solution as of Sept 2019.

readiness_check: path: "/readiness_check" check_interval_sec: 10 timeout_sec: 10 failure_threshold: 3 success_threshold: 3 app_start_timeout_sec: 900 liveness_check: path: "/liveness_check" check_interval_sec: 60 timeout_sec: 4 failure_threshold: 3 success_threshold: 2 initial_delay_sec: 30

Split health checks are definitely on. The output from gcloud beta app describe is:

authDomain: gmail.com codeBucket: staging.proj-id-000000.appspot.com databaseType: CLOUD_DATASTORE_COMPATIBILITY defaultBucket: proj-id-000000.appspot.com defaultHostname: proj-id-000000.ts.r.appspot.com featureSettings: splitHealthChecks: true useContainerOptimizedOs: true gcrDomain: asia.gcr.io id: proj-id-000000 locationId: australia-southeast1 name: apps/proj-id-000000 servingStatus: SERVING

That didn't work, so also tried to increase the resources available to the instance and allocated the maximum amount of memory for 1 CPU (6.1GB):

resources: cpu: 1 memory_gb: 6.1 disk_size_gb: 10

Just to be on the safe side, I added health check endpoints to the app (legacy health checks and the split health checks) - it's a Django app, so this went into the project's urls.py:

path(r'_ah/health/', lambda r: HttpResponse("OK", status=200)), path(r'readiness_check/', lambda r: HttpResponse("OK", status=200)), path(r'liveness_check/', lambda r: HttpResponse("OK", status=200)),

So, when I dive into the logs, there seems to be a successful request to /liveness_check from a curl user agent, but the subsequent requests to /readiness_check from GoogleHC agent return a 503 (Service Unavailable)

Shortly after (after 8 failed requests - why 8?) a shutdown trigger seems to be sent of:

2020-07-05 09:00:02.603 AEST Triggering app shutdown handlers.

Any ideas of what is going on here? I think I've pretty much exhausted the options to fix this problem and wonder whether the time wouldn't have been better invested in getting things up and running in Compute/EC2.

ADDENDUM:

in addition to the SO issue linked, I've gone through issues on Google (here and here)

Open an issue with Google Cloud - let's see if they can figure it out. — Hendrik F
– Hendrik F, Commented Jul 9, 2020 at 23:34

Hendrik F · Accepted Answer · 2020-07-10 10:27:04Z

All right, the Google guys could not help fix it either, but after an epic journey through way too many logs I managed to figure out what the issue is: The Dockerfile needs a CMD statement. While I had assumed this is what the entrypoint in app.yaml was for, it seems App Engine spins up the container with docker run. Therefore, simply adding this line to the Dockerfile fixes it:

CMD gunicorn -b :$PORT gisapplication.wsgi

I also reverted to default health check settings and was able to take the URL paths for health checks out of my app and let the default nginx instance shipped by the Google base container handle those.

GAEfan · Accepted Answer · 2020-07-05 05:53:14Z

You are sending the readiness check to path: "/readiness_check", but your url handler for that is path(r'readiness_check/'...)

Note trailing slash in the handler. Remove that (or add a trailing slash in the path for readiness_check:) and see if that fixes it. I would think that would give you a 404, but you are getting a 503 which tells me that you may have a more serious error. Click one of the arrows at the left of a 503 in the console, and see what the error message is. You may need to search in the console for traceback to see it.

No, that isn't it. As you mention yourself, this wouldn't spit out a 503. If you omit the trailing slash, Django redirects, so a 301 Moved Permanently could show up.
The console just reads "DEBUG: Operation [project-id-redacted] not complete. Waiting to retry. Updating service [default] (this may take several minutes)...⠏DEBUG: Operation [project-id-redacted] not complete. Waiting to retry." If the flag --verbosity=debug it just prints this statement over and over again, so no details there. Going through the logs but there are 1e6, so hard to see what the issue is. There is a successful liveness check by a curl agent (200) but then right after comes a "VM shutdown initiated".

RodKLV · Accepted Answer · 2023-04-15 01:54:44Z

I went for a modification of my app.yaml like this :

readiness_check: path: "/readiness_check" check_interval_sec: 60 timeout_sec: 20 failure_threshold: 2 success_threshold: 2 app_start_timeout_sec: 600 #-> increase to 3600 liveness_check: path: "/liveness_check" check_interval_sec: 60 timeout_sec: 20 failure_threshold: 2 success_threshold: 2

as per the doc here

Also, after getting to learn about the logs there, filtering on last deployment version, I finaly had this ERROR msg in bash:

ERROR: (gcloud.app.deploy) Error Response: [4] An internal error occurred while processing task /app-engine-flex/flex_await_healthy/flex_await_healthy>2023-04-15T01:30:41.152Z11154.hx.2: Your deployment has failed to become healthy in the allotted time and therefore was rolled back. If you believe this was an error, try adjusting the 'app_start_timeout_sec' setting in the 'readiness_check' section.

so up there increased the timeout. The next line down here more about the build time timeout, which was ok for me too, but just to be sure, and without any -step in your .yaml file but just this in the CLI:

gcloud config set app/cloud_build_timeout 3000 --installation

Collectives™ on Stack Overflow

Google App Engine deployment fails because of failing readiness check

3 Answers 3

Comments

2 Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

2 Comments

Comments

Linked

Related