6

We have recently setted up continuous integration/deployment/delivery of a nodejs webapp on Google App Engine. The CI server (GitLabCI) runs dependencies installation, build, tests and deployment to integration/prod depending on the branch (develop/master).

At the day of today, the only bugs we've faced to was during the dependencies step, and so we didn't care much about it. But yesterday (21/10/16), there was a wide-scale DNS outage and the pipeline failed in the middle of the deployment step, breaking down the prod. Simply re-run the pipeline has made the job, but the problem can reproduce at any time.

My questions are:

  • How can we handle this sort of network issues, in the continuous deployment process ?
  • Is the continuous deployment on Google App Engine really a good idea ?
  • If so, what is the App Engine deployment methodo ? I don't find any relevant doc about it...

For the moment we have only two versions "dev" and "prod" that are updated after commits, but at random times I could observe strange behaviours.

Any response/suggestions/feedback is very welcome !

Example of stacktrace concerning the networking issues I am talking about:

DEBUG: Error sending result: 'MetadataServerException(HTTPError(),)'. Reason: 'PicklingError("Can't pickle <type 'cStringIO.StringO'>: attribute lookup cStringIO.StringO failed",)' Traceback (most recent call last): File "/google-cloud-sdk/lib/googlecloudsdk/calliope/cli.py", line 733, in Execute resources = args.calliope_command.Run(cli=self, args=args) File "/google-cloud-sdk/lib/googlecloudsdk/calliope/backend.py", line 1630, in Run resources = command_instance.Run(args) File "/google-cloud-sdk/lib/surface/app/deploy.py", line 53, in Run return deploy_util.RunDeploy(self, args) File "/google-cloud-sdk/lib/googlecloudsdk/command_lib/app/deploy_util.py", line 387, in RunDeploy all_services) File "/google-cloud-sdk/lib/googlecloudsdk/command_lib/app/deploy_util.py", line 247, in Deploy manifest = _UploadFiles(service, code_bucket_ref) File "/google-cloud-sdk/lib/googlecloudsdk/command_lib/app/deploy_util.py", line 115, in _UploadFiles service, code_bucket_ref) File "/google-cloud-sdk/lib/googlecloudsdk/api_lib/app/deploy_app_command_util.py", line 277, in CopyFilesToCodeBucketNoGsUtil _UploadFiles(files_to_upload, bucket_ref) File "/google-cloud-sdk/lib/googlecloudsdk/api_lib/app/deploy_app_command_util.py", line 219, in _UploadFiles results = pool.map(_UploadFile, tasks) File "/usr/lib/python2.7/multiprocessing/pool.py", line 251, in map return self.map_async(func, iterable, chunksize).get() File "/usr/lib/python2.7/multiprocessing/pool.py", line 558, in get raise self._value MaybeEncodingError: Error sending result: 'MetadataServerException(HTTPError(),)'. Reason: 'PicklingError("Can't pickle <type 'cStringIO.StringO'>: attribute lookup cStringIO.StringO failed",)' DEBUG: Exception captured in Error Traceback (most recent call last): File "/google-cloud-sdk/lib/googlecloudsdk/core/metrics.py", line 411, in Wrapper return func(*args, **kwds) TypeError: Error() takes exactly 3 arguments (1 given) ERROR: gcloud crashed (MaybeEncodingError): Error sending result: 'MetadataServerException(HTTPError(),)'. Reason: 'PicklingError("Can't pickle <type 'cStringIO.StringO'>: attribute lookup cStringIO.StringO failed",)' Traceback (most recent call last): File "/google-cloud-sdk/lib/gcloud.py", line 65, in <module> main() File "/google-cloud-sdk/lib/gcloud.py", line 61, in main sys.exit(googlecloudsdk.gcloud_main.main()) File "/google-cloud-sdk/lib/googlecloudsdk/gcloud_main.py", line 145, in main crash_handling.HandleGcloudCrash(err) File "/google-cloud-sdk/lib/googlecloudsdk/command_lib/crash_handling.py", line 107, in HandleGcloudCrash _ReportError(err) File "/google-cloud-sdk/lib/googlecloudsdk/command_lib/crash_handling.py", line 86, in _ReportError util.ErrorReporting().ReportEvent(error_message=stacktrace, File "/google-cloud-sdk/lib/googlecloudsdk/api_lib/error_reporting/util.py", line 28, in __init__ self._API_NAME, self._API_VERSION) File "/google-cloud-sdk/lib/googlecloudsdk/core/apis.py", line 254, in GetClientInstance http_client = http.Http() File "/google-cloud-sdk/lib/googlecloudsdk/core/credentials/http.py", line 60, in Http creds = store.Load() File "/google-cloud-sdk/lib/googlecloudsdk/core/credentials/store.py", line 282, in Load if account in c_gce.Metadata().Accounts(): File "/google-cloud-sdk/lib/googlecloudsdk/core/credentials/gce.py", line 122, in Accounts gce_read.GOOGLE_GCE_METADATA_ACCOUNTS_URI + '/') File "/google-cloud-sdk/lib/googlecloudsdk/core/util/retry.py", line 160, in TryFunc return func(*args, **kwargs), None File "/google-cloud-sdk/lib/googlecloudsdk/core/credentials/gce.py", line 45, in _ReadNoProxyWithCleanFailures raise MetadataServerException(e) googlecloudsdk.core.credentials.gce.MetadataServerException: HTTP Error 503: Service Unavailable DEBUG: Uploading [/builds/apps/webapp/lib/jinja2/defaults.pyc] to [151c77b4e5bdd2c38b6a2bf914fffa3a6ffa71a6] INFO: Uploading [/builds/apps/webapp/lib/jinja2/defaults.pyc] to [151c77b4e5bdd2c38b6a2bf914fffa3a6ffa71a6] INFO: Refreshing access_token 

1 Answer 1

7

Good/bad? Subjective - thus off-topic for SO. Assuming the question is how to make continuous deployment reliable :)

Well, the trouble is that you're using app versions as your CI environments, which means you can't avoid breakages due to a specific version being bad. You can only hope to recover as fast as possible by re-deploying the version (when the outage ends) - this can be automated.

You should not have your production site running directly off the version overwritten by the CI production pipeline, otherwise you risk site outage on a bad deployment. Instead you could use a new/unique version for each execution of the CI production pipeline and only after that completes successfully you finally switch site traffic to its version using the flow described below (which can also be used inside the CI pipelines if using different apps instead of app versions as CI environments)

From Deploying your program:

By default the deploy command automatically generates a new version ID each time that you use it and will route any traffic to the new version.

To override this behavior, you can specify the version ID with the version flag:

gcloud app deploy --version myID 

You can also specify not to send all traffic to the new version immediatey with the --no-promote flag:

gcloud app deploy --no-promote 

So make sure you never deploy a version and make that version the default traffic destination one in the same step (possibly not atomic if driven from the client side). Especially for the production app. Instead:

This way the only critical operation is traffic switching, which (hopefully) is an atomic operation which is either successful or it's completely rolled back on GAE side (if not it's a GAE bug). If this step fails the app should still continue to work with the old version.

Of course, this assumes the networking issues are only in between you and GAE, if they're also affecting GAE's internal ops all bets are off (but those I trust should be fixed rather timely).

Sign up to request clarification or add additional context in comments.

6 Comments

Thank you for your very detailed response. You're probably right about using different apps as CI environments, that's a better idea and it could solve the different issues we're facing to. I've a last question: the app is autoscaled, and so I cannot start/stop versions (according to the doc). When a build will create a version, the traffic will be charged, should I set up basic scale ? Or should I delete previous version when new is created ?
No need to explicitly start the new version with autoscaling. Just use the respective version's URLs for testing that the version works and GAE will start the instances itself: cloud.google.com/appengine/docs/flexible/python/…
+1000000 Good answer as it comes right there @DanCornilescu
Does the gcloud app deploy --no-promote generate version ID if the version ID not included in the command?
@GeekGuy yes, it should - the deployment needs a version ID. That's the default behaviour.
|

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.