Skip to content

fix: prevent EC2 termination when GitHub runner de-registration fails#5061

Open
shivdesh wants to merge 2 commits intogithub-aws-runners:mainfrom
shivdesh:fix/scale-down-retry-logic
Open

fix: prevent EC2 termination when GitHub runner de-registration fails#5061
shivdesh wants to merge 2 commits intogithub-aws-runners:mainfrom
shivdesh:fix/scale-down-retry-logic

Conversation

@shivdesh
Copy link
Copy Markdown

@shivdesh shivdesh commented Mar 10, 2026

Summary

This PR complements #4990 by ensuring that when GitHub runner de-registration fails (even after automatic retries), the EC2 instance is not terminated. This prevents stale runner entries in GitHub org settings.

Problem

When the scale-down Lambda fails to de-register a runner from GitHub (e.g., due to persistent API errors), the current code still terminates the EC2 instance. This leaves stale/offline runner entries in the GitHub organization settings.

We encountered this issue in production where transient 502 errors during scale-down left 117 stale runner entries in our GitHub organization.

Relationship to #4990

PR #4990 added @octokit/plugin-retry which provides automatic retries at the Octokit client level. This is great for handling transient failures. However, if de-registration ultimately fails after all retries, we still need to handle that gracefully by NOT terminating the EC2 instance.

Solution

  • Extract deleteGitHubRunner() helper that catches errors per-runner
  • Only terminate EC2 instance if all GitHub de-registrations succeed
  • If any de-registration fails, leave the instance running so the next scale-down cycle can retry

Changes

  • lambdas/functions/control-plane/src/scale-runners/scale-down.ts:
    • Added deleteGitHubRunner() helper function
    • Modified removeRunner() to only terminate EC2 if all de-registrations succeed

Testing

  • Added test verifying EC2 is NOT terminated when de-registration throws an error
  • All 124 scale-down tests pass

Why not custom retry logic?

The @octokit/plugin-retry (added in #4990) already handles automatic retries at the client level, so no custom retry logic is needed. This PR focuses solely on the failure handling aspect - what to do when de-registration fails after all retries.

@npalm
Copy link
Copy Markdown
Member

npalm commented Mar 10, 2026

@Brend-Smits do you have time to check this PR, since you digged in similar PRs

@shivdesh shivdesh force-pushed the fix/scale-down-retry-logic branch from 8b5e8ef to a5b332f Compare March 10, 2026 18:12
@Brend-Smits
Copy link
Copy Markdown
Contributor

Brend-Smits commented Mar 11, 2026

Hi @shivdesh,

Thanks for the contribution!

I recently added something similar, but for the scale-up lambda. Please see: https://github.com/github-aws-runners/terraform-aws-github-runner/pull/4990/changes
Can you have a look if you can reuse that code or make that code more reusable?

@shivdesh
Copy link
Copy Markdown
Author

Hi @Brend-Smits,

Thanks for the feedback! I reviewed PR #4990 and you're right - the @octokit/plugin-retry approach is much cleaner than custom retry logic.

I've updated this PR to:

  1. Leverage the existing @octokit/plugin-retry you added in fix: gracefully handle JIT config failures and terminate unconfigured instance #4990 (removed all custom retry code)
  2. Add the missing failure handling for scale-down - preventing EC2 termination when de-registration fails after all retries

Thanks for pointing me to your work!

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves the scale-down Lambda’s failure handling to prevent terminating an EC2 runner instance when GitHub runner de-registration fails, reducing the risk of leaving stale self-hosted runner entries behind.

Changes:

  • Extracts a deleteGitHubRunner() helper to de-register runners with per-runner error handling.
  • Updates removeRunner() to terminate the EC2 instance only when all GitHub de-registrations succeed.
  • Adds a test ensuring the instance is not terminated when the de-registration call throws.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
lambdas/functions/control-plane/src/scale-runners/scale-down.ts Adds per-runner de-registration helper and gates EC2 termination on full de-registration success.
lambdas/functions/control-plane/src/scale-runners/scale-down.test.ts Adds regression test for thrown de-registration errors preventing termination.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@shivdesh shivdesh force-pushed the fix/scale-down-retry-logic branch from 6bab611 to aa2e75b Compare March 13, 2026 18:46
@shivdesh shivdesh changed the title fix: add retry logic for GitHub API calls in scale-down Lambda fix: prevent EC2 termination when GitHub runner de-registration fails Mar 13, 2026
@shivdesh shivdesh force-pushed the fix/scale-down-retry-logic branch from aa2e75b to 1f66402 Compare March 13, 2026 21:07
@shivdesh
Copy link
Copy Markdown
Author

@Brend-Smits checking to see if anything pending from merging this PR?

When the scale-down Lambda fails to de-register a runner from GitHub (even after automatic retries via @octokit/plugin-retry), the EC2 instance should NOT be terminated. This prevents stale runner entries in GitHub org settings. This change complements PR github-aws-runners#4990 which added @octokit/plugin-retry for automatic retries. While that handles transient failures, this ensures that if de-registration ultimately fails, we don't leave orphaned GitHub runner entries by terminating the EC2 instance prematurely. Key changes: - Extract deleteGitHubRunner() helper that catches errors per-runner - Only terminate EC2 instance if ALL GitHub de-registrations succeed - If any de-registration fails, leave instance running for next cycle - Rename githubAppClient to githubInstallationClient for clarity - Refactor to split owner/repo once instead of multiple times - Fix error logging to handle non-Error objects properly The @octokit/plugin-retry (added in github-aws-runners#4990) handles automatic retries at the client level, so no custom retry logic is needed here. Tests: - Add test verifying EC2 is NOT terminated when de-registration fails
…ed maximum When pool and scale-up lambdas run concurrently, currentRunners can temporarily exceed maximumRunners. This caused the calculation `maximumRunners - currentRunners` to produce a negative value, which was then passed to EC2 CreateFleet API, resulting in: InvalidTargetCapacitySpecification: TotalTargetCapacity should not be negative. This fix wraps the calculation with Math.max(0, ...) to ensure we never attempt to create a negative number of runners. Fixes race condition between pool-lambda and scale-up-lambda.
@shivdesh shivdesh force-pushed the fix/scale-down-retry-logic branch from 1f66402 to 7a368bc Compare March 27, 2026 17:35
shivdesh added a commit to shivdesh/terraform-aws-github-runner that referenced this pull request Mar 27, 2026
- Prevent EC2 termination when GitHub runner de-registration fails (upstream PR github-aws-runners#5061) - Prevent negative TotalTargetCapacity when runners exceed maximum (upstream PR github-aws-runners#5062) Changes: - scale-down.ts: Extract deleteGitHubRunner() helper, only terminate EC2 if all de-registrations succeed - scale-up.ts: Add Math.max(0, ...) guard to prevent negative runner count
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

4 participants