autoresume state handling improvements#2196
autoresume state handling improvements#2196matthewlouisbrockman wants to merge 30 commits intomainfrom
Conversation
| @codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: aa2c4735a8
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Autofix Details
Bugbot Autofix prepared a fix for the issue found in the latest run.
- ✅ Fixed: Tight spin loop on failed snapshot transition
- Added a retry backoff when sandbox state remains snapshotting after refresh so auto-resume no longer busy-spins on failed snapshot transitions.
Or push these changes by commenting:
@cursor push 8d53281ca5 Preview (8d53281ca5)
diff --git a/packages/api/internal/handlers/proxy_grpc.go b/packages/api/internal/handlers/proxy_grpc.go --- a/packages/api/internal/handlers/proxy_grpc.go +++ b/packages/api/internal/handlers/proxy_grpc.go @@ -24,6 +24,8 @@ sharedutils "github.com/e2b-dev/infra/packages/shared/pkg/utils" ) +const snapshottingStateRetryDelay = 100 * time.Millisecond + type SandboxService struct { proxygrpc.UnimplementedSandboxServiceServer @@ -134,6 +136,14 @@ updatedSandbox, getSandboxErr := getSandbox(ctx) if getSandboxErr == nil { +if sbx.State == sandbox.StateSnapshotting && updatedSandbox.State == sandbox.StateSnapshotting { +select { +case <-time.After(snapshottingStateRetryDelay): +case <-ctx.Done(): +return "", false, status.Error(codes.Internal, "error waiting for sandbox snapshot to finish") +} +} + sbx = updatedSandbox continue diff --git a/packages/api/internal/handlers/proxy_grpc_test.go b/packages/api/internal/handlers/proxy_grpc_test.go --- a/packages/api/internal/handlers/proxy_grpc_test.go +++ b/packages/api/internal/handlers/proxy_grpc_test.go @@ -5,6 +5,7 @@ "errors" "fmt" "testing" +"time" "github.com/google/uuid" "github.com/stretchr/testify/assert" @@ -404,6 +405,45 @@ assert.Equal(t, "error waiting for sandbox snapshot to finish", st.Message()) }) +t.Run("snapshotting sandbox with no active transition does not spin", func(t *testing.T) { +t.Parallel() + +ctx, cancel := context.WithTimeout(t.Context(), 20*time.Millisecond) +defer cancel() + +waitCalls := 0 +getCalls := 0 + +_, handled, err := handleExistingSandboxAutoResume( +ctx, +"test-sandbox", +testSandboxForAutoResume(sandbox.StateSnapshotting), +func(context.Context) error { +waitCalls++ + +return nil +}, +func(context.Context) (sandbox.Sandbox, error) { +getCalls++ + +return testSandboxForAutoResume(sandbox.StateSnapshotting), nil +}, +func(sandbox.Sandbox) (string, error) { +t.Fatal("getNodeIP should not be called while sandbox remains snapshotting") + +return "", nil +}, +) +require.Error(t, err) +assert.False(t, handled) +st, ok := status.FromError(err) +require.True(t, ok) +assert.Equal(t, codes.Internal, st.Code()) +assert.Equal(t, "error waiting for sandbox snapshot to finish", st.Message()) +assert.Equal(t, 1, waitCalls) +assert.Equal(t, 1, getCalls) +}) + t.Run("pausing sandbox returns internal error when refreshed sandbox lookup fails unexpectedly", func(t *testing.T) { t.Parallel()This Bugbot Autofix run was free. To enable autofix for future PRs, go to the Cursor dashboard.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 964c63291e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 99c73584aa
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
… on http so this falls within it
…ing against changes to failedprecondition
PR SummaryMedium Risk Overview Written by Cursor Bugbot for commit 724c237. This will update automatically on new commits. Configure here. |
…e tests tothe orch
# Conflicts: # packages/api/internal/handlers/proxy_grpc.go
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.


auto-resume was not mirroring the state handling like in /resume and /connect. On catalog miss, it could return routing for sandboxes still in pausing/snapshotting transition, which let post-connect run traffic hit 502s.
Changes:
HandleExistingSandboxAutoResumeinautoresume.goto handle the state transitions.Note
Medium Risk
Touches auto-resume/resume routing and orchestrator state handling, which can change when requests fall back to DB-backed resumes or return 409s; misclassification could increase resume failures or transient errors.
Overview
Auto-resume now checks orchestrator state before attempting DB-backed resume: running sandboxes return routing immediately, pausing/snapshotting sandboxes are waited on and re-checked (bounded retries/time budget), killing/unknown states fail, and snapshot metadata is reloaded before fallback to avoid resuming from stale pre-pause data. A new shared
SandboxStillTransitioningMessageis returned asFailedPreconditionand is propagated through client-proxy into a dedicatedSandboxStillTransitioningError, which the HTTP proxy renders as a 409 (JSON or browser HTML) with added tests covering the new state-machine and error-template behavior.Written by Cursor Bugbot for commit 2eb2308. This will update automatically on new commits. Configure here.