Skip to content

autoresume state handling improvements#2196

Open
matthewlouisbrockman wants to merge 30 commits intomainfrom
connect-autoresume-state-handling
Open

autoresume state handling improvements#2196
matthewlouisbrockman wants to merge 30 commits intomainfrom
connect-autoresume-state-handling

Conversation

@matthewlouisbrockman
Copy link
Contributor

@matthewlouisbrockman matthewlouisbrockman commented Mar 21, 2026

auto-resume was not mirroring the state handling like in /resume and /connect. On catalog miss, it could return routing for sandboxes still in pausing/snapshotting transition, which let post-connect run traffic hit 502s.

Changes:

  • orchestrator now has HandleExistingSandboxAutoResume in autoresume.go to handle the state transitions.

Note

Medium Risk
Touches auto-resume/resume routing and orchestrator state handling, which can change when requests fall back to DB-backed resumes or return 409s; misclassification could increase resume failures or transient errors.

Overview
Auto-resume now checks orchestrator state before attempting DB-backed resume: running sandboxes return routing immediately, pausing/snapshotting sandboxes are waited on and re-checked (bounded retries/time budget), killing/unknown states fail, and snapshot metadata is reloaded before fallback to avoid resuming from stale pre-pause data. A new shared SandboxStillTransitioningMessage is returned as FailedPrecondition and is propagated through client-proxy into a dedicated SandboxStillTransitioningError, which the HTTP proxy renders as a 409 (JSON or browser HTML) with added tests covering the new state-machine and error-template behavior.

Written by Cursor Bugbot for commit 2eb2308. This will update automatically on new commits. Configure here.

@matthewlouisbrockman
Copy link
Contributor Author

@codex review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: aa2c4735a8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Autofix Details

Bugbot Autofix prepared a fix for the issue found in the latest run.

  • ✅ Fixed: Tight spin loop on failed snapshot transition
    • Added a retry backoff when sandbox state remains snapshotting after refresh so auto-resume no longer busy-spins on failed snapshot transitions.

Create PR

Or push these changes by commenting:

@cursor push 8d53281ca5 
Preview (8d53281ca5)
diff --git a/packages/api/internal/handlers/proxy_grpc.go b/packages/api/internal/handlers/proxy_grpc.go --- a/packages/api/internal/handlers/proxy_grpc.go +++ b/packages/api/internal/handlers/proxy_grpc.go @@ -24,6 +24,8 @@	sharedutils "github.com/e2b-dev/infra/packages/shared/pkg/utils" ) +const snapshottingStateRetryDelay = 100 * time.Millisecond + type SandboxService struct {	proxygrpc.UnimplementedSandboxServiceServer @@ -134,6 +136,14 @@	updatedSandbox, getSandboxErr := getSandbox(ctx)	if getSandboxErr == nil { +if sbx.State == sandbox.StateSnapshotting && updatedSandbox.State == sandbox.StateSnapshotting { +select { +case <-time.After(snapshottingStateRetryDelay): +case <-ctx.Done(): +return "", false, status.Error(codes.Internal, "error waiting for sandbox snapshot to finish") +} +} +	sbx = updatedSandbox	continue diff --git a/packages/api/internal/handlers/proxy_grpc_test.go b/packages/api/internal/handlers/proxy_grpc_test.go --- a/packages/api/internal/handlers/proxy_grpc_test.go +++ b/packages/api/internal/handlers/proxy_grpc_test.go @@ -5,6 +5,7 @@	"errors"	"fmt"	"testing" +"time"	"github.com/google/uuid"	"github.com/stretchr/testify/assert" @@ -404,6 +405,45 @@	assert.Equal(t, "error waiting for sandbox snapshot to finish", st.Message())	}) +t.Run("snapshotting sandbox with no active transition does not spin", func(t *testing.T) { +t.Parallel() + +ctx, cancel := context.WithTimeout(t.Context(), 20*time.Millisecond) +defer cancel() + +waitCalls := 0 +getCalls := 0 + +_, handled, err := handleExistingSandboxAutoResume( +ctx, +"test-sandbox", +testSandboxForAutoResume(sandbox.StateSnapshotting), +func(context.Context) error { +waitCalls++ + +return nil +}, +func(context.Context) (sandbox.Sandbox, error) { +getCalls++ + +return testSandboxForAutoResume(sandbox.StateSnapshotting), nil +}, +func(sandbox.Sandbox) (string, error) { +t.Fatal("getNodeIP should not be called while sandbox remains snapshotting") + +return "", nil +}, +) +require.Error(t, err) +assert.False(t, handled) +st, ok := status.FromError(err) +require.True(t, ok) +assert.Equal(t, codes.Internal, st.Code()) +assert.Equal(t, "error waiting for sandbox snapshot to finish", st.Message()) +assert.Equal(t, 1, waitCalls) +assert.Equal(t, 1, getCalls) +}) +	t.Run("pausing sandbox returns internal error when refreshed sandbox lookup fails unexpectedly", func(t *testing.T) {	t.Parallel()

This Bugbot Autofix run was free. To enable autofix for future PRs, go to the Cursor dashboard.

@matthewlouisbrockman matthewlouisbrockman marked this pull request as ready for review March 22, 2026 08:08
@matthewlouisbrockman matthewlouisbrockman assigned dobrac and unassigned levb Mar 22, 2026
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 964c63291e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@matthewlouisbrockman matthewlouisbrockman marked this pull request as draft March 22, 2026 08:15
@matthewlouisbrockman matthewlouisbrockman marked this pull request as ready for review March 22, 2026 09:15
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 99c73584aa

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@jakubno jakubno assigned jakubno and unassigned dobrac Mar 23, 2026
Copy link
Member

@jakubno jakubno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just NITs

@cursor
Copy link

cursor bot commented Mar 24, 2026

PR Summary

Medium Risk
Changes auto-resume control flow and error propagation across API, orchestrator, and proxies, which could alter client-visible behavior during sandbox transitions. Risk is moderated by added unit/integration tests for the new transition and error-handling paths.

Overview
Improves sandbox auto-resume to avoid returning routing information while a sandbox is still pausing/snapshotting by adding orchestrator-side state handling with bounded waiting/retries, reloading snapshot metadata before resume, and propagating a dedicated FailedPrecondition (sandbox is still transitioning) signal end-to-end so the client proxy and shared proxy can surface a 409 "still transitioning" response (JSON or browser HTML) instead of 502s.

Written by Cursor Bugbot for commit 724c237. This will update automatically on new commits. Configure here.

# Conflicts: #	packages/api/internal/handlers/proxy_grpc.go
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

@ValentaTomas ValentaTomas removed their request for review March 25, 2026 16:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

4 participants