Skip to content

fix: recover crashed agents by recreating missing tmux windows#362

Open
whitmo wants to merge 1 commit intodlorenc:mainfrom
whitmo:work/cool-dolphin
Open

fix: recover crashed agents by recreating missing tmux windows#362
whitmo wants to merge 1 commit intodlorenc:mainfrom
whitmo:work/cool-dolphin

Conversation

@whitmo
Copy link

@whitmo whitmo commented Mar 3, 2026

Summary

  • Fixes the critical bug where crashed agents could not be restarted because handleRestartAgent refused to proceed when the tmux window was missing, returning an unhelpful error instead of recreating it
  • Health check loop now attempts window recreation + restart for persistent agents before falling back to cleanup, so agents like supervisor/workspace/merge-queue auto-recover even when their window disappears
  • Adds worktree path validation to both handleRestartAgent and restartAgent to fail fast with clear errors when the agent's working directory no longer exists

What was broken

When an agent crashed hard enough to lose its tmux window (e.g., OOM kill, manual tmux kill-window, session corruption):

  1. multiclaude agent restart <name> would fail with "tmux window does not exist - the agent may need to be recreated" — but there was no way to "recreate" it
  2. The health check loop would immediately mark the agent for cleanup and delete it from state, losing all context

What this fixes

  1. handleRestartAgent now recreates the missing tmux window (pointed at the agent's worktree) before restarting Claude with --resume
  2. Health check loop tries window recreation + restart for persistent agents before marking them for cleanup
  3. Both paths validate the worktree exists first to avoid starting Claude in a nonexistent directory

Test plan

  • Existing TestHandleRestartAgentTableDriven passes (all 8 cases)
  • New test case for missing worktree path validation
  • Full test suite passes (go test ./...)
  • Manual: kill a persistent agent's tmux window, verify multiclaude agent restart recovers it
  • Manual: kill a persistent agent's tmux window, wait 2 min, verify health check auto-recovers it

🤖 Generated with Claude Code

Previously, when an agent's tmux window disappeared (crash, manual kill, etc.), both the restart command and health check loop would fail to recover it - the restart command returned an error saying the window needed to be recreated, and the health check immediately marked the agent for cleanup. Now: - handleRestartAgent recreates the tmux window when missing, then restarts Claude in it, preserving session context via --resume - Health check loop attempts window recreation + restart for persistent agents before falling back to cleanup - Both paths validate the agent's worktree still exists before attempting recovery - restartAgent validates worktree existence early to fail fast with a clear error message Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@whitmo
Copy link
Author

whitmo commented Mar 7, 2026

Triage Review

Priority: P1 (Agent restart - roadmap item)
Build: Pass
Tests: Pass (new test: TestHandleRestartAgentTableDriven for missing worktree)
Merge conflicts: Conflicts with PR #364 in internal/daemon/daemon.go (both modify health check and restart handler)
Roadmap alignment: Directly addresses P1 "Agent restart" item

Changes:

  • Recreates missing tmux windows for persistent agents during health check
  • Adds worktree existence verification before restart
  • Recreates tmux window in handleRestartAgent if missing

Recommendation: Merge before #364. Good defensive improvements.

@whitmo
Copy link
Author

whitmo commented Mar 12, 2026

Local CI Verification (2026-03-12)

Check Result
go build PASS
go vet PASS
go test ./... PASS (all 25 packages)

CI Status: No GitHub Actions checks are running — this is expected for first-time fork PRs. GitHub requires a maintainer to approve workflow runs for PRs from forks.

Branch is rebased on upstream/main (0 commits behind). Ready for maintainer review and CI approval.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

1 participant