Fix workflow/step not found errors to fail run instead of queue retry#1452
Fix workflow/step not found errors to fail run instead of queue retry#1452
Conversation
Slack-Thread: https://vercel.slack.com/archives/C09G3EQAL84/p1773856370214769?thread_ts=1773856370.214769&cid=C09G3EQAL84 Co-authored-by: Pranay Prakash <1797812+pranaygp@users.noreply.github.com>
🦋 Changeset detectedLatest commit: 5033859 The changes in this PR will be included in the next version bump. This PR includes changesets to release 16 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
🧪 E2E Test Results❌ Some tests failed Summary
❌ Failed Tests🌍 Community Worlds (56 failed)mongodb (3 failed):
redis (2 failed):
turso (51 failed):
Details by Category✅ ▲ Vercel Production
✅ 💻 Local Development
✅ 📦 Local Production
✅ 🐘 Local Postgres
✅ 🪟 Windows
❌ 🌍 Community Worlds
✅ 📋 Other
❌ Some E2E test jobs failed:
Check the workflow run for details. |
📊 Benchmark Results
workflow with no steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Next.js (Turbopack) | Express workflow with 1 step💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Next.js (Turbopack) | Nitro workflow with 10 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Express | Next.js (Turbopack) workflow with 25 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Next.js (Turbopack) | Express workflow with 50 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Express | Next.js (Turbopack) Promise.all with 10 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Express | Next.js (Turbopack) Promise.all with 25 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Express | Next.js (Turbopack) Promise.all with 50 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Express | Next.js (Turbopack) Promise.race with 10 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) | Nitro | Express Promise.race with 25 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Next.js (Turbopack) | Nitro Promise.race with 50 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Express | Next.js (Turbopack) Stream Benchmarks (includes TTFB metrics)workflow with stream💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Express | Next.js (Turbopack) SummaryFastest Framework by WorldWinner determined by most benchmark wins
Fastest World by FrameworkWinner determined by most benchmark wins
Column Definitions
Worlds:
|
There was a problem hiding this comment.
Pull request overview
This PR prevents infinite queue retries caused by deployment mismatches (workflow/step removed or renamed) by turning “not found” conditions into run failures that the runtime can persist via run_failed events.
Changes:
- Throw
WorkflowRuntimeErrorwhen the workflow function can’t be resolved, so the runtime fails the run instead of retrying the queue message. - In the step queue handler, detect missing/non-function step handlers and immediately create a
run_failedevent witherrorCode=RUNTIME_ERRORrather than throwing. - Update
runWorkflowtests to assert the new error type/message for missing workflow functions.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| packages/core/src/workflow.ts | Converts “workflow not found” into WorkflowRuntimeError so runtime failure handling kicks in. |
| packages/core/src/runtime/step-handler.ts | Fails runs via run_failed when the step function is missing/non-callable, avoiding queue retry loops. |
| packages/core/src/workflow.test.ts | Updates assertions for the new workflow-not-found error type/message. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
| if (typeof workflowFn !== 'function') { | ||
| throw new ReferenceError( | ||
| // Use WorkflowRuntimeError so this is caught and properly fails the run | ||
| // rather than bubbling up to the queue for infinite retries. | ||
| // A missing workflow function indicates a code deployment mismatch that | ||
| // retries won't fix. | ||
| throw new WorkflowRuntimeError( | ||
| `Workflow ${JSON.stringify( | ||
| workflowRun.workflowName | ||
| )} must be a function, but got "${typeof workflowFn}" instead` | ||
| )} not found. This usually means the workflow was removed or renamed in a newer deployment while the run was in progress.` | ||
| ); |
| // Validate step function exists before attempting to start it. | ||
| // A missing step function indicates a code deployment mismatch that | ||
| // retries won't fix, so we fail the run immediately. | ||
| const stepFn = getStepFunction(stepName); | ||
| if (!stepFn) { | ||
| throw new Error(`Step "${stepName}" not found`); | ||
| } | ||
| if (typeof stepFn !== 'function') { | ||
| throw new Error( | ||
| `Step "${stepName}" is not a function (got ${typeof stepFn})` | ||
| ); | ||
| if (!stepFn || typeof stepFn !== 'function') { | ||
| const errorMessage = !stepFn | ||
| ? `Step "${stepName}" not found. This usually means the step was removed or renamed in a newer deployment while the run was in progress.` | ||
| : `Step "${stepName}" is not a function (got ${typeof stepFn})`; | ||
| | ||
| runtimeLogger.error('Fatal step lookup error, failing run', { | ||
| workflowRunId, | ||
| stepName, | ||
| error: errorMessage, | ||
| }); | ||
| | ||
| // Fail the run via event - this is a fatal condition that retries won't fix | ||
| try { | ||
| await world.events.create( | ||
| workflowRunId, | ||
| { | ||
| eventType: 'run_failed', | ||
| specVersion: SPEC_VERSION_CURRENT, | ||
| eventData: { | ||
| error: { | ||
| message: errorMessage, | ||
| stack: new Error(errorMessage).stack, | ||
| }, | ||
| errorCode: RUN_ERROR_CODES.RUNTIME_ERROR, | ||
| }, | ||
| }, | ||
| { requestId } | ||
| ); | ||
| } catch (failErr) { | ||
| if ( | ||
| EntityConflictError.is(failErr) || | ||
| RunExpiredError.is(failErr) | ||
| ) { | ||
| runtimeLogger.info( | ||
| 'Tried failing run for missing step, but run has already finished.', | ||
| { | ||
| workflowRunId, | ||
| stepName, | ||
| message: failErr.message, | ||
| } | ||
| ); | ||
| return; | ||
| } | ||
| throw failErr; | ||
| } | ||
| return; | ||
| } |
Move step function validation after step_started and call step_failed directly if not found. Co-authored-by: Pranay Prakash <1797812+pranaygp@users.noreply.github.com>
| // Validate step function exists AFTER step_started so we can | ||
| // properly fail the step (not the run) if the function is missing. | ||
| // This allows the workflow to handle the step failure gracefully, | ||
| // similar to how FatalError is handled. | ||
| if (!stepFn || typeof stepFn !== 'function') { | ||
| const errorMessage = !stepFn |
There was a problem hiding this comment.
we need to test this behavior
| `Workflow ${JSON.stringify( | ||
| workflowRun.workflowName | ||
| )} must be a function, but got "${typeof workflowFn}" instead` | ||
| )} not found. This usually means the workflow was removed or renamed in a newer deployment while the run was in progress.` |
There was a problem hiding this comment.
| )} not found. This usually means the workflow was removed or renamed in a newer deployment while the run was in progress.` | |
| )} must be a function but got "${typeof workflowFn}" instead. This usually means an attempt was made to start a workflow against a deployment that does not have the workflow registered` |
There was a problem hiding this comment.
honestly we should probably have an error sling and and error doc explaining this. Troubleshooting this means that either
- something went wrong with the bundling/build tooling that caused the workflow to not get build correctly
- a run was started (ore restarted on the UI) to target a new deployment, but the the workflow was either renamed, or moved to a different directory and so it is no longer accessible
- in this case, triggering a new deployment with the workflow moved back to the old location and name, and then triggering a "rerun on lates" against the run will cause it to work again
this stuff is too long for an error message and requires an error slug + doc
When a workflow or step function is not found during queue message processing, the error was bubbling up to the queue handler causing infinite retries. This is a code deployment mismatch that retries will never fix.
What changed
run_failedevent withRUNTIME_ERRORcode instead of throwing a plain errorReferenceErrortoWorkflowRuntimeErrorfor workflow not found - this is already caught by the runtime and properly fails the runBoth cases now provide helpful error messages indicating this is likely a deployment mismatch (workflow/step removed or renamed in a newer deployment while run was in progress).