Skip to content

fix: break SDK iterator after result message to prevent hang in pull_request runs#1339

Open
scobbe wants to merge 1 commit into
anthropics:mainfrom
sonoma-security:fix/sdk-iterator-hang-after-result-message
Open

fix: break SDK iterator after result message to prevent hang in pull_request runs#1339
scobbe wants to merge 1 commit into
anthropics:mainfrom
sonoma-security:fix/sdk-iterator-hang-after-result-message

Conversation

@scobbe
Copy link
Copy Markdown

@scobbe scobbe commented May 22, 2026

Problem

In some workflow contexts — reliably reproducible for us on pull_request-triggered runs — the SDK query() iterator does not close after the result message is emitted. The for await loop in runClaudeWithSdk (base-action/src/run-claude-sdk.ts) blocks indefinitely after Claude has finished its work, until the workflow's timeout-minutes cap kills the job.

Symptoms observed in production (4× in our scan-reviewer workflow over the past week):

  • Claude completes successfully: SDK emits { "type": "result", "subtype": "success", ... } with cost / turns / duration set.
  • Claude's verdict has already been posted to GitHub during its turns via tools.
  • The action then sits with zero log output for the remainder of timeout-minutes (we measured 18-19 min of dead time after the result).
  • Job is cancelled at timeout. writeExecutionFile is never called → no claude-execution-output.json → downstream cost-tracker / post-steps see nothing.
  • Run shows as cancelled, even though Claude did the work correctly.

Concrete example: scan-reviewer run 26263697343 on sonoma-security/sonoma#1519:

01:48:57  Claude init (Sonnet)
01:50:05  Claude result emitted (success, 13 turns, 68 s, $0.39)
[18 min 43 s of complete silence]
02:08:48  ##[error] The operation was canceled

Author-mode (workflow_dispatch) runs from the same codebase terminate cleanly the same day (we see Log saved to .../claude-execution-output.json within ~1 s of the result message), so the hang is specific to certain event triggers — not a global SDK issue.

Fix

By SDK contract, no further messages follow a result message — it's terminal. Break out of the loop immediately after capturing it, regardless of whether the iterator itself closes cleanly.

if (message.type === "result") {
  resultMessage = message as SDKResultMessage;
+ break;
}

Defensive: eliminates the hang regardless of why the upstream iterator stays open. If the SDK is later fixed to close cleanly in all contexts, the break becomes a no-op.

Test plan

  • Author-mode runs (workflow_dispatch) that previously terminated cleanly should still terminate cleanly — no behavior change.
  • Reviewer-mode runs (pull_request) that previously hung at 20-min timeout should now exit within ~1 s of the result message.

We're running the patched version against our scan-reviewer workflow this week via a sonoma-security fork pinned in our workflow YAML; happy to report back when we have data. Glad to add a regression test (mock SDK iterator that yields result then stays open) if the project has a convention for that.

In some workflow contexts — reliably reproducible for us on
pull_request-triggered runs of this action — the Claude Agent SDK
query() async iterator does not close after the terminal result
message is emitted. The for-await loop in runClaudeWithSdk blocks
indefinitely after Claude has finished its work, until the workflow's
timeout-minutes cap kills the job.

Symptoms observed in production (4× in our scan-reviewer workflow):
- Claude completes successfully: SDK emits { type: "result",
  subtype: "success", ... } with the cost / turns / duration set.
- The action then sits with zero log output for the rest of
  timeout-minutes (we measured 18-19 min of dead time after result).
- The job is cancelled at timeout. writeExecutionFile is never
  called → no claude-execution-output.json → cost-tracker and other
  post-steps see nothing.
- Run shows as cancelled, even though Claude did its work and any
  verdict it posted via gh tools already landed.

Author-mode (workflow_dispatch) runs from the same codebase
terminate cleanly the same day, so the hang is specific to certain
event triggers.

By SDK contract the result message is terminal — no further messages
follow. Break out of the loop immediately after capturing it,
regardless of whether the upstream iterator ever closes. If the SDK
is later fixed to close cleanly in all contexts, this break becomes
a no-op.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant