fix: break SDK iterator after result message to prevent hang in pull_request runs#1339
Open
scobbe wants to merge 1 commit into
Open
Conversation
In some workflow contexts — reliably reproducible for us on
pull_request-triggered runs of this action — the Claude Agent SDK
query() async iterator does not close after the terminal result
message is emitted. The for-await loop in runClaudeWithSdk blocks
indefinitely after Claude has finished its work, until the workflow's
timeout-minutes cap kills the job.
Symptoms observed in production (4× in our scan-reviewer workflow):
- Claude completes successfully: SDK emits { type: "result",
subtype: "success", ... } with the cost / turns / duration set.
- The action then sits with zero log output for the rest of
timeout-minutes (we measured 18-19 min of dead time after result).
- The job is cancelled at timeout. writeExecutionFile is never
called → no claude-execution-output.json → cost-tracker and other
post-steps see nothing.
- Run shows as cancelled, even though Claude did its work and any
verdict it posted via gh tools already landed.
Author-mode (workflow_dispatch) runs from the same codebase
terminate cleanly the same day, so the hang is specific to certain
event triggers.
By SDK contract the result message is terminal — no further messages
follow. Break out of the loop immediately after capturing it,
regardless of whether the upstream iterator ever closes. If the SDK
is later fixed to close cleanly in all contexts, this break becomes
a no-op.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
In some workflow contexts — reliably reproducible for us on
pull_request-triggered runs — the SDKquery()iterator does not close after theresultmessage is emitted. Thefor awaitloop inrunClaudeWithSdk(base-action/src/run-claude-sdk.ts) blocks indefinitely after Claude has finished its work, until the workflow'stimeout-minutescap kills the job.Symptoms observed in production (4× in our scan-reviewer workflow over the past week):
{ "type": "result", "subtype": "success", ... }with cost / turns / duration set.timeout-minutes(we measured 18-19 min of dead time after the result).writeExecutionFileis never called → noclaude-execution-output.json→ downstream cost-tracker / post-steps see nothing.cancelled, even though Claude did the work correctly.Concrete example: scan-reviewer run
26263697343onsonoma-security/sonoma#1519:Author-mode (
workflow_dispatch) runs from the same codebase terminate cleanly the same day (we seeLog saved to .../claude-execution-output.jsonwithin ~1 s of the result message), so the hang is specific to certain event triggers — not a global SDK issue.Fix
By SDK contract, no further messages follow a
resultmessage — it's terminal. Break out of the loop immediately after capturing it, regardless of whether the iterator itself closes cleanly.Defensive: eliminates the hang regardless of why the upstream iterator stays open. If the SDK is later fixed to close cleanly in all contexts, the
breakbecomes a no-op.Test plan
workflow_dispatch) that previously terminated cleanly should still terminate cleanly — no behavior change.pull_request) that previously hung at 20-min timeout should now exit within ~1 s of the result message.We're running the patched version against our scan-reviewer workflow this week via a sonoma-security fork pinned in our workflow YAML; happy to report back when we have data. Glad to add a regression test (mock SDK iterator that yields result then stays open) if the project has a convention for that.