Skip to content

[examples] add coding_agent_rl: agent-in-sandbox RL minimal demo#1923

Open
jingshenghang wants to merge 36 commits into
THUDM:mainfrom
jingshenghang:coding-agent-rl-example
Open

[examples] add coding_agent_rl: agent-in-sandbox RL minimal demo#1923
jingshenghang wants to merge 36 commits into
THUDM:mainfrom
jingshenghang:coding-agent-rl-example

Conversation

@jingshenghang
Copy link
Copy Markdown
Collaborator

End-to-end loop for "coding agent + sandbox execution + test reward":

  1. Boot an E2B sandbox per sample (dataset metadata.image).
  2. Install Node 22 + Claude Code CLI inside; create unprivileged agent user; chown repo; write problem_statement.md.
  3. Launch claude --output-format stream-json pointed at a head-node bridge via ANTHROPIC_BASE_URL; ANTHROPIC_AUTH_TOKEN doubles as the session id for concurrent request demux.
  4. bridge.py translates each /v1/messages call into a SGLang /generate call and streams an Anthropic SSE reply back; per-session it keeps (prompt_ids, response_ids, loss_mask) so no post-hoc retokenize.
  5. After the agent finishes, a fresh eval sandbox applies the model git diff and runs the dataset's tests -> 0/1 reward.
  6. generate.py drops the bridge-collected tokens straight into Sample.

Layout under examples/coding_agent_rl/:
generate.py slime custom-generate entrypoint
sandbox.py all sandbox-side ops (boot/exec/upload, install
Node/Claude Code, run agent, git diff, eval)
bridge.py Anthropic Messages API <-> SGLang /generate
shim; model-agnostic
run_glm47_355b.sh 8-node / 64-GPU / colocate / E2B launch
script for GLM-4.7-355B-A32B
README.md walkthrough + dataset schema + swap-model howto

Model-agnostic: chat template via tokenizer.apply_chat_template; tool call parsing via sglang FunctionCallParser; reasoning parsing via sglang ReasoningParser. Swapping model = changing SWE_TOOL_PARSER / SWE_REASONING_PARSER envs (default glm47/glm45).

No new swe_rollout.py: reuses slime's default sglang_rollout outer loop via --custom-generate-function-path.

@jingshenghang jingshenghang force-pushed the coding-agent-rl-example branch from 585ef36 to f2fd320 Compare May 19, 2026 09:40
Comment thread examples/coding_agent_rl/bridge.py Outdated
# Canonical chat log. Each assistant turn we append after /generate carries
# reasoning_content so the next round's apply_chat_template re-render matches
# the tokens the model actually emitted (preserving prefix match).
glm_messages: list[dict] = dataclasses.field(default_factory=list)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

应该叫 messages 就可以了

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

在重构中,改成了 chat_messages

Comment thread examples/coding_agent_rl/bridge.py Outdated
@@ -0,0 +1,424 @@
"""Anthropic Messages API <-> SGLang /generate bridge.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我建议改成叫 middleware 之类的东西,主要是现在 slime 里面已经有 mbridge 和 megatron bridge 了,容易混淆。

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

改成了 middleware.py 防止混淆

Comment thread examples/coding_agent_rl/middleware.py Outdated
lock: asyncio.Lock = dataclasses.field(default_factory=asyncio.Lock)


class _Store:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

python 的 asyncio 是不是单线程的?需要这个带锁的 session 吗?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

asyncio 单线程的语义只在没有 await 暂停时才成立。_run_turn 中间有 await _post_generate(...)(一次 sglang HTTP,秒级到分钟级),同一个 session_id 在这段窗口里再来一个请求(claude-code 正常不会,但 Anthropic SDK 的 retry / 客户端竞速会触发),两个协程会交错改 chat_messages / response_ids / pending_raw_tokens,TITO 校验立即崩。锁是 per-session 的,单会话只串行一个 turn;不同 session 仍然完全并发。

Comment thread examples/coding_agent_rl/middleware.py Outdated
"message": "Authorization Bearer <session_id> required",
}}, status=400)

s = await store.get(session_id)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

有可能需要加个提示,session_id 不能重复

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sid 改为 f'cagent-{instance_id}-{sample.index}-{sample.group_index}',靠 Sample 自带字段构造唯一

Comment thread examples/coding_agent_rl/middleware.py Outdated
ideal_text = tok.apply_chat_template(
s.glm_messages, tools=s.tools_schema, tokenize=False, add_generation_prompt=True,
)
ideal_ids = tok.encode(ideal_text, add_special_tokens=False)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里是不是 tokenize=True 就行?token=True 还能设置 add_special_tokens=False 吗?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

是的,已修改成 tokenize=True

    valid = {i: tup for i, tup in asst_raw_tokens.items() if 0 <= i < len(chat_messages)}
    if not valid:
        ids = tok.apply_chat_template(
            chat_messages,
            tools=tools_schema,
            tokenize=True,
            add_generation_prompt=True,
        )
        return list(ids), []

Comment thread examples/coding_agent_rl/middleware.py Outdated
else:
logger.warning("[bridge] %s template-rerender mismatch; rebaselining", session_id)
s.response_ids = ideal_ids[len(s.prompt_ids):]
s.loss_mask = [0] * len(s.response_ids)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里是按如果不匹配会 mask 掉整条 response 来做的是吗?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

def _decode_and_verify(
    tok: Any,
    target: Chain,
    output_ids: list[int],
) -> str:
    """Decode output_ids; if the round-trip doesn't match, zero the loss_mask
    over this turn (cause: tokenizer ambiguity). Returns raw decoded text."""
    n = len(output_ids)
    if n == 0:
        return ""
    raw_output = tok.decode(output_ids, skip_special_tokens=False)
    if not verify_tito_for_turn(tok, raw_output, output_ids):
        target.loss_mask[-n:] = [0] * n
        logger.warning("[middleware] TITO mismatch; loss_mask zeroed (n=%d)", n)
    return raw_output

是的,会 mask 掉整条 response

"content": [], "stop_reason": None, "stop_sequence": None,
"usage": {"input_tokens": in_tokens, "output_tokens": 0}},
}))
for idx, b in enumerate(blocks):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

想确定一下,这里是会先把整个 streaming 的请求变成同步的吗?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

是的,_post_generate 走 sglang 的非流式接口拿完整 output_ids,然后 _stream_anthropic_sse 把它当一段 SSE 一次性吐回 claude-code(claude-code 协议只接受 stream,但内容可以攒齐)。

流式增量 retokenize 会让 TITO 校验在每个 chunk 边界都可能切坏 BPE,所以这里攒够 streaming ,再一次性发送

jingshenghang and others added 26 commits May 25, 2026 03:47
A minimal, readable example of coding agent + sandbox execution + test
reward in slime (~1500 LoC across 4 files). One training sample:

  spin up a sandbox -> run Claude Code inside it -> capture the
  model-produced git diff -> spin up a SECOND clean sandbox, apply the
  diff, run the dataset's tests -> 0/1 reward -> feed the actual
  generated tokens (with loss-mask) back to slime, no re-tokenization.

Wire-up is one CLI flag:

  --custom-generate-function-path examples.coding_agent_rl.generate.generate

slime's default sglang_rollout.generate_rollout outer loop is reused;
only the per-sample generate() is swapped.

Files:

* generate.py - per-sample entrypoint slime calls. Provision sandbox ->
  drop PROBLEM_STATEMENT.md -> run agent -> git diff -> eval in a fresh
  sandbox -> fill Sample.
* sandbox.py  - E2B sandbox backend. Boot/kill, exec/upload,
  install_node22 + install_claude_code, long-running agent spawn with
  done-marker poll, git_diff, fresh-sandbox eval runner (swepro /
  f2p_script / eval_cmd).
* bridge.py   - head-node aiohttp shim. Translates the agent's
  Anthropic Messages API into slime's SGLang /generate (token-native +
  logprobs) and keeps (prompt_ids, response_ids, loss_mask) per session
  so the trainer skips re-tokenization. Model-agnostic.
* run_glm47_355b.sh - reference launch script (GLM-4.7-355B-A32B,
  8 nodes / 64 GPUs, colocate, E2B). All required env vars guarded by
  \${VAR:?...}; no operator-specific paths.
* README.md   - file table, sample flow diagram, dataset schema (flat
  and remote_env_info layouts), required vs optional env knobs,
  "Swap things out" recipes (model / agent / sandbox backend), and
  design notes (no re-tokenization, reasoning round-trip, done-marker
  poll, boot semaphore + retry).
…metadata configurable

* bridge.py renamed to middleware.py (file + class BridgeHandle ->
  MiddlewareHandle + log prefix + thread name). The chat-log dataclass
  field glm_messages was also renamed chat_messages -- the example is
  model-agnostic and the GLM-specific naming was misleading.

* sandbox.py no longer hardcodes ``glm-platform/*`` metadata keys.
  Set SWE_SANDBOX_METADATA_JSON='{"my-platform/size":"lg",...}' to pass
  arbitrary routing tags into AsyncSandbox.create(metadata=...). Default
  is empty, which works for stock E2B accounts.

* generate.py defaults SWE_TOOL_PARSER / SWE_REASONING_PARSER to None
  (no hardcoded GLM-specific parser fallback). The reference launch
  script run_glm47_355b.sh still sets these to glm47/glm45.

* sandbox.run_claude_code's ``bridge_url=`` kwarg renamed to
  ``middleware_url=`` (caller in generate.py updated).

* README + run_glm47_355b.sh updated for the rename and the new
  metadata env var.
…ddleware

- middleware: track each /messages call as a `_Turn` (request snapshot,
  response, finish/stop reason, parent prefix), expose via pop_session
  return value and `record_tree` opt-in.
- middleware: detect non-linear message updates by hashing prior messages
  and rebuild the prompt when the client diverges, instead of silently
  appending. Also translate Anthropic `thinking` blocks into
  `reasoning_content` so prior reasoning is preserved across turns.
- generate: add SWE_SAVE_TRAJECTORY_TREE env knob; when set, stash the
  exported tree under sample.metadata["trajectory_tree"]. Also allow
  overriding the Claude Code prompt via SWE_CC_PROMPT.
- sandbox: pass --include-partial-messages / --include-hook-events to
  claude CLI and allow extra args via SWE_CLAUDE_EXTRA_ARGS; quote the
  trajectory path with shlex.
Captures the in-place rollout/middleware state after the 2026-05-21 work:

middleware.py:
  - Raw-token preservation: _attach_pending_raw_tokens / _render_with_raw_splice
    keep model-generated tokens verbatim across tool turns so loss_mask aligns
    with actual sglang output, eliminating the chat_template re-render drift
    that was wiping 95% of segments in earlier runs.
  - cache_control stripping in _strip_cache_control (Anthropic adds blocks
    sglang doesn't accept).
  - _classify_branch overhaul: detects compact / compact_summarization /
    sibling and falls back to linear, with linear-extend check ordered AHEAD
    of the autoCompact-marker check so post-compact linear chains do not get
    mis-flagged as a 'compact storm' purely because the resume pack floats
    around in m[0..1].
  - _is_compact_resume_request scans m[0..1] (the resume pack can land in
    either when claude-code splits a Read tool call from its result).
  - Per-turn trajectory_tree dump (id, parent_id, parent_prefix_len,
    input_len, output_len, branch_kind, request, response) under
    metadata.trajectory_tree when SWE_SAVE_TRAJECTORY_TREE=1.
  - Abort/resume hardening: poll budgets, attempt caps, raw-token replay.

generate.py:
  - Sandbox provisioning (concurrency-capped, retried) via _provision_sandbox.
  - SWE_TIME_BUDGET_SEC / SWE_MAX_RESPONSE_TOKENS / SWE_CC_PROMPT plumbing.
  - _cap_response_for_training caps the dumped response_ids to bound the
    per-sample memory pressure (entropy_input.clone()).

sandbox.py:
  - ensure_agent_user / apply_before_repo_set_cmd hooks for the GLM
    swedev/sweap-images family.
  - E2B reply-path host pin (SLIME_HEAD_HOST -> 172.27.14.209).

This commit lands the "one tree per sample" branching mode. The companion
branch list_trajectory will split the same tree into multiple trainable
trajectories (one per sub-agent dispatch / compact boundary / final chain).
…ple training

In tree_trajectory mode the middleware drops pre-wipe tokens at every chain
rebase, so only the final post-last-compact tail ever feeds the trainer.
list_trajectory mode preserves them.

middleware.py:
  - _Session.completed_trajectories: list of (prompt_ids, response_ids,
    loss_mask, meta) snapshots taken just before every wipe. meta records
    {"kind": "pre_wipe"|"diverge_reset", "completed_turns": <int>} so the
    downstream Sample fan-out can label what kind of boundary produced it.
  - pop_session_split / _pop_session_split: variant of pop_session that
    returns the snapshot list (plus the still-live final chain) and the
    same trajectory_tree export as before.

generate.py:
  - SWE_LIST_TRAJECTORY env switch. Default 0 keeps the tree_trajectory
    single-Sample behavior; 1 calls pop_session_split and emits one
    deep-copied Sample per non-empty segment.
  - Each fanned-out Sample carries the same per-instance reward
    (sample-level reward propagates verbatim) plus segment-specific metadata
    (segment_idx, num_segments, segment_kind, completed_turns_at_split).
  - trajectory_tree is attached only to the first segment to keep dumps
    compact; viz still gets the full per-instance picture.

slime's sglang_rollout already supports custom_generate_func returning
list[Sample] (see generate_and_rm:283 `if isinstance(sample, list)`), so no
training-loop changes are required.

Smoke test (/tmp/smoke_list_trajectory.py): 4 PASS — empty session, 2-segment
fan-out, empty-final drop, 4-segment multi-wipe chain.
… + nearest-parent selection

Implements MASTER_PLAN.md §2 (C1-C6) and §3 (smoke + offline reclassify):

- C1 (middleware._export_tree): bump version 2 -> 3, emit initial_prompt_len.
- C2 (middleware._is_summarization_request): also scan non-last user msgs,
  but skip ones containing _COMPACT_RESUME_MARKER (avoids false positives on
  embedded resume packs that carry prior summary text).
- C3 (middleware._classify_branch): module-level LINEAR_DEFICIT_TOK=2048;
  new near-linear tolerance channel for chat_template re-render drift. Guard
  loosened from spec's `pfx > par.in + par.out/2` (rejected sib 14/22) to
  `pfx > par.in` (consumed >=1 byte of par.output); still rejects true forks
  via the deficit > 2048 condition. Justified in RECLASSIFY_V3_REPORT.md.
- C4 (middleware._classify_branch): root-parent special case when
  initial_prompt_len is 0 (legacy dumps) but parent.id==0 + resume marker
  present + pfx nearly fills root.full -> compact.
- C5 (middleware._new_turn): two-pass parent selection. Pass 1 walks
  reversed(s.turns) and picks the most recent prev where prefix_len >=
  prev.full_len - 2048 AND prefix_len > prev.input_len. Pass 2 (if pass 1
  found none) falls back to original longest-prefix-match.
- C6 (reclassify_existing_rollout.py): explicit assertion on
  initial_prompt_len > 0; legacy-dump soft-fall to turns[0].input_len with
  a single per-dump warning.

Smoke test: 15/15 PASS (smoke_classifier_v3.py covers cases 1-6 from v2 +
the 9 new cases from spec_1 §4 table).

Reclassify deltas (full _new_turn replay):
- compact_only/rollout_0.pt   : OLD={root:16, linear:388, compact:38, sibling:5, compact_summ:14}
                                NEW={root:16, linear:404, compact:20, sibling:1, compact_summ:20}
- 100k/rollout_0.pt (full)    : OLD={root:16, linear:385, sibling:96, compact:9}
                                NEW={root:16, linear:448, compact:20, sibling:0, compact_summ:22}
- 100k/rollout_0.pt sample 11 : OLD={root:1, linear:51, sibling:22}
                                NEW={root:1, linear:65, compact:3, compact_summ:5, sibling:0}

sample-11 sibling == 0 PASS. compact_summ overshoot vs spec target traced to
real summarization requests caught by the v3 marker check (NOT regression);
see RECLASSIFY_V3_REPORT.md "Notes" section for the per-turn breakdown.
…es + segment cap

- Port the list_trajectory fan-out from list_trajectory branch into tree_trajectory
  (gated by SWE_LIST_TRAJECTORY=1, default 0 keeps single-Sample behavior).
- Add SWE_LIST_TRAJECTORY_REWARD_MODE = {uniform | copy | final_only}, default
  uniform (reward / num_segments). copy keeps the original per-segment full
  reward; final_only zeros all but the kind=='final' segment.
- Add SWE_LIST_TRAJECTORY_MAX_SEGMENTS (default 8) hard cap. When K > cap,
  keep segments[:cap-1] + segments[-1:] (head + final) to bound GBS growth
  and GRPO baseline dilution. Drops the middle on purpose; logged WARN.
- Propagate list_trajectory_reward_mode + list_trajectory_finish_reason into
  each fanned Sample's metadata for downstream tally / reward-allocation
  audit. finish_reason may be None until sub-agent B wires the field.
- Extract reward allocation into pure _segment_reward() so the smoke test
  can validate without spinning up middleware.
- New smoke_reward_modes.py: 6 cases covering uniform, copy, final_only,
  N=1, N=0 defensive, and final_only with no final segment. All PASS.

Spec: 0521/specs/MASTER_PLAN.md §6, §A; spec_3 §3 row generate.py:312, §5 item 3.
…endent segments

Implements MASTER_PLAN.md §5 (subagent segment split) for the list_trajectory
branch. Adds per-dispatch state tracking via a new `_SubSession` dataclass and
a `subagent_stack` field on `_Session`, plus dispatch / return detection logic
inside `_handle_messages`.

User decisions from MASTER_PLAN preamble:

  1. Subagent segment prefix = subagent's OWN system prompt + initial task,
     never the main-line prefix from before the dispatch. Implemented by
     letting each `_SubSession` hold its own `prompt_ids` / `chat_messages` /
     `initial_prompt_len` and renderer-state; the subagent's segment in
     `completed_trajectories` uses those rather than reusing main-line tokens.
  2. Whitelist of segment kinds: only `subagent` / `pre_wipe` / `final`.
     Removed the `diverge_reset` snapshot append at the prefix-divergence
     reset site (was middleware.py:~1126); divergence now just drops the
     current chain and the next chunk lands in the next pre_wipe / final.
     `compact_summarization` was never emitted as a segment so no change.
  3. Nested subagents: only the outermost (depth==1) pop emits a `subagent`
     segment. Deeper pops merge into the parent via `_merge_into_parent`,
     which concatenates `response_ids` and forces `loss_mask=1` for the
     merged tokens. The outer segment's metadata records `nested_depth =
     max(seen)` so post-hoc analysis can tell flat from nested subagents.
  4. Subagent identification: hash `body["system"]` (`_hash_obj`) and
     compare against the main-line cached `system_hash`. Different hash
     while a Task/Agent dispatch marker is pending -> push a new sub-session.
  5. Fallback: when identification is ambiguous (missing system, missing
     dispatch marker, etc.) route to main line - never invent a sub-session.

Detection scheme:

  * On every main-line response, scan `blocks` for `tool_use` with name in
    {Task, Agent} via `_find_dispatch_tool_use_id`; record id on
    `s._pending_dispatch_id` (or on the active sub-session for nested
    dispatches).
  * On every incoming request, before routing, scan new user messages for
    a `tool_result` with the pending dispatch id via `_has_tool_result_for`.
    Match -> pop top of `subagent_stack` and snapshot (or merge if nested).
  * Routing decision: top-of-stack `system_hash` match -> route to that
    sub-session; new `system_hash` with pending dispatch -> push a new
    sub-session; else -> main line.
  * `_pop_session_split` flushes any still-open sub-sessions at session
    pop (claude-code exits mid-dispatch) and appends `last_finish_reason`
    to the `final` segment metadata.

Smoke test: examples/coding_agent_rl/smoke_list_trajectory_v2.py covers
all 10 cases from spec_3_list_trajectory.md §4 (plus helper sanity); all
10 PASS. Case 8 (subagent + compact mix) is interpreted as 3 segments
[subagent, pre_wipe, final] (not the 5 mentioned in the spec note) -
the canonical semantics keep main-line linear continuation in one segment
unless a wipe or dispatch changes the routing target; documented in the
case 8 comment block.

Files:
  examples/coding_agent_rl/middleware.py     +320 / -87 (new _SubSession,
                                              routing helpers, dispatch/return
                                              detection, target-based handler,
                                              updated _pop_session_split)
  examples/coding_agent_rl/smoke_list_trajectory_v2.py  +new (10 cases)
Implements SPEC §9 (Visualization).
- viz/swimlane.py: per-sample swimlane Gantt (matplotlib broken_barh);
  color by segment_kind, red-hatch overlay when tito_masked_turns > 0,
  ↻N annotation when num_aborts > 0
- viz/stacked.py: 1-row-per-sample horizontal stacked bars for batch overview
- both CLIs accept rollout dump .pt globs and load via torch.load
Combines SPEC §10.1 PRs THUDM#2-THUDM#5 into one atomic rewrite because all four touch
the same single file (M1 decision: no sub-module split). Each PR's intent
preserved as a distinct §section banner.

PR THUDM#2 - §0 TOC docstring + §2 DATACLASSES (Turn, SubSession, SubSnapshot, Session)
  - Turn loses branch_kind/parent_id/parent_prefix_len/full_ids (0521 legacy)
  - Turn gains tito_masked: bool (U3)
  - SubSession gains num_aborts / last_finish_reason / tito_masked_turn_count
  - SubSnapshot adds finish_reason / num_aborts / tito_masked_turn_count
  - Session: active_subagent (flat) + completed_subagents + _emit_order (I6)
  - Session: num_aborts / num_aborts_this_turn / tito_*_turn_count
  - §3 PRIMITIVES & §4 STORE banners + Store.open_session(record_raw_dump=)

PR THUDM#3 - §5 TRANSLATE adds verify_tito_for_turn (D2)
  - pure function: retokenize(decode(output_ids)) == output_ids
  - §6 ENGINE keeps AbortCoordinator + generate_with_abort_resume (P2 verbatim)

PR THUDM#4 - §7 SEGMENTS rewrite
  - delete classifier triple: _classify_branch / _new_turn / 2-pass parent
  - delete _COMPACT_RESUME_MARKER text sniff + _SUMMARIZATION_MARKERS
  - delete _SubSession.subagent_stack nested stack -> flat active_subagent
  - new pick_target with nested-dispatch fail-safe (R2 CC3)
  - new classify_and_apply (4-condition is_append + pre_wipe snapshot)
  - new snapshot_subagent + pop_session_split (chronological replay of _emit_order)
  - segment meta key renamed kind -> segment_kind (U3 / R3)
  - U3: per-segment finish_reason + num_aborts + tito_masked_turns

PR THUDM#5 - §8 HANDLER 15-step rewrite + §9 SHELL
  - _handle_messages: 380 lines -> ~140 lines, numbered 1-15
  - step 12: I8 (n>0 guard) + I9 (abort skip TITO) + per-turn mask (only [-n:])
  - Session.lock is sole sync primitive (P6); helpers never re-acquire (I3)
  - open_session(record_tree=) renamed to open_session(record_raw_dump=)
  - pop_session() single-segment API removed (list mode always on)
  - _export_raw_dump emits v4 schema (drops full_ids, adds tito_masked/etc)

Tests (SPEC §7.1): 22 cases across 5 files, all passing
  - test_segments_classify.py    - 7 cases (pre_wipe, nested fail-safe, etc.)
  - test_translate_tito.py       - 6 cases incl. test_empty_turn_skip (I8 fix)
                                   and test_abort_skip_tito (I9)
  - test_engine_abort_resume.py  - 3 cases (concatenate / max-attempts / budget)
  - test_session_lock.py         - 2 cases (same-sid serialize, different sid)
  - test_pop_session_split.py    - 4 cases (chronological, drain, U3 fields)
  - fixtures/README.md           - schema doc
Implements SPEC §4 (generate.py) + §5 (sandbox.py -142) + §8 (5 run scripts).

generate.py:
  - drop pop_session_split() / pop_session() dual API (list mode always on)
  - drop SWE_LIST_TRAJECTORY / SWE_LIST_TRAJECTORY_REWARD_MODE /
    SWE_LIST_TRAJECTORY_MAX_SEGMENTS env (replace with reducer hook)
  - rename SWE_SAVE_TRAJECTORY_TREE -> SWE_DUMP_RAW_TRAJECTORY (Q5 / semantic
    narrowing: now only attaches to segment 0 metadata, not a tree)
  - new _default_uniform_fan_out (D4/Q7 reward/K)
  - new _load_reducer with import fail-soft fallback
  - new _fan_out_with_fail_soft (U4: 4 fail paths -> sample abort + metric)
  - shallow copy.copy per sub-sample (R2 deepcopy memory advice)
  - --swe-segment-reducer-path opt-in CLI arg (read via getattr; no slime
    arg-parser change needed; default reducer used when absent)

sandbox.py (532 -> 388 lines, SPEC §5 trim):
  - drop ~50 lines of redundant module/RPC docstrings (README is reference)
  - merge _setup_swepro_assets + _apply_pre_commands into evaluate() body flow
  - preserve P-equivalent: _rpc_retry / E2BSandbox 5-primitive contract /
    install_node22 host-decompress / run_claude_code done-marker poll /
    evaluate 3-mode branch (swepro / f2p_script / eval_cmd)

run_qwen36_base.sh + 4 scenario shells (SPEC §8):
  - run_qwen36_base.sh         : DRY common config + exec_train helper
  - run_qwen36_linear_2steps   : autoCompactWindow=1M + disable Agent/Task
  - run_qwen36_subagent_2steps : investigator agent + force dispatch prompt
  - run_qwen36_compact_2steps  : autoCompactWindow=100K + disable subagents
  - run_qwen36_mixed_2steps    : investigator + autoCompactWindow=100K
  - RUNTIME_ENV_JSON no longer forwards deleted SWE_LIST_TRAJECTORY* keys

test_reducer_failure.py (SPEC §7.1 U4): 8 cases all passing
  - import path bad -> falls back to default + logs error
  - reducer not callable -> falls back
  - reducer raises -> sample abort
  - reducer returns None / non-list -> sample abort
  - happy path passthrough

Delete obsolete 0521 smoke scripts:
  - smoke_classifier_v3.py / smoke_list_trajectory_v2.py / smoke_reward_modes.py
  - reclassify_existing_rollout.py (0521 archive replay; superseded by SPEC §7.3)

Tests: 30/30 passing across all 6 test files.
Implements SPEC §10.3 (v0.3 round 3, U5 decision).

- mv slime/utils/aiohttp_threaded.py -> examples/coding_agent_rl/aiohttp_threaded.py
- middleware.py:69 import: from slime.utils.aiohttp_threaded -> from aiohttp_threaded
  (bare import to match sub-agent's existing sys.path + bare-import style;
   examples/coding_agent_rl/ is not a package so SPEC's relative-import form
   `from .aiohttp_threaded` doesn't apply here)

Notes:
- Logically part of PR THUDM#2 (dataclass/docstring cleanup) per SPEC §10.3, but
  filed as a standalone commit because the original PR THUDM#2-THUDM#5 commit (7091189)
  is no longer HEAD (HEAD is PR THUDM#6 / 159b2b0); amending HEAD would
  semantically corrupt PR THUDM#6 scope. User may interactively rebase to fold
  this into 7091189 if desired.
- No 0521 legacy test/script files in this worktree, so only 1 import
  needed updating (vs SPEC §10.3 listing 5 files — those don't exist here).
- All 30 smoke tests still pass.
Train backward on 152K-token samples OOMs at fused_cross_entropy alloc
even with chunked log_probs (chunk_size=2) and reduced sglang
mem-fraction. Root cause: actor activation peak on the longest sample
exceeds 80 GiB despite CP=8.

Cap per-sample total tokens at SLIME_MAX_SAMPLE_TOKENS in
_get_rollout_data after Sample.from_dict (covers both load_debug and
live rollout paths). Truncates response tail first; if prompt itself
exceeds cap, collapses sample to no-gradient loss_mask=[0] (same
contract as _abort).

Verified on agent_only/rollout_0.pt (303 MiB, max 152K tokens) via
oom_repro_load_debug.sh with cap=100K: 4/16 samples capped (1
truncated, 3 collapsed), step 0 SUCCEEDED, backward microbatch 4/4
finished in 19s, GPU peak free 5.62 GB.

Default 0 = no cap (back-compat). See 0521/OOM_DIAGNOSIS.md §9.
Stage 8B real E2E run failed with ModuleNotFoundError on aiohttp_threaded:
middleware.py used a bare `import aiohttp_threaded` that only resolved when
coding_agent_rl/ itself was on sys.path -- which the in-tree smoke tests
set up but ray rollout workers (loading via importlib from external cwd)
never do.

Stage 9 fix was a dual-mode try/except. Stage 10 promoted to clean package
mode:
  * add __init__.py making examples.coding_agent_rl an explicit package
  * middleware.py: `from .aiohttp_threaded import AppHandle, ...`
  * collapse smoke-test sys.path hack to a single worktree-root insert
  * tests import via `from examples.coding_agent_rl import middleware`
  * add test_external_cwd_import.py as B2 regression guard: subprocess
    in /tmp imports the three modules to assert ray-worker-mode stays green

34/34 smoke tests pass.
Root cause (stage14_oom_rootcause_fanout.md, 2026-05-24):

0522 refactor removed the SWE_LIST_TRAJECTORY env knob and made fan-out
the unconditional path. archive-config (CP=4, SGLANG=0.75, no margin)
that the main repo runs cleanly in 32 min on this dataset failed with
GPU wake_up DENY at 6.48 GB and CE-forward OOM (73.7 GB allocated /
1 GB free / 20 MiB CE buffer fail). Investigation traced the failure
not to per-sample length but to ray rollout buffer sample-count
explosion (16 -> ~64-80 samples after fan-out, ~4-5x). Bigger ray.put +
host pinned mem footprint -> wake_up denied -> downstream CE OOM.

Fix: restore SWE_LIST_TRAJECTORY env knob (default 0); in the default
path, collapse to the FINAL segment (the reward-bearing
post-final-compact-reset segment). This matches main-repo behavior and
restores the archive baseline. Setting SWE_LIST_TRAJECTORY=1 still
opts into per-segment fan-out for users who want it.

Verified: fix_listtraj0_20260524_022842 PASS — 2 train steps in 45 min,
wake_up 7.5s / 11.97s (vs main-repo 9.5s / 10.07s baseline),
num_samples=16, ray job SUCCEEDED.

Implementation:
  * extract _collapse_to_final_segment helper (testable in isolation)
  * SWE_LIST_TRAJECTORY env constant w/ docstring explaining trade-offs
  * 2 new smoke tests (test_collapse_to_final_segment_picks_last_segment
    + test_collapse_with_single_segment_keeps_it) directly verify the
    short-arm of Plan D from the 0524 segment-per-sample research
  * 36/36 tests pass

Long-term direction (Plan D long-arm) lives on a separate branch:
per-segment fan-out + PR THUDM#1933 per-rollout reducer port. See
docs/project_memory/1.segment_per_sample_oom/README.md.
…ples

Port from THUDM/slime PR THUDM#1933 step 9 (R3 SWE-side glue):
- _default_uniform_fan_out now sets sub.rollout_id = sample_proto.index
  on every fan-out sibling. With the PR-THUDM#1933 reducer this makes the
  loss treat the K segments as one trajectory (per-rollout mean) rather
  than K independent samples (over-counting by K).

Kept intact: SWE_LIST_TRAJECTORY=0 collapse path (stage 14 OOM fix
fallback), _fan_out_with_fail_soft wrapper, sub.index (shared across
siblings so framework group_by(group_index) still works for GRPO).
…ssthrough

Two new CPU-only test files under examples/coding_agent_rl/tests/ so
they get picked up by the existing pytest config without touching
pyproject testpaths or top-level tests/.

test_dp_schedule.py (7 tests, ~120 LOC):
- static one-step-one-sample-per-rollout invariant check
- dynamic compact (2 samples/rollout) keeps same-rollout samples in
  one step
- trailing rollouts get trimmed
- < gbs rollouts raises AssertionError
- step_size < dp_size raises AssertionError
- balance_data round-robin vs KK partition

test_rollout_id_passthrough.py (3 tests):
- compact K=3 fan-out samples share rollout_id and rollout_mask_sums
- default 1-sample-per-rollout falls back to index
- SWE _default_uniform_fan_out wires rollout_id from sample_proto.index
E2 retry 2 run planD_e2_pr1933_fanout_20260524_073011 was killed at
~27 min by an external `concurrent.futures.TimeoutError` set on the
RolloutManager.generate() future (15/16 trajectories completed,
1 still in sandbox.evaluate; see r5 doc). No env-tunable timeout
(SWE_SANDBOX_LIFETIME_SEC=3600 / SWE_TIME_BUDGET_SEC=1200 /
SWE_EVAL_TIMEOUT_SEC=600 / e2b REQUEST_TIMEOUT=60s) explains the
27-min number -- root cause likely a Ray job heartbeat or an
undiscovered wrapper, not in our code.

Fix (Plan B): wrap generate() body in asyncio.wait_for(
  _generate_inner(...), timeout=SWE_GENERATE_GUARD_SEC), where
SWE_GENERATE_GUARD_SEC defaults to SWE_TIME_BUDGET_SEC +
SWE_EVAL_TIMEOUT_SEC + 180 (1980s). On timeout, sample is aborted
with reason `wall_clock_timeout` and a snapshot of pending asyncio
tasks is logged for future debugging. This makes a single hung
trajectory a recoverable per-sample abort instead of a whole-step
ray job failure.

Note: asyncio.TimeoutError IS builtin TimeoutError in Python 3.11+
(they were unified), so the except catches both.

Added test_generate_guard_aborts_on_timeout in test_reducer_failure.py.
Verified: 47/47 smoke tests pass.

This fix should also be cherry-picked to list_trajectory_0522_exp
when convenient -- the timeout path is shared between E0 (collapse)
and E2 (fan-out).
Branch list_trajectory_0522_pr1933_e2_v3 (off pr1933 HEAD @ 82135cc).
This experiment branch isolates v3 runs from the staging pr1933 branch
so we can iterate without polluting the merge candidate.

Diff vs planD_e2_pr1933_fanout.sh:
  * SCENARIO renamed to planD_e2_v3_with_guard
  * SWE_GENERATE_GUARD_SEC=1980 exported + added to runtime_env keys
    so ray workers see it (default in generate.py is the same; explicit
    here so the value is logged)
  * Updated header docstring to point at branch + previous failure

PASS criteria:
  - At least 1 train step completes (compare to E2 retry that did)
  - If any trajectory hangs > 1980s, expect `wall_clock_timeout` abort
    + run continues (vs E2 retry 2 where the whole ray job died)
…ignment

E2 v3 attempt 2 (run 20260524_110039) died at dp_schedule.py:176:
  num_mbs (23) is not a multiple of dp_size * mb_group (8); static path.

Root cause: fan-out produced 23 samples from 16 rollouts (variable K).
With static micro_batch_size=1 the static path requires num_samples %
(dp_size * mb_group) == 0, which fan-out variability cannot guarantee.

Earlier E2 retry (073011) only worked because the legacy sample-count
trim accidentally landed on 32 (33 -> trim 32, 32 % 8 == 0). After
the trim-by-rollout fix (a6af128), trim preserves 16 rollouts intact
and the 23 fan-out samples now hit this static-path assertion that
the accidental trim hid.

Fix: add --use-dynamic-batch-size so dp_schedule.py:166-174 takes the
dynamic-path branch that calls expand_bins_by_splitting to grow mbs
count to the alignment target. Token-budget packing also handles the
variable K naturally.

No code change; this is config only on the v3 experiment branch.
Verified the examples/coding_agent_rl refactor end-to-end on
archive_clone_2step_0524_fanout.sh; surfaced and fixed two latent bugs
that blocked rollout 0 / rollout 2->3 transition; Run 3 completed
10/10 rollouts in 4h03m with 0 fatal errors.

Fixes:

1) examples/coding_agent_rl/generate.py: fan-out abort path returned
   a bare Sample while the success path returns list[Sample] under
   SWE_LIST_TRAJECTORY=1. The mixed shape sneaks past
   _get_rollout_data's flatten loop and AttributeError's later in
   _save_debug_rollout_data. Added _abort_result() wrapper that
   wraps the abort Sample in a single-element list when fan-out is
   on, and routed all 5 abort return sites through it.

2) slime/ray/rollout.py: defensive normalization in
   _get_rollout_data — if any element of the live-rollout result is
   a list, wrap bare Samples to [Sample] before chain.from_iterable,
   then collapse remaining nesting.

3) slime/backends/sglang_utils/sglang_engine.py: SGLang scheduler
   asserts is_fully_idle() inside release_memory_occupation, but
   flush_cache only drains the radix cache, not in-flight forwards.
   This race crashed scheduler_0 (exit -3, unrecoverable) at the
   rollout 2 -> rollout 3 train transition. Added _wait_for_idle()
   that polls /get_load until num_reqs + num_waiting_reqs == 0 on
   every DP rank (60s timeout, warn-and-proceed on miss); call it
   between flush_cache and release_memory_occupation.

Refactor surface (already in the working tree, verified by Run 3):
- examples/coding_agent_rl/middleware.py simplified by ~1100 lines
- examples/coding_agent_rl/sandbox.py / README.md aligned

Reproducibility note: Run 3 RUN_ROOT logged at
0522/logs/archive_clone_2step_fanout_20260524_192238 (10 rollout_*.pt
dumps, 0 release_memory_occupation crashes, 0 wait_for_idle timeouts).
Reward stayed 0 because swe-train-1545-localcache.jsonl lacks
swepro/eval_cmd — pipeline-only verification, not a learning run.
Inside examples/coding_agent_rl/, keep only the modules imported by
the SWE rollout path:
  - __init__.py
  - aiohttp_threaded.py
  - generate.py
  - middleware.py
  - sandbox.py

Drop the README, all run_*.sh launchers, tests/, and viz/. Other
example subdirectories (eval_multi_task, geo3k_vlm, retool, tau-bench,
...) are untouched.
Remove the v3 reclassify report and its companion experiment launcher
script from the repo.
Add a third metadata path for sweb-style rows that ship a
self-contained pytest script under metadata.remote_env_info.f2p_script
(ends with sys.exit(pytest.main([...]))). When eval_cmd is absent,
_metadata now wraps the script via base64 into a shell one-liner that
materializes /tmp/slime_f2p.py and runs it — sidesteps all shell
quoting and lets python's exit code drive the existing reward path
without touching _run_eval_cmd.

Precedence stays the same: explicit eval_cmd wins over f2p_script,
and swepro still wins over both when present.
…ed segments

- generate.py: run pre_commands (typically `git checkout <base_sha> -f`)
  BEFORE writing PROBLEM_STATEMENT.md so `git clean -fd` doesn't wipe it,
  and so the model edits the same baseline `evaluate()` apply-checks
  against (otherwise applied_cleanly is always False).
- sandbox.py: promote `_apply_pre_commands` to public `apply_pre_commands`
  for cross-module reuse from generate.py.
- middleware.py: add SWE_MAX_SEGMENT_TOKENS (default 96000) drop cap.
  claude-code auto-compact estimates in its own tokenizer; a 100k
  autoCompactWindow can produce 130k+ Qwen-tokenized segments after
  sub-agent dispatch reads large files, OOM-ing fused CE on actor_train
  (single sample can't be split by dynamic batch).
@jingshenghang jingshenghang force-pushed the coding-agent-rl-example branch from 3bdf682 to ef8d88e Compare May 25, 2026 07:44
- generate.py: remove --swe-segment-reducer-path indirection (always use
  default uniform reducer), make SLIME_HEAD_HOST required (no hostname
  fallback that silently misroutes sandboxes), bump default
  SWE_TIME_BUDGET_SEC 900 -> 1800
- sandbox.py: make SWE_SANDBOX_IMAGE_METADATA_KEY required (no silent
  default key), hardcode RPC backoff base constant (never overridden by
  any script in the repo)
- sglang_engine: drop _wait_for_idle polling helper (no longer called)
- ray/rollout: drop _cap_sample_total_tokens token-cap hack and restore
  the simpler flatten loop; long-trajectory capping is now done at the
  segment level upstream
jingshenghang added 9 commits May 25, 2026 16:12
generate.py:
- rename _provision_sandbox -> _boot_agent_sandbox
- collapse _collapse_to_final_segment + _default_uniform_fan_out into
  _write_segment_to_sample (shared) + _fan_out_to_samples
- inline the default collapse path into generate() (4 lines)
- strip references to historical run ids from docstrings/comments

middleware.py:
- reorganize top-to-bottom into 10 numbered sections (constants /
  pure helpers / translation / splice / parsing / state / store / I/O /
  handler / entry)
- move _attach_pending_raw_tokens to Chain.attach_pending_raw_tokens
- rename segment kind pre_wipe -> wipe in docs and comments
- generate.py: build sid from (instance_id, sample.index, group_index)
  so it's unique by construction within a rollout step; fall back to
  random hex only when an index is missing.
- middleware.py: open_session now raises on duplicate sid instead of
  silently merging two rollouts' state into one chain.
- drop examples/coding_agent_rl/__init__.py; PEP 420 namespace package
  handles relative imports fine and the docstring's rationale was wrong.
Attach a uuid rid to /generate and fire-and-forget /abort_request when the
client side is cancelled or the transport raises. Without this a subsequent
release_memory_occupation can race with a still-pending generate and trip
sglang's "server is idle" assertion, killing the scheduler.
- run_qwen36_35b_a3b_swe_8node.sh: reusable baseline launcher for Qwen3.5-35B-A3B SWE RL on 8 nodes.
- run_qwen36_35b_a3b_swe_8node_agent_only.sh: agent-only variant that forces sub-agent dispatch (investigator subagent + prompt + tool allowlist) so the saved trajectory tree carries real sibling branches.
- run_qwen36_35b_a3b_swe_8node_{linear,tree}.sh: thin wrappers toggling SWE_SAVE_TRAJECTORY_TREE.
- run_dynamic_trajectory_tree_train.sh: pytest smoke for trainer accepting both linear and trajectory_tree samples.
Keep only run_qwen36_35b_a3b_swe_8node_agent_only.sh as the canonical
launcher; the linear/tree wrappers and the synthetic train smoke have been
superseded.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants