[examples] add coding_agent_rl: agent-in-sandbox RL minimal demo by jingshenghang · Pull Request #1923 · THUDM/slime

jingshenghang · 2026-05-19T08:28:35Z

End-to-end loop for "coding agent + sandbox execution + test reward":

Boot an E2B sandbox per sample (dataset metadata.image).
Install Node 22 + Claude Code CLI inside; create unprivileged agent user; chown repo; write problem_statement.md.
Launch claude --output-format stream-json pointed at a head-node bridge via ANTHROPIC_BASE_URL; ANTHROPIC_AUTH_TOKEN doubles as the session id for concurrent request demux.
bridge.py translates each /v1/messages call into a SGLang /generate call and streams an Anthropic SSE reply back; per-session it keeps (prompt_ids, response_ids, loss_mask) so no post-hoc retokenize.
After the agent finishes, a fresh eval sandbox applies the model git diff and runs the dataset's tests -> 0/1 reward.
generate.py drops the bridge-collected tokens straight into Sample.

Layout under examples/coding_agent_rl/:
generate.py slime custom-generate entrypoint
sandbox.py all sandbox-side ops (boot/exec/upload, install
Node/Claude Code, run agent, git diff, eval)
bridge.py Anthropic Messages API <-> SGLang /generate
shim; model-agnostic
run_glm47_355b.sh 8-node / 64-GPU / colocate / E2B launch
script for GLM-4.7-355B-A32B
README.md walkthrough + dataset schema + swap-model howto

Model-agnostic: chat template via tokenizer.apply_chat_template; tool call parsing via sglang FunctionCallParser; reasoning parsing via sglang ReasoningParser. Swapping model = changing SWE_TOOL_PARSER / SWE_REASONING_PARSER envs (default glm47/glm45).

No new swe_rollout.py: reuses slime's default sglang_rollout outer loop via --custom-generate-function-path.

zhuzilin · 2026-05-19T09:25:36Z

+    # Canonical chat log. Each assistant turn we append after /generate carries
+    # reasoning_content so the next round's apply_chat_template re-render matches
+    # the tokens the model actually emitted (preserving prefix match).
+    glm_messages: list[dict] = dataclasses.field(default_factory=list)


应该叫 messages 就可以了

在重构中，改成了 chat_messages

zhuzilin · 2026-05-19T09:32:44Z

@@ -0,0 +1,424 @@
+"""Anthropic Messages API <-> SGLang /generate bridge.


我建议改成叫 middleware 之类的东西，主要是现在 slime 里面已经有 mbridge 和 megatron bridge 了，容易混淆。

改成了 middleware.py 防止混淆

zhuzilin · 2026-05-19T09:33:20Z

+    lock: asyncio.Lock = dataclasses.field(default_factory=asyncio.Lock)
+
+
+class _Store:


python 的 asyncio 是不是单线程的？需要这个带锁的 session 吗？

asyncio 单线程的语义只在没有 await 暂停时才成立。_run_turn 中间有 await _post_generate(...)（一次 sglang HTTP，秒级到分钟级），同一个 session_id 在这段窗口里再来一个请求（claude-code 正常不会，但 Anthropic SDK 的 retry / 客户端竞速会触发），两个协程会交错改 chat_messages / response_ids / pending_raw_tokens，TITO 校验立即崩。锁是 per-session 的，单会话只串行一个 turn；不同 session 仍然完全并发。

zhuzilin · 2026-05-19T09:38:25Z

+            "message": "Authorization Bearer <session_id> required",
+        }}, status=400)
+
+    s = await store.get(session_id)


有可能需要加个提示，session_id 不能重复

sid 改为 f'cagent-{instance_id}-{sample.index}-{sample.group_index}'，靠 Sample 自带字段构造唯一

zhuzilin · 2026-05-19T09:39:38Z

+        ideal_text = tok.apply_chat_template(
+            s.glm_messages, tools=s.tools_schema, tokenize=False, add_generation_prompt=True,
+        )
+        ideal_ids = tok.encode(ideal_text, add_special_tokens=False)


这里是不是 tokenize=True 就行？token=True 还能设置 add_special_tokens=False 吗？

是的，已修改成 tokenize=True

valid = {i: tup for i, tup in asst_raw_tokens.items() if 0 <= i < len(chat_messages)} if not valid: ids = tok.apply_chat_template( chat_messages, tools=tools_schema, tokenize=True, add_generation_prompt=True, ) return list(ids), []

zhuzilin · 2026-05-19T09:57:49Z

+            else:
+                logger.warning("[bridge] %s template-rerender mismatch; rebaselining", session_id)
+                s.response_ids = ideal_ids[len(s.prompt_ids):]
+                s.loss_mask = [0] * len(s.response_ids)


这里是按如果不匹配会 mask 掉整条 response 来做的是吗？

def _decode_and_verify( tok: Any, target: Chain, output_ids: list[int], ) -> str: """Decode output_ids; if the round-trip doesn't match, zero the loss_mask over this turn (cause: tokenizer ambiguity). Returns raw decoded text.""" n = len(output_ids) if n == 0: return "" raw_output = tok.decode(output_ids, skip_special_tokens=False) if not verify_tito_for_turn(tok, raw_output, output_ids): target.loss_mask[-n:] = [0] * n logger.warning("[middleware] TITO mismatch; loss_mask zeroed (n=%d)", n) return raw_output

是的，会 mask 掉整条 response

zhuzilin · 2026-05-19T10:07:47Z

+                    "content": [], "stop_reason": None, "stop_sequence": None,
+                    "usage": {"input_tokens": in_tokens, "output_tokens": 0}},
+    }))
+    for idx, b in enumerate(blocks):


想确定一下，这里是会先把整个 streaming 的请求变成同步的吗？

是的，_post_generate 走 sglang 的非流式接口拿完整 output_ids，然后 _stream_anthropic_sse 把它当一段 SSE 一次性吐回 claude-code（claude-code 协议只接受 stream，但内容可以攒齐）。

流式增量 retokenize 会让 TITO 校验在每个 chunk 边界都可能切坏 BPE，所以这里攒够 streaming ，再一次性发送

A minimal, readable example of coding agent + sandbox execution + test reward in slime (~1500 LoC across 4 files). One training sample: spin up a sandbox -> run Claude Code inside it -> capture the model-produced git diff -> spin up a SECOND clean sandbox, apply the diff, run the dataset's tests -> 0/1 reward -> feed the actual generated tokens (with loss-mask) back to slime, no re-tokenization. Wire-up is one CLI flag: --custom-generate-function-path examples.coding_agent_rl.generate.generate slime's default sglang_rollout.generate_rollout outer loop is reused; only the per-sample generate() is swapped. Files: * generate.py - per-sample entrypoint slime calls. Provision sandbox -> drop PROBLEM_STATEMENT.md -> run agent -> git diff -> eval in a fresh sandbox -> fill Sample. * sandbox.py - E2B sandbox backend. Boot/kill, exec/upload, install_node22 + install_claude_code, long-running agent spawn with done-marker poll, git_diff, fresh-sandbox eval runner (swepro / f2p_script / eval_cmd). * bridge.py - head-node aiohttp shim. Translates the agent's Anthropic Messages API into slime's SGLang /generate (token-native + logprobs) and keeps (prompt_ids, response_ids, loss_mask) per session so the trainer skips re-tokenization. Model-agnostic. * run_glm47_355b.sh - reference launch script (GLM-4.7-355B-A32B, 8 nodes / 64 GPUs, colocate, E2B). All required env vars guarded by \${VAR:?...}; no operator-specific paths. * README.md - file table, sample flow diagram, dataset schema (flat and remote_env_info layouts), required vs optional env knobs, "Swap things out" recipes (model / agent / sandbox backend), and design notes (no re-tokenization, reasoning round-trip, done-marker poll, boot semaphore + retry).

…metadata configurable * bridge.py renamed to middleware.py (file + class BridgeHandle -> MiddlewareHandle + log prefix + thread name). The chat-log dataclass field glm_messages was also renamed chat_messages -- the example is model-agnostic and the GLM-specific naming was misleading. * sandbox.py no longer hardcodes ``glm-platform/*`` metadata keys. Set SWE_SANDBOX_METADATA_JSON='{"my-platform/size":"lg",...}' to pass arbitrary routing tags into AsyncSandbox.create(metadata=...). Default is empty, which works for stock E2B accounts. * generate.py defaults SWE_TOOL_PARSER / SWE_REASONING_PARSER to None (no hardcoded GLM-specific parser fallback). The reference launch script run_glm47_355b.sh still sets these to glm47/glm45. * sandbox.run_claude_code's ``bridge_url=`` kwarg renamed to ``middleware_url=`` (caller in generate.py updated). * README + run_glm47_355b.sh updated for the rename and the new metadata env var.

…ddleware - middleware: track each /messages call as a `_Turn` (request snapshot, response, finish/stop reason, parent prefix), expose via pop_session return value and `record_tree` opt-in. - middleware: detect non-linear message updates by hashing prior messages and rebuild the prompt when the client diverges, instead of silently appending. Also translate Anthropic `thinking` blocks into `reasoning_content` so prior reasoning is preserved across turns. - generate: add SWE_SAVE_TRAJECTORY_TREE env knob; when set, stash the exported tree under sample.metadata["trajectory_tree"]. Also allow overriding the Claude Code prompt via SWE_CC_PROMPT. - sandbox: pass --include-partial-messages / --include-hook-events to claude CLI and allow extra args via SWE_CLAUDE_EXTRA_ARGS; quote the trajectory path with shlex.

Captures the in-place rollout/middleware state after the 2026-05-21 work: middleware.py: - Raw-token preservation: _attach_pending_raw_tokens / _render_with_raw_splice keep model-generated tokens verbatim across tool turns so loss_mask aligns with actual sglang output, eliminating the chat_template re-render drift that was wiping 95% of segments in earlier runs. - cache_control stripping in _strip_cache_control (Anthropic adds blocks sglang doesn't accept). - _classify_branch overhaul: detects compact / compact_summarization / sibling and falls back to linear, with linear-extend check ordered AHEAD of the autoCompact-marker check so post-compact linear chains do not get mis-flagged as a 'compact storm' purely because the resume pack floats around in m[0..1]. - _is_compact_resume_request scans m[0..1] (the resume pack can land in either when claude-code splits a Read tool call from its result). - Per-turn trajectory_tree dump (id, parent_id, parent_prefix_len, input_len, output_len, branch_kind, request, response) under metadata.trajectory_tree when SWE_SAVE_TRAJECTORY_TREE=1. - Abort/resume hardening: poll budgets, attempt caps, raw-token replay. generate.py: - Sandbox provisioning (concurrency-capped, retried) via _provision_sandbox. - SWE_TIME_BUDGET_SEC / SWE_MAX_RESPONSE_TOKENS / SWE_CC_PROMPT plumbing. - _cap_response_for_training caps the dumped response_ids to bound the per-sample memory pressure (entropy_input.clone()). sandbox.py: - ensure_agent_user / apply_before_repo_set_cmd hooks for the GLM swedev/sweap-images family. - E2B reply-path host pin (SLIME_HEAD_HOST -> 172.27.14.209). This commit lands the "one tree per sample" branching mode. The companion branch list_trajectory will split the same tree into multiple trainable trajectories (one per sub-agent dispatch / compact boundary / final chain).

…ple training In tree_trajectory mode the middleware drops pre-wipe tokens at every chain rebase, so only the final post-last-compact tail ever feeds the trainer. list_trajectory mode preserves them. middleware.py: - _Session.completed_trajectories: list of (prompt_ids, response_ids, loss_mask, meta) snapshots taken just before every wipe. meta records {"kind": "pre_wipe"|"diverge_reset", "completed_turns": <int>} so the downstream Sample fan-out can label what kind of boundary produced it. - pop_session_split / _pop_session_split: variant of pop_session that returns the snapshot list (plus the still-live final chain) and the same trajectory_tree export as before. generate.py: - SWE_LIST_TRAJECTORY env switch. Default 0 keeps the tree_trajectory single-Sample behavior; 1 calls pop_session_split and emits one deep-copied Sample per non-empty segment. - Each fanned-out Sample carries the same per-instance reward (sample-level reward propagates verbatim) plus segment-specific metadata (segment_idx, num_segments, segment_kind, completed_turns_at_split). - trajectory_tree is attached only to the first segment to keep dumps compact; viz still gets the full per-instance picture. slime's sglang_rollout already supports custom_generate_func returning list[Sample] (see generate_and_rm:283 `if isinstance(sample, list)`), so no training-loop changes are required. Smoke test (/tmp/smoke_list_trajectory.py): 4 PASS — empty session, 2-segment fan-out, empty-final drop, 4-segment multi-wipe chain.

… + nearest-parent selection Implements MASTER_PLAN.md §2 (C1-C6) and §3 (smoke + offline reclassify): - C1 (middleware._export_tree): bump version 2 -> 3, emit initial_prompt_len. - C2 (middleware._is_summarization_request): also scan non-last user msgs, but skip ones containing _COMPACT_RESUME_MARKER (avoids false positives on embedded resume packs that carry prior summary text). - C3 (middleware._classify_branch): module-level LINEAR_DEFICIT_TOK=2048; new near-linear tolerance channel for chat_template re-render drift. Guard loosened from spec's `pfx > par.in + par.out/2` (rejected sib 14/22) to `pfx > par.in` (consumed >=1 byte of par.output); still rejects true forks via the deficit > 2048 condition. Justified in RECLASSIFY_V3_REPORT.md. - C4 (middleware._classify_branch): root-parent special case when initial_prompt_len is 0 (legacy dumps) but parent.id==0 + resume marker present + pfx nearly fills root.full -> compact. - C5 (middleware._new_turn): two-pass parent selection. Pass 1 walks reversed(s.turns) and picks the most recent prev where prefix_len >= prev.full_len - 2048 AND prefix_len > prev.input_len. Pass 2 (if pass 1 found none) falls back to original longest-prefix-match. - C6 (reclassify_existing_rollout.py): explicit assertion on initial_prompt_len > 0; legacy-dump soft-fall to turns[0].input_len with a single per-dump warning. Smoke test: 15/15 PASS (smoke_classifier_v3.py covers cases 1-6 from v2 + the 9 new cases from spec_1 §4 table). Reclassify deltas (full _new_turn replay): - compact_only/rollout_0.pt : OLD={root:16, linear:388, compact:38, sibling:5, compact_summ:14} NEW={root:16, linear:404, compact:20, sibling:1, compact_summ:20} - 100k/rollout_0.pt (full) : OLD={root:16, linear:385, sibling:96, compact:9} NEW={root:16, linear:448, compact:20, sibling:0, compact_summ:22} - 100k/rollout_0.pt sample 11 : OLD={root:1, linear:51, sibling:22} NEW={root:1, linear:65, compact:3, compact_summ:5, sibling:0} sample-11 sibling == 0 PASS. compact_summ overshoot vs spec target traced to real summarization requests caught by the v3 marker check (NOT regression); see RECLASSIFY_V3_REPORT.md "Notes" section for the per-turn breakdown.

…es + segment cap - Port the list_trajectory fan-out from list_trajectory branch into tree_trajectory (gated by SWE_LIST_TRAJECTORY=1, default 0 keeps single-Sample behavior). - Add SWE_LIST_TRAJECTORY_REWARD_MODE = {uniform | copy | final_only}, default uniform (reward / num_segments). copy keeps the original per-segment full reward; final_only zeros all but the kind=='final' segment. - Add SWE_LIST_TRAJECTORY_MAX_SEGMENTS (default 8) hard cap. When K > cap, keep segments[:cap-1] + segments[-1:] (head + final) to bound GBS growth and GRPO baseline dilution. Drops the middle on purpose; logged WARN. - Propagate list_trajectory_reward_mode + list_trajectory_finish_reason into each fanned Sample's metadata for downstream tally / reward-allocation audit. finish_reason may be None until sub-agent B wires the field. - Extract reward allocation into pure _segment_reward() so the smoke test can validate without spinning up middleware. - New smoke_reward_modes.py: 6 cases covering uniform, copy, final_only, N=1, N=0 defensive, and final_only with no final segment. All PASS. Spec: 0521/specs/MASTER_PLAN.md §6, §A; spec_3 §3 row generate.py:312, §5 item 3.

…endent segments Implements MASTER_PLAN.md §5 (subagent segment split) for the list_trajectory branch. Adds per-dispatch state tracking via a new `_SubSession` dataclass and a `subagent_stack` field on `_Session`, plus dispatch / return detection logic inside `_handle_messages`. User decisions from MASTER_PLAN preamble: 1. Subagent segment prefix = subagent's OWN system prompt + initial task, never the main-line prefix from before the dispatch. Implemented by letting each `_SubSession` hold its own `prompt_ids` / `chat_messages` / `initial_prompt_len` and renderer-state; the subagent's segment in `completed_trajectories` uses those rather than reusing main-line tokens. 2. Whitelist of segment kinds: only `subagent` / `pre_wipe` / `final`. Removed the `diverge_reset` snapshot append at the prefix-divergence reset site (was middleware.py:~1126); divergence now just drops the current chain and the next chunk lands in the next pre_wipe / final. `compact_summarization` was never emitted as a segment so no change. 3. Nested subagents: only the outermost (depth==1) pop emits a `subagent` segment. Deeper pops merge into the parent via `_merge_into_parent`, which concatenates `response_ids` and forces `loss_mask=1` for the merged tokens. The outer segment's metadata records `nested_depth = max(seen)` so post-hoc analysis can tell flat from nested subagents. 4. Subagent identification: hash `body["system"]` (`_hash_obj`) and compare against the main-line cached `system_hash`. Different hash while a Task/Agent dispatch marker is pending -> push a new sub-session. 5. Fallback: when identification is ambiguous (missing system, missing dispatch marker, etc.) route to main line - never invent a sub-session. Detection scheme: * On every main-line response, scan `blocks` for `tool_use` with name in {Task, Agent} via `_find_dispatch_tool_use_id`; record id on `s._pending_dispatch_id` (or on the active sub-session for nested dispatches). * On every incoming request, before routing, scan new user messages for a `tool_result` with the pending dispatch id via `_has_tool_result_for`. Match -> pop top of `subagent_stack` and snapshot (or merge if nested). * Routing decision: top-of-stack `system_hash` match -> route to that sub-session; new `system_hash` with pending dispatch -> push a new sub-session; else -> main line. * `_pop_session_split` flushes any still-open sub-sessions at session pop (claude-code exits mid-dispatch) and appends `last_finish_reason` to the `final` segment metadata. Smoke test: examples/coding_agent_rl/smoke_list_trajectory_v2.py covers all 10 cases from spec_3_list_trajectory.md §4 (plus helper sanity); all 10 PASS. Case 8 (subagent + compact mix) is interpreted as 3 segments [subagent, pre_wipe, final] (not the 5 mentioned in the spec note) - the canonical semantics keep main-line linear continuation in one segment unless a wipe or dispatch changes the routing target; documented in the case 8 comment block. Files: examples/coding_agent_rl/middleware.py +320 / -87 (new _SubSession, routing helpers, dispatch/return detection, target-based handler, updated _pop_session_split) examples/coding_agent_rl/smoke_list_trajectory_v2.py +new (10 cases)

Implements SPEC §9 (Visualization). - viz/swimlane.py: per-sample swimlane Gantt (matplotlib broken_barh); color by segment_kind, red-hatch overlay when tito_masked_turns > 0, ↻N annotation when num_aborts > 0 - viz/stacked.py: 1-row-per-sample horizontal stacked bars for batch overview - both CLIs accept rollout dump .pt globs and load via torch.load

Combines SPEC §10.1 PRs THUDM#2-THUDM#5 into one atomic rewrite because all four touch the same single file (M1 decision: no sub-module split). Each PR's intent preserved as a distinct §section banner. PR THUDM#2 - §0 TOC docstring + §2 DATACLASSES (Turn, SubSession, SubSnapshot, Session) - Turn loses branch_kind/parent_id/parent_prefix_len/full_ids (0521 legacy) - Turn gains tito_masked: bool (U3) - SubSession gains num_aborts / last_finish_reason / tito_masked_turn_count - SubSnapshot adds finish_reason / num_aborts / tito_masked_turn_count - Session: active_subagent (flat) + completed_subagents + _emit_order (I6) - Session: num_aborts / num_aborts_this_turn / tito_*_turn_count - §3 PRIMITIVES & §4 STORE banners + Store.open_session(record_raw_dump=) PR THUDM#3 - §5 TRANSLATE adds verify_tito_for_turn (D2) - pure function: retokenize(decode(output_ids)) == output_ids - §6 ENGINE keeps AbortCoordinator + generate_with_abort_resume (P2 verbatim) PR THUDM#4 - §7 SEGMENTS rewrite - delete classifier triple: _classify_branch / _new_turn / 2-pass parent - delete _COMPACT_RESUME_MARKER text sniff + _SUMMARIZATION_MARKERS - delete _SubSession.subagent_stack nested stack -> flat active_subagent - new pick_target with nested-dispatch fail-safe (R2 CC3) - new classify_and_apply (4-condition is_append + pre_wipe snapshot) - new snapshot_subagent + pop_session_split (chronological replay of _emit_order) - segment meta key renamed kind -> segment_kind (U3 / R3) - U3: per-segment finish_reason + num_aborts + tito_masked_turns PR THUDM#5 - §8 HANDLER 15-step rewrite + §9 SHELL - _handle_messages: 380 lines -> ~140 lines, numbered 1-15 - step 12: I8 (n>0 guard) + I9 (abort skip TITO) + per-turn mask (only [-n:]) - Session.lock is sole sync primitive (P6); helpers never re-acquire (I3) - open_session(record_tree=) renamed to open_session(record_raw_dump=) - pop_session() single-segment API removed (list mode always on) - _export_raw_dump emits v4 schema (drops full_ids, adds tito_masked/etc) Tests (SPEC §7.1): 22 cases across 5 files, all passing - test_segments_classify.py - 7 cases (pre_wipe, nested fail-safe, etc.) - test_translate_tito.py - 6 cases incl. test_empty_turn_skip (I8 fix) and test_abort_skip_tito (I9) - test_engine_abort_resume.py - 3 cases (concatenate / max-attempts / budget) - test_session_lock.py - 2 cases (same-sid serialize, different sid) - test_pop_session_split.py - 4 cases (chronological, drain, U3 fields) - fixtures/README.md - schema doc

Implements SPEC §4 (generate.py) + §5 (sandbox.py -142) + §8 (5 run scripts). generate.py: - drop pop_session_split() / pop_session() dual API (list mode always on) - drop SWE_LIST_TRAJECTORY / SWE_LIST_TRAJECTORY_REWARD_MODE / SWE_LIST_TRAJECTORY_MAX_SEGMENTS env (replace with reducer hook) - rename SWE_SAVE_TRAJECTORY_TREE -> SWE_DUMP_RAW_TRAJECTORY (Q5 / semantic narrowing: now only attaches to segment 0 metadata, not a tree) - new _default_uniform_fan_out (D4/Q7 reward/K) - new _load_reducer with import fail-soft fallback - new _fan_out_with_fail_soft (U4: 4 fail paths -> sample abort + metric) - shallow copy.copy per sub-sample (R2 deepcopy memory advice) - --swe-segment-reducer-path opt-in CLI arg (read via getattr; no slime arg-parser change needed; default reducer used when absent) sandbox.py (532 -> 388 lines, SPEC §5 trim): - drop ~50 lines of redundant module/RPC docstrings (README is reference) - merge _setup_swepro_assets + _apply_pre_commands into evaluate() body flow - preserve P-equivalent: _rpc_retry / E2BSandbox 5-primitive contract / install_node22 host-decompress / run_claude_code done-marker poll / evaluate 3-mode branch (swepro / f2p_script / eval_cmd) run_qwen36_base.sh + 4 scenario shells (SPEC §8): - run_qwen36_base.sh : DRY common config + exec_train helper - run_qwen36_linear_2steps : autoCompactWindow=1M + disable Agent/Task - run_qwen36_subagent_2steps : investigator agent + force dispatch prompt - run_qwen36_compact_2steps : autoCompactWindow=100K + disable subagents - run_qwen36_mixed_2steps : investigator + autoCompactWindow=100K - RUNTIME_ENV_JSON no longer forwards deleted SWE_LIST_TRAJECTORY* keys test_reducer_failure.py (SPEC §7.1 U4): 8 cases all passing - import path bad -> falls back to default + logs error - reducer not callable -> falls back - reducer raises -> sample abort - reducer returns None / non-list -> sample abort - happy path passthrough Delete obsolete 0521 smoke scripts: - smoke_classifier_v3.py / smoke_list_trajectory_v2.py / smoke_reward_modes.py - reclassify_existing_rollout.py (0521 archive replay; superseded by SPEC §7.3) Tests: 30/30 passing across all 6 test files.

Implements SPEC §10.3 (v0.3 round 3, U5 decision). - mv slime/utils/aiohttp_threaded.py -> examples/coding_agent_rl/aiohttp_threaded.py - middleware.py:69 import: from slime.utils.aiohttp_threaded -> from aiohttp_threaded (bare import to match sub-agent's existing sys.path + bare-import style; examples/coding_agent_rl/ is not a package so SPEC's relative-import form `from .aiohttp_threaded` doesn't apply here) Notes: - Logically part of PR THUDM#2 (dataclass/docstring cleanup) per SPEC §10.3, but filed as a standalone commit because the original PR THUDM#2-THUDM#5 commit (7091189) is no longer HEAD (HEAD is PR THUDM#6 / 159b2b0); amending HEAD would semantically corrupt PR THUDM#6 scope. User may interactively rebase to fold this into 7091189 if desired. - No 0521 legacy test/script files in this worktree, so only 1 import needed updating (vs SPEC §10.3 listing 5 files — those don't exist here). - All 30 smoke tests still pass.

Train backward on 152K-token samples OOMs at fused_cross_entropy alloc even with chunked log_probs (chunk_size=2) and reduced sglang mem-fraction. Root cause: actor activation peak on the longest sample exceeds 80 GiB despite CP=8. Cap per-sample total tokens at SLIME_MAX_SAMPLE_TOKENS in _get_rollout_data after Sample.from_dict (covers both load_debug and live rollout paths). Truncates response tail first; if prompt itself exceeds cap, collapses sample to no-gradient loss_mask=[0] (same contract as _abort). Verified on agent_only/rollout_0.pt (303 MiB, max 152K tokens) via oom_repro_load_debug.sh with cap=100K: 4/16 samples capped (1 truncated, 3 collapsed), step 0 SUCCEEDED, backward microbatch 4/4 finished in 19s, GPU peak free 5.62 GB. Default 0 = no cap (back-compat). See 0521/OOM_DIAGNOSIS.md §9.

Stage 8B real E2E run failed with ModuleNotFoundError on aiohttp_threaded: middleware.py used a bare `import aiohttp_threaded` that only resolved when coding_agent_rl/ itself was on sys.path -- which the in-tree smoke tests set up but ray rollout workers (loading via importlib from external cwd) never do. Stage 9 fix was a dual-mode try/except. Stage 10 promoted to clean package mode: * add __init__.py making examples.coding_agent_rl an explicit package * middleware.py: `from .aiohttp_threaded import AppHandle, ...` * collapse smoke-test sys.path hack to a single worktree-root insert * tests import via `from examples.coding_agent_rl import middleware` * add test_external_cwd_import.py as B2 regression guard: subprocess in /tmp imports the three modules to assert ray-worker-mode stays green 34/34 smoke tests pass.

Root cause (stage14_oom_rootcause_fanout.md, 2026-05-24): 0522 refactor removed the SWE_LIST_TRAJECTORY env knob and made fan-out the unconditional path. archive-config (CP=4, SGLANG=0.75, no margin) that the main repo runs cleanly in 32 min on this dataset failed with GPU wake_up DENY at 6.48 GB and CE-forward OOM (73.7 GB allocated / 1 GB free / 20 MiB CE buffer fail). Investigation traced the failure not to per-sample length but to ray rollout buffer sample-count explosion (16 -> ~64-80 samples after fan-out, ~4-5x). Bigger ray.put + host pinned mem footprint -> wake_up denied -> downstream CE OOM. Fix: restore SWE_LIST_TRAJECTORY env knob (default 0); in the default path, collapse to the FINAL segment (the reward-bearing post-final-compact-reset segment). This matches main-repo behavior and restores the archive baseline. Setting SWE_LIST_TRAJECTORY=1 still opts into per-segment fan-out for users who want it. Verified: fix_listtraj0_20260524_022842 PASS — 2 train steps in 45 min, wake_up 7.5s / 11.97s (vs main-repo 9.5s / 10.07s baseline), num_samples=16, ray job SUCCEEDED. Implementation: * extract _collapse_to_final_segment helper (testable in isolation) * SWE_LIST_TRAJECTORY env constant w/ docstring explaining trade-offs * 2 new smoke tests (test_collapse_to_final_segment_picks_last_segment + test_collapse_with_single_segment_keeps_it) directly verify the short-arm of Plan D from the 0524 segment-per-sample research * 36/36 tests pass Long-term direction (Plan D long-arm) lives on a separate branch: per-segment fan-out + PR THUDM#1933 per-rollout reducer port. See docs/project_memory/1.segment_per_sample_oom/README.md.

…ples Port from THUDM/slime PR THUDM#1933 step 9 (R3 SWE-side glue): - _default_uniform_fan_out now sets sub.rollout_id = sample_proto.index on every fan-out sibling. With the PR-THUDM#1933 reducer this makes the loss treat the K segments as one trajectory (per-rollout mean) rather than K independent samples (over-counting by K). Kept intact: SWE_LIST_TRAJECTORY=0 collapse path (stage 14 OOM fix fallback), _fan_out_with_fail_soft wrapper, sub.index (shared across siblings so framework group_by(group_index) still works for GRPO).

…ssthrough Two new CPU-only test files under examples/coding_agent_rl/tests/ so they get picked up by the existing pytest config without touching pyproject testpaths or top-level tests/. test_dp_schedule.py (7 tests, ~120 LOC): - static one-step-one-sample-per-rollout invariant check - dynamic compact (2 samples/rollout) keeps same-rollout samples in one step - trailing rollouts get trimmed - < gbs rollouts raises AssertionError - step_size < dp_size raises AssertionError - balance_data round-robin vs KK partition test_rollout_id_passthrough.py (3 tests): - compact K=3 fan-out samples share rollout_id and rollout_mask_sums - default 1-sample-per-rollout falls back to index - SWE _default_uniform_fan_out wires rollout_id from sample_proto.index

E2 retry 2 run planD_e2_pr1933_fanout_20260524_073011 was killed at ~27 min by an external `concurrent.futures.TimeoutError` set on the RolloutManager.generate() future (15/16 trajectories completed, 1 still in sandbox.evaluate; see r5 doc). No env-tunable timeout (SWE_SANDBOX_LIFETIME_SEC=3600 / SWE_TIME_BUDGET_SEC=1200 / SWE_EVAL_TIMEOUT_SEC=600 / e2b REQUEST_TIMEOUT=60s) explains the 27-min number -- root cause likely a Ray job heartbeat or an undiscovered wrapper, not in our code. Fix (Plan B): wrap generate() body in asyncio.wait_for( _generate_inner(...), timeout=SWE_GENERATE_GUARD_SEC), where SWE_GENERATE_GUARD_SEC defaults to SWE_TIME_BUDGET_SEC + SWE_EVAL_TIMEOUT_SEC + 180 (1980s). On timeout, sample is aborted with reason `wall_clock_timeout` and a snapshot of pending asyncio tasks is logged for future debugging. This makes a single hung trajectory a recoverable per-sample abort instead of a whole-step ray job failure. Note: asyncio.TimeoutError IS builtin TimeoutError in Python 3.11+ (they were unified), so the except catches both. Added test_generate_guard_aborts_on_timeout in test_reducer_failure.py. Verified: 47/47 smoke tests pass. This fix should also be cherry-picked to list_trajectory_0522_exp when convenient -- the timeout path is shared between E0 (collapse) and E2 (fan-out).

Branch list_trajectory_0522_pr1933_e2_v3 (off pr1933 HEAD @ 82135cc). This experiment branch isolates v3 runs from the staging pr1933 branch so we can iterate without polluting the merge candidate. Diff vs planD_e2_pr1933_fanout.sh: * SCENARIO renamed to planD_e2_v3_with_guard * SWE_GENERATE_GUARD_SEC=1980 exported + added to runtime_env keys so ray workers see it (default in generate.py is the same; explicit here so the value is logged) * Updated header docstring to point at branch + previous failure PASS criteria: - At least 1 train step completes (compare to E2 retry that did) - If any trajectory hangs > 1980s, expect `wall_clock_timeout` abort + run continues (vs E2 retry 2 where the whole ray job died)

…ignment E2 v3 attempt 2 (run 20260524_110039) died at dp_schedule.py:176: num_mbs (23) is not a multiple of dp_size * mb_group (8); static path. Root cause: fan-out produced 23 samples from 16 rollouts (variable K). With static micro_batch_size=1 the static path requires num_samples % (dp_size * mb_group) == 0, which fan-out variability cannot guarantee. Earlier E2 retry (073011) only worked because the legacy sample-count trim accidentally landed on 32 (33 -> trim 32, 32 % 8 == 0). After the trim-by-rollout fix (a6af128), trim preserves 16 rollouts intact and the 23 fan-out samples now hit this static-path assertion that the accidental trim hid. Fix: add --use-dynamic-batch-size so dp_schedule.py:166-174 takes the dynamic-path branch that calls expand_bins_by_splitting to grow mbs count to the alignment target. Token-budget packing also handles the variable K naturally. No code change; this is config only on the v3 experiment branch.

Verified the examples/coding_agent_rl refactor end-to-end on archive_clone_2step_0524_fanout.sh; surfaced and fixed two latent bugs that blocked rollout 0 / rollout 2->3 transition; Run 3 completed 10/10 rollouts in 4h03m with 0 fatal errors. Fixes: 1) examples/coding_agent_rl/generate.py: fan-out abort path returned a bare Sample while the success path returns list[Sample] under SWE_LIST_TRAJECTORY=1. The mixed shape sneaks past _get_rollout_data's flatten loop and AttributeError's later in _save_debug_rollout_data. Added _abort_result() wrapper that wraps the abort Sample in a single-element list when fan-out is on, and routed all 5 abort return sites through it. 2) slime/ray/rollout.py: defensive normalization in _get_rollout_data — if any element of the live-rollout result is a list, wrap bare Samples to [Sample] before chain.from_iterable, then collapse remaining nesting. 3) slime/backends/sglang_utils/sglang_engine.py: SGLang scheduler asserts is_fully_idle() inside release_memory_occupation, but flush_cache only drains the radix cache, not in-flight forwards. This race crashed scheduler_0 (exit -3, unrecoverable) at the rollout 2 -> rollout 3 train transition. Added _wait_for_idle() that polls /get_load until num_reqs + num_waiting_reqs == 0 on every DP rank (60s timeout, warn-and-proceed on miss); call it between flush_cache and release_memory_occupation. Refactor surface (already in the working tree, verified by Run 3): - examples/coding_agent_rl/middleware.py simplified by ~1100 lines - examples/coding_agent_rl/sandbox.py / README.md aligned Reproducibility note: Run 3 RUN_ROOT logged at 0522/logs/archive_clone_2step_fanout_20260524_192238 (10 rollout_*.pt dumps, 0 release_memory_occupation crashes, 0 wait_for_idle timeouts). Reward stayed 0 because swe-train-1545-localcache.jsonl lacks swepro/eval_cmd — pipeline-only verification, not a learning run.

Inside examples/coding_agent_rl/, keep only the modules imported by the SWE rollout path: - __init__.py - aiohttp_threaded.py - generate.py - middleware.py - sandbox.py Drop the README, all run_*.sh launchers, tests/, and viz/. Other example subdirectories (eval_multi_task, geo3k_vlm, retool, tau-bench, ...) are untouched.

Remove the v3 reclassify report and its companion experiment launcher script from the repo.

Add a third metadata path for sweb-style rows that ship a self-contained pytest script under metadata.remote_env_info.f2p_script (ends with sys.exit(pytest.main([...]))). When eval_cmd is absent, _metadata now wraps the script via base64 into a shell one-liner that materializes /tmp/slime_f2p.py and runs it — sidesteps all shell quoting and lets python's exit code drive the existing reward path without touching _run_eval_cmd. Precedence stays the same: explicit eval_cmd wins over f2p_script, and swepro still wins over both when present.

…ed segments - generate.py: run pre_commands (typically `git checkout <base_sha> -f`) BEFORE writing PROBLEM_STATEMENT.md so `git clean -fd` doesn't wipe it, and so the model edits the same baseline `evaluate()` apply-checks against (otherwise applied_cleanly is always False). - sandbox.py: promote `_apply_pre_commands` to public `apply_pre_commands` for cross-module reuse from generate.py. - middleware.py: add SWE_MAX_SEGMENT_TOKENS (default 96000) drop cap. claude-code auto-compact estimates in its own tokenizer; a 100k autoCompactWindow can produce 130k+ Qwen-tokenized segments after sub-agent dispatch reads large files, OOM-ing fused CE on actor_train (single sample can't be split by dynamic batch).

- generate.py: remove --swe-segment-reducer-path indirection (always use default uniform reducer), make SLIME_HEAD_HOST required (no hostname fallback that silently misroutes sandboxes), bump default SWE_TIME_BUDGET_SEC 900 -> 1800 - sandbox.py: make SWE_SANDBOX_IMAGE_METADATA_KEY required (no silent default key), hardcode RPC backoff base constant (never overridden by any script in the repo) - sglang_engine: drop _wait_for_idle polling helper (no longer called) - ray/rollout: drop _cap_sample_total_tokens token-cap hack and restore the simpler flatten loop; long-trajectory capping is now done at the segment level upstream

generate.py: - rename _provision_sandbox -> _boot_agent_sandbox - collapse _collapse_to_final_segment + _default_uniform_fan_out into _write_segment_to_sample (shared) + _fan_out_to_samples - inline the default collapse path into generate() (4 lines) - strip references to historical run ids from docstrings/comments middleware.py: - reorganize top-to-bottom into 10 numbered sections (constants / pure helpers / translation / splice / parsing / state / store / I/O / handler / entry) - move _attach_pending_raw_tokens to Chain.attach_pending_raw_tokens - rename segment kind pre_wipe -> wipe in docs and comments

- generate.py: build sid from (instance_id, sample.index, group_index) so it's unique by construction within a rollout step; fall back to random hex only when an index is missing. - middleware.py: open_session now raises on duplicate sid instead of silently merging two rollouts' state into one chain. - drop examples/coding_agent_rl/__init__.py; PEP 420 namespace package handles relative imports fine and the docstring's rationale was wrong.

Attach a uuid rid to /generate and fire-and-forget /abort_request when the client side is cancelled or the transport raises. Without this a subsequent release_memory_occupation can race with a still-pending generate and trip sglang's "server is idle" assertion, killing the scheduler.

- run_qwen36_35b_a3b_swe_8node.sh: reusable baseline launcher for Qwen3.5-35B-A3B SWE RL on 8 nodes. - run_qwen36_35b_a3b_swe_8node_agent_only.sh: agent-only variant that forces sub-agent dispatch (investigator subagent + prompt + tool allowlist) so the saved trajectory tree carries real sibling branches. - run_qwen36_35b_a3b_swe_8node_{linear,tree}.sh: thin wrappers toggling SWE_SAVE_TRAJECTORY_TREE. - run_dynamic_trajectory_tree_train.sh: pytest smoke for trainer accepting both linear and trajectory_tree samples.

Keep only run_qwen36_35b_a3b_swe_8node_agent_only.sh as the canonical launcher; the linear/tree wrappers and the synthetic train smoke have been superseded.

…odes.sh

…atch path

…kip text round-trip

jingshenghang force-pushed the coding-agent-rl-example branch from 585ef36 to f2fd320 Compare May 19, 2026 09:40

zhuzilin reviewed May 19, 2026

View reviewed changes

jingshenghang and others added 26 commits May 25, 2026 03:47

[examples] fix coding agent sandbox node install

c5bd5be

[cleanup] drop RECLASSIFY_V3_REPORT and planD_e2_v3_with_guard launcher

eab50f8

Remove the v3 reclassify report and its companion experiment launcher script from the repo.

jingshenghang force-pushed the coding-agent-rl-example branch from 3bdf682 to ef8d88e Compare May 25, 2026 07:44

jingshenghang added 9 commits May 25, 2026 16:12

[coding_agent_rl] apply pre-commit (ruff + black)

36d5821

[coding_agent_rl] drop unused 8-node launcher variants

df2d2a2

Keep only run_qwen36_35b_a3b_swe_8node_agent_only.sh as the canonical launcher; the linear/tree wrappers and the synthetic train smoke have been superseded.

[coding_agent_rl] rename 8-node launcher to run_qwen36_35b_a3b_swe_8n…

353b5b7

…odes.sh

[coding_agent_rl] clarify cross-turn TITO rationale in re-render mism…

24dc46c

…atch path

[coding_agent_rl] use tokenize=True in empty raw-splice fallback to s…

7915d46

…kip text round-trip

		@@ -0,0 +1,424 @@
		"""Anthropic Messages API <-> SGLang /generate bridge.

		lock: asyncio.Lock = dataclasses.field(default_factory=asyncio.Lock)


		class _Store:

Conversation

jingshenghang commented May 19, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants