[examples] add coding_agent_rl: agent-in-sandbox RL minimal demo#1923
[examples] add coding_agent_rl: agent-in-sandbox RL minimal demo#1923jingshenghang wants to merge 36 commits into
Conversation
585ef36 to
f2fd320
Compare
| # Canonical chat log. Each assistant turn we append after /generate carries | ||
| # reasoning_content so the next round's apply_chat_template re-render matches | ||
| # the tokens the model actually emitted (preserving prefix match). | ||
| glm_messages: list[dict] = dataclasses.field(default_factory=list) |
There was a problem hiding this comment.
在重构中,改成了 chat_messages
| @@ -0,0 +1,424 @@ | |||
| """Anthropic Messages API <-> SGLang /generate bridge. | |||
There was a problem hiding this comment.
我建议改成叫 middleware 之类的东西,主要是现在 slime 里面已经有 mbridge 和 megatron bridge 了,容易混淆。
There was a problem hiding this comment.
改成了 middleware.py 防止混淆
| lock: asyncio.Lock = dataclasses.field(default_factory=asyncio.Lock) | ||
|
|
||
|
|
||
| class _Store: |
There was a problem hiding this comment.
python 的 asyncio 是不是单线程的?需要这个带锁的 session 吗?
There was a problem hiding this comment.
asyncio 单线程的语义只在没有 await 暂停时才成立。_run_turn 中间有 await _post_generate(...)(一次 sglang HTTP,秒级到分钟级),同一个 session_id 在这段窗口里再来一个请求(claude-code 正常不会,但 Anthropic SDK 的 retry / 客户端竞速会触发),两个协程会交错改 chat_messages / response_ids / pending_raw_tokens,TITO 校验立即崩。锁是 per-session 的,单会话只串行一个 turn;不同 session 仍然完全并发。
| "message": "Authorization Bearer <session_id> required", | ||
| }}, status=400) | ||
|
|
||
| s = await store.get(session_id) |
There was a problem hiding this comment.
有可能需要加个提示,session_id 不能重复
There was a problem hiding this comment.
sid 改为 f'cagent-{instance_id}-{sample.index}-{sample.group_index}',靠 Sample 自带字段构造唯一
| ideal_text = tok.apply_chat_template( | ||
| s.glm_messages, tools=s.tools_schema, tokenize=False, add_generation_prompt=True, | ||
| ) | ||
| ideal_ids = tok.encode(ideal_text, add_special_tokens=False) |
There was a problem hiding this comment.
这里是不是 tokenize=True 就行?token=True 还能设置 add_special_tokens=False 吗?
There was a problem hiding this comment.
是的,已修改成 tokenize=True
valid = {i: tup for i, tup in asst_raw_tokens.items() if 0 <= i < len(chat_messages)}
if not valid:
ids = tok.apply_chat_template(
chat_messages,
tools=tools_schema,
tokenize=True,
add_generation_prompt=True,
)
return list(ids), []
| else: | ||
| logger.warning("[bridge] %s template-rerender mismatch; rebaselining", session_id) | ||
| s.response_ids = ideal_ids[len(s.prompt_ids):] | ||
| s.loss_mask = [0] * len(s.response_ids) |
There was a problem hiding this comment.
这里是按如果不匹配会 mask 掉整条 response 来做的是吗?
There was a problem hiding this comment.
def _decode_and_verify(
tok: Any,
target: Chain,
output_ids: list[int],
) -> str:
"""Decode output_ids; if the round-trip doesn't match, zero the loss_mask
over this turn (cause: tokenizer ambiguity). Returns raw decoded text."""
n = len(output_ids)
if n == 0:
return ""
raw_output = tok.decode(output_ids, skip_special_tokens=False)
if not verify_tito_for_turn(tok, raw_output, output_ids):
target.loss_mask[-n:] = [0] * n
logger.warning("[middleware] TITO mismatch; loss_mask zeroed (n=%d)", n)
return raw_output
是的,会 mask 掉整条 response
| "content": [], "stop_reason": None, "stop_sequence": None, | ||
| "usage": {"input_tokens": in_tokens, "output_tokens": 0}}, | ||
| })) | ||
| for idx, b in enumerate(blocks): |
There was a problem hiding this comment.
想确定一下,这里是会先把整个 streaming 的请求变成同步的吗?
There was a problem hiding this comment.
是的,_post_generate 走 sglang 的非流式接口拿完整 output_ids,然后 _stream_anthropic_sse 把它当一段 SSE 一次性吐回 claude-code(claude-code 协议只接受 stream,但内容可以攒齐)。
流式增量 retokenize 会让 TITO 校验在每个 chunk 边界都可能切坏 BPE,所以这里攒够 streaming ,再一次性发送
A minimal, readable example of coding agent + sandbox execution + test
reward in slime (~1500 LoC across 4 files). One training sample:
spin up a sandbox -> run Claude Code inside it -> capture the
model-produced git diff -> spin up a SECOND clean sandbox, apply the
diff, run the dataset's tests -> 0/1 reward -> feed the actual
generated tokens (with loss-mask) back to slime, no re-tokenization.
Wire-up is one CLI flag:
--custom-generate-function-path examples.coding_agent_rl.generate.generate
slime's default sglang_rollout.generate_rollout outer loop is reused;
only the per-sample generate() is swapped.
Files:
* generate.py - per-sample entrypoint slime calls. Provision sandbox ->
drop PROBLEM_STATEMENT.md -> run agent -> git diff -> eval in a fresh
sandbox -> fill Sample.
* sandbox.py - E2B sandbox backend. Boot/kill, exec/upload,
install_node22 + install_claude_code, long-running agent spawn with
done-marker poll, git_diff, fresh-sandbox eval runner (swepro /
f2p_script / eval_cmd).
* bridge.py - head-node aiohttp shim. Translates the agent's
Anthropic Messages API into slime's SGLang /generate (token-native +
logprobs) and keeps (prompt_ids, response_ids, loss_mask) per session
so the trainer skips re-tokenization. Model-agnostic.
* run_glm47_355b.sh - reference launch script (GLM-4.7-355B-A32B,
8 nodes / 64 GPUs, colocate, E2B). All required env vars guarded by
\${VAR:?...}; no operator-specific paths.
* README.md - file table, sample flow diagram, dataset schema (flat
and remote_env_info layouts), required vs optional env knobs,
"Swap things out" recipes (model / agent / sandbox backend), and
design notes (no re-tokenization, reasoning round-trip, done-marker
poll, boot semaphore + retry).
…metadata configurable
* bridge.py renamed to middleware.py (file + class BridgeHandle ->
MiddlewareHandle + log prefix + thread name). The chat-log dataclass
field glm_messages was also renamed chat_messages -- the example is
model-agnostic and the GLM-specific naming was misleading.
* sandbox.py no longer hardcodes ``glm-platform/*`` metadata keys.
Set SWE_SANDBOX_METADATA_JSON='{"my-platform/size":"lg",...}' to pass
arbitrary routing tags into AsyncSandbox.create(metadata=...). Default
is empty, which works for stock E2B accounts.
* generate.py defaults SWE_TOOL_PARSER / SWE_REASONING_PARSER to None
(no hardcoded GLM-specific parser fallback). The reference launch
script run_glm47_355b.sh still sets these to glm47/glm45.
* sandbox.run_claude_code's ``bridge_url=`` kwarg renamed to
``middleware_url=`` (caller in generate.py updated).
* README + run_glm47_355b.sh updated for the rename and the new
metadata env var.
…ddleware - middleware: track each /messages call as a `_Turn` (request snapshot, response, finish/stop reason, parent prefix), expose via pop_session return value and `record_tree` opt-in. - middleware: detect non-linear message updates by hashing prior messages and rebuild the prompt when the client diverges, instead of silently appending. Also translate Anthropic `thinking` blocks into `reasoning_content` so prior reasoning is preserved across turns. - generate: add SWE_SAVE_TRAJECTORY_TREE env knob; when set, stash the exported tree under sample.metadata["trajectory_tree"]. Also allow overriding the Claude Code prompt via SWE_CC_PROMPT. - sandbox: pass --include-partial-messages / --include-hook-events to claude CLI and allow extra args via SWE_CLAUDE_EXTRA_ARGS; quote the trajectory path with shlex.
Captures the in-place rollout/middleware state after the 2026-05-21 work:
middleware.py:
- Raw-token preservation: _attach_pending_raw_tokens / _render_with_raw_splice
keep model-generated tokens verbatim across tool turns so loss_mask aligns
with actual sglang output, eliminating the chat_template re-render drift
that was wiping 95% of segments in earlier runs.
- cache_control stripping in _strip_cache_control (Anthropic adds blocks
sglang doesn't accept).
- _classify_branch overhaul: detects compact / compact_summarization /
sibling and falls back to linear, with linear-extend check ordered AHEAD
of the autoCompact-marker check so post-compact linear chains do not get
mis-flagged as a 'compact storm' purely because the resume pack floats
around in m[0..1].
- _is_compact_resume_request scans m[0..1] (the resume pack can land in
either when claude-code splits a Read tool call from its result).
- Per-turn trajectory_tree dump (id, parent_id, parent_prefix_len,
input_len, output_len, branch_kind, request, response) under
metadata.trajectory_tree when SWE_SAVE_TRAJECTORY_TREE=1.
- Abort/resume hardening: poll budgets, attempt caps, raw-token replay.
generate.py:
- Sandbox provisioning (concurrency-capped, retried) via _provision_sandbox.
- SWE_TIME_BUDGET_SEC / SWE_MAX_RESPONSE_TOKENS / SWE_CC_PROMPT plumbing.
- _cap_response_for_training caps the dumped response_ids to bound the
per-sample memory pressure (entropy_input.clone()).
sandbox.py:
- ensure_agent_user / apply_before_repo_set_cmd hooks for the GLM
swedev/sweap-images family.
- E2B reply-path host pin (SLIME_HEAD_HOST -> 172.27.14.209).
This commit lands the "one tree per sample" branching mode. The companion
branch list_trajectory will split the same tree into multiple trainable
trajectories (one per sub-agent dispatch / compact boundary / final chain).
…ple training
In tree_trajectory mode the middleware drops pre-wipe tokens at every chain
rebase, so only the final post-last-compact tail ever feeds the trainer.
list_trajectory mode preserves them.
middleware.py:
- _Session.completed_trajectories: list of (prompt_ids, response_ids,
loss_mask, meta) snapshots taken just before every wipe. meta records
{"kind": "pre_wipe"|"diverge_reset", "completed_turns": <int>} so the
downstream Sample fan-out can label what kind of boundary produced it.
- pop_session_split / _pop_session_split: variant of pop_session that
returns the snapshot list (plus the still-live final chain) and the
same trajectory_tree export as before.
generate.py:
- SWE_LIST_TRAJECTORY env switch. Default 0 keeps the tree_trajectory
single-Sample behavior; 1 calls pop_session_split and emits one
deep-copied Sample per non-empty segment.
- Each fanned-out Sample carries the same per-instance reward
(sample-level reward propagates verbatim) plus segment-specific metadata
(segment_idx, num_segments, segment_kind, completed_turns_at_split).
- trajectory_tree is attached only to the first segment to keep dumps
compact; viz still gets the full per-instance picture.
slime's sglang_rollout already supports custom_generate_func returning
list[Sample] (see generate_and_rm:283 `if isinstance(sample, list)`), so no
training-loop changes are required.
Smoke test (/tmp/smoke_list_trajectory.py): 4 PASS — empty session, 2-segment
fan-out, empty-final drop, 4-segment multi-wipe chain.
… + nearest-parent selection
Implements MASTER_PLAN.md §2 (C1-C6) and §3 (smoke + offline reclassify):
- C1 (middleware._export_tree): bump version 2 -> 3, emit initial_prompt_len.
- C2 (middleware._is_summarization_request): also scan non-last user msgs,
but skip ones containing _COMPACT_RESUME_MARKER (avoids false positives on
embedded resume packs that carry prior summary text).
- C3 (middleware._classify_branch): module-level LINEAR_DEFICIT_TOK=2048;
new near-linear tolerance channel for chat_template re-render drift. Guard
loosened from spec's `pfx > par.in + par.out/2` (rejected sib 14/22) to
`pfx > par.in` (consumed >=1 byte of par.output); still rejects true forks
via the deficit > 2048 condition. Justified in RECLASSIFY_V3_REPORT.md.
- C4 (middleware._classify_branch): root-parent special case when
initial_prompt_len is 0 (legacy dumps) but parent.id==0 + resume marker
present + pfx nearly fills root.full -> compact.
- C5 (middleware._new_turn): two-pass parent selection. Pass 1 walks
reversed(s.turns) and picks the most recent prev where prefix_len >=
prev.full_len - 2048 AND prefix_len > prev.input_len. Pass 2 (if pass 1
found none) falls back to original longest-prefix-match.
- C6 (reclassify_existing_rollout.py): explicit assertion on
initial_prompt_len > 0; legacy-dump soft-fall to turns[0].input_len with
a single per-dump warning.
Smoke test: 15/15 PASS (smoke_classifier_v3.py covers cases 1-6 from v2 +
the 9 new cases from spec_1 §4 table).
Reclassify deltas (full _new_turn replay):
- compact_only/rollout_0.pt : OLD={root:16, linear:388, compact:38, sibling:5, compact_summ:14}
NEW={root:16, linear:404, compact:20, sibling:1, compact_summ:20}
- 100k/rollout_0.pt (full) : OLD={root:16, linear:385, sibling:96, compact:9}
NEW={root:16, linear:448, compact:20, sibling:0, compact_summ:22}
- 100k/rollout_0.pt sample 11 : OLD={root:1, linear:51, sibling:22}
NEW={root:1, linear:65, compact:3, compact_summ:5, sibling:0}
sample-11 sibling == 0 PASS. compact_summ overshoot vs spec target traced to
real summarization requests caught by the v3 marker check (NOT regression);
see RECLASSIFY_V3_REPORT.md "Notes" section for the per-turn breakdown.
…es + segment cap
- Port the list_trajectory fan-out from list_trajectory branch into tree_trajectory
(gated by SWE_LIST_TRAJECTORY=1, default 0 keeps single-Sample behavior).
- Add SWE_LIST_TRAJECTORY_REWARD_MODE = {uniform | copy | final_only}, default
uniform (reward / num_segments). copy keeps the original per-segment full
reward; final_only zeros all but the kind=='final' segment.
- Add SWE_LIST_TRAJECTORY_MAX_SEGMENTS (default 8) hard cap. When K > cap,
keep segments[:cap-1] + segments[-1:] (head + final) to bound GBS growth
and GRPO baseline dilution. Drops the middle on purpose; logged WARN.
- Propagate list_trajectory_reward_mode + list_trajectory_finish_reason into
each fanned Sample's metadata for downstream tally / reward-allocation
audit. finish_reason may be None until sub-agent B wires the field.
- Extract reward allocation into pure _segment_reward() so the smoke test
can validate without spinning up middleware.
- New smoke_reward_modes.py: 6 cases covering uniform, copy, final_only,
N=1, N=0 defensive, and final_only with no final segment. All PASS.
Spec: 0521/specs/MASTER_PLAN.md §6, §A; spec_3 §3 row generate.py:312, §5 item 3.
…endent segments
Implements MASTER_PLAN.md §5 (subagent segment split) for the list_trajectory
branch. Adds per-dispatch state tracking via a new `_SubSession` dataclass and
a `subagent_stack` field on `_Session`, plus dispatch / return detection logic
inside `_handle_messages`.
User decisions from MASTER_PLAN preamble:
1. Subagent segment prefix = subagent's OWN system prompt + initial task,
never the main-line prefix from before the dispatch. Implemented by
letting each `_SubSession` hold its own `prompt_ids` / `chat_messages` /
`initial_prompt_len` and renderer-state; the subagent's segment in
`completed_trajectories` uses those rather than reusing main-line tokens.
2. Whitelist of segment kinds: only `subagent` / `pre_wipe` / `final`.
Removed the `diverge_reset` snapshot append at the prefix-divergence
reset site (was middleware.py:~1126); divergence now just drops the
current chain and the next chunk lands in the next pre_wipe / final.
`compact_summarization` was never emitted as a segment so no change.
3. Nested subagents: only the outermost (depth==1) pop emits a `subagent`
segment. Deeper pops merge into the parent via `_merge_into_parent`,
which concatenates `response_ids` and forces `loss_mask=1` for the
merged tokens. The outer segment's metadata records `nested_depth =
max(seen)` so post-hoc analysis can tell flat from nested subagents.
4. Subagent identification: hash `body["system"]` (`_hash_obj`) and
compare against the main-line cached `system_hash`. Different hash
while a Task/Agent dispatch marker is pending -> push a new sub-session.
5. Fallback: when identification is ambiguous (missing system, missing
dispatch marker, etc.) route to main line - never invent a sub-session.
Detection scheme:
* On every main-line response, scan `blocks` for `tool_use` with name in
{Task, Agent} via `_find_dispatch_tool_use_id`; record id on
`s._pending_dispatch_id` (or on the active sub-session for nested
dispatches).
* On every incoming request, before routing, scan new user messages for
a `tool_result` with the pending dispatch id via `_has_tool_result_for`.
Match -> pop top of `subagent_stack` and snapshot (or merge if nested).
* Routing decision: top-of-stack `system_hash` match -> route to that
sub-session; new `system_hash` with pending dispatch -> push a new
sub-session; else -> main line.
* `_pop_session_split` flushes any still-open sub-sessions at session
pop (claude-code exits mid-dispatch) and appends `last_finish_reason`
to the `final` segment metadata.
Smoke test: examples/coding_agent_rl/smoke_list_trajectory_v2.py covers
all 10 cases from spec_3_list_trajectory.md §4 (plus helper sanity); all
10 PASS. Case 8 (subagent + compact mix) is interpreted as 3 segments
[subagent, pre_wipe, final] (not the 5 mentioned in the spec note) -
the canonical semantics keep main-line linear continuation in one segment
unless a wipe or dispatch changes the routing target; documented in the
case 8 comment block.
Files:
examples/coding_agent_rl/middleware.py +320 / -87 (new _SubSession,
routing helpers, dispatch/return
detection, target-based handler,
updated _pop_session_split)
examples/coding_agent_rl/smoke_list_trajectory_v2.py +new (10 cases)
Implements SPEC §9 (Visualization). - viz/swimlane.py: per-sample swimlane Gantt (matplotlib broken_barh); color by segment_kind, red-hatch overlay when tito_masked_turns > 0, ↻N annotation when num_aborts > 0 - viz/stacked.py: 1-row-per-sample horizontal stacked bars for batch overview - both CLIs accept rollout dump .pt globs and load via torch.load
Combines SPEC §10.1 PRs THUDM#2-THUDM#5 into one atomic rewrite because all four touch the same single file (M1 decision: no sub-module split). Each PR's intent preserved as a distinct §section banner. PR THUDM#2 - §0 TOC docstring + §2 DATACLASSES (Turn, SubSession, SubSnapshot, Session) - Turn loses branch_kind/parent_id/parent_prefix_len/full_ids (0521 legacy) - Turn gains tito_masked: bool (U3) - SubSession gains num_aborts / last_finish_reason / tito_masked_turn_count - SubSnapshot adds finish_reason / num_aborts / tito_masked_turn_count - Session: active_subagent (flat) + completed_subagents + _emit_order (I6) - Session: num_aborts / num_aborts_this_turn / tito_*_turn_count - §3 PRIMITIVES & §4 STORE banners + Store.open_session(record_raw_dump=) PR THUDM#3 - §5 TRANSLATE adds verify_tito_for_turn (D2) - pure function: retokenize(decode(output_ids)) == output_ids - §6 ENGINE keeps AbortCoordinator + generate_with_abort_resume (P2 verbatim) PR THUDM#4 - §7 SEGMENTS rewrite - delete classifier triple: _classify_branch / _new_turn / 2-pass parent - delete _COMPACT_RESUME_MARKER text sniff + _SUMMARIZATION_MARKERS - delete _SubSession.subagent_stack nested stack -> flat active_subagent - new pick_target with nested-dispatch fail-safe (R2 CC3) - new classify_and_apply (4-condition is_append + pre_wipe snapshot) - new snapshot_subagent + pop_session_split (chronological replay of _emit_order) - segment meta key renamed kind -> segment_kind (U3 / R3) - U3: per-segment finish_reason + num_aborts + tito_masked_turns PR THUDM#5 - §8 HANDLER 15-step rewrite + §9 SHELL - _handle_messages: 380 lines -> ~140 lines, numbered 1-15 - step 12: I8 (n>0 guard) + I9 (abort skip TITO) + per-turn mask (only [-n:]) - Session.lock is sole sync primitive (P6); helpers never re-acquire (I3) - open_session(record_tree=) renamed to open_session(record_raw_dump=) - pop_session() single-segment API removed (list mode always on) - _export_raw_dump emits v4 schema (drops full_ids, adds tito_masked/etc) Tests (SPEC §7.1): 22 cases across 5 files, all passing - test_segments_classify.py - 7 cases (pre_wipe, nested fail-safe, etc.) - test_translate_tito.py - 6 cases incl. test_empty_turn_skip (I8 fix) and test_abort_skip_tito (I9) - test_engine_abort_resume.py - 3 cases (concatenate / max-attempts / budget) - test_session_lock.py - 2 cases (same-sid serialize, different sid) - test_pop_session_split.py - 4 cases (chronological, drain, U3 fields) - fixtures/README.md - schema doc
Implements SPEC §4 (generate.py) + §5 (sandbox.py -142) + §8 (5 run scripts).
generate.py:
- drop pop_session_split() / pop_session() dual API (list mode always on)
- drop SWE_LIST_TRAJECTORY / SWE_LIST_TRAJECTORY_REWARD_MODE /
SWE_LIST_TRAJECTORY_MAX_SEGMENTS env (replace with reducer hook)
- rename SWE_SAVE_TRAJECTORY_TREE -> SWE_DUMP_RAW_TRAJECTORY (Q5 / semantic
narrowing: now only attaches to segment 0 metadata, not a tree)
- new _default_uniform_fan_out (D4/Q7 reward/K)
- new _load_reducer with import fail-soft fallback
- new _fan_out_with_fail_soft (U4: 4 fail paths -> sample abort + metric)
- shallow copy.copy per sub-sample (R2 deepcopy memory advice)
- --swe-segment-reducer-path opt-in CLI arg (read via getattr; no slime
arg-parser change needed; default reducer used when absent)
sandbox.py (532 -> 388 lines, SPEC §5 trim):
- drop ~50 lines of redundant module/RPC docstrings (README is reference)
- merge _setup_swepro_assets + _apply_pre_commands into evaluate() body flow
- preserve P-equivalent: _rpc_retry / E2BSandbox 5-primitive contract /
install_node22 host-decompress / run_claude_code done-marker poll /
evaluate 3-mode branch (swepro / f2p_script / eval_cmd)
run_qwen36_base.sh + 4 scenario shells (SPEC §8):
- run_qwen36_base.sh : DRY common config + exec_train helper
- run_qwen36_linear_2steps : autoCompactWindow=1M + disable Agent/Task
- run_qwen36_subagent_2steps : investigator agent + force dispatch prompt
- run_qwen36_compact_2steps : autoCompactWindow=100K + disable subagents
- run_qwen36_mixed_2steps : investigator + autoCompactWindow=100K
- RUNTIME_ENV_JSON no longer forwards deleted SWE_LIST_TRAJECTORY* keys
test_reducer_failure.py (SPEC §7.1 U4): 8 cases all passing
- import path bad -> falls back to default + logs error
- reducer not callable -> falls back
- reducer raises -> sample abort
- reducer returns None / non-list -> sample abort
- happy path passthrough
Delete obsolete 0521 smoke scripts:
- smoke_classifier_v3.py / smoke_list_trajectory_v2.py / smoke_reward_modes.py
- reclassify_existing_rollout.py (0521 archive replay; superseded by SPEC §7.3)
Tests: 30/30 passing across all 6 test files.
Implements SPEC §10.3 (v0.3 round 3, U5 decision). - mv slime/utils/aiohttp_threaded.py -> examples/coding_agent_rl/aiohttp_threaded.py - middleware.py:69 import: from slime.utils.aiohttp_threaded -> from aiohttp_threaded (bare import to match sub-agent's existing sys.path + bare-import style; examples/coding_agent_rl/ is not a package so SPEC's relative-import form `from .aiohttp_threaded` doesn't apply here) Notes: - Logically part of PR THUDM#2 (dataclass/docstring cleanup) per SPEC §10.3, but filed as a standalone commit because the original PR THUDM#2-THUDM#5 commit (7091189) is no longer HEAD (HEAD is PR THUDM#6 / 159b2b0); amending HEAD would semantically corrupt PR THUDM#6 scope. User may interactively rebase to fold this into 7091189 if desired. - No 0521 legacy test/script files in this worktree, so only 1 import needed updating (vs SPEC §10.3 listing 5 files — those don't exist here). - All 30 smoke tests still pass.
Train backward on 152K-token samples OOMs at fused_cross_entropy alloc even with chunked log_probs (chunk_size=2) and reduced sglang mem-fraction. Root cause: actor activation peak on the longest sample exceeds 80 GiB despite CP=8. Cap per-sample total tokens at SLIME_MAX_SAMPLE_TOKENS in _get_rollout_data after Sample.from_dict (covers both load_debug and live rollout paths). Truncates response tail first; if prompt itself exceeds cap, collapses sample to no-gradient loss_mask=[0] (same contract as _abort). Verified on agent_only/rollout_0.pt (303 MiB, max 152K tokens) via oom_repro_load_debug.sh with cap=100K: 4/16 samples capped (1 truncated, 3 collapsed), step 0 SUCCEEDED, backward microbatch 4/4 finished in 19s, GPU peak free 5.62 GB. Default 0 = no cap (back-compat). See 0521/OOM_DIAGNOSIS.md §9.
Stage 8B real E2E run failed with ModuleNotFoundError on aiohttp_threaded:
middleware.py used a bare `import aiohttp_threaded` that only resolved when
coding_agent_rl/ itself was on sys.path -- which the in-tree smoke tests
set up but ray rollout workers (loading via importlib from external cwd)
never do.
Stage 9 fix was a dual-mode try/except. Stage 10 promoted to clean package
mode:
* add __init__.py making examples.coding_agent_rl an explicit package
* middleware.py: `from .aiohttp_threaded import AppHandle, ...`
* collapse smoke-test sys.path hack to a single worktree-root insert
* tests import via `from examples.coding_agent_rl import middleware`
* add test_external_cwd_import.py as B2 regression guard: subprocess
in /tmp imports the three modules to assert ray-worker-mode stays green
34/34 smoke tests pass.
Root cause (stage14_oom_rootcause_fanout.md, 2026-05-24):
0522 refactor removed the SWE_LIST_TRAJECTORY env knob and made fan-out
the unconditional path. archive-config (CP=4, SGLANG=0.75, no margin)
that the main repo runs cleanly in 32 min on this dataset failed with
GPU wake_up DENY at 6.48 GB and CE-forward OOM (73.7 GB allocated /
1 GB free / 20 MiB CE buffer fail). Investigation traced the failure
not to per-sample length but to ray rollout buffer sample-count
explosion (16 -> ~64-80 samples after fan-out, ~4-5x). Bigger ray.put +
host pinned mem footprint -> wake_up denied -> downstream CE OOM.
Fix: restore SWE_LIST_TRAJECTORY env knob (default 0); in the default
path, collapse to the FINAL segment (the reward-bearing
post-final-compact-reset segment). This matches main-repo behavior and
restores the archive baseline. Setting SWE_LIST_TRAJECTORY=1 still
opts into per-segment fan-out for users who want it.
Verified: fix_listtraj0_20260524_022842 PASS — 2 train steps in 45 min,
wake_up 7.5s / 11.97s (vs main-repo 9.5s / 10.07s baseline),
num_samples=16, ray job SUCCEEDED.
Implementation:
* extract _collapse_to_final_segment helper (testable in isolation)
* SWE_LIST_TRAJECTORY env constant w/ docstring explaining trade-offs
* 2 new smoke tests (test_collapse_to_final_segment_picks_last_segment
+ test_collapse_with_single_segment_keeps_it) directly verify the
short-arm of Plan D from the 0524 segment-per-sample research
* 36/36 tests pass
Long-term direction (Plan D long-arm) lives on a separate branch:
per-segment fan-out + PR THUDM#1933 per-rollout reducer port. See
docs/project_memory/1.segment_per_sample_oom/README.md.
…ples Port from THUDM/slime PR THUDM#1933 step 9 (R3 SWE-side glue): - _default_uniform_fan_out now sets sub.rollout_id = sample_proto.index on every fan-out sibling. With the PR-THUDM#1933 reducer this makes the loss treat the K segments as one trajectory (per-rollout mean) rather than K independent samples (over-counting by K). Kept intact: SWE_LIST_TRAJECTORY=0 collapse path (stage 14 OOM fix fallback), _fan_out_with_fail_soft wrapper, sub.index (shared across siblings so framework group_by(group_index) still works for GRPO).
…ssthrough Two new CPU-only test files under examples/coding_agent_rl/tests/ so they get picked up by the existing pytest config without touching pyproject testpaths or top-level tests/. test_dp_schedule.py (7 tests, ~120 LOC): - static one-step-one-sample-per-rollout invariant check - dynamic compact (2 samples/rollout) keeps same-rollout samples in one step - trailing rollouts get trimmed - < gbs rollouts raises AssertionError - step_size < dp_size raises AssertionError - balance_data round-robin vs KK partition test_rollout_id_passthrough.py (3 tests): - compact K=3 fan-out samples share rollout_id and rollout_mask_sums - default 1-sample-per-rollout falls back to index - SWE _default_uniform_fan_out wires rollout_id from sample_proto.index
E2 retry 2 run planD_e2_pr1933_fanout_20260524_073011 was killed at ~27 min by an external `concurrent.futures.TimeoutError` set on the RolloutManager.generate() future (15/16 trajectories completed, 1 still in sandbox.evaluate; see r5 doc). No env-tunable timeout (SWE_SANDBOX_LIFETIME_SEC=3600 / SWE_TIME_BUDGET_SEC=1200 / SWE_EVAL_TIMEOUT_SEC=600 / e2b REQUEST_TIMEOUT=60s) explains the 27-min number -- root cause likely a Ray job heartbeat or an undiscovered wrapper, not in our code. Fix (Plan B): wrap generate() body in asyncio.wait_for( _generate_inner(...), timeout=SWE_GENERATE_GUARD_SEC), where SWE_GENERATE_GUARD_SEC defaults to SWE_TIME_BUDGET_SEC + SWE_EVAL_TIMEOUT_SEC + 180 (1980s). On timeout, sample is aborted with reason `wall_clock_timeout` and a snapshot of pending asyncio tasks is logged for future debugging. This makes a single hung trajectory a recoverable per-sample abort instead of a whole-step ray job failure. Note: asyncio.TimeoutError IS builtin TimeoutError in Python 3.11+ (they were unified), so the except catches both. Added test_generate_guard_aborts_on_timeout in test_reducer_failure.py. Verified: 47/47 smoke tests pass. This fix should also be cherry-picked to list_trajectory_0522_exp when convenient -- the timeout path is shared between E0 (collapse) and E2 (fan-out).
Branch list_trajectory_0522_pr1933_e2_v3 (off pr1933 HEAD @ 82135cc). This experiment branch isolates v3 runs from the staging pr1933 branch so we can iterate without polluting the merge candidate. Diff vs planD_e2_pr1933_fanout.sh: * SCENARIO renamed to planD_e2_v3_with_guard * SWE_GENERATE_GUARD_SEC=1980 exported + added to runtime_env keys so ray workers see it (default in generate.py is the same; explicit here so the value is logged) * Updated header docstring to point at branch + previous failure PASS criteria: - At least 1 train step completes (compare to E2 retry that did) - If any trajectory hangs > 1980s, expect `wall_clock_timeout` abort + run continues (vs E2 retry 2 where the whole ray job died)
…ignment E2 v3 attempt 2 (run 20260524_110039) died at dp_schedule.py:176: num_mbs (23) is not a multiple of dp_size * mb_group (8); static path. Root cause: fan-out produced 23 samples from 16 rollouts (variable K). With static micro_batch_size=1 the static path requires num_samples % (dp_size * mb_group) == 0, which fan-out variability cannot guarantee. Earlier E2 retry (073011) only worked because the legacy sample-count trim accidentally landed on 32 (33 -> trim 32, 32 % 8 == 0). After the trim-by-rollout fix (a6af128), trim preserves 16 rollouts intact and the 23 fan-out samples now hit this static-path assertion that the accidental trim hid. Fix: add --use-dynamic-batch-size so dp_schedule.py:166-174 takes the dynamic-path branch that calls expand_bins_by_splitting to grow mbs count to the alignment target. Token-budget packing also handles the variable K naturally. No code change; this is config only on the v3 experiment branch.
Verified the examples/coding_agent_rl refactor end-to-end on archive_clone_2step_0524_fanout.sh; surfaced and fixed two latent bugs that blocked rollout 0 / rollout 2->3 transition; Run 3 completed 10/10 rollouts in 4h03m with 0 fatal errors. Fixes: 1) examples/coding_agent_rl/generate.py: fan-out abort path returned a bare Sample while the success path returns list[Sample] under SWE_LIST_TRAJECTORY=1. The mixed shape sneaks past _get_rollout_data's flatten loop and AttributeError's later in _save_debug_rollout_data. Added _abort_result() wrapper that wraps the abort Sample in a single-element list when fan-out is on, and routed all 5 abort return sites through it. 2) slime/ray/rollout.py: defensive normalization in _get_rollout_data — if any element of the live-rollout result is a list, wrap bare Samples to [Sample] before chain.from_iterable, then collapse remaining nesting. 3) slime/backends/sglang_utils/sglang_engine.py: SGLang scheduler asserts is_fully_idle() inside release_memory_occupation, but flush_cache only drains the radix cache, not in-flight forwards. This race crashed scheduler_0 (exit -3, unrecoverable) at the rollout 2 -> rollout 3 train transition. Added _wait_for_idle() that polls /get_load until num_reqs + num_waiting_reqs == 0 on every DP rank (60s timeout, warn-and-proceed on miss); call it between flush_cache and release_memory_occupation. Refactor surface (already in the working tree, verified by Run 3): - examples/coding_agent_rl/middleware.py simplified by ~1100 lines - examples/coding_agent_rl/sandbox.py / README.md aligned Reproducibility note: Run 3 RUN_ROOT logged at 0522/logs/archive_clone_2step_fanout_20260524_192238 (10 rollout_*.pt dumps, 0 release_memory_occupation crashes, 0 wait_for_idle timeouts). Reward stayed 0 because swe-train-1545-localcache.jsonl lacks swepro/eval_cmd — pipeline-only verification, not a learning run.
Inside examples/coding_agent_rl/, keep only the modules imported by the SWE rollout path: - __init__.py - aiohttp_threaded.py - generate.py - middleware.py - sandbox.py Drop the README, all run_*.sh launchers, tests/, and viz/. Other example subdirectories (eval_multi_task, geo3k_vlm, retool, tau-bench, ...) are untouched.
Remove the v3 reclassify report and its companion experiment launcher script from the repo.
Add a third metadata path for sweb-style rows that ship a self-contained pytest script under metadata.remote_env_info.f2p_script (ends with sys.exit(pytest.main([...]))). When eval_cmd is absent, _metadata now wraps the script via base64 into a shell one-liner that materializes /tmp/slime_f2p.py and runs it — sidesteps all shell quoting and lets python's exit code drive the existing reward path without touching _run_eval_cmd. Precedence stays the same: explicit eval_cmd wins over f2p_script, and swepro still wins over both when present.
…ed segments - generate.py: run pre_commands (typically `git checkout <base_sha> -f`) BEFORE writing PROBLEM_STATEMENT.md so `git clean -fd` doesn't wipe it, and so the model edits the same baseline `evaluate()` apply-checks against (otherwise applied_cleanly is always False). - sandbox.py: promote `_apply_pre_commands` to public `apply_pre_commands` for cross-module reuse from generate.py. - middleware.py: add SWE_MAX_SEGMENT_TOKENS (default 96000) drop cap. claude-code auto-compact estimates in its own tokenizer; a 100k autoCompactWindow can produce 130k+ Qwen-tokenized segments after sub-agent dispatch reads large files, OOM-ing fused CE on actor_train (single sample can't be split by dynamic batch).
3bdf682 to
ef8d88e
Compare
- generate.py: remove --swe-segment-reducer-path indirection (always use default uniform reducer), make SLIME_HEAD_HOST required (no hostname fallback that silently misroutes sandboxes), bump default SWE_TIME_BUDGET_SEC 900 -> 1800 - sandbox.py: make SWE_SANDBOX_IMAGE_METADATA_KEY required (no silent default key), hardcode RPC backoff base constant (never overridden by any script in the repo) - sglang_engine: drop _wait_for_idle polling helper (no longer called) - ray/rollout: drop _cap_sample_total_tokens token-cap hack and restore the simpler flatten loop; long-trajectory capping is now done at the segment level upstream
generate.py: - rename _provision_sandbox -> _boot_agent_sandbox - collapse _collapse_to_final_segment + _default_uniform_fan_out into _write_segment_to_sample (shared) + _fan_out_to_samples - inline the default collapse path into generate() (4 lines) - strip references to historical run ids from docstrings/comments middleware.py: - reorganize top-to-bottom into 10 numbered sections (constants / pure helpers / translation / splice / parsing / state / store / I/O / handler / entry) - move _attach_pending_raw_tokens to Chain.attach_pending_raw_tokens - rename segment kind pre_wipe -> wipe in docs and comments
- generate.py: build sid from (instance_id, sample.index, group_index) so it's unique by construction within a rollout step; fall back to random hex only when an index is missing. - middleware.py: open_session now raises on duplicate sid instead of silently merging two rollouts' state into one chain. - drop examples/coding_agent_rl/__init__.py; PEP 420 namespace package handles relative imports fine and the docstring's rationale was wrong.
Attach a uuid rid to /generate and fire-and-forget /abort_request when the client side is cancelled or the transport raises. Without this a subsequent release_memory_occupation can race with a still-pending generate and trip sglang's "server is idle" assertion, killing the scheduler.
- run_qwen36_35b_a3b_swe_8node.sh: reusable baseline launcher for Qwen3.5-35B-A3B SWE RL on 8 nodes.
- run_qwen36_35b_a3b_swe_8node_agent_only.sh: agent-only variant that forces sub-agent dispatch (investigator subagent + prompt + tool allowlist) so the saved trajectory tree carries real sibling branches.
- run_qwen36_35b_a3b_swe_8node_{linear,tree}.sh: thin wrappers toggling SWE_SAVE_TRAJECTORY_TREE.
- run_dynamic_trajectory_tree_train.sh: pytest smoke for trainer accepting both linear and trajectory_tree samples.
Keep only run_qwen36_35b_a3b_swe_8node_agent_only.sh as the canonical launcher; the linear/tree wrappers and the synthetic train smoke have been superseded.
…kip text round-trip
End-to-end loop for "coding agent + sandbox execution + test reward":
agentuser; chown repo; write problem_statement.md.claude --output-format stream-jsonpointed at a head-node bridge via ANTHROPIC_BASE_URL; ANTHROPIC_AUTH_TOKEN doubles as the session id for concurrent request demux.Layout under examples/coding_agent_rl/:
generate.py slime custom-generate entrypoint
sandbox.py all sandbox-side ops (boot/exec/upload, install
Node/Claude Code, run agent, git diff, eval)
bridge.py Anthropic Messages API <-> SGLang /generate
shim; model-agnostic
run_glm47_355b.sh 8-node / 64-GPU / colocate / E2B launch
script for GLM-4.7-355B-A32B
README.md walkthrough + dataset schema + swap-model howto
Model-agnostic: chat template via tokenizer.apply_chat_template; tool call parsing via sglang FunctionCallParser; reasoning parsing via sglang ReasoningParser. Swapping model = changing SWE_TOOL_PARSER / SWE_REASONING_PARSER envs (default glm47/glm45).
No new swe_rollout.py: reuses slime's default sglang_rollout outer loop via --custom-generate-function-path.