Bug Description
raw error
timer.py:24 - Timer train_wait start Traceback (most recent call last):
File "/root/slime/train.py", line 110, in <module> train(args)
File "/root/slime/train.py", line 24, in train actor_model, critic_model = create_training_models(args, pgs, rollout_manager) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/slime/slime/ray/placement_group.py", line 152, in create_training_models start_rollout_ids = ray.get( ^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper return fn(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/ray/_private/worker.py", line 2822, in get values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/ray/_private/worker.py", line 930, in get_objects raise value.as_instanceof_cause() ray.exceptions.RayTaskError(AssertionError): ray::MegatronTrainRayActor.init() (pid=289796, ip=33.163.45.138, actor_id=f847046dd56551e53b8fdcb002000000, repr=<slime.backends.megatron_utils.actor.MegatronTrainRayActor object at 0x7f4a53c113d0>) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/slime/slime/utils/timer.py", line 97, in wrapper return fn(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^
File "/root/slime/slime/backends/megatron_utils/actor.py", line 113, in init self.weights_backuper.backup("actor")
File "/root/slime/slime/utils/tensor_backper.py", line 96, in backup self._backup_hash_dict = _compute_hash_dict(dict(self._source_getter())) ^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/slime/slime/backends/megatron_utils/update_weight/common.py", line 128, in <genexpr> ans = ((name, _maybe_get_cpu_backup(tensor)) for name, tensor in ans) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/slime/slime/backends/megatron_utils/update_weight/common.py", line 136, in _maybe_get_cpu_backup if (cpu_tensor := torch_memory_saver.get_cpu_backup(x)) is not None: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch_memory_saver/entrypoint.py", line 86, in get_cpu_backup self._ensure_initialized()
File "/usr/local/lib/python3.12/dist-packages/torch_memory_saver/entrypoint.py", line 92, in _ensure_initialized self._impl = _TorchMemorySaverImpl(**self._impl_ctor_kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch_memory_saver/entrypoint.py", line 100, in __init__ self._binary_wrapper = BinaryWrapper(path_binary=self._hook_util.get_path_binary()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch_memory_saver/hooks/mode_preload.py", line 15, in get_path_binary assert len(interest_paths) == 1, ( ^^^^^^^^^^^^^^^^^^^^^^^^ AssertionError: TorchMemorySaver observes invalid LD_PRELOAD. You can use configure_subprocess() utility, or directly specify LD_PRELOAD=/path/to/torch_memory_saver_cpp.some-postfix.so python your_script.py. (LD_PRELOAD="" process_id=289796) --------------------------------------- Job 'raysubmit_tpVuTC2CdefHwbfb' failed --------------------------------------- Status message: Job entrypoint command failed with exit code 1, last available logs (truncated to 20,000 chars):
File "/usr/local/lib/python3.12/dist-packages/torch_memory_saver/entrypoint.py", line 92, in _ensure_initialized self._impl = _TorchMemorySaverImpl(**self._impl_ctor_kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch_memory_saver/entrypoint.py", line 100, in __init__ self._binary_wrapper = BinaryWrapper(path_binary=self._hook_util.get_path_binary()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch_memory_saver/hooks/mode_preload.py", line 15, in get_path_binary assert len(interest_paths) == 1, ( ^^^^^^^^^^^^^^^^^^^^^^^^ AssertionError: TorchMemorySaver observes invalid LD_PRELOAD. You can use configure_subprocess() utility, or directly specify LD_PRELOAD=/path/to/torch_memory_saver_cpp.some-postfix.so python your_script.py. (LD_PRELOAD="" process_id=289796)
Steps to Reproduce
only add --disable-weights-backuper to script which can runs well:
ray job submit --address="http://127.0.0.1:8265" \
--runtime-env-json="${RUNTIME_ENV_JSON}" \
-- python3 /root/slime/train.py \
--actor-num-nodes 1 \
--actor-num-gpus-per-node 8 \
--rollout-num-gpus 8 \
--num-gpus-per-node 16 \
--sglang-log-level error \
--load-debug-rollout-data "$WORKDIR/ckpt/slime/debug/data_{rollout_id}.pt" \
--disable-weights-backuper \
${MODEL_ARGS[@]} \
${CKPT_ARGS[@]} \
${ROLLOUT_ARGS[@]} \
${OPTIMIZER_ARGS[@]} \
${GRPO_ARGS[@]} \
${WANDB_ARGS[@]} \
${PERF_ARGS[@]} \
${EVAL_ARGS[@]} \
${SGLANG_ARGS[@]} \
${MISC_ARGS[@]}
Expected Behavior
LD_PRELOAD can be fund automaticly
Actual Behavior
error raise
Environment
- slime version:
- Python version:
- PyTorch version:
- CUDA/ROCm version:
- GPU type and count:
- OS:
- SGLang version (if relevant):
- Megatron-LM version (if relevant):
Logs
Additional Context
No response
Pre-submission Checklist
Bug Description
raw error
Steps to Reproduce
only add
--disable-weights-backuperto script which can runs well:Expected Behavior
LD_PRELOAD can be fund automaticly
Actual Behavior
error raise
Environment
Logs
Additional Context
No response
Pre-submission Checklist