Skip to content

[Bug] TorchMemorySaver observes invalid LD_PRELOAD. when add --disable-weights-backuper #1936

@zyfzjsc988

Description

@zyfzjsc988

Bug Description

raw error

timer.py:24 - Timer train_wait start Traceback (most recent call last):
File "/root/slime/train.py", line 110, in <module> train(args)
File "/root/slime/train.py", line 24, in train actor_model, critic_model = create_training_models(args, pgs, rollout_manager) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/slime/slime/ray/placement_group.py", line 152, in create_training_models start_rollout_ids = ray.get( ^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper return fn(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/ray/_private/worker.py", line 2822, in get values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/ray/_private/worker.py", line 930, in get_objects raise value.as_instanceof_cause() ray.exceptions.RayTaskError(AssertionError): ray::MegatronTrainRayActor.init() (pid=289796, ip=33.163.45.138, actor_id=f847046dd56551e53b8fdcb002000000, repr=<slime.backends.megatron_utils.actor.MegatronTrainRayActor object at 0x7f4a53c113d0>) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/slime/slime/utils/timer.py", line 97, in wrapper return fn(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^
File "/root/slime/slime/backends/megatron_utils/actor.py", line 113, in init self.weights_backuper.backup("actor")
File "/root/slime/slime/utils/tensor_backper.py", line 96, in backup self._backup_hash_dict = _compute_hash_dict(dict(self._source_getter())) ^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/slime/slime/backends/megatron_utils/update_weight/common.py", line 128, in <genexpr> ans = ((name, _maybe_get_cpu_backup(tensor)) for name, tensor in ans) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/slime/slime/backends/megatron_utils/update_weight/common.py", line 136, in _maybe_get_cpu_backup if (cpu_tensor := torch_memory_saver.get_cpu_backup(x)) is not None: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch_memory_saver/entrypoint.py", line 86, in get_cpu_backup self._ensure_initialized()
File "/usr/local/lib/python3.12/dist-packages/torch_memory_saver/entrypoint.py", line 92, in _ensure_initialized self._impl = _TorchMemorySaverImpl(**self._impl_ctor_kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch_memory_saver/entrypoint.py", line 100, in __init__ self._binary_wrapper = BinaryWrapper(path_binary=self._hook_util.get_path_binary()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch_memory_saver/hooks/mode_preload.py", line 15, in get_path_binary assert len(interest_paths) == 1, ( ^^^^^^^^^^^^^^^^^^^^^^^^ AssertionError: TorchMemorySaver observes invalid LD_PRELOAD. You can use configure_subprocess() utility, or directly specify LD_PRELOAD=/path/to/torch_memory_saver_cpp.some-postfix.so python your_script.py. (LD_PRELOAD="" process_id=289796) --------------------------------------- Job 'raysubmit_tpVuTC2CdefHwbfb' failed --------------------------------------- Status message: Job entrypoint command failed with exit code 1, last available logs (truncated to 20,000 chars):
File "/usr/local/lib/python3.12/dist-packages/torch_memory_saver/entrypoint.py", line 92, in _ensure_initialized self._impl = _TorchMemorySaverImpl(**self._impl_ctor_kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch_memory_saver/entrypoint.py", line 100, in __init__ self._binary_wrapper = BinaryWrapper(path_binary=self._hook_util.get_path_binary()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch_memory_saver/hooks/mode_preload.py", line 15, in get_path_binary assert len(interest_paths) == 1, ( ^^^^^^^^^^^^^^^^^^^^^^^^ AssertionError: TorchMemorySaver observes invalid LD_PRELOAD. You can use configure_subprocess() utility, or directly specify LD_PRELOAD=/path/to/torch_memory_saver_cpp.some-postfix.so python your_script.py. (LD_PRELOAD="" process_id=289796)

Steps to Reproduce

only add --disable-weights-backuper to script which can runs well:

ray job submit --address="http://127.0.0.1:8265" \
   --runtime-env-json="${RUNTIME_ENV_JSON}" \
   -- python3 /root/slime/train.py \
   --actor-num-nodes 1 \
   --actor-num-gpus-per-node 8 \
   --rollout-num-gpus 8 \
   --num-gpus-per-node 16 \
   --sglang-log-level error \
   --load-debug-rollout-data "$WORKDIR/ckpt/slime/debug/data_{rollout_id}.pt" \
   --disable-weights-backuper \
   ${MODEL_ARGS[@]} \
   ${CKPT_ARGS[@]} \
   ${ROLLOUT_ARGS[@]} \
   ${OPTIMIZER_ARGS[@]} \
   ${GRPO_ARGS[@]} \
   ${WANDB_ARGS[@]} \
   ${PERF_ARGS[@]} \
   ${EVAL_ARGS[@]} \
   ${SGLANG_ARGS[@]} \
   ${MISC_ARGS[@]}

Expected Behavior

LD_PRELOAD can be fund automaticly

Actual Behavior

error raise

Environment

  • slime version:
  • Python version:
  • PyTorch version:
  • CUDA/ROCm version:
  • GPU type and count:
  • OS:
  • SGLang version (if relevant):
  • Megatron-LM version (if relevant):

Logs

Additional Context

No response

Pre-submission Checklist

  • I have read the CONTRIBUTING.md and understand the collaboration scope.
  • I have read the documentation and my issue is not addressed there.
  • I have searched for existing issues and this is not a duplicate.
  • I have provided a minimal, reproducible example.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions