Skip to content

Miles CI gap Between ROCm & CUDA #1105

Description

@indianspeedster

Tracking ROCm CI parity with the NVIDIA PR Test workflow (.github/workflows/pr-test.yml): the same per-commit tests green on AMD Instinct MI300/MI355X (ROCm) as on NVIDIA.

Snapshot of the currently active (enabled) per-commit tests against origin/main @ c8e85df24. A ticked box = confirmed passing on MI355X (trailing #NNNN is the PR carrying the ROCm fix, where one was needed); a blank box = still to run (newly added/split files, or not yet green). Suites are bucketed by GPU type/count and selected at runtime by domain labels. List a suite's contents with:

python3 -m tests.ci.run_suite --hw cpu  --suite stage-a-cpu --list-only
python3 -m tests.ci.run_suite --hw cuda --suite <suite-name> --list-only

77 / 83 passing.

stage-a-cpu (CPU) — 44 active

  • tests/fast/backends/megatron_utils/test_fp32_param_utils.py
  • tests/fast/backends/megatron_utils/test_lora_checkpoint_helpers.py
  • tests/fast/backends/megatron_utils/test_lora_hf_weight_iterator.py
  • tests/fast/backends/megatron_utils/test_lora_model_branches.py
  • tests/fast/backends/megatron_utils/test_lora_update_weight.py
  • tests/fast/backends/megatron_utils/test_lora_utils.py
  • tests/fast/backends/megatron_utils/test_lora_weight_sync_validation.py
  • tests/fast/backends/megatron_utils/test_model_provider_true_on_policy.py
  • tests/fast/backends/megatron_utils/test_qwen2_true_on_policy_conversion.py
  • tests/fast/rollout/generate_hub/test_tool_call_utils.py
  • tests/fast/rollout/generate_utils/test_openai_endpoint_utils.py
  • tests/fast/rollout/generate_utils/test_sample_utils.py
  • tests/fast/rollout/inference_rollout/test_compatibility.py
  • tests/fast/rollout/rm_hub/test_deepscaler.py
  • tests/fast/rollout/rm_hub/test_f1.py
  • tests/fast/rollout/rm_hub/test_gpqa.py
  • tests/fast/rollout/rm_hub/test_math_dapo_utils.py
  • tests/fast/rollout/rm_hub/test_math_utils.py
  • tests/fast/rollout/rm_hub/test_rm_hub.py
  • tests/fast/router/test_linear_trajectory.py
  • tests/fast/router/test_router.py
  • tests/fast/router/test_session_pretokenized_e2e.py
  • tests/fast/router/test_session_race_conditions.py
  • tests/fast/router/test_sessions.py
  • tests/fast/test_megatron_cli_flags.py
  • tests/fast/utils/chat_template_utils/test_pretokenized_chat.py
  • tests/fast/utils/chat_template_utils/test_pretokenized_via_tito.py
  • tests/fast/utils/chat_template_utils/test_template.py
  • tests/fast/utils/chat_template_utils/test_tito_tokenizer.py
  • tests/fast/utils/chat_template_utils/test_token_seq_comparator.py
  • tests/fast/utils/test_arguments.py
  • tests/fast/utils/test_async_utils.py
  • tests/fast/utils/test_dumper_utils.py
  • tests/fast/utils/test_env_report.py
  • tests/fast/utils/test_http_utils.py
  • tests/fast/utils/test_logging_utils.py
  • tests/fast/utils/test_lora_arguments.py
  • tests/fast/utils/test_mask_utils.py
  • tests/fast/utils/test_misc.py
  • tests/fast/utils/test_types.py
  • tests/fast/utils/test_utils/test_mock_sglang_server.py
  • tests/fast/utils/test_utils/test_mock_tools.py
  • tests/fast/utils/test_utils/test_session_verify_runner.py
  • tests/utils/test_sglang_config.py

stage-b-cpu (CPU)

No tests registered yet.

stage-b-2-gpu-h200 (CUDA, 2-GPU) — 15 active

  • tests/fast/rollout/generate_hub/test_multi_turn.py
  • tests/fast/rollout/generate_hub/test_single_turn.py
  • tests/fast/rollout/inference_rollout/integration/test_agent_metadata.py
  • tests/fast/rollout/inference_rollout/integration/test_basic.py
  • tests/fast/rollout/inference_rollout/integration/test_deterministic.py
  • tests/fast/rollout/inference_rollout/integration/test_dynamic_filter.py
  • tests/fast/rollout/inference_rollout/integration/test_group_rm.py
  • tests/fast/rollout/inference_rollout/integration/test_multi_sample.py
  • tests/fast/rollout/inference_rollout/integration/test_multi_turn.py
  • tests/fast/rollout/inference_rollout/integration/test_over_sampling.py
  • tests/fast/rollout/inference_rollout/integration/test_sample_filter.py
  • tests/fast/rollout/inference_rollout/integration/test_semaphore.py
  • tests/fast/utils/test_quantizer_ci.py
  • tests/fast/utils/test_mxfp8_quantizer.py
  • tests/fast/utils/test_nvfp4_quantizer.py (ROCm TE wheel has no NVFP4QuantizerRef)

stage-c-2-gpu-h200 (CUDA, 2-GPU) — 2 active

stage-c-4-gpu-h200 (CUDA, 4-GPU) — 12 active

stage-c-8-gpu-h100 (CUDA, 8-GPU) — 10 active

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions