Feat/minimax m2.5 support#1929
Conversation
Add full integration for MiniMax-M2.5, a 229B MoE model with 256 experts and top-8 routing. This includes: - Model spec plugin with custom SelfAttention for full-dimension QK Norm (RMSNorm over all heads concatenated, with TP gather/scatter) - mbridge weight bridge (HF <-> Megatron conversion via Qwen2MoEBridge) - Megatron-to-HF converter for saving trained checkpoints - Shell scripts: model args, RL training launch, HF<->Megatron weight conversion (3-script pipeline) Key architecture differences from standard Qwen2MoE: - block_sparse_moe prefix with w1/w2/w3 expert naming - Full-dimension QK Norm (q_norm/k_norm, not per-head) - Sigmoid router with e_score_correction_bias - Partial RoPE (rotary_percent=0.5) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Thank you for the PR! Just a quick check — could you provide some evidence that the implementation is working correctly, such as W&B screenshots, logs, or validation outputs? |
aime eval results with dapo-math-17k train
parameters:
`ROLLOUT_ARGS=( EVAL_ARGS=( |
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Summary
Add full integration for MiniMax-M2.5 (256 experts, top-8 routing), including:
slime_plugins/models/minimax_m2.py): CustomSelfAttentionwith full-dimension QK Norm (RMSNorm over all heads concatenated, with TP gather/scatter)slime_plugins/mbridge/minimax_m2.py): HF ↔ Megatron weight mapping extendingQwen2MoEBridgeslime/backends/megatron_utils/megatron_to_hf/minimax_m2.py): Reverse conversion for saving trained checkpoints back to HF formatscripts/): Model architecture args, RL training launch script, and 3-script HF ↔ Megatron weight conversion pipelineKey architecture differences from standard Qwen2MoE
block_sparse_moe(w1/w2/w3)mlpe_score_correction_bias