Skip to content
Merged
Show file tree
Hide file tree
Changes from 117 commits
Commits
Show all changes
118 commits
Select commit Hold shift + click to select a range
dfdd382
Changed VERSION to 2.13.0.dev0
ptrendx Jan 20, 2026
27fc168
[Common] Enable determinism for cuDNN >= 9.18.1 on Blackwell (#2584)
cyanguwa Jan 20, 2026
fbb16f4
[Common] Tuned NVFP4 cast kernel (#2412)
Oleg-Goncharov Jan 21, 2026
36f4e45
Fixed the year to 2026 (#2611)
Oleg-Goncharov Jan 21, 2026
605786f
[pyTorch] CPU performance optimizations (#2439)
ptrendx Jan 21, 2026
8bf37f0
[JAX] Fix cb.CUDAOptions usage for Triton 3.6.0 (#2610)
jberchtold-nvidia Jan 22, 2026
3d46bf6
Permutation to always return group_size/tokens_per_expert (#2613)
tdophung Jan 22, 2026
0f0e229
[PyT] Update THD sink attention logic for cudnn >=9.18.0 (#2568)
cuichenx Jan 22, 2026
c6a92a4
Add support for SWA (left, right) with FusedAttention (#2477)
sudhakarsingh27 Jan 22, 2026
52ee5ea
Fix bugs in permutation custom partitioning (#2617)
tdophung Jan 23, 2026
a0a89a8
[Common] Disabled the tuned NVFP4 kernels (#2615)
Oleg-Goncharov Jan 23, 2026
7259276
[PyTorch] Support user-defined op fusions (#2597)
timmoon10 Jan 25, 2026
2dbfbc7
fix(examples): te_llama compatibility with transformers >= 4.57 (#2572)
sbhavani Jan 26, 2026
2104e4c
[JAX] Use "nyu-mll/glue" instead of "glue" for encoder datasets to fi…
jberchtold-nvidia Jan 27, 2026
f04b094
[PyTorch] ONNX test fix + export for FP8 attention (#2598)
pggPL Jan 28, 2026
b9f4013
[common] Add support for cuBLASLt GEMM for GroupedTensor (#2502)
pggPL Jan 28, 2026
f8cca8b
[Pytorch] Fix wheel test (#2635)
pggPL Jan 29, 2026
c3769cb
Fix minimum version of cublas for grouped gemm (#2631)
pggPL Jan 30, 2026
3ceb248
More detailed documentation for recipes (#2343)
pggPL Feb 2, 2026
94ba75d
Support building with headers from nvidia wheels (#2623)
vmarkovtsev Feb 3, 2026
29b84c1
[Common] Fix NVFP4 tuned-kernel numerics (#2639)
Oleg-Goncharov Feb 3, 2026
74faf7e
[PyTorch Debug] NVFP4 debug stats support (#2296)
pggPL Feb 3, 2026
59f6f38
[JAX] Update JAX container in readme (#2648)
jberchtold-nvidia Feb 4, 2026
71971e3
Fix exp2f_rcp to properly handle nan and 0xFE cases (#2647)
kainzhong Feb 6, 2026
7393947
[Common] MXFP8 kernel for grouped tensors (#2586)
Oleg-Goncharov Feb 6, 2026
dccf67e
[Common] Bucket batch size with higher granularity for THD (#2653)
cyanguwa Feb 7, 2026
c1a0c97
[PyTorch][Core][JAX] Expand troubleshooting docs (#2602)
jberchtold-nvidia Feb 9, 2026
b841243
[PyTorch Debug] Skip logging stats if unsupported (#2652)
pggPL Feb 9, 2026
2894e49
[Pytorch] Add get_backward_dw_params api for TE module (#2614)
Wohox Feb 9, 2026
b09ff7e
[pyTorch] Fix the compilation warnings (#2663)
ptrendx Feb 10, 2026
01ac7f8
[Pytorch] Make test script generate checkpoints if they don't exist (…
kainzhong Feb 10, 2026
8d15258
Fix Broken Quickstart Links (#2641)
faradawn Feb 11, 2026
8ebb47e
Fix on TE to support Mcore Vision Encoder CUDA Graph (#2657)
tomlifu Feb 11, 2026
ac81c85
[PyTorch] Python `GroupedTensor` (#2654)
ksivaman Feb 11, 2026
402ea54
[C] NVFP4 quantization for `GroupedTensor` (#2655)
ksivaman Feb 11, 2026
c4175fc
fix(build): Handle namespace packages for PyPI CUDA detection (#2580)
sbhavani Feb 12, 2026
93d51c8
[Common] Fuse pre-swizzling into grouped MXFP8 quantization kernel (#…
Oleg-Goncharov Feb 12, 2026
3774aa3
[PyTorch] Add ops for MoE grouped MLP (#2664)
timmoon10 Feb 12, 2026
33ca615
Add sigmoid GLU (#2656)
singleheart Feb 12, 2026
cd098e4
fix: correct FusedAdam copy-paste in FusedSGD error messages (#2675)
Mr-Neutr0n Feb 12, 2026
496620a
Get rid of nvshmem dependency for cuBLASMp integration (#2661)
vcherepanov-nv Feb 12, 2026
f844905
[PyTorch] Make grouped weights opt-in (#2678)
ksivaman Feb 13, 2026
5d112e3
[JAX] TE Permutation integration to Maxtext (#2672)
tdophung Feb 13, 2026
fa68781
Fix `build_tools` missing from sdist causing `uv` cached installs to …
hemildesai Feb 17, 2026
7e48fa1
[JAX] Debugging inspect utility (#2651)
jberchtold-nvidia Feb 17, 2026
f122b07
Changed VERSION to 2.14.0.dev0
ptrendx Feb 18, 2026
2d0d276
[PyT] Plumbing correct bias dims from TE to cudnn, while adding suppo…
KshitijLakhani Feb 18, 2026
63defea
Update cudnn-frontend to v1.18 (#2689)
cyanguwa Feb 20, 2026
e583222
[PyTorch] Documentation for op fuser API (#2447)
timmoon10 Feb 20, 2026
57b5b60
Fix race condition in RHT amax kernels (#2695)
ksivaman Feb 21, 2026
e8f7c5a
Add and verify support for `deterministic` fp8 dpa/mha on SM100 (#2621)
sudhakarsingh27 Feb 24, 2026
39b6dd9
[PyTorch Debug] Custom feature tutorial. (#2216)
pggPL Feb 24, 2026
7d1de30
Fix vermin pre-commit hook (#2699)
pstjohn Feb 24, 2026
459e7cf
[Common][PyTorch] Fuse scaling and unscaling of bf16 momentums into k…
yaox12 Feb 24, 2026
9eb982e
Fix incorrect MNNVL fabric check (#2626)
nvcastet Feb 24, 2026
f8b271f
[JAX] Fix FSDP when FSDP+EP is active (#2649)
jberchtold-nvidia Feb 24, 2026
7222d87
[PyTorch Debug] Support precision debug tools for fp8 model parameter…
pggPL Feb 25, 2026
df0ef6e
remove deprecated qkv/kv_packed apis (#2696)
sudhakarsingh27 Feb 25, 2026
842b770
[Common] Remove volatile keyword in fused router kernel utils (#2683)
denera Feb 26, 2026
ad56283
[CI] Cancel on concurrency (#2708)
yaox12 Feb 27, 2026
b345941
[PyTorch] `GroupedTensor` integration (#2600)
ksivaman Feb 27, 2026
a9a9b3a
[Common][PyTorch] Enhance the fused router and unify the precision (#…
yaox12 Feb 27, 2026
3ecb5bf
[PyTorch] Fix L3 FA tests (#2709)
cyanguwa Feb 28, 2026
f508e66
[PyTorch] Remove `is_first_microbatch` setting after cudagraph warmup…
buptzyb Mar 2, 2026
537f134
[Common][PyTorch] Fix normalization for `fused_score_for_moe_aux_loss…
Autumn1998 Mar 2, 2026
bba7bf6
[PyTorch] Support cuda graph capturing offloading module (#2435)
lhb8125 Mar 2, 2026
3275e1a
[JAX] CGEMM with Shardy (#2714)
phu0ngng Mar 2, 2026
9dac78e
CPU Overhead Optimizations (#2559)
vthumbe1503 Mar 3, 2026
c68ec31
Add fast_set_attr to modules not inheriting from base.py (#2724)
vthumbe1503 Mar 3, 2026
39d249b
[JAX] Remove GSPMD tests + adding guards and warning msg for GSPMD ru…
phu0ngng Mar 3, 2026
a3bc040
NVFP4 primary weight support (#2691)
WanZzzzzz Mar 3, 2026
bf3201a
[PyTorch] Support single parameter for `GroupedLinear` (#2731)
ksivaman Mar 4, 2026
00ba0b4
pass params_dtype to qk_norm creation (#2718)
pstjohn Mar 4, 2026
505b896
[JAX] GSPMD Deprecation Warning - Only trigger when the primitive is …
phu0ngng Mar 4, 2026
139c863
Add fused_adam, quantized_model_init, and fsdp2 example (#2698)
pstjohn Mar 4, 2026
56c2fa6
[JAX] Support calling MOE router kernels from JAX side (#2711)
tdophung Mar 4, 2026
d2e4755
[PyTorch] Skip `test_nvfp4_partial_cast_matches_full` test when NVFP4…
ksivaman Mar 5, 2026
145e88c
Add multi-precision training support to FSDP script (#2662)
aagallo Mar 5, 2026
d9152b0
[PyTorch] Support `GroupedTensor` torch ops for DDP and distributed o…
ksivaman Mar 5, 2026
d226ce2
[JAX] Integrate BF16 Grouped GEMM with on-device group sizes (#2680)
jberchtold-nvidia Mar 5, 2026
d40b9de
WAR sort_chunks_by_index intermittent failures in L0 JAX unitttest pa…
tdophung Mar 5, 2026
5fd5c35
Fix FP8 block scaling with sequence parallel (#2637)
cuichenx Mar 8, 2026
ab9d60e
[PyTorch] Zero-initialize learnable softmax_offset in DotProductAtten…
fjosw Mar 8, 2026
e9ea352
docs: update cuDNN sliding window attention support (#2624)
sbhavani Mar 8, 2026
6638fef
[JAX] GEMM tex and FFI cleanup (#2739)
phu0ngng Mar 8, 2026
34a6c0a
Fix Flash Attention 3 API compatibility for window size parameters (#…
jhvmhg Mar 9, 2026
6e0085a
[Common] Remove redundant grad_logits zero-initialization in fused ro…
roycho96 Mar 9, 2026
f64941a
Enable dequantization from MXFP8 tensor with only columnwise data (#2…
ptrendx Mar 10, 2026
e6d97ff
[PyTorch] Fix cross_entropy_forward stride guard for non-contiguous i…
Bias92 Mar 10, 2026
7c2aa2c
[Common] MOE Split dBias (#2674)
Oleg-Goncharov Mar 10, 2026
3846bf7
Fix deploy nightly docs issue (#2636)
pggPL Mar 10, 2026
d32f9e4
[JAX] Fix get_seqlens_and_offsets() to accept vmapped seg ids and non…
KshitijLakhani Mar 11, 2026
61d5865
[NVFP4][MOE] Add unfused quantization fallback when input shape is no…
zhongbozhu Mar 11, 2026
7545d8c
[PyTorch debug] Fix issue with tp_group=None (#2733)
pggPL Mar 11, 2026
107f558
Documentation for cpu offloading (#2520)
pggPL Mar 11, 2026
d5ce416
Add guard at lowest JAX version that still supports triton kernel cal…
tdophung Mar 11, 2026
f6001c4
Support configurable number of philox rounds for stochastic rounding …
ksivaman Mar 11, 2026
61f9594
[All] Added better error messages (#2705)
ptrendx Mar 11, 2026
c021e7e
[PyTorch] Fix fuser so it releases tensors properly (#2750)
kainzhong Mar 11, 2026
7fb10d3
[PyTorch] Add dtype information to QuantizedTensorStorage class (#2676)
ptrendx Mar 11, 2026
4c5b1a2
[JAX] Change dtype of intermediate result aval of fused_topk_and_scor…
tdophung Mar 11, 2026
06a23e3
Initial commit to pass scale as Tensor for multi_tensor_scale op (#2594)
vasunvidia Mar 12, 2026
ef703e5
[Core] MXFP8 grouped GEMM + tensor-scaled FP8 fixes (#2748)
jberchtold-nvidia Mar 12, 2026
67898a7
Cherry pick "Adds dst.dtype information in copy_ method of quantized …
ptrendx Mar 12, 2026
134304e
Fused kernel for calculating offsets from first dim splits (#2755)
ksivaman Mar 12, 2026
a5d7464
Added new users to CI (#2756)
ptrendx Mar 12, 2026
6a68c73
[PyTorch] Error out if constructing `LayerNormLinear` with row tensor…
timmoon10 Mar 12, 2026
14c29da
[JAX] Collective GEMM with FP8 and MXFP8 support (#2740)
phu0ngng Mar 13, 2026
fcceeb9
[Pytorch] Add QuantizedTensor support in FusedAdam.step for MXFP8Bloc…
jomitchellnv Mar 13, 2026
306e853
add .claude to gitignore (#2762)
pstjohn Mar 13, 2026
b7214fd
Fix for async dcp checkpointing with Float8Tensors (#2721)
pstjohn Mar 15, 2026
708d7c1
Pytorch binding for cublas grouped gemm + Grouped Bias Support + Grou…
vthumbe1503 Mar 16, 2026
2156e61
Merge commit '708d7c160ad6b2bf44c9c597083d4cbb4860f068' from upstream
ipanfilo Apr 14, 2026
268fcb7
Resovle merging errors, restore missing codepaths, fix functional and…
ipanfilo Jun 2, 2026
d4b0313
Merge with dev commit 27f4acd2 and restore functional issue after mer…
ipanfilo Jun 2, 2026
561d64c
Fix tests and address review commits
ipanfilo Jun 2, 2026
74aec94
Merge with dev commit a47087b4
ipanfilo Jun 2, 2026
6dc02f1
Merge branch 'dev' into IFU-dev-20260315-v2.14
ipanfilo Jun 2, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
161 changes: 161 additions & 0 deletions .github/agent-skills/copyright-check/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,161 @@
---
name: copyright-check
description: Verify AMD copyright header compliance on files modified or introduced by ROCm. Checks presence, format, and year correctness. Use whenever reviewing a PR on the ROCm TransformerEngine fork, or when asked to audit copyright headers.
argument-hint: [base-branch] [--paths <glob>...]
allowed-tools: [Read, Glob, Grep, Bash]
---

# Copyright Header Check

Audits AMD copyright headers on files that ROCm has modified or introduced. The
existing `qa/L0_license/copyright_checker.py` only validates NVIDIA headers —
this skill is the AMD-side counterpart.

## Arguments

- `<base-branch>` — diff against this ref (default: `dev`).
- `--paths <glob>...` — restrict the check to specific paths. Without this,
every file changed in the diff is checked.

## Scope detection

Determine the file set to audit:
```
base="${1:-dev}"
git fetch origin "$base" --quiet
git diff --name-status "origin/$base"...HEAD
```
- `A` = added — counts as **ROCm-introduced**.
- `M` / `R` = modified or renamed — counts as **ROCm-modified**.
- `D` = deleted — skip.

Skip files that the existing checker also skips (see `qa/L0_license/config.json`):
binaries, `3rdparty/`, `LICENSE`, `VERSION`, `.png`, `.ipynb`, `.json`, `.md`,
`.txt`, Dockerfiles, generated files. Also skip files outside these extensions:
`c, cpp, cu, h, cuh, hpp, hip, py, sh, cmake, yml, yaml, toml, rst, cfg`,
plus `CMakeLists.txt`. If a file has none of those, note it as "unchecked
filetype" and move on — do not flag.

## Header rules

Read the **first 15 lines** of each in-scope file (headers may sit below a
shebang and a coding declaration). The current year is `$(date +%Y)` — call it
`Y`. Use that value, do not hardcode it.

### Comment-style families
| Family | Comment marker | File types |
|---|---|---|
| hash | `#` | py, sh, cmake, yml, yaml, toml, cfg, `CMakeLists.txt` |
| c-block | `/*` … `*/` or `//` | c, cpp, cu, h, cuh, hpp, hip |
| rst | `..` (indented) | rst |

### AMD copyright line — required form
```
<comment> Copyright (c) <YEARS>, Advanced Micro Devices, Inc. All rights reserved.
```
where `<YEARS>` is either `Y` (single year) or `Yfirst-Y` (range ending in
the current year). Use a regex like:

```
Copyright \(c\) (\d{4})(-(\d{4}))?, Advanced Micro Devices, Inc\. All rights reserved\.
```

### NVIDIA copyright line — preserved form
```
<comment> Copyright (c) <YEARS>, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
```

### File classification (in scope)

For each file, classify by its **post-change** content combined with its diff
status:

1. **ROCm-only file** — added by ROCm, no NVIDIA copyright present.
Required: AMD copyright. NVIDIA copyright must NOT be added.
2. **Modified upstream file** — already existed with a NVIDIA copyright before
this PR, ROCm is now changing it.
Required: BOTH AMD and NVIDIA copyright lines, AMD line first.
3. **Already-mixed file** — already had both AMD and NVIDIA headers, ROCm is
modifying it again.
Required: both lines remain; AMD year extended to include `Y`.
4. **Cherry-picked from upstream** — added in this PR but contains a NVIDIA
copyright (file pulled from upstream NVIDIA TE).
Required: NVIDIA copyright preserved verbatim; AMD copyright added only if
the cherry-pick included AMD-authored modifications.

To distinguish (1) from (4) on added files: check `git log` of the upstream
remote (`origin/upstream-main` or whatever upstream tracking branch exists —
fall back to checking whether the path exists on `origin/main`). If unsure,
flag as "ambiguous" and let the human decide.

To know whether a *modified* file had a NVIDIA copyright before the PR:
```
git show "origin/$base:$path" | head -15 | grep -F "NVIDIA CORPORATION"
```

## Year correctness

For every file in scope, the AMD copyright year must include the current year
`Y`:
- Single year `YYYY`: must equal `Y`.
- Range `Yfirst-Ylast`: `Ylast` must equal `Y`. `Yfirst` must be ≤ `Ylast` and
≥ the year the AMD copyright was first added (best-effort: `git log
--diff-filter=A --follow --format=%ad --date=format:%Y -- $path | tail -1`).

For NVIDIA copyrights on modified files: the year MUST NOT have been changed
by this PR. Compare against `git show "origin/$base:$path"`. ROCm changes do
not extend NVIDIA's copyright window.

## Recommended-but-optional marker

Files in family (2) often carry the marker line:
```
<comment> This file was modified for portability to AMDGPU
```
Treat its absence as a **note**, not a failure. Its presence makes review
easier but is not legally required.

## What NOT to flag

- Headers in files outside the diff scope (don't audit the whole tree).
- Stylistic differences (spacing, ordering of unrelated lines).
- The `License for AMD contributions = MIT.` shorthand line — it's a valid
alternative to `See LICENSE for license information.` for AMD-only files.
- Files in `exclude_copyright` from `qa/L0_license/config.json`.

## Output format

```
## Copyright Header Audit: <base>...HEAD

### Summary
| Status | Count |
|---|---:|
| OK | <n> |
| Missing AMD copyright | <n> |
| Stale AMD year | <n> |
| Altered NVIDIA copyright | <n> |
| Ambiguous (needs human review) | <n> |

### Findings

#### <path>:<line>
- **Classification:** <ROCm-only | Modified upstream | Already-mixed | Cherry-pick | Ambiguous>
- **Issue:** <one-line description>
- **Found:** `<relevant line(s) from the file>`
- **Expected:** `<the line as it should be>`
- **Fix:** <minimal patch — usually a 1-3 line header replacement>

(repeat per finding; group by file)
```

If everything passes, output `All <n> files in scope have correct AMD
copyright headers.` and stop — no per-file noise.

## Notes for re-runs

When this skill is invoked from `/claude review` on the same PR a second time,
prior findings posted by `claude[bot]` will already exist as inline comments.
The orchestrator (the calling workflow) is responsible for deduplication; this
skill should always emit the full current set of findings and let the caller
decide what to post.
104 changes: 104 additions & 0 deletions .github/agent-skills/review-pr/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
---
name: review-pr
description: Deep code review of a branch as a PR against dev. Focus on intent, correctness, reuse, and test semantics.
argument-hint: <branch-name> [base-branch]
allowed-tools: [Read, Glob, Grep, Bash, Agent]
---

# PR Review

Review the branch `$ARGUMENTS` as a pull request. If no base branch is given after the branch name, default to `dev`.

## Setup

1. Parse arguments: first token is the feature branch, optional second token is the base branch.
2. Fetch and diff:
```
git fetch origin <branch>
git log --oneline <base>..<branch>
git diff --stat <base>..<branch>
git diff <base>..<branch>
```
3. If the diff is large (>800 lines), launch parallel Explore agents to study affected areas of the codebase. Otherwise, read the changed files and their surrounding infrastructure directly.

## Review Methodology

Work through these phases in order. Each phase builds understanding that informs the next.

### Phase 1: Understand intent

Before evaluating any code, answer:
- **What problem does this PR solve?** Derive this from the diff, commit messages, and any linked issues.
- **What is the ideal solution to that problem?** Consider the codebase as it exists today. What is the most direct, minimal way to achieve the goal using existing infrastructure?
- **How does the PR's approach compare to the ideal?** Where does it diverge, and is each divergence justified?

This framing prevents reviewing code in isolation. Every subsequent judgment — "is this correct?", "is this redundant?" — is anchored to whether it serves the intent effectively.

### Phase 2: Correctness

Evaluate whether every changed file — production and test code alike — is correct:

- **Production code**: Does the change achieve the stated intent? Are there edge cases, off-by-one errors, or silent failures? Trace the data flow from entry point through to effect.
- **Test code**: What property does each test claim to verify? Trace from inputs through execution to assertions. Can the test produce false passes (assertions too loose, setup makes the test trivially true)? Can it produce false failures (tolerances too tight, depends on unrelated behavior)?

### Phase 3: Approach and reuse

For every changed file — production and test — evaluate whether the approach is the best way to accomplish its goal:

- **Are there simpler alternatives?** Could existing code paths, utilities, helpers, runners, or APIs be used instead of writing new code? Would a different design reduce complexity or risk?
- **Is new code justified?** When the PR introduces new functions, helpers, or patterns, determine whether existing infrastructure could serve the same purpose. Be precise: identify the specific existing code, what it does, and what gap (if any) prevents direct reuse.
- **Critical nuance on reuse**: A helper that does not return intermediate values (e.g., asserts internally but doesn't expose results) cannot be reused for code that needs those intermediates. Acknowledge structural limitations honestly. Do NOT recommend reuse that requires modifying existing APIs — that changes scope. Instead, note what existing code *does* expose and suggest the minimal composition that avoids reimplementation.

The distinction to draw:
- **Avoidable duplication**: input construction, env var setup, model instantiation, config validation — things existing code already does and exposes.
- **Unavoidable divergence**: the new code genuinely needs something the existing infrastructure wasn't designed for.

### Phase 4: Minimality and integration

- **Minimality**: Does the PR touch only what it needs to? Flag unrelated cleanups, speculative additions, or defensive code that guards against impossible states. Does it test things beyond its stated goal that are already covered elsewhere (adding fragility without coverage)?
- **Integration**: Does the new code fit surrounding conventions? Does it introduce a second way to do something that already has an established pattern?
- **Efficiency** (tests): Are test variants split across multiple functions when they could be a single parametrized function? Does the parametrization create excessive combinations relative to marginal coverage?
- **Missing coverage**: Given the production change, are there scenarios that should be tested but aren't?

### Phase 5: Upstream compatibility (ROCm fork)

This repo is a downstream fork of NVIDIA's TransformerEngine. Every change must be classified against three rules:

1. **CUDA behavior must remain unchanged.** New or divergent behavior on the ROCm path must be guarded so the CUDA execution path stays byte-identical to upstream. Acceptable guards include compile-time switches (`#ifdef USE_ROCM`, `__HIP_PLATFORM_AMD__`, `IS_HIP_COMPILE`), build-system selection (separate ROCm sources), and runtime checks (e.g., `is_rocm()`, device-type dispatch). A change to a code path that CUDA also executes is **not** ROCm-specific, even if it was motivated by a ROCm issue.
2. **Generic bug fixes to upstream code are allowed, but must be documented.** If the PR fixes a defect that exists on both CUDA and ROCm, the PR description (or a comment near the change) must explicitly call this out so it can be upstreamed. Silent "drive-by" fixes to shared code make future IFU merges harder to reason about.
3. **ROCm-specific behavior must be documented.** Any new ROCm-only code path, kernel, workaround, or divergence from upstream needs a brief comment (or PR-description entry) explaining *why* it diverges. This is what future IFU merges read to decide how to resolve conflicts.

Flag any of the following as findings:
- **Unguarded divergence**: changes to shared (CUDA-reachable) code that alter behavior, without a guard and without being declared as a generic bug fix.
- **Missing guard**: ROCm-motivated changes placed in shared code where a guard would cleanly isolate them. Suggest the appropriate guard.
- **Undocumented bug fix**: a change to shared code that looks like a generic fix but isn't called out as such — ask the author to confirm classification and document it.
- **Undocumented ROCm divergence**: a new ROCm-only code path / file / kernel with no comment or PR-description note explaining the rationale relative to upstream.
- **Over-broad guard**: a guard that wraps more code than necessary, making it harder to see what actually differs from upstream.

Do not flag pure additions of new ROCm-only files (e.g., HIP kernels, ROCm-only build glue) as "divergence" merely for existing — they're divergent by definition. The concern is whether the *rationale* is documented.

## Output Format

```
## PR Review: <branch> -> <base>

### Intent
[What the PR is trying to accomplish and whether the approach is sound]

### Correctness
[Per-file: does it work? Gaps, edge cases, false-pass/false-fail risks]

### Reuse and Approach
[What existing infrastructure could be leveraged, what was reimplemented, what's justified vs. avoidable]

### Minimality and Integration
[Scope creep, convention fit, parametrization efficiency, missing coverage]

### Upstream Compatibility
[Per change touching shared code: classify as (a) CUDA-preserving + ROCm-guarded, (b) generic bug fix (documented?), or (c) unguarded divergence (must fix). Note any ROCm-specific additions lacking rationale comments.]

### Summary
| Area | Verdict | Key Issues |
|------|---------|------------|
| ... | ... | ... |
```
20 changes: 17 additions & 3 deletions .github/scripts/Dockerfile.ci.deps
Original file line number Diff line number Diff line change
Expand Up @@ -7,23 +7,37 @@ ARG BASE_DOCKER=registry-sc-harbor.amd.com/framework/compute-rocm-rel-7.2:57_ubu
FROM $BASE_DOCKER
WORKDIR /

# Updated git via git-core PPA
RUN apt-get update && apt-get install -y --no-install-recommends software-properties-common \
&& add-apt-repository ppa:git-core/ppa -y \
&& apt-get update \
&& apt-get install -y --no-install-recommends git vim \
&& rm -rf /var/lib/apt/lists/*

# Build arguments
ARG FA_VERSION=v2.8.1
ARG ROCM_VERSION=7.2
ARG JAX_VERSION=0.8.0
ARG PYTHON_VERSION=311
# AITER - Required for MXFP4 FP4 GEMM kernels.
ARG AITER_COMMIT=77455e3ecf4f0d28756afc452e914940c45b944b

RUN pip install setuptools wheel
RUN pip install ipython pytest fire pydantic pybind11 ninja pandas
RUN apt-get update && apt-get install -y vim

# Install flash-attention
ENV GPU_ARCHS=gfx950;gfx942
RUN git clone --branch ${FA_VERSION} --depth 1 https://github.com/Dao-AILab/flash-attention.git \
&& cd flash-attention \
&& FLASH_ATTENTION_TRITON_AMD_ENABLE=TRUE && FLASH_ATTENTION_SKIP_CK_BUILD=FALSE python setup.py install \
&& GPU_ARCHS="gfx950;gfx942" FLASH_ATTENTION_TRITON_AMD_ENABLE=TRUE FLASH_ATTENTION_SKIP_CK_BUILD=FALSE python setup.py install \
&& cd ..

# Install AITER
RUN git clone --no-checkout https://github.com/ROCm/aiter.git \
&& cd aiter \
&& git checkout ${AITER_COMMIT} \
&& git submodule update --init --recursive \
&& pip install .

# Install JAX
RUN ROCM_MAJOR=$(echo "${ROCM_VERSION}" | cut -d. -f1) && pip install \
https://repo.radeon.com/rocm/manylinux/rocm-rel-${ROCM_VERSION}/jax_rocm${ROCM_MAJOR}_pjrt-${JAX_VERSION}%2Brocm${ROCM_VERSION}.0-py3-none-manylinux_2_28_x86_64.whl \
Expand Down
44 changes: 30 additions & 14 deletions .github/scripts/aiter_prebuild_upload.sh
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,9 @@ fi
export ROCM_PATH
ROCM_VER=`head -n1 "${ROCM_PATH}/.info/version" | cut -d. -f1`

AITER_DIR="${ROOT_DIR}/3rdparty/aiter"
QOLA_DIR="${ROOT_DIR}/3rdparty/QoLA"
AITER_DIR="${QOLA_DIR}/3rdparty/aiter"
QOLA_MANIFEST="${ROOT_DIR}/transformer_engine/common/ck_fused_attn/qola_manifest.toml"
GIT_CONFIG_GLOBAL="$(mktemp /tmp/gitconfig.XXXXXX)"
trap 'rm -f "${GIT_CONFIG_GLOBAL}"' EXIT
git config --file "${GIT_CONFIG_GLOBAL}" --add safe.directory "${AITER_DIR}"
Expand Down Expand Up @@ -52,21 +54,35 @@ fi
# Optional build stage
if [[ "${1:-}" == "--build" ]]; then
shift
GPU_ARCHS="gfx942;gfx950"
echo "[AITER-PREBUILT] Building aiter libs for ${GPU_ARCHS} ..."
bash "${ROOT_DIR}/transformer_engine/common/ck_fused_attn/aiter_build.sh" \
--aiter-dir "${ROOT_DIR}/3rdparty/aiter" \
--install-dir "${EXTRACT_DIR}" \
--gpu-archs "${GPU_ARCHS}"
fi
GPU_ARCHS=("gfx942" "gfx950")
echo "[AITER-PREBUILT] Building aiter libs via QoLA for ${GPU_ARCHS[*]} ..."
QOLA_BUILD_DIR="${QOLA_DIR}/build"
arch_args=()
for a in "${GPU_ARCHS[@]}"; do arch_args+=(--arch "${a}"); done
PYTHONPATH="${QOLA_DIR}:${PYTHONPATH:-}" \
python3 -m qola.cli build \
--manifest "${QOLA_MANIFEST}" \
--aiter-root "${AITER_DIR}" \
--output-dir "${QOLA_BUILD_DIR}" \
"${arch_args[@]}"

# Ensure built libs exist
if [[ ! -f "${EXTRACT_DIR}/libmha_fwd.so" ]]; then
echo "[AITER-PREBUILT] Missing libmha_fwd.so in ${EXTRACT_DIR}" >&2
exit 1
# Stage QoLA outputs into the cache layout expected by aiter_prebuilt.cmake.
mkdir -p "${EXTRACT_DIR}/lib" "${EXTRACT_DIR}/include"
cp "${QOLA_BUILD_DIR}/lib/"*.so "${EXTRACT_DIR}/lib/"
cp "${QOLA_BUILD_DIR}/include/"*.h "${EXTRACT_DIR}/include/"
fi
if [[ ! -f "${EXTRACT_DIR}/libmha_bwd.so" ]]; then
echo "[AITER-PREBUILT] Missing libmha_bwd.so in ${EXTRACT_DIR}" >&2

# Ensure built libs exist (matches aiter_prebuilt.cmake::is_aiter_cache_valid).
for lib in te_libmha_fwd.so te_libmha_bwd.so; do
if [[ ! -f "${EXTRACT_DIR}/lib/${lib}" ]]; then
echo "[AITER-PREBUILT] Missing ${lib} in ${EXTRACT_DIR}/lib" >&2
exit 1
fi
done

# qola_config.h is the namespace-baked header; without it consumer compiles fail.
if [[ ! -f "${EXTRACT_DIR}/include/qola_config.h" ]]; then
echo "[AITER-PREBUILT] Missing qola_config.h in ${EXTRACT_DIR}/include" >&2
exit 1
fi

Expand Down
6 changes: 1 addition & 5 deletions .github/workflows/aiter-prebuilt-upload.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ on:

jobs:
upload:
runs-on: linux-te-mi325-8
runs-on: build-only-te
steps:
- name: Checkout source
uses: actions/checkout@v6
Expand Down Expand Up @@ -44,11 +44,7 @@ jobs:
--rm \
--name te-aiter-upload \
--network=host \
--device=/dev/dri --device=/dev/kfd \
--shm-size=16G \
--pid=host \
--group-add $(getent group render | cut -d: -f3) \
--group-add $(getent group video | cut -d: -f3) \
-v "${{ github.workspace }}:/workspace" \
-w /workspace \
${{ steps.cfg.outputs.image }}
Expand Down
4 changes: 4 additions & 0 deletions .github/workflows/build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,10 @@ name: 'Build'
on:
pull_request:
workflow_dispatch:
concurrency:
# Group by workflow name + PR number (for PRs) or ref (for branch/tag pushes)
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
cancel-in-progress: true
jobs:
core:
name: 'Core'
Expand Down
Loading