ROCm · ipanfilo · Jun 2, 2026 · Jan 20, 2026 · Jan 20, 2026 · Jan 21, 2026
diff --git a/.github/agent-skills/copyright-check/SKILL.md b/.github/agent-skills/copyright-check/SKILL.md
@@ -0,0 +1,161 @@
+---
+name: copyright-check
+description: Verify AMD copyright header compliance on files modified or introduced by ROCm. Checks presence, format, and year correctness. Use whenever reviewing a PR on the ROCm TransformerEngine fork, or when asked to audit copyright headers.
+argument-hint: [base-branch] [--paths <glob>...]
+allowed-tools: [Read, Glob, Grep, Bash]
+---
+
+# Copyright Header Check
+
+Audits AMD copyright headers on files that ROCm has modified or introduced. The
+existing `qa/L0_license/copyright_checker.py` only validates NVIDIA headers —
+this skill is the AMD-side counterpart.
+
+## Arguments
+
+- `<base-branch>` — diff against this ref (default: `dev`).
+- `--paths <glob>...` — restrict the check to specific paths. Without this,
+  every file changed in the diff is checked.
+
+## Scope detection
+
+Determine the file set to audit:
+```
+base="${1:-dev}"
+git fetch origin "$base" --quiet
+git diff --name-status "origin/$base"...HEAD
+```
+- `A` = added — counts as **ROCm-introduced**.
+- `M` / `R` = modified or renamed — counts as **ROCm-modified**.
+- `D` = deleted — skip.
+
+Skip files that the existing checker also skips (see `qa/L0_license/config.json`):
+binaries, `3rdparty/`, `LICENSE`, `VERSION`, `.png`, `.ipynb`, `.json`, `.md`,
+`.txt`, Dockerfiles, generated files. Also skip files outside these extensions:
+`c, cpp, cu, h, cuh, hpp, hip, py, sh, cmake, yml, yaml, toml, rst, cfg`,
+plus `CMakeLists.txt`. If a file has none of those, note it as "unchecked
+filetype" and move on — do not flag.
+
+## Header rules
+
+Read the **first 15 lines** of each in-scope file (headers may sit below a
+shebang and a coding declaration). The current year is `$(date +%Y)` — call it
+`Y`. Use that value, do not hardcode it.
+
+### Comment-style families
+| Family | Comment marker | File types |
+|---|---|---|
+| hash | `#` | py, sh, cmake, yml, yaml, toml, cfg, `CMakeLists.txt` |
+| c-block | `/*` … `*/` or `//` | c, cpp, cu, h, cuh, hpp, hip |
+| rst | `..` (indented) | rst |
+
+### AMD copyright line — required form
+```
+<comment> Copyright (c) <YEARS>, Advanced Micro Devices, Inc. All rights reserved.
+```
+where `<YEARS>` is either `Y` (single year) or `Yfirst-Y` (range ending in
+the current year). Use a regex like:
+
+```
+Copyright \(c\) (\d{4})(-(\d{4}))?, Advanced Micro Devices, Inc\. All rights reserved\.
+```
+
+### NVIDIA copyright line — preserved form
+```
+<comment> Copyright (c) <YEARS>, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+```
+
+### File classification (in scope)
+
+For each file, classify by its **post-change** content combined with its diff
+status:
+
+1. **ROCm-only file** — added by ROCm, no NVIDIA copyright present.
+   Required: AMD copyright. NVIDIA copyright must NOT be added.
+2. **Modified upstream file** — already existed with a NVIDIA copyright before
+   this PR, ROCm is now changing it.
+   Required: BOTH AMD and NVIDIA copyright lines, AMD line first.
+3. **Already-mixed file** — already had both AMD and NVIDIA headers, ROCm is
+   modifying it again.
+   Required: both lines remain; AMD year extended to include `Y`.
+4. **Cherry-picked from upstream** — added in this PR but contains a NVIDIA
+   copyright (file pulled from upstream NVIDIA TE).
+   Required: NVIDIA copyright preserved verbatim; AMD copyright added only if
+   the cherry-pick included AMD-authored modifications.
+
+To distinguish (1) from (4) on added files: check `git log` of the upstream
+remote (`origin/upstream-main` or whatever upstream tracking branch exists —
+fall back to checking whether the path exists on `origin/main`). If unsure,
+flag as "ambiguous" and let the human decide.
+
+To know whether a *modified* file had a NVIDIA copyright before the PR:
+```
+git show "origin/$base:$path" | head -15 | grep -F "NVIDIA CORPORATION"
+```
+
+## Year correctness
+
+For every file in scope, the AMD copyright year must include the current year
+`Y`:
+- Single year `YYYY`: must equal `Y`.
+- Range `Yfirst-Ylast`: `Ylast` must equal `Y`. `Yfirst` must be ≤ `Ylast` and
+  ≥ the year the AMD copyright was first added (best-effort: `git log
+  --diff-filter=A --follow --format=%ad --date=format:%Y -- $path | tail -1`).
+
+For NVIDIA copyrights on modified files: the year MUST NOT have been changed
+by this PR. Compare against `git show "origin/$base:$path"`. ROCm changes do
+not extend NVIDIA's copyright window.
+
+## Recommended-but-optional marker
+
+Files in family (2) often carry the marker line:
+```
+<comment> This file was modified for portability to AMDGPU
+```
+Treat its absence as a **note**, not a failure. Its presence makes review
+easier but is not legally required.
+
+## What NOT to flag
+
+- Headers in files outside the diff scope (don't audit the whole tree).
+- Stylistic differences (spacing, ordering of unrelated lines).
+- The `License for AMD contributions = MIT.` shorthand line — it's a valid
+  alternative to `See LICENSE for license information.` for AMD-only files.
+- Files in `exclude_copyright` from `qa/L0_license/config.json`.
+
+## Output format
+
+```
+## Copyright Header Audit: <base>...HEAD
+
+### Summary
+| Status | Count |
+|---|---:|
+| OK | <n> |
+| Missing AMD copyright | <n> |
+| Stale AMD year | <n> |
+| Altered NVIDIA copyright | <n> |
+| Ambiguous (needs human review) | <n> |
+
+### Findings
+
+#### <path>:<line>
+- **Classification:** <ROCm-only | Modified upstream | Already-mixed | Cherry-pick | Ambiguous>
+- **Issue:** <one-line description>
+- **Found:**   `<relevant line(s) from the file>`
+- **Expected:** `<the line as it should be>`
+- **Fix:** <minimal patch — usually a 1-3 line header replacement>
+
+(repeat per finding; group by file)
+```
+
+If everything passes, output `All <n> files in scope have correct AMD
+copyright headers.` and stop — no per-file noise.
+
+## Notes for re-runs
+
+When this skill is invoked from `/claude review` on the same PR a second time,
+prior findings posted by `claude[bot]` will already exist as inline comments.
+The orchestrator (the calling workflow) is responsible for deduplication; this
+skill should always emit the full current set of findings and let the caller
+decide what to post.
diff --git a/.github/agent-skills/review-pr/SKILL.md b/.github/agent-skills/review-pr/SKILL.md
@@ -0,0 +1,104 @@
+---
+name: review-pr
+description: Deep code review of a branch as a PR against dev. Focus on intent, correctness, reuse, and test semantics.
+argument-hint: <branch-name> [base-branch]
+allowed-tools: [Read, Glob, Grep, Bash, Agent]
+---
+
+# PR Review
+
+Review the branch `$ARGUMENTS` as a pull request. If no base branch is given after the branch name, default to `dev`.
+
+## Setup
+
+1. Parse arguments: first token is the feature branch, optional second token is the base branch.
+2. Fetch and diff:
+   ```
+   git fetch origin <branch>
+   git log --oneline <base>..<branch>
+   git diff --stat <base>..<branch>
+   git diff <base>..<branch>
+   ```
+3. If the diff is large (>800 lines), launch parallel Explore agents to study affected areas of the codebase. Otherwise, read the changed files and their surrounding infrastructure directly.
+
+## Review Methodology
+
+Work through these phases in order. Each phase builds understanding that informs the next.
+
+### Phase 1: Understand intent
+
+Before evaluating any code, answer:
+- **What problem does this PR solve?** Derive this from the diff, commit messages, and any linked issues.
+- **What is the ideal solution to that problem?** Consider the codebase as it exists today. What is the most direct, minimal way to achieve the goal using existing infrastructure?
+- **How does the PR's approach compare to the ideal?** Where does it diverge, and is each divergence justified?
+
+This framing prevents reviewing code in isolation. Every subsequent judgment — "is this correct?", "is this redundant?" — is anchored to whether it serves the intent effectively.
+
+### Phase 2: Correctness
+
+Evaluate whether every changed file — production and test code alike — is correct:
+
+- **Production code**: Does the change achieve the stated intent? Are there edge cases, off-by-one errors, or silent failures? Trace the data flow from entry point through to effect.
+- **Test code**: What property does each test claim to verify? Trace from inputs through execution to assertions. Can the test produce false passes (assertions too loose, setup makes the test trivially true)? Can it produce false failures (tolerances too tight, depends on unrelated behavior)?
+
+### Phase 3: Approach and reuse
+
+For every changed file — production and test — evaluate whether the approach is the best way to accomplish its goal:
+
+- **Are there simpler alternatives?** Could existing code paths, utilities, helpers, runners, or APIs be used instead of writing new code? Would a different design reduce complexity or risk?
+- **Is new code justified?** When the PR introduces new functions, helpers, or patterns, determine whether existing infrastructure could serve the same purpose. Be precise: identify the specific existing code, what it does, and what gap (if any) prevents direct reuse.
+- **Critical nuance on reuse**: A helper that does not return intermediate values (e.g., asserts internally but doesn't expose results) cannot be reused for code that needs those intermediates. Acknowledge structural limitations honestly. Do NOT recommend reuse that requires modifying existing APIs — that changes scope. Instead, note what existing code *does* expose and suggest the minimal composition that avoids reimplementation.
+
+The distinction to draw:
+- **Avoidable duplication**: input construction, env var setup, model instantiation, config validation — things existing code already does and exposes.
+- **Unavoidable divergence**: the new code genuinely needs something the existing infrastructure wasn't designed for.
+
+### Phase 4: Minimality and integration
+
+- **Minimality**: Does the PR touch only what it needs to? Flag unrelated cleanups, speculative additions, or defensive code that guards against impossible states. Does it test things beyond its stated goal that are already covered elsewhere (adding fragility without coverage)?
+- **Integration**: Does the new code fit surrounding conventions? Does it introduce a second way to do something that already has an established pattern?
+- **Efficiency** (tests): Are test variants split across multiple functions when they could be a single parametrized function? Does the parametrization create excessive combinations relative to marginal coverage?
+- **Missing coverage**: Given the production change, are there scenarios that should be tested but aren't?
+
+### Phase 5: Upstream compatibility (ROCm fork)
+
+This repo is a downstream fork of NVIDIA's TransformerEngine. Every change must be classified against three rules:
+
+1. **CUDA behavior must remain unchanged.** New or divergent behavior on the ROCm path must be guarded so the CUDA execution path stays byte-identical to upstream. Acceptable guards include compile-time switches (`#ifdef USE_ROCM`, `__HIP_PLATFORM_AMD__`, `IS_HIP_COMPILE`), build-system selection (separate ROCm sources), and runtime checks (e.g., `is_rocm()`, device-type dispatch). A change to a code path that CUDA also executes is **not** ROCm-specific, even if it was motivated by a ROCm issue.
+2. **Generic bug fixes to upstream code are allowed, but must be documented.** If the PR fixes a defect that exists on both CUDA and ROCm, the PR description (or a comment near the change) must explicitly call this out so it can be upstreamed. Silent "drive-by" fixes to shared code make future IFU merges harder to reason about.
+3. **ROCm-specific behavior must be documented.** Any new ROCm-only code path, kernel, workaround, or divergence from upstream needs a brief comment (or PR-description entry) explaining *why* it diverges. This is what future IFU merges read to decide how to resolve conflicts.
+
+Flag any of the following as findings:
+- **Unguarded divergence**: changes to shared (CUDA-reachable) code that alter behavior, without a guard and without being declared as a generic bug fix.
+- **Missing guard**: ROCm-motivated changes placed in shared code where a guard would cleanly isolate them. Suggest the appropriate guard.
+- **Undocumented bug fix**: a change to shared code that looks like a generic fix but isn't called out as such — ask the author to confirm classification and document it.
+- **Undocumented ROCm divergence**: a new ROCm-only code path / file / kernel with no comment or PR-description note explaining the rationale relative to upstream.
+- **Over-broad guard**: a guard that wraps more code than necessary, making it harder to see what actually differs from upstream.
+
+Do not flag pure additions of new ROCm-only files (e.g., HIP kernels, ROCm-only build glue) as "divergence" merely for existing — they're divergent by definition. The concern is whether the *rationale* is documented.
+
+## Output Format
+
+```
+## PR Review: <branch> -> <base>
+
+### Intent
+[What the PR is trying to accomplish and whether the approach is sound]
+
+### Correctness
+[Per-file: does it work? Gaps, edge cases, false-pass/false-fail risks]
+
+### Reuse and Approach
+[What existing infrastructure could be leveraged, what was reimplemented, what's justified vs. avoidable]
+
+### Minimality and Integration
+[Scope creep, convention fit, parametrization efficiency, missing coverage]
+
+### Upstream Compatibility
+[Per change touching shared code: classify as (a) CUDA-preserving + ROCm-guarded, (b) generic bug fix (documented?), or (c) unguarded divergence (must fix). Note any ROCm-specific additions lacking rationale comments.]
+
+### Summary
+| Area | Verdict | Key Issues |
+|------|---------|------------|
+| ...  | ...     | ...        |
+```
diff --git a/.github/scripts/Dockerfile.ci.deps b/.github/scripts/Dockerfile.ci.deps
@@ -7,23 +7,37 @@ ARG BASE_DOCKER=registry-sc-harbor.amd.com/framework/compute-rocm-rel-7.2:57_ubu
 FROM $BASE_DOCKER
 WORKDIR /
 
+# Updated git via git-core PPA
+RUN apt-get update && apt-get install -y --no-install-recommends software-properties-common \
+    && add-apt-repository ppa:git-core/ppa -y \
+    && apt-get update \
+    && apt-get install -y --no-install-recommends git vim \
+    && rm -rf /var/lib/apt/lists/*
+
 # Build arguments
 ARG FA_VERSION=v2.8.1
 ARG ROCM_VERSION=7.2
 ARG JAX_VERSION=0.8.0
 ARG PYTHON_VERSION=311
+# AITER - Required for MXFP4 FP4 GEMM kernels.
+ARG AITER_COMMIT=77455e3ecf4f0d28756afc452e914940c45b944b
 
 RUN pip install setuptools wheel
 RUN pip install ipython pytest fire pydantic pybind11 ninja pandas
-RUN apt-get update && apt-get install -y vim
 
 # Install flash-attention
-ENV GPU_ARCHS=gfx950;gfx942
 RUN git clone --branch ${FA_VERSION} --depth 1 https://github.com/Dao-AILab/flash-attention.git \
     && cd flash-attention \
-    && FLASH_ATTENTION_TRITON_AMD_ENABLE=TRUE && FLASH_ATTENTION_SKIP_CK_BUILD=FALSE python setup.py install \
+    && GPU_ARCHS="gfx950;gfx942" FLASH_ATTENTION_TRITON_AMD_ENABLE=TRUE FLASH_ATTENTION_SKIP_CK_BUILD=FALSE python setup.py install \
     && cd ..
 
+# Install AITER
+RUN git clone --no-checkout https://github.com/ROCm/aiter.git \
+    && cd aiter \
+    && git checkout ${AITER_COMMIT} \
+    && git submodule update --init --recursive \
+    && pip install .
+
 # Install JAX
 RUN ROCM_MAJOR=$(echo "${ROCM_VERSION}" | cut -d. -f1) && pip install \
     https://repo.radeon.com/rocm/manylinux/rocm-rel-${ROCM_VERSION}/jax_rocm${ROCM_MAJOR}_pjrt-${JAX_VERSION}%2Brocm${ROCM_VERSION}.0-py3-none-manylinux_2_28_x86_64.whl \

diff --git a/.github/scripts/aiter_prebuild_upload.sh b/.github/scripts/aiter_prebuild_upload.sh
@@ -22,7 +22,9 @@ fi
 export ROCM_PATH
 ROCM_VER=`head -n1 "${ROCM_PATH}/.info/version" | cut -d. -f1`
 
-AITER_DIR="${ROOT_DIR}/3rdparty/aiter"
+QOLA_DIR="${ROOT_DIR}/3rdparty/QoLA"
+AITER_DIR="${QOLA_DIR}/3rdparty/aiter"
+QOLA_MANIFEST="${ROOT_DIR}/transformer_engine/common/ck_fused_attn/qola_manifest.toml"
 GIT_CONFIG_GLOBAL="$(mktemp /tmp/gitconfig.XXXXXX)"
 trap 'rm -f "${GIT_CONFIG_GLOBAL}"' EXIT
 git config --file "${GIT_CONFIG_GLOBAL}" --add safe.directory "${AITER_DIR}"
@@ -52,21 +54,35 @@ fi
 # Optional build stage
 if [[ "${1:-}" == "--build" ]]; then
   shift
-  GPU_ARCHS="gfx942;gfx950"
-  echo "[AITER-PREBUILT] Building aiter libs for ${GPU_ARCHS} ..."
-  bash "${ROOT_DIR}/transformer_engine/common/ck_fused_attn/aiter_build.sh" \
-    --aiter-dir "${ROOT_DIR}/3rdparty/aiter" \
-    --install-dir "${EXTRACT_DIR}" \
-    --gpu-archs "${GPU_ARCHS}"
-fi
+  GPU_ARCHS=("gfx942" "gfx950")
+  echo "[AITER-PREBUILT] Building aiter libs via QoLA for ${GPU_ARCHS[*]} ..."
+  QOLA_BUILD_DIR="${QOLA_DIR}/build"
+  arch_args=()
+  for a in "${GPU_ARCHS[@]}"; do arch_args+=(--arch "${a}"); done
+  PYTHONPATH="${QOLA_DIR}:${PYTHONPATH:-}" \
+    python3 -m qola.cli build \
+      --manifest "${QOLA_MANIFEST}" \
+      --aiter-root "${AITER_DIR}" \
+      --output-dir "${QOLA_BUILD_DIR}" \
+      "${arch_args[@]}"
 
-# Ensure built libs exist
-if [[ ! -f "${EXTRACT_DIR}/libmha_fwd.so" ]]; then
-  echo "[AITER-PREBUILT] Missing libmha_fwd.so in ${EXTRACT_DIR}" >&2
-  exit 1
+  # Stage QoLA outputs into the cache layout expected by aiter_prebuilt.cmake.
+  mkdir -p "${EXTRACT_DIR}/lib" "${EXTRACT_DIR}/include"
+  cp "${QOLA_BUILD_DIR}/lib/"*.so "${EXTRACT_DIR}/lib/"
+  cp "${QOLA_BUILD_DIR}/include/"*.h "${EXTRACT_DIR}/include/"
 fi
-if [[ ! -f "${EXTRACT_DIR}/libmha_bwd.so" ]]; then
-  echo "[AITER-PREBUILT] Missing libmha_bwd.so in ${EXTRACT_DIR}" >&2
+
+# Ensure built libs exist (matches aiter_prebuilt.cmake::is_aiter_cache_valid).
+for lib in te_libmha_fwd.so te_libmha_bwd.so; do
+  if [[ ! -f "${EXTRACT_DIR}/lib/${lib}" ]]; then
+    echo "[AITER-PREBUILT] Missing ${lib} in ${EXTRACT_DIR}/lib" >&2
+    exit 1
+  fi
+done
+
+# qola_config.h is the namespace-baked header; without it consumer compiles fail.
+if [[ ! -f "${EXTRACT_DIR}/include/qola_config.h" ]]; then
+  echo "[AITER-PREBUILT] Missing qola_config.h in ${EXTRACT_DIR}/include" >&2
   exit 1
 fi
 

diff --git a/.github/workflows/aiter-prebuilt-upload.yml b/.github/workflows/aiter-prebuilt-upload.yml
@@ -13,7 +13,7 @@ on:
 
 jobs:
   upload:
-    runs-on: linux-te-mi325-8
+    runs-on: build-only-te
     steps:
       - name: Checkout source
         uses: actions/checkout@v6
@@ -44,11 +44,7 @@ jobs:
             --rm \
             --name te-aiter-upload \
             --network=host \
-            --device=/dev/dri --device=/dev/kfd \
-            --shm-size=16G \
             --pid=host \
-            --group-add $(getent group render | cut -d: -f3) \
-            --group-add $(getent group video | cut -d: -f3) \
             -v "${{ github.workspace }}:/workspace" \
             -w /workspace \
             ${{ steps.cfg.outputs.image }}

@@ -7,6 +7,10 @@ name: 'Build'
 on:
   pull_request:
   workflow_dispatch:
+concurrency:
+  # Group by workflow name + PR number (for PRs) or ref (for branch/tag pushes)
+  group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
+  cancel-in-progress: true
 jobs:
   core:
     name: 'Core'