-
Notifications
You must be signed in to change notification settings - Fork 29
[WIP] TDM porting #558
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
wangye805
wants to merge
48
commits into
npi_gfx1250
Choose a base branch
from
yewang12/tdm_port_npi
base: npi_gfx1250
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
[WIP] TDM porting #558
Changes from 3 commits
Commits
Show all changes
48 commits
Select commit
Hold shift + click to select a range
fe9a66c
[ROCm] port tdm to npi_gfx1250
wangye805 40f3902
[ROCm] address first round reviewer comments
wangye805 cb421f8
[ROCm] address reviewer comments
wangye805 0e2d24d
[ROCm] address more reviewer comments
wangye805 0a13bbc
[ROCm] Address TDM review comments: remove extra params, add explanat…
wangye805 bfb7199
[ROCm] Address remaining review comments and enable TDM flow in CI gt…
wangye805 ba4bbb7
tdm: clamp tensorDim to avoid uint32_t underflow on OOB prefetch tiles
wangye805 ab77fbf
tdm: add HIPTensorMap descriptor struct; revert TDM from rocm_* kernels
wangye805 a0a60fe
tdm: fully revert rocm_*.cuh to branch-point state
wangye805 506d78c
tdm: extract ROCm flow into separate rocm_* launchers; TDM stays in m…
wangye805 acc7e4f
tdm: address review comments for cast_gated_kernels.cuh
wangye805 3c85101
tdm: revert swizzled_* lines to NV upstream position
wangye805 0007c88
tdm: fix cast_mxfp8_gated to match NV upstream structure
wangye805 bacd226
tdm: use switch(scaling_type) for AMD TDM mxfp8 gated dispatch
wangye805 4ba5883
tdm: hoist shared next-stage offset vars above #ifdef in cast_mxfp8_g…
wangye805 7dbf218
tdm: hoist shared shmem computation above #ifdef in cast_fp8_gated
wangye805 293d970
tdm: collapse duplicate switch(scaling_type) blocks in cast_mxfp8_gated
wangye805 53b5d22
tdm: remove tma_flow namespace; prefix ROCm-specific constants with R…
wangye805 a0a9ab6
util: apply ROCM_ prefix to ROCm-specific constants in cast/dequantiz…
wangye805 fdb4b1a
util: address PR review comments on cast_kernels.cuh and dequantize_k…
wangye805 a732d35
util: address 4 more PR review comments on cast/dequantize kernels
wangye805 1a004df
util: hoist shared next-iter offset vars above #ifndef in cast_mxfp8_…
wangye805 573f8d7
Revert " Remove padding from scales for hipBLASlt calls (#442)"
wangye805 fec2de5
fix(rocm): correct double-prefixed constants in rocm_cast_gated_kerne…
wangye805 004d59f
fix(rocm): add TMA_SHMEM_ALIGNMENT alias and sigmoidf for AMD compila…
wangye805 7c86c98
fix(rocm): fix fp8_quantize AMD flow — remove unavailable fp8_quantiz…
wangye805 0b40533
fix(rocm): route NVTE_MXFP8_1D_SCALING through fp8_quantize_rocm on AMD
wangye805 c89b5ff
fix(rocm): use padded scales_stride in rocm_mxfp8_dequantize
wangye805 9f55a8b
fix(rocm): guard TDM flow dispatch behind __gfx1250__ on AMD
wangye805 8338725
fix(rocm): wire up cast_mxfp8_2D_kernel launch on gfx1250 TDM path
wangye805 14329d5
refactor(rocm): consolidate mxfp8_quantize kernel launch for TDM and TMA
wangye805 d38c6bd
fix(amd): guard cudaFuncSetAttribute and add hip_bfloat16 overloads f…
wangye805 14a1dab
chore: remove debug print statements from MXFP8 cast/dequantize kernels
wangye805 198495a
chore: restore launcher debug prints, remove only in-kernel printf st…
wangye805 362ae53
test: add 16384x16384 matrix size to CastMXFP8_GatedAct benchmark run
wangye805 0456492
feat: migrate benchmarks/cpp/cast from dev branch
wangye805 573f6ea
build: add rocm_utils.cmake needed by benchmarks/cpp CMakeLists
wangye805 9f340a1
fix: suppress clang warnings in Google Benchmark for gfx1250 toolchain
wangye805 09ed78c
test: remove 16384x16384 from gated swiglu test (causes CPU ref hang)
wangye805 b02fe76
fix(rocm): remove TDM debug prints and fix NVTE_ROCM_BENCHMARK guards…
wangye805 186d793
fix(rocm): restore indentation lost during debug print removal
wangye805 9862745
fix(rocm): remove segfault debug handler and restore indentation in b…
wangye805 1afc5b2
fix(rocm): restore indentation in fp8_quantize_rocm TDM branch
wangye805 5da5014
Remove leftover debug printf from tdm.cuh copy_2d_to_shared
wangye805 65e92ab
Fix TDM double-buffer: use wait_tensorcnt_1 after store to preserve o…
wangye805 8ea1cbd
Make TDM store wait robust via wait_tensorcnt<N>() template
wangye805 882bea2
cast_gated_kernels: use wait_tensorcnt<TDM_PREFETCH_LOADS>() template
wangye805 fcf1932
tdm: rename is_tdm_wave() to is_tdm_lane(), restrict to thread 0
wangye805 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Some comments aren't visible on the classic Files Changed page.
There are no files selected for viewing
Large diffs are not rendered by default.
Oops, something went wrong.
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.