rocBLAS: SWMMAC full precision family + MXFP4/FP8/BF8/UE8M0 types#1677
Open
clearnature wants to merge 147 commits into
Open
rocBLAS: SWMMAC full precision family + MXFP4/FP8/BF8/UE8M0 types#1677clearnature wants to merge 147 commits into
clearnature wants to merge 147 commits into
Conversation
support labels with spaces * fix CI pipelines passing label list
remove out of date information on device memory management (ROCm#488) Co-authored-by: Andrew Chapman <anchapman@ctr2-alola-login-02.amd.com> Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>
find OpenMP config (ROCm#517) Summary of proposed changes: First search for ROCm's libomp.so via openmp-config.cmake. This is what we would prefer instead of searching for a system libomp.so/libgomp.so and then manually adding in a ROCm lib path. This methodology should still be RHEL-10 RPATH compliant. Co-authored-by: estewart08 <ethan.stewart@amd.com>
Remove stray .pyc (ROCm#508)
[rocblas] Remove .jenkins folder V2 (ROCm#524) running CI on rocJenkins instead of .jenkins
use gemm_ex type in constexpr (ROCm#640) * clarify expected compilation
Add environment variables index page to rocBLAS (ROCm#451) resolves Environment variables collection doc epic Summary of proposed changes: - This is part of a bigger work, where we collect environment variables in index page. - Tried to come up with a solution, where we have minimal doc duplication. --- 🔁 Imported from [ROCm#1661](ROCm#1661) 🧑💻 Originally authored by @neon60 --------- Co-authored-by: Istvan Kiss <neon60@gmail.com> Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>
Add gfx1150 support (ROCm#452) - Adding initial support for gfx1150 - Need to update tensile_tag.txt after PR with Tensile changes is merged in rocm-libraries and automatically in ROCm/Tensile (ROCm/rocm-libraries#396) - Update tensile_tag.txt for testing purposes, it will be reverted --- 🔁 Imported from [ROCm#1659](ROCm#1659) 🧑💻 Originally authored by @amd-mtrifuno --------- Co-authored-by: Milica Trifunovic <milica.trifunovic@amd.com> Co-authored-by: assistant-librarian[bot] <assistant-librarian[bot]@users.noreply.github.com> Co-authored-by: Torre Zuk <42548444+TorreZuk@users.noreply.github.com>
[rocBLAS] Docs: Change GitHub links and install guides to account for repo change (ROCm#541) This PR makes changes to several guides to point to the rocblas folder in the new rocm-libraries repository. It also updates some install instructions and some other links. I added a note in a few places to refer people using 6.4 or older to the previous rocBLAS repository.
[rocBLAS] Docs: Correct Windows filename (ROCm#717) Correct the windows filename that was erroneously changed in an earlier update.
Add rocBLAS yaml files for gfx1151 (ROCm#699) - Add Strix Halo yaml files that are copy of Navi33 yaml files (the same changes added for Strix Point) - Need to update tensile_tag.txt after PR with Tensile changes is merged in rocm-libraries and automatically in ROCm/Tensile (ROCm/rocm-libraries#696) --------- Co-authored-by: Torre Zuk <42548444+TorreZuk@users.noreply.github.com>
refer to HIP C++ and HIP runtime in documentation (ROCm#727) - in documentation refer to HIP C++ and the HIP runtime. --------- Co-authored-by: Andrew Chapman <anchapman@ctr2-alola-login-02.amd.com>
[rocblas] - matches hipblas OpenMP fix for gcc issue where clang headers used (ROCm#750) Additional cleanup: Remove unnecessary options added to COMMON_LINK_LIBS. If the openmp config is not found then OpenMP::OpenMP_CXX will be the fallback for both gcc and clang.
[rocBLAS] collect rocblas tip commit Note packaging commit would still be the top repo level commit for rocm-libraries, vs. one related to rocBLAS folder 8f685a e.g. below: rocBLAS version: 5.1.0.d263a37c12-dirty rocBLAS-commit-hash: 8f685a3acf013a31522fa5a8ce9bf6029a58bb69 Tensile-commit-hash: N/A, as rocBLAS was built without Tensile hipBLASLt: N/A, as rocBLAS was built without hipBLASLt So the -dirty you can see is different commit.
[rocBLAS] allow forced single double hipblaslt backend via env (ROCm#781) * re-enable types size >= 4 to use hipblaslt on gfx950 via env to allow comparison bench runs
opening PR for updated gemv transpose kernel I have added a new kernel to handle gemv transpose and conjugate transpose for matrices m<=n (for certain values of m depending on the datatype) for gfx942 and gfx90a. The discovery of the issue, testing, and performance results are on the Jira ticket: https://ontrack-internal.amd.com/browse/SWDEV-538784 The new kernel greatly improves memory bandwidth percentage (see plots on Jira ticket). All ~7000 quick tests pass.
[rocBLAS] Update tensile_tag.txt to tip of develop Expecting Tensile perf regression fix so capture latest develop first as baseline
update tensile_tag to address perf regression Co-authored-by: Torre Zuk <42548444+TorreZuk@users.noreply.github.com>
[rocblas] Docs: Add topic on BLAS operations This adds a topic that explains BLAS operations, along with index and ToC changes
[rocBLAS] Users/torrezuk/swdev 544606 batched env hipblaslt (ROCm#1178) * [rocBLAS] require specific env to use batched hipblaslt backend * ROCBLAS_USE_HIPBLASLT_BATCHED=1
[rocBLAS] adds hipblaslt batched env variable to turn off (ROCm#1189) * [rocBLAS] default hipblaslt backend enabled for batched unless turned off * ROCBLAS_USE_HIPBLASLT_BATCHED=0
[rocBLAS] rocbas-bench use hip event based timing as default (ROCm#919) default to be more consistent perf. across these: * ROCBLAS_STREAM_ORDER_ALLOC=0 ./rocblas-bench ... * ROCBLAS_STREAM_ORDER_ALLOC=1 ./rocblas-bench old way had significant differences: * ROCBLAS_STREAM_ORDER_ALLOC=0 ROCBLAS_BENCH_STREAM_SYNC=1 ./rocblas-bench ... * ROCBLAS_STREAM_ORDER_ALLOC=1 ROCBLAS_BENCH_STREAM_SYNC=1 ./rocblas-bench
[rocBLAS] Fix ISA args in strixhalo yaml files
[rocBLAS] Update min CMake version This change updates the required minimum version of CMake to comile rocBLAS to 3.24.4. In case the option --cmake_install is provided to rocBLAS install.sh, the version 3.24.4 will be installed if an older version or no CMake is found. The previous minimum required version of CMake (3.16.8) fails to compile Tensile with error: ``` cmake -E env: unknown option '--' make[2]: *** [library/src/CMakeFiles/TENSILE_LIBRARY_TARGET.dir/build.make:65: Tensile/library] Error 1 make[1]: *** [CMakeFiles/Makefile2:484: library/src/CMakeFiles/TENSILE_LIBRARY_TARGET.dir/all] Error 2 make[1]: *** Waiting for unfinished jobs.... ```
[rocBLAS] cache and export handle device properties * cache hipDeviceProp_t in handle and provide internal API
[rocBLAS] client argument yaml variable 2D scan support (ROCm#1263) * support to simplify 2D performance scans via yaml arguments
[rocblas]: Fix formatting errors in license file ## Motivation <!-- Explain the purpose of this PR and the goals it aims to achieve. --> Fix some minor formatting errors in the rocBLAS license file ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> Edit license file ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> NA ## Test Result <!-- Briefly summarize test outcomes. --> NA ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[rocBLAS] fix pointer compare warning for char array in client arguments (ROCm#1284) * should be only deprecation warnings remaining
[rocblas] Add dockerfiles to ease starting a build env for rocBLAS (#6272) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Motivation Setting up a working rocBLAS build environment requires installing ROCm and several dependencies, which can be time-consuming and error-prone. Docker images provide a reproducible, ready-to-use development environment that removes this barrier for contributors and CI workflows. ## Technical Details Adds two Dockerfiles and a README under docker/, both based on Ubuntu 24.04: - Dockerfile.ubuntu24.prebuilt — downloads a prebuilt ROCm nightly tarball from rocm.nightlies.amd.com. Supports configuring the target ASIC (THEROCK_ASIC), nightly tag (THEROCK_GIT_TAG), tarball filename (THEROCK_TARBALL), and tarball source URL (THEROCK_URL_BASE). Suitable for day-to-day development where build speed matters. - Dockerfile.ubuntu24.fullbuild — clones https://github.com/ROCm/TheRock and builds ROCm from source. Supports configuring the target ASIC, a specific commit hash (THEROCK_GIT_HASH), build type (THEROCK_BUILD_MODE: Release, Debug, or Preset), CMake preset (THEROCK_BUILD_PRESET), and parallel job count (BUILD_JOBS). Useful when a prebuilt tarball is unavailable or custom build options are required. Both images share a common base stage that installs all system dependencies (including clang-format-18, clang-tidy-20, llvm-20, libzstd-dev, and development libraries), sets up update-alternatives, and configures the environment (PATH, LD_LIBRARY_PATH, CMAKE_GENERATOR). - README.md — documents build commands for both images, all configurable --build-arg options, available tarball sources, and a docker run example for launching a container with GPU device access. ## Test Plan - Build rocblas:prebuilt with default args - Build rocblas:prebuilt with a custom args - Build rocblas:fullbuild - Launch container and verify amdclang++, cmake, clang-format, and clang-tidy are accessible ## Test Result All tests passing. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[rocBLAS] Use precheckin gtest patterns for comprehensive/full categories (#6353) ## Motivation The rocBLAS nightly, extended, and full client test selections depend on ILP64 OpenBLAS support for LAPACK and reference checks. That support on TheRock is incomplete at the moment; a later change will restore the broader coverage once the stack is ready. ## What changed - `comprehensive` and `full` entries in `projects/rocblas/clients/gtest/test_categories.yaml` now use the same test pattern set as the standard precheckin category (`*quick*` and `*pre_checkin*`), instead of additional nightly/stress filters that fail or time out under the current environment. ## Tracking ROCM-21549 Made with [Cursor](https://cursor.com)
[rocBLAS] hipblaslt on windows is supported This pull request updates the documentation to clarify the limitations of the hipBLASLt backend support in rocBLAS. The main change is to the note about platform and build support. Documentation update: * Updated the note in `rocblas-design-notes.rst` to specify that the hipBLASLt backend for rocBLAS is only not supported on static builds, removing the mention of Windows builds.
[rocBLAS] adds gfx1250 This pull request adds support for the `gfx1250` GPU architecture throughout the `rocblas` project. It updates target lists, processor enumerations, and device detection logic. Additionally, it corrects a grid launch limit constant. **Support for new GPU architecture:** - Added `gfx1250` to the `Processor` enum in `handle.hpp`, allowing the codebase to recognize and use the new architecture. - Updated the device string detection logic in `handle.cpp` to return `Processor::gfx1250` when the device string contains "gfx1250". **Build system and target updates:** - Added `gfx1250` to the ROCm 7.13 target lists in `CMakeLists.txt` for both regular and address sanitizer builds, and updated the logic to select the appropriate target list based on the ROCm platform version. **Constant correction:** - Fixed the `c_YZ_grid_launch_limit` constant in `utility.hpp` from `1 << 16` (65536) to `(1 << 16) - 1` (65535). Co-authored-by: amcamd <andrew.chapman@amd.com> Co-authored-by: Yoichi Yoshida <yoichi.yoshida@amd.com> Co-authored-by: mahmoodw <wmahmood@amd.com>
[rocBLAS] Update CODEOWNERS
[rocBLAS]: OpenBLAS ILP64 support in tests/clients + TheRock CI (#6361) ## Summary - Enable ILP64 OpenBLAS for rocBLAS client tests and benchmarks (CMake and Windows packaging paths). - Point TheRock GitHub Actions workflows at `users/todavis/openblas-ilp64-enablement` so CI builds align with the OpenBLAS ILP64 work in TheRock (replaces the prior Windows-test-enablement branch pin after reverting that change). Also reverts PR #6353, re-enabling the full test patterns that previously failed due to OpenBLAS linking. ## Tracking ROCM-21562 Made with [Cursor](https://cursor.com)
[rocBLAS] CMake: ROCBLAS_ENABLE_CTEST, fail-fast, single enable_testing() (#5678) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Summary Follow-up to [#4659](ROCm/rocm-libraries#4659) addressing review feedback. This PR does **not** change the install location of `CTestTestfile.cmake` (still under `CMAKE_INSTALL_BINDIR/rocblas`); that was deferred for broader alignment on GNUInstallDirs / consumers. ## Motivation (from #4659 review) - Prefer an explicit **ON/OFF** control so builds are **deterministic**: if categorization is enabled but preconditions are missing, **configure fails** instead of silently skipping. - Thread: ROCm/rocm-libraries#4659 (comment) - Related install gating: ROCm/rocm-libraries#4659 (comment) - Call **`enable_testing()` once** at the rocBLAS project level when building tests, and **remove** the extra call from `clients/gtest`. - Thread: ROCm/rocm-libraries#4659 (comment) - "Remove this" (`enable_testing` in gtest): ROCm/rocm-libraries#4659 (comment) - **Deferred:** `bin/` vs `DATADIR` (or similar) for the installed `CTestTestfile.cmake` — needs discussion with integrators. - Thread: ROCm/rocm-libraries#4659 (comment) ## What changed - Add **`ROCBLAS_ENABLE_CTEST`** CMake option. - Default **ON** when `${ROCM_LIBRARIES_ROOT}/shared/ctest/TestCategories.cmake` exists; otherwise **OFF**. - When **ON** and **`BUILD_CLIENTS_TESTS`**: require `clients/gtest/test_categories.yaml` and the shared `TestCategories.cmake`; otherwise **`FATAL_ERROR`** with a short message. - **`include(shared/ctest/TestCategories.cmake)`** and **`ROCBLAS_HAS_CTEST_CATEGORIES`** are set from **`projects/rocblas/CMakeLists.txt`**; gtest only runs `apply_test_category_labels` / staging when that flag is on. - **`enable_testing()`** once before the main clients `if()` block when **`BUILD_CLIENTS_TESTS`**. - Install of **`CTestTestfile.cmake`** gated on **`ROCBLAS_ENABLE_CTEST`** (install path unchanged). - **`shared/ctest/README.md`**: document the option and fail-fast behavior under the rocBLAS example. ## Proposed work We didn't change the CTestTestfile.cmake location yet. We should do this: see https://cmake.org/cmake/help/latest/module/GNUInstallDirs.html for descriptions of the different dirs. ## Testing - Configure rocBLAS with **`BUILD_CLIENTS_TESTS=ON`** in a full **rocm-libraries** tree: categorization should configure cleanly with defaults. - **`ROCBLAS_ENABLE_CTEST=OFF`**: build should still succeed; status message indicates categorization skipped. - **`ROCBLAS_ENABLE_CTEST=ON`** with missing YAML or wrong **`ROCM_LIBRARIES_ROOT`**: configure should **fail** with the new errors. ## Related - Parent / context: [#4659](ROCm/rocm-libraries#4659)
[rocBLAS][Tensile] Initial support for gfx90c ## Motivation Enabling gfx90c w/TheRock ROCm/TheRock#3818 gfx90c build fails due to lack of support in rocBLAS/Tensile. ## Technical Details - Mimicking the enablement work done for gfx1152/1153 in ROCm/rocm-libraries#2653. - gfx90c should be able to piggyback off of the existing vega10 Tensile Kernel logic files - Not sure which test .yaml files require the `skip-gfx90c` marker so I've omittted that for now. Please let me know if it's the same as the `skip-gfx900` or some other subset and I'll add that in. ## Test Plan 1. Build rocBLAS targeting gfx90c w/TheRock 2. psdb tests w/ `*pre_checkin*:*quick*` 3. osdb tests w/ `*nightly*` ## Test Result 1. Build passes 2. psdb tests passed ``` /home/rocm/prebuild/rocm/bin/rocblas-test --gtest_filter='*pre_checkin*:*quick*' rocBLAS info: Limiting OpenMP threads to 14 (detected 16 available, reduced by 2 to optimize AOCL performance) rocBLAS warning: LD_LIBRARY_PATH override may use incompatible rocblas rocBLAS info: Using reference library 'OpenBLAS::OpenBLAS' rocBLAS version: 5.3.0.7567d83979-dirty rocBLAS-commit-hash: cd4c348ba6f9e0bf66fd923b60b657cf7d6d4b3c Tensile-commit-hash: hipBLASLt version: 1.2.2 commit-hash: 7567d83979-dirty Query device success: there are 1 devices
[rocBLAS] adds trace logging to beta get_solution APIs and tensile initialize (#6545) ## Motivation This pull request adds enhanced logging and tracing capabilities to several GEMM solution retrieval beta functions and the Tensile host initialization in rocBLAS. These changes improve traceability and debugging by logging function calls and environment information when appropriate layer modes are enabled. The most important changes are: #### Enhanced logging for GEMM solution APIs * Added detailed trace logging to `rocblas_gemm_ex_get_solutions`, `rocblas_gemm_batched_ex_get_solutions`, and `rocblas_gemm_strided_batched_ex_get_solutions`, including all key parameters and data types, when `rocblas_layer_mode_log_trace` is enabled. * Added similar trace logging to the `_get_solutions_by_type` variants for both GEMM and batched GEMM APIs, logging the input/output/compute types and other parameters. #### Defensive programming * Added null handle checks to the `_get_solutions_by_type` functions to prevent null pointer dereferencing. #### Initialization and environment logging * During Tensile host initialization, added logging of the rocBLAS version to `rocblas_cerr` if the environment variable `ROCBLAS_LAYER` enables trace or internal logging, improving startup diagnostics.
[rocBLAS] adds version notes This pull request updates the `CHANGELOG.md` for `rocBLAS 5.4.0` to document several new features, optimizations, and bug fixes. The main changes include enabling new GPU targets, adding trace logging, Windows DLL improvements, support for OpenBLAS ILP64, Dockerfile additions, performance optimizations, and a resolved issue with solution querying. **New features and enhancements:** * Enabled support for `gfx1250` and `gfx90c` GPU architectures. * Added trace logging (`ROCBLAS_LAYER=1`) for several `rocblas_gemm_ex_*_solutions` APIs. * Included version and other properties in the Windows `rocblas.dll`. * Added support for the `OpenBLAS` ILP64 API for host reference in clients. * Introduced Dockerfiles in the `docker` directory to assist with development setup. **Performance improvements:** * Improved performance of Level 3 `geam` for pure transpose scale use cases. * Improved performance of Level 2 `tpsv`. (Fc5b6e7b
[rocBLAS] Update test_categories.yaml for full runs This pull request updates the `test_categories.yaml` configuration to better align the `full` test category with the current CI use cases. The main change is to adjust the test patterns for the `full` category so they match the purpose of CI runner. Test pattern updates: * Updated the `full` category in `test_categories.yaml` to comment out the previous `*stress*` and `*HMM*` patterns and add `*pre_checkin*` and `*nightly*` patterns instead, ensuring the test selection matches the current CI use case.
[rocblas] Docs: Add brief explanation for rocBLAS Dockerfiles (#6782) This pull request adds documentation to the Linux installation guide for rocBLAS, introducing guidance on using Docker images to simplify the development environment setup. The most important change is the addition of a new section describing the available Docker images and their use cases. **New Docker-based installation instructions:** * Added a section titled "Using the rocBLAS Docker images" to the `Linux_Install_Guide.rst`, explaining the benefits of Docker for reproducible and ready-to-use development environments. * Provided details for two Dockerfiles (`Dockerfile.ubuntu24.prebuilt` and `Dockerfile.ubuntu24.fullbuild`), including their purposes, configuration options, and when to use each. * Linked to the official rocBLAS Docker documentation for further instructions on downloading and using the images.
[rocBLAS] default hipblaslt on gfx1250 This pull request makes a small update to the `isDefaultHipBLASLtArch` method in the `struct _rocblas_handle` to by default assume hipblaslt support for an additional GPU architecture. Otherwise ROCBLAS_USE_HIPBLASLT environment variable is required. - Added support for GPU architecture `1250` to the condition that determines if the default HipBLASLt architecture is used in the `isDefaultHipBLASLtArch` method in `handle.hpp`.
[rocBLAS] add advice on low memory testing This pull request updates the programmer's guide to provide clearer instructions for handling out-of-memory (OOM) issues during stress testing with `rocblas-test`. The documentation now includes guidance on adjusting memory limits and recommendations for minimum VRAM. Testing guidance improvements: * Added advice to reduce the `ROCBLAS_CLIENT_RAM_GB_LIMIT` by half and retry if `rocblas-test` is killed due to OOM, and recommended not running stress tests with less than 8 GB VRAM. Also clarified the use of test subtraction syntax with filters. ROCM-21747
[rocblas] Dockerfile: fetch latest tarball by default and
option to fetch from local (#7034)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
## Summary of changes
- Dockerfile.ubuntu24.prebuilt:
- Fixed LD_LIBRARY_PATH uninitialized variable warning (direct
assignment instead of shell expansion in ENV)
- Added THEROCK_GIT_TAG=latest support: queries THEROCK_URL_BASE at
build time and selects the highest-versioned tarball for the given ASIC;
ADHOCBUILD tarballs are excluded
- Auto-derives tarball suffix (-dcgpu, -dgpu, -all, or none) from
THEROCK_ASIC via a case statement, instead of hardcoding -dcgpu
- Added THEROCK_PREBUILT_ID ARG to override the auto-derived suffix
- Added THEROCK_TARBALL_URL ARG to provide a full download URL directly,
bypassing all auto-detection (useful when the tarball source uses a
non-standard naming convention)
- THEROCK_TARBALL_URL accepts paths to local files (e.g.
`file:///tarball.tar.gz`; or `./tarball.tar.gz`)
- Three-stage fallback for latest resolution: exact prefix → ASIC name
in therock-dist-linux-* → any .tar.gz containing the ASIC name
- Improved error message when the directory listing fetch fails (e.g.
403), pointing to public URLs and the THEROCK_TARBALL_URL escape hatch
- Added doxygen to the apt-get install list
- Dockerfile.ubuntu24.fullbuild: same LD_LIBRARY_PATH and doxygen fixes
- docker/README.md: fully updated to document all new build arguments,
the ASIC suffix table, and tarball URL override; tables aligned for
readability in text editors
## Test plan and Results
- [x] Test added docker build commands from the docker/README.md
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[rocBLAS] adds axpy replacement reference code when using windows OpenBLAS (#7184) This pull request introduces a Windows-specific implementation of the `ref_axpy` function to address compatibility issues when using OpenBLAS on Windows. The main changes involve conditionally compiling a new template implementation for `ref_axpy` and updating the header to ensure the correct version is used based on the platform and build configuration. Platform-specific implementation for Windows: * Added a custom template implementation of `ref_axpy` for Windows platforms when BLIS CBLAS is not enabled, including specializations for `float`, `double`, `rocblas_float_complex`, `rocblas_double_complex`, and a custom implementation for `rocblas_bfloat16` that performs computation in float precision. (`projects/rocblas/clients/common/cblas_interface.cpp`) Header updates for conditional compilation: * Updated the declaration and inline implementation of `ref_axpy` in `cblas_interface.hpp` to use the new Windows-specific override only when appropriate, ensuring the standard implementation is excluded on Windows/OpenBLAS builds without BLIS CBLAS. (`projects/rocblas/clients/include/cblas_interface.hpp`) ROCM-23872
[rocblas] Add parallel test runner script (run_tests.py)
(#6854)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
## Summary
- Adds `scripts/utilities/run_tests.py`, a parallel test runner designed
for simulation/emulation environments where GPU memory is limited and
test processes must run in isolation
- Adds `scripts/utilities/run_tests.md` with usage reference and caveats
- Covers 145 rocblas quick-filter jobs across four groups:
AUXILIARY (13), L1_BLAS (42), L1_BLAS_EX (18), L2_BLAS (72)
## Key design decisions
- **Flat concurrency**: all jobs share one `threading.Semaphore(N)`;
no per-group queuing — whichever job acquires a slot runs next
- **Persistent state**: `run_state.json` is written atomically
(tmp + `os.replace`) so a dropped SSH session or OOM kill never
leaves a corrupted file; on restart, dead PIDs are reset and
live PIDs are reattached via `waitpid`
- **Scoped display**: `--job` / `--group` filters the live dashboard
and summary to only the selected jobs, so counts are always accurate
- **Stdlib only**: no pip dependencies; works on any Python 3.8+ install
## Test plan and results
- [x] `python3 run_tests.py --list-jobs` prints 145 jobs across 4 groups
- [x] `python3 run_tests.py --job AUXILIARY.logging` shows `1/1` in the
header and a single-row table
- [x] `python3 run_tests.py --group L1_BLAS` shows only the L1_BLAS row
- [x] Re-running after partial completion skips already-passed jobs and
re-runs failures
- [x] Ctrl+C saves state and exits 130; next run resumes correctly
- [x] `--no-color` output is clean under `tee`
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[rocBLAS][hipBLAS] windows adapt AOCL 5.2 to runtime DLL (#7057) This pull request improves the configuration and portability of Windows builds for the hipBLAS and rocBLAS clients by standardizing the use of the `AOCL_ROOT` variable for AOCL library paths, updating how AOCL utilities are linked, and ensuring required DLLs are included in the test deployment. The changes make the build scripts more maintainable and robust by avoiding hardcoded paths and improving dependency handling. **Windows AOCL integration improvements:** * Standardized the use of the `AOCL_ROOT` variable instead of hardcoded paths for AOCL libraries and include directories in both `hipblas` and `rocblas` client CMake scripts, improving maintainability and portability. * Updated the AOCL utils library to use `libaoclutils` (instead of `libaoclutils_static`) in the Windows build configuration for hipBLAS clients. **Test deployment enhancements:** * Added AOCL utility DLLs (`*aocl*.dll`) to the list of files copied for local test application deployment on Windows, ensuring all necessary runtime dependencies are included. AIROCBLAS-224
- Added rocblas_datatype_mxfp4_r(170), fp8_r(171), bf8_r(172) - Added rocblas_swmmac.cpp: INT4/INT8/FP16/BF16/MXFP4/FP8/BF8 kernels - gemm_ex_template: SWMMAC routing for all 7 types - L2 persistent counter per device, never hipMemset - __launch_bounds__(32,2) dual-wave resonance - All via LLVM __builtin_amdgcn_swmmac_* intrinsics (no AsmParser) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- rocblas-types.h: added mxfp4_r(170, E2M1+E8M0), fp8_r(171, E4M3), bf8_r(172, E5M2) - gemm_ex_kernels.cpp: SWMMAC routing checks for all 6 types - CMakeLists.txt: include rocblas_swmmac.cpp in build Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Contributor
|
@clearnature this is the deprecated repo for rocBLAS, can you rebase your PR to tip of https://github.com/ROCm/rocm-libraries/tree/develop/projects rocblas where we can review. It is unclear if this change is acceptable even there, but hopefully Claude will do most of the work. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Extended gemm_ex routing with SWMMAC StaggeredPipeline for INT4/INT8/FP16/BF16/MXFP4/FP8/BF8. Added mxfp4_r(170), fp8_r(171), bf8_r(172), e8m0_r(173) to rocblas_datatype enum. New rocblas_swmmac.cpp with per-device L2 persistent counter and launch_bounds(32,2) dual-wave resonance.