Skip to content

[CI] run_llama.sh: Replace llama-cli usage with llama-bench#2231

Merged
mhalk merged 1 commit into
ROCm:aomp-devfrom
mhalk:amd/dev/mhalkenh/fix/ci-llama-use-bench-instead-cli
May 26, 2026
Merged

[CI] run_llama.sh: Replace llama-cli usage with llama-bench#2231
mhalk merged 1 commit into
ROCm:aomp-devfrom
mhalk:amd/dev/mhalkenh/fix/ci-llama-use-bench-instead-cli

Conversation

@mhalk
Copy link
Copy Markdown
Contributor

@mhalk mhalk commented May 26, 2026

llama-cli has a tendency to hang during prefetch/load step llama-bench can handle this step in a more robust manner

  • Note: cold and warm caches provide the same perf results

Fixed current / changed cache directory structure conformity Support fallback as before

Motivation

Intermittent hangs of llama-cli processes during prefetch step.

Technical Details

This avoids the llamacpp-perf hang by removing the normal-path llama-cli -hf ... --prompt /exit -st prefetch step.
Modern llama-bench supports -hf directly, so when no cached GGUF is available the script now lets llama-bench resolve/download and benchmark the requested HF model itself.
The existing cached-model behavior is preserved: if cached GGUF files are found, the script still benchmarks them through the existing llama-bench -m loop. The fallback also handles the current HF cache layout by finding symlinked snapshot GGUFs.

Test Plan

Run script with [warm, cold] cache x [with, without] fix; 100 times each.
Document hangs/timeouts and reported performance results.

Sporadic runs of the nightly scripts.

Test Result

Tested on gfx950:

Cache state Path Runs Successes Timeouts pp512 avg t/s tg128 avg t/s
warm with fix 100 100 0 41566.36 358.97
warm without fix 100 84 16 41645.86 359.22
cold with fix 100 99 1 40369.79 359.90
cold without fix 100 75 25 41833.08 359.30

Without the fix, timeouts stopped in llama-cli -hf ... --prompt /exit -st at Loading model..., before benchmark rows were emitted. With the fix, that prefetch path is avoided while successful benchmark output remains in the same format and has roughly equivalent performance.

The single "cold-fix" timeout seems to have had slow download speed (chose a rather narrow timeout setting of 42s).
The returncode was -15 while the other hanging returncodes were all -9.

All manual nightly script runs (10) returned results in the expected time and range.

Submission Checklist

llama-cli has a tendency to hang during prefetch/load step
llama-bench can handle this step in a more robust manner
 - Note: cold and warm caches provide the same perf results

Fixed current / changed cache directory structure conformity
Support fallback as before
Copy link
Copy Markdown
Contributor

@jplehr jplehr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!

@mhalk
Copy link
Copy Markdown
Contributor Author

mhalk commented May 26, 2026

Let's see how this performs :)

@mhalk mhalk merged commit be0edf9 into ROCm:aomp-dev May 26, 2026
1 check passed
@mhalk mhalk deleted the amd/dev/mhalkenh/fix/ci-llama-use-bench-instead-cli branch May 26, 2026 20:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants