[CI] run_llama.sh: Replace llama-cli usage with llama-bench by mhalk · Pull Request #2231 · ROCm/aomp

mhalk · 2026-05-26T18:37:20Z

llama-cli has a tendency to hang during prefetch/load step llama-bench can handle this step in a more robust manner

Note: cold and warm caches provide the same perf results

Fixed current / changed cache directory structure conformity Support fallback as before

Motivation

Intermittent hangs of llama-cli processes during prefetch step.

Technical Details

This avoids the llamacpp-perf hang by removing the normal-path llama-cli -hf ... --prompt /exit -st prefetch step.
Modern llama-bench supports -hf directly, so when no cached GGUF is available the script now lets llama-bench resolve/download and benchmark the requested HF model itself.
The existing cached-model behavior is preserved: if cached GGUF files are found, the script still benchmarks them through the existing llama-bench -m loop. The fallback also handles the current HF cache layout by finding symlinked snapshot GGUFs.

Test Plan

Run script with [warm, cold] cache x [with, without] fix; 100 times each.
Document hangs/timeouts and reported performance results.

Sporadic runs of the nightly scripts.

Test Result

Tested on gfx950:

Cache state	Path	Runs	Successes	Timeouts	pp512 avg t/s	tg128 avg t/s
warm	with fix	100	100	0	41566.36	358.97
warm	without fix	100	84	16	41645.86	359.22
cold	with fix	100	99	1	40369.79	359.90
cold	without fix	100	75	25	41833.08	359.30

Without the fix, timeouts stopped in llama-cli -hf ... --prompt /exit -st at Loading model..., before benchmark rows were emitted. With the fix, that prefetch path is avoided while successful benchmark output remains in the same format and has roughly equivalent performance.

The single "cold-fix" timeout seems to have had slow download speed (chose a rather narrow timeout setting of 42s).
The returncode was -15 while the other hanging returncodes were all -9.

All manual nightly script runs (10) returned results in the expected time and range.

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

llama-cli has a tendency to hang during prefetch/load step llama-bench can handle this step in a more robust manner - Note: cold and warm caches provide the same perf results Fixed current / changed cache directory structure conformity Support fallback as before

jplehr

thanks!

mhalk · 2026-05-26T20:07:06Z

Let's see how this performs :)

mhalk requested review from AMD-rranjanp, Kewen12 and jplehr May 26, 2026 18:37

mhalk requested review from carlobertolli, estewart08, gregrodgers, ronlieb and zGoldthorpe as code owners May 26, 2026 18:37

jplehr approved these changes May 26, 2026

View reviewed changes

mhalk merged commit be0edf9 into ROCm:aomp-dev May 26, 2026
1 check passed

mhalk deleted the amd/dev/mhalkenh/fix/ci-llama-use-bench-instead-cli branch May 26, 2026 20:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] run_llama.sh: Replace llama-cli usage with llama-bench#2231

[CI] run_llama.sh: Replace llama-cli usage with llama-bench#2231
mhalk merged 1 commit into
ROCm:aomp-devfrom
mhalk:amd/dev/mhalkenh/fix/ci-llama-use-bench-instead-cli

mhalk commented May 26, 2026

Uh oh!

jplehr left a comment

Uh oh!

mhalk commented May 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mhalk commented May 26, 2026

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

jplehr left a comment

Choose a reason for hiding this comment

Uh oh!

mhalk commented May 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants