Skip to content

adding vllm unit tests#1517

Open
kinjalpatel27 wants to merge 4 commits into
mainfrom
kinjal/vllm_fq_unit
Open

adding vllm unit tests#1517
kinjalpatel27 wants to merge 4 commits into
mainfrom
kinjal/vllm_fq_unit

Conversation

@kinjalpatel27
Copy link
Copy Markdown
Contributor

@kinjalpatel27 kinjalpatel27 commented May 18, 2026

What does this PR do?

Type of change: new tests

This PR adds unit tests for vLLM fakequant, specifically testing code in modelopt/torch/quantization/plugins/vllm.py

Testing

pytest tests/gpu_vllm/torch/quantization/test_vllm_dynamic_modules.py -sv

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

  • Is this change backward compatible?: ✅
  • If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: N/A
  • Did you write any new necessary tests?: ✅
  • Did you update Changelog?: N/A
  • Did you get Claude approval on this PR?: ✅

Additional Information

Summary by CodeRabbit

  • Tests

    • Added comprehensive GPU vLLM test suite with end-to-end quantization checks and fixtures for TinyLlama, TinyQwen3-MoE, and DeepSeek V3; includes helpers to build tiny DeepSeek V3 models.
  • Chores

    • Updated GPU CI to use explicit container image references, added a GPU-focused test session, and adjusted test-run setup for vLLM.
  • Documentation

    • Documented new GPU test directory in contributing guide.

Review Change Stack

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 18, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 18, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds GPU vLLM fakequant tests: CI matrix and nox session for containerized GPU tests, DeepSeek V3 test model helpers, pytest conftest to enable vLLM RPC serialization, and end-to-end quantization tests with fixtures asserting quantizer coverage and static _amax invariants.

Changes

vLLM Fakequant Test Suite

Layer / File(s) Summary
CI & Test Runner Setup
.github/workflows/gpu_tests.yml, noxfile.py, CONTRIBUTING.md, examples/vllm_serve/Dockerfile
GPU test matrix now stores full container image paths; job uses ${{ matrix.container_image }} directly; subprocess coverage is disabled for gpu_vllm; python3 -m pip install nox is used; new @nox.session(venv_backend="none") session runs tests/gpu_vllm; vLLM example Dockerfile image tag and flash-attn install flags updated.
Test Model Utilities
tests/_test_utils/torch/transformers_models.py
Adds DeepseekV3Config import and introduces get_tiny_deepseek_v3() and create_tiny_deepseek_v3_dir() helpers to construct and save minimal DeepSeek V3 models with fixed router/expert settings and enforced topk_method.
Pytest Environment & vLLM Configuration
tests/gpu_vllm/conftest.py
New conftest module sets VLLM_ALLOW_INSECURE_SERIALIZATION=1 at import time (before any import vllm) to enable RPC-based callable serialization for vLLM worker processes.
Quantization Worker Framework & Fixtures
tests/gpu_vllm/torch/quantization/test_vllm_dynamic_modules.py (core infra)
Adds RPC worker _quantize_and_summarize() that runs calibration within mtq.quantize() and inspects quantized modules; adds _boot_llm/_shutdown_llm, module-scoped fixtures for TinyLlama, Qwen3-MoE, and Deepseek-V3 engines, and an assertion helper enforcing static _amax on enabled quantizers.
Quantization Test Cases
tests/gpu_vllm/torch/quantization/test_vllm_dynamic_modules.py (tests)
Three end-to-end tests for TinyLlama, TinyQwen3-MoE, and TinyDeepseek-V3 validate no missing quantizers, expected quantized module counts and coverage (parallel-linear, attention, MLA, MoE), and the static _amax invariant.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 38.24% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'adding vllm unit tests' is vague and generic. It fails to specify which vLLM functionality is being tested or what the primary technical change involves. Consider a more descriptive title that clarifies the specific functionality being tested, such as 'Add vLLM fakequant dynamic module quantization tests' or similar.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns ✅ Passed No security anti-patterns detected: no unsafe torch.load, numpy.load, trust_remote_code, eval/exec, nosec comments, or new dependencies. Test-only env var setting is permitted.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch kinjal/vllm_fq_unit

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 18, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.50%. Comparing base (b02e888) to head (d3cfe4e).
⚠️ Report is 4 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1517      +/-   ##
==========================================
- Coverage   76.63%   76.50%   -0.14%     
==========================================
  Files         476      476              
  Lines       51813    52208     +395     
==========================================
+ Hits        39707    39940     +233     
- Misses      12106    12268     +162     
Flag Coverage Δ
examples 41.63% <ø> (+2.60%) ⬆️
gpu 59.51% <ø> (-0.59%) ⬇️
regression 15.21% <ø> (+0.07%) ⬆️
unit 52.72% <ø> (+0.08%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@kinjalpatel27 kinjalpatel27 force-pushed the kinjal/vllm_fq_unit branch from 7b708cb to 0ebeee6 Compare May 18, 2026 23:05
@kinjalpatel27 kinjalpatel27 marked this pull request as ready for review May 18, 2026 23:33
@kinjalpatel27 kinjalpatel27 requested a review from a team as a code owner May 18, 2026 23:33
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/gpu_vllm_fakequant/torch/quantization/test_vllm_dynamic_modules.py`:
- Around line 16-17: Move the test suite file
tests/gpu_vllm_fakequant/torch/quantization/test_vllm_dynamic_modules.py into
one of the approved GPU test folders (e.g., tests/gpu or the appropriate
tests/gpu_* target) so it complies with the repository test-directory policy,
then update any CI and nox configuration entries that reference the old path
(search for occurrences of "gpu_vllm_fakequant" or the filename
"test_vllm_dynamic_modules.py" in noxfiles and workflow YAMLs) to point to the
new location and ensure the test is included in the correct GPU test collection.
- Around line 352-353: Replace the broad "except Exception as exc" that
unconditionally calls pytest.skip with a targeted check that only skips for
known environment/capability failures and re-raises all other exceptions: catch
specific exception types (e.g., ImportError, OSError, RuntimeError) or inspect
exc's message for known indicators (CUDA/vLLM model download failures) and call
pytest.skip only in those matched cases; for any other exc, re-raise the
exception so real regressions in the vLLM/model loading/quantization path
surface. Ensure this logic surrounds the same try block that currently ends with
pytest.skip and continues to use pytest.skip(...) for the allowed cases.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f28e2a71-b5a2-446b-a077-041bd9668aed

📥 Commits

Reviewing files that changed from the base of the PR and between 3f88f0d and 0ebeee6.

📒 Files selected for processing (4)
  • .github/workflows/gpu_tests.yml
  • noxfile.py
  • tests/gpu_vllm_fakequant/conftest.py
  • tests/gpu_vllm_fakequant/torch/quantization/test_vllm_dynamic_modules.py

Comment thread tests/gpu_vllm/torch/quantization/test_vllm_dynamic_modules.py
Comment thread tests/gpu_vllm_fakequant/torch/quantization/test_vllm_dynamic_modules.py Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/gpu_tests.yml:
- Around line 55-56: The vLLM image tag in the workflow (the line containing
"container_image: docker.io/vllm/vllm-openai:v0.21.0") is out of sync with
examples/vllm_serve/Dockerfile which uses v0.10.2; update the workflow entry to
use the same tag as the Dockerfile (change the container_image value to
docker.io/vllm/vllm-openai:v0.10.2) so the comment "Keep in sync with
examples/vllm_serve/Dockerfile" is honored, or conversely update the Dockerfile
to v0.21.0 if you intend to standardize on the newer release—ensure both places
reference the identical image tag.
- Line 56: The workflow pins the OpenAI-server image to container_image:
docker.io/vllm/vllm-openai:v0.21.0 which may be vulnerable to
GHSA-3mwp-wvh9-7528; verify whether that v0.21.0 release includes the patch and
if not either update the tag to the patched vLLM image (reference the patched
release tag) or add mitigations: enforce input validation and a max "n"
parameter and/or enable rate-limiting in the vLLM server invocation;
specifically, locate the container_image: docker.io/vllm/vllm-openai:v0.21.0
entry and either change it to the patched image tag or add environment
variables/command-line args or ingress rate-limiting configuration to bound the
`n` parameter and throttle requests.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: c96aa3eb-b472-440a-91c2-bf1b5151e78b

📥 Commits

Reviewing files that changed from the base of the PR and between 0ebeee6 and 3d4848a.

📒 Files selected for processing (3)
  • .github/workflows/gpu_tests.yml
  • noxfile.py
  • tests/gpu_vllm_fakequant/torch/quantization/test_vllm_dynamic_modules.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • noxfile.py
  • tests/gpu_vllm_fakequant/torch/quantization/test_vllm_dynamic_modules.py

Comment thread .github/workflows/gpu_tests.yml Outdated
Comment thread .github/workflows/gpu_tests.yml Outdated
@kinjalpatel27 kinjalpatel27 marked this pull request as draft May 19, 2026 00:24
Comment thread tests/gpu_vllm_fakequant/conftest.py Outdated
Comment thread tests/gpu_vllm_fakequant/conftest.py Outdated
Comment thread tests/gpu_vllm_fakequant/conftest.py Outdated
Comment thread tests/gpu_vllm_fakequant/torch/quantization/test_vllm_dynamic_modules.py Outdated
Comment thread tests/gpu_vllm_fakequant/torch/quantization/test_vllm_dynamic_modules.py Outdated
Comment thread tests/gpu_vllm_fakequant/torch/quantization/test_vllm_dynamic_modules.py Outdated
Comment thread tests/gpu_vllm_fakequant/torch/quantization/test_vllm_dynamic_modules.py Outdated
Comment thread tests/gpu_vllm_fakequant/torch/quantization/test_vllm_dynamic_modules.py Outdated
@kinjalpatel27 kinjalpatel27 marked this pull request as ready for review May 20, 2026 01:06
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
.github/workflows/gpu_tests.yml (1)

25-33: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Expand the PR gate to cover the new vLLM lane’s actual inputs.

Right now any_changed will stay false for changes in tests/_test_utils/** or examples/vllm_serve/Dockerfile, even though this lane now depends on the shared tiny-model builders and the workflow explicitly says the image pin must stay in sync with that Dockerfile. That lets helper regressions or image drift skip gpu_vllm entirely.

Suggested patch
       files: |
         .github/workflows/gpu_tests.yml
+        examples/vllm_serve/Dockerfile
         modelopt/**
         noxfile.py
         pyproject.toml
+        tests/_test_utils/**
         tests/gpu/**
         tests/gpu_megatron/**
         tests/gpu_trtllm/**
         tests/gpu_vllm/**
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/gpu_tests.yml around lines 25 - 33, The PR gate for the
gpu_vllm lane is missing inputs so any_changed stays false for changes to shared
tiny-model builders and the vLLM Dockerfile; update the files list in
.github/workflows/gpu_tests.yml (the block that populates any_changed) to
include tests/_test_utils/** and examples/vllm_serve/Dockerfile (and any other
tiny-model builder paths if separate) so changes to those files will trigger the
gpu_vllm lane and keep the image pin in sync with the Dockerfile referenced by
the gpu_vllm job.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/gpu_vllm/torch/quantization/test_vllm_dynamic_modules.py`:
- Around line 353-355: The test unconditionally asserts VllmMLAAttention exists
but that symbol can be None on builds without MLA; update the
test_registry_has_mla_attention function to first check if VllmMLAAttention is
None and call pytest.skip("MLA not available") (or return) when absent,
otherwise perform the existing assertions against QuantModuleRegistry; reference
the VllmMLAAttention symbol and the QuantModuleRegistry in the change.

---

Outside diff comments:
In @.github/workflows/gpu_tests.yml:
- Around line 25-33: The PR gate for the gpu_vllm lane is missing inputs so
any_changed stays false for changes to shared tiny-model builders and the vLLM
Dockerfile; update the files list in .github/workflows/gpu_tests.yml (the block
that populates any_changed) to include tests/_test_utils/** and
examples/vllm_serve/Dockerfile (and any other tiny-model builder paths if
separate) so changes to those files will trigger the gpu_vllm lane and keep the
image pin in sync with the Dockerfile referenced by the gpu_vllm job.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 94d4592d-e642-4fab-aa1f-b84f904dafaa

📥 Commits

Reviewing files that changed from the base of the PR and between 3d4848a and 6bd7465.

📒 Files selected for processing (5)
  • .github/workflows/gpu_tests.yml
  • noxfile.py
  • tests/_test_utils/torch/transformers_models.py
  • tests/gpu_vllm/conftest.py
  • tests/gpu_vllm/torch/quantization/test_vllm_dynamic_modules.py

Comment thread tests/gpu_vllm/torch/quantization/test_vllm_dynamic_modules.py Outdated
Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com>
@kinjalpatel27
Copy link
Copy Markdown
Contributor Author

/claude review

Comment on lines +117 to +120

# Static-amax invariant: every enabled quantizer must own an ``_amax``
# after calibration. ``kv_b_proj`` is exempt — vLLM's MLA decode path
# reads its weight directly and never calls its forward.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[SUGGESTION] The kv_b_proj not in name exemption is broader than the stated rationale.

The comment says vLLM's MLA decode path reads kv_b_proj's weight directly and never calls its forward. That justifies exempting input_quantizer._amax (no observed activations), but weight_quantizer._amax is computed from the weight tensor at calibration time independently of any forward call, so it should still be present. As written, the substring match also exempts the output_quantizer and any future quantizer attached to a module whose name contains kv_b_proj.

If a future regression silently strips weight calibration for kv_b_proj, this assertion would not catch it. Consider narrowing to the specific quantizer that's expected to be missing, e.g.:

if not hasattr(module, "_amax") and not name.endswith("kv_b_proj.input_quantizer"):
    quantizers_without_amax.append(name)

(or whichever exact suffix matches the bypassed quantizer in the MLA path).

Comment thread noxfile.py
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude review passed — no blocking issues found.

Verified the new tests against the vLLM plugin: imported symbols (_VLLMParallelLinear, _QuantFusedMoEBase, VllmMLAAttention, _ATTENTION_TYPES, disable_compilation) exist in modelopt/torch/quantization/plugins/vllm.py, and the class-name strings asserted in the parallel-linear counts (QuantRowParallelLinear, QuantQKVParallelLinear, QuantMergedColumnParallelLinear) match what _DMRegistryCls._get_dynamic_class_name generates (prefix("Quant") + nn_cls.__name__). The MoE isinstance check correctly covers both _QuantVLLMFusedMoE and _QuantVLLMSharedFusedMoE via the shared _QuantFusedMoEBase. VLLM_ALLOW_INSECURE_SERIALIZATION is set in conftest before any vLLM import, which is the right place. The workflow's pythonpython3 and explicit container_image paths are sane.

Findings (all SUGGESTION, non-blocking):

  • 2 inline comments — kv_b_proj exemption is broader than its stated rationale; python3 vs python inconsistency in noxfile.

Risk: low. Pure test-only change, no production code touched.

LGTM

Copy link
Copy Markdown
Collaborator

@cjluo-nv cjluo-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bot review — DM the bot to share feedback.

Test-only PR adding tests/gpu_vllm/ with end-to-end fakequant coverage for _VLLMParallelLinear, _QuantFusedMoEBase, vLLM Attention, and MLAAttention across TinyLlama / TinyQwen3-MoE / TinyDeepseekV3. Approach (driving mtq.quantize inside the worker via LLM.collective_rpc, with _dummy_run(1) as the calibration forward loop) looks reasonable and the assertions are meaningful (per-class linear counts, per-quantizer slot presence, and _amax registration with a documented kv_b_proj exemption). Couple of points worth a human eye:

  • Dockerfile/noxfile pin mismatch. noxfile.py and .github/workflows/gpu_tests.yml both pin docker.io/vllm/vllm-openai:v0.20.0 with the comment "Keep in sync with examples/vllm_serve/Dockerfile", but examples/vllm_serve/Dockerfile is on vllm/vllm-openai:v0.10.2. Either the Dockerfile needs bumping in this PR or the comment/pin is wrong — please reconcile before merge.
  • v0.20.0 image tag. Worth double-checking the tag actually exists on Docker Hub (vLLM's mainline at the time of writing is ~0.10/0.11). If it's a pre-release or internal tag, the gpu_vllm CI job will fail on first run.
  • Subprocess coverage disabled for gpu_vllm. Reasonable workaround given the documented engine-core IPC deadlock, but it does mean the new test surface contributes zero coverage to Codecov; the gpu-tests job still uploads the (empty-ish) coverage file. Just flagging; not blocking.

No design-review concerns — this is test infra, not a new abstraction. Standard NVIDIA license headers (exception applies). Nudging rather than approving so an owner can confirm the Docker tag/pin question.

Comment thread noxfile.py
Comment thread .github/workflows/gpu_tests.yml

import gc

import pytest
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bot comment.

Top-level from vllm import LLM here means pytest collection imports vLLM before conftest.py has had a chance to set VLLM_ALLOW_INSECURE_SERIALIZATION — which works only because pytest loads conftests before test modules in the same package. Worth a brief note in the conftest docstring confirming that ordering, since the comment currently warns "must precede any import vllm" without explaining why the test-module import is OK.

Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com>
@kinjalpatel27 kinjalpatel27 requested a review from a team as a code owner May 22, 2026 01:10
@kinjalpatel27 kinjalpatel27 requested a review from sugunav14 May 22, 2026 01:10
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
examples/vllm_serve/Dockerfile (1)

1-44: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Run the image as a non-root user before CMD.

The Dockerfile never sets USER, so workloads run as root by default. That weakens container isolation and matches the DS-0002 security finding.

🔐 Suggested hardening patch
 FROM vllm/vllm-openai:v0.20.0
@@
 # Allow users to run without root
 RUN chmod -R 777 /workspace
+
+# Drop root privileges for runtime
+RUN useradd --create-home --uid 10001 --shell /bin/bash appuser && \
+    chown -R appuser:appuser /workspace
+USER appuser
 
 # Override the ENTRYPOINT from the base image to allow flexible usage
 ENTRYPOINT []
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/vllm_serve/Dockerfile` around lines 1 - 44, The Dockerfile currently
leaves processes running as root (no USER set) which weakens isolation; create a
non-root user (e.g., "appuser") after installing packages and preparing files,
chown the WORKDIR (/workspace) and Model-Optimizer files to that user (update or
replace the chmod -R 777 /workspace step with chown -R appuser:appuser
/workspace and Model-Optimizer ownership), then add USER appuser before the
ENTRYPOINT/CMD lines so the container runs as that non-root user; keep root for
earlier steps (apt-get, pip installs, extensions precompile) and only switch to
the non-root account once image setup is complete.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@examples/vllm_serve/Dockerfile`:
- Around line 1-44: The Dockerfile currently leaves processes running as root
(no USER set) which weakens isolation; create a non-root user (e.g., "appuser")
after installing packages and preparing files, chown the WORKDIR (/workspace)
and Model-Optimizer files to that user (update or replace the chmod -R 777
/workspace step with chown -R appuser:appuser /workspace and Model-Optimizer
ownership), then add USER appuser before the ENTRYPOINT/CMD lines so the
container runs as that non-root user; keep root for earlier steps (apt-get, pip
installs, extensions precompile) and only switch to the non-root account once
image setup is complete.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 2c9f75c4-ebe9-4c34-a2b4-4d201f036ef5

📥 Commits

Reviewing files that changed from the base of the PR and between 597ecca and 1c6c40d.

📒 Files selected for processing (2)
  • examples/vllm_serve/Dockerfile
  • tests/gpu_vllm/torch/quantization/test_vllm_dynamic_modules.py

Comment thread examples/vllm_serve/Dockerfile Outdated
Comment thread tests/_test_utils/torch/transformers_models.py Outdated
Comment thread .github/workflows/gpu_tests.yml
@@ -1,4 +1,4 @@
FROM vllm/vllm-openai:v0.10.2
FROM vllm/vllm-openai:v0.20.0
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From previous benchmarking, I know 0.20 has an accuracy bug on kimi k2.6. Not sure if it's fixed. 0.19.0 looks safer to me.

Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com>
Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants