Project-HAMi · mesutoezdil · May 29, 2026 · May 29, 2026 · May 29, 2026
diff --git a/docs/faq/faq.md b/docs/faq/faq.md
@@ -181,3 +181,216 @@ If the official Device Plugin cannot provide the required information, HAMi deve
 
 - Ascend’s official Device Plugin requires a separate plugin for each card type. HAMi abstracts these card templates into a unified plugin for easier integration with the scheduler.
 - NVIDIA requires custom implementations to support advanced features like compute and memory limits, overcommitment, and NUMA awareness, necessitating HAMi’s custom Device Plugin.
+
+## How does HAMi enforce GPU memory and compute limits - is it a kernel driver, hardware partition, or a library?
+
+**TL;DR**
+
+HAMi enforces limits using a user-space CUDA interception library (`libvgpu.so`, part of HAMi-core). It is not a kernel driver and does not use hardware partitioning. The library intercepts CUDA API calls inside the container before they reach the GPU driver.
+
+---
+
+When a container starts on a HAMi-managed node, the device plugin mounts `libvgpu.so` and a `/etc/ld.so.preload` file into the container via hostPath during the `Allocate` call. The `/etc/ld.so.preload` file contains a single line pointing to `libvgpu.so`. The Linux dynamic linker reads this file when any process starts inside the container and loads `libvgpu.so` first, before any other library. This achieves the same effect as `LD_PRELOAD` without modifying any environment variables. Every CUDA memory allocation call (`cudaMalloc`, `cuMemAlloc`, and related functions) is then intercepted before it reaches the NVIDIA driver. The library checks the remaining budget from the `nvidia.com/gpumem` annotation. If the allocation would exceed the limit, the call returns an out-of-memory error to the application.
+
+Compute limits (`nvidia.com/gpucores`) use a token-bucket throttle inside the same library: compute calls are held until the slice’s compute share is available.
+
+This approach has two implications:
+
+1. **No kernel or firmware changes are required.** HAMi works on any standard Kubernetes node with NVIDIA drivers v440 or later, without needing MIG-capable hardware or privileged kernel modules.
+2. **Applications that bypass the CUDA library are not covered.** If an application calls the driver API directly, or runs in an environment where the `/etc/ld.so.preload` mount is not effective (for example Docker-in-Docker or when `CUDA_DISABLE_CONTROL=true` is set), enforcement does not apply. See the related question on [gpumem limits not being enforced](#why-is-my-nvidiagpumem-limit-not-enforced).
+
+For a detailed diagram of the interception flow, see [GPU Virtualization](./core-concepts/gpu-virtualization).
+
+## How does HAMi vGPU differ from NVIDIA MIG? When should I use each?
+
+**TL;DR**
+
+HAMi vGPU is a software-only, flexible partition with no hardware requirements. NVIDIA MIG is a hardware partition available only on Ampere and later GPUs (A100, H100, A30). Use HAMi vGPU for workloads that need flexible, dynamic allocation across any NVIDIA GPU. Use MIG when hard hardware isolation and guaranteed performance are required.
+
+---
+
+| Property | HAMi vGPU | NVIDIA MIG |
+|---|---|---|
+| Hardware requirement | Any NVIDIA GPU, driver v440+ | Ampere or later only (A100, H100, A30, H200) |
+| Isolation mechanism | User-space library interception | Hardware engine partitioning |
+| Memory enforcement | Soft (CUDA API level) | Hard (hardware-enforced) |
+| Compute enforcement | Soft (throttle inside libvgpu.so) | Hard (separate SM partitions) |
+| Partition granularity | 1 MiB memory, 1% compute | Fixed MIG profiles (e.g. 1g.10gb) |
+| Dynamic reconfiguration | Yes, no node drain needed | Requires reconfiguring the MIG profile and restarting device plugin |
+| Multi-tenant noise isolation | Best-effort | Strong |
+
+HAMi also supports dynamic MIG (`dynamic-mig`), which uses `mig-parted` to reconfigure MIG profiles on demand and then schedules through HAMi’s scheduler. See [Dynamic MIG Support](./userguide/nvidia-device/dynamic-mig-support).
+
+Choose HAMi vGPU when:
+- The GPU model does not support MIG
+- Workloads need flexible memory sizes that do not map to fixed MIG profiles
+- Dynamic repacking of GPU resources is needed without node drains
+
+Choose MIG when:
+- Strict hardware-level isolation between tenants is a compliance or SLA requirement
+- The workload benefits from guaranteed, predictable compute throughput
+
+## Why does nvidia-smi inside my container show less memory than on the host?
+
+**TL;DR**
+
+This is expected behavior. HAMi replaces the driver’s memory reporting inside the container so that `nvidia-smi` shows only the allocated limit, not the physical GPU memory. The host still sees the full physical memory.
+
+---
+
+When `libvgpu.so` intercepts CUDA driver calls, it also intercepts the query functions that `nvidia-smi` uses to report total and free GPU memory (`nvmlDeviceGetMemoryInfo`, `cuDeviceTotalMem`, and related calls). These return the values derived from `nvidia.com/gpumem` rather than the physical card capacity.
+
+This design is intentional: workloads that check available GPU memory before deciding how much to allocate (for example, vLLM’s memory profiling step) will see only their budget and size accordingly.
+
+If the host’s `nvidia-smi` shows more memory than expected on a running pod, that is also expected - the host view shows physical memory, not virtual limits.
+
+## Why is my nvidia.com/gpumem limit not enforced - the container uses more memory than requested? {#why-is-my-nvidiagpumem-limit-not-enforced}
+
+**TL;DR**
+
+The most common causes are: `CUDA_DISABLE_CONTROL=true` is set in the container, the workload runs inside a nested container environment (Docker-in-Docker), or the application bypasses the CUDA library and calls the GPU driver directly.
+
+---
+
+### Cause 1: CUDA_DISABLE_CONTROL is set
+
+Setting `CUDA_DISABLE_CONTROL=true` in the container environment disables the HAMi-core enforcement layer entirely. The container can then access the full physical GPU without restriction.
+
+This variable is intended for debugging only. Remove it from production workloads that need memory limits.
+
+### Cause 2: Docker-in-Docker (DinD)
+
+When a container runs another container runtime inside it (DinD), the inner container runtime creates a new root filesystem for inner containers. The `/etc/ld.so.preload` hostPath mount that the outer container has does not carry over to the filesystems of inner containers. The inner container’s CUDA calls go directly to the driver without passing through `libvgpu.so`.
+
+HAMi enforcement does not apply inside DinD. This is a known limitation with no current workaround.
+
+### Cause 3: Direct driver API usage
+
+Some workloads call the NVIDIA Management Library (NVML) or the CUDA Driver API directly, bypassing `libvgpu.so`. Examples include custom CUDA kernels that use driver-level allocation or monitoring tools that query NVML directly.
+
+### Cause 4: nvidia-container-runtime not set as default
+
+If the container runtime on the node is not configured with `nvidia-container-runtime` as the default, the device plugin cannot inject `libvgpu.so` into the container environment. Verify the runtime configuration:
+
+```bash
+containerd config dump | grep default_runtime_name
+```
+
+The output must show `nvidia`. If it does not, follow the [Prerequisites](./installation/prerequisites) guide to reconfigure.
+
+## Does HAMi replace kube-scheduler or run alongside it?
+
+**TL;DR**
+
+HAMi runs alongside kube-scheduler as a scheduler extender. It does not replace kube-scheduler. All standard Kubernetes scheduling behavior is preserved.
+
+---
+
+HAMi deploys a `hami-scheduler` component that registers as an [extender](https://github.com/kubernetes/design-proposals-archive/blob/main/scheduling/scheduler_extender.md) to the standard kube-scheduler. The extender adds two filter and score callbacks:
+
+- **Filter**: removes nodes that do not have enough vGPU resources to satisfy the pod’s request
+- **Score**: ranks the remaining nodes using the configured policy (Binpack or Spread)
+
+kube-scheduler still runs all built-in filters and priorities. HAMi’s extender runs after them. Pods that do not request any HAMi resource (`nvidia.com/gpu`, `nvidia.com/gpumem`, etc.) are never touched by the extender and follow the standard scheduling path.
+
+The MutatingWebhook sets `schedulerName: hami-scheduler` on pods that request HAMi resources. Pods without HAMi resource requests keep the default `schedulerName` and are not affected.
+
+:::note
+HAMi supports running multiple `hami-scheduler` replicas with leader election. See the [configuration guide](./userguide/configure) for Helm values that control scheduler deployment settings.
+:::
+
+## Does HAMi work with vLLM, and what are the known limitations for multi-GPU tensor parallelism?
+
+**TL;DR**
+
+HAMi works with vLLM for single-GPU and multi-GPU workloads. Multi-GPU tensor parallelism (`tensor_parallel_size > 1`) with vLLM versions greater than 0.18 requires HAMi v2.9.0 or later. Earlier versions had partial fixes but tensor parallelism initialization errors persisted in newer vLLM releases.
+
+---
+
+### Single-GPU vLLM
+
+Single-GPU vLLM with `nvidia.com/gpumem` works without any special configuration. The memory profiling step inside vLLM reads the memory limit from `libvgpu.so` and allocates accordingly.
+
+### Multi-GPU tensor parallelism
+
+vLLM uses NCCL for cross-GPU communication in tensor parallel mode. Earlier HAMi versions had initialization errors when multiple processes inside a container shared CUDA device memory state files. These issues were progressively addressed across v2.7.x and v2.8.x, with full support for vLLM tensor parallelism on vLLM versions greater than 0.18 landing in v2.9.0.
+
+If encountering NCCL initialization failures or `Illegal device id` segfaults with tensor parallelism, upgrade to HAMi v2.9.0 or later.
+
+### Running vLLM in a Volcano environment
+
+vLLM across multiple pods in a Volcano job environment follows the same rules. Set `tensor_parallel_size` to the number of GPUs per pod, not the total across all pods. Inter-pod communication uses standard NCCL over the pod network (RDMA or TCP), not the HAMi vGPU layer.
+
+:::note
+vLLM’s `--enforce-eager` flag disables CUDA graph capture. Some HAMi versions have issues with CUDA graph capture due to shared memory layout differences. If encountering errors during graph capture, try `--enforce-eager` as a temporary workaround and check the release notes for the specific version.
+:::
+
+For more context, see issues [#1764](https://github.com/Project-HAMi/HAMi/issues/1764) and [#1853](https://github.com/Project-HAMi/HAMi/issues/1853).
+
+## Is HAMi compatible with NVIDIA GPU Operator and DCGM metrics?
+
+**TL;DR**
+
+HAMi’s device plugin conflicts with the device plugin deployed by GPU Operator. Use HAMi’s device plugin instead of GPU Operator’s if GPU sharing is needed. DCGM-based metrics work independently and are not affected.
+
+---
+
+### Device plugin conflict
+
+GPU Operator installs its own `nvidia-device-plugin` DaemonSet. HAMi installs `hami-device-plugin`. Both report `nvidia.com/gpu` resources to kubelet. Running both on the same node causes resource reporting conflicts and unpredictable scheduling behavior.
+
+Resolution: disable the NVIDIA device plugin component in GPU Operator by setting `devicePlugin.enabled=false` in the GPU Operator Helm values, then deploy HAMi’s device plugin normally.
+
+```yaml
+# GPU Operator values.yaml
+devicePlugin:
+  enabled: false
+```
+
+### DCGM metrics
+
+DCGM Exporter scrapes physical GPU metrics from the NVIDIA driver independently of the device plugin. It is not affected by HAMi’s `libvgpu.so` and continues to report physical-level counters (temperature, power, SM utilization, physical memory bandwidth) normally.
+
+HAMi’s own metrics (per-container virtual memory usage, virtual core utilization) are exposed separately. See [Prometheus and Grafana monitoring](#how-do-i-set-up-prometheus-and-grafana-monitoring-for-hami-vgpu-metrics) below.
+
+## How do I set up Prometheus and Grafana monitoring for HAMi vGPU metrics?
+
+**TL;DR**
+
+HAMi exposes per-container vGPU metrics through `hami-device-plugin-monitor`. Scrape it with Prometheus and use the bundled Grafana dashboard JSON at `static/grafana/gpu-dashboard.json`.
+
+---
+
+### Metrics endpoint
+
+The `hami-device-plugin` pod on each node exposes a metrics endpoint. The port is configurable via the `devicePlugin.monitorPort` Helm value (default: `31992`).
+
+Key metrics exposed:
+
+| Metric name | Description |
+|---|---|
+| `Device_memory_desc_of_container` | Virtual GPU memory allocated to a container |
+| `Device_utilization_desc_of_container` | GPU compute utilization reported per container |
+| `Device_memory_limit_of_container` | Memory limit set for the container |
+
+### Prometheus scrape config
+
+Add a scrape job to your Prometheus configuration:
+
+```yaml
+scrape_configs:
+  - job_name: hami-device-plugin
+    static_configs:
+      - targets:
+          - <node-ip>:31992
+```
+
+For Prometheus Operator, create a `ServiceMonitor` targeting the `hami-device-plugin` service on port `31992`.
+
+### Grafana dashboard
+
+A pre-built Grafana dashboard JSON is included in the repository at [`static/grafana/gpu-dashboard.json`](https://github.com/Project-HAMi/website/blob/master/static/grafana/gpu-dashboard.json). Import it into Grafana via **Dashboards > Import > Upload JSON file**.
+
+The dashboard shows per-node and per-container virtual GPU memory and compute usage alongside physical GPU counters. If DCGM Exporter is also deployed, the physical counters are populated automatically; otherwise, only the HAMi virtual metrics are available.
+
+For a step-by-step walkthrough, see [Grafana Dashboard](./userguide/monitoring/grafana-dashboard).