Environment
- Hardware: Apple M4 Max (Mac16,9), 64GB unified memory
- OS: macOS 26.4.1 (Sequoia)
- Python: 3.11.15 (Homebrew)
- mlx-lm: latest (installed via pip)
- mlx: latest (bundled with mlx-lm)
- Model: mlx-community/Qwen3.5-27B-4bit
Description
mlx_lm.server crashes intermittently during inference with a Metal GPU error. The crash occurs during quantized matrix multiplication (QuantizedMatmul::eval_gpu) when the Metal allocator attempts to allocate GPU memory.
The server runs successfully for several requests before crashing, suggesting a gradual memory pressure issue rather than an immediate failure.
Crash Signature
Triggered by Thread: com.Metal.CompletionQueueDispatch
Exception Type: EXC_CRASH (SIGABRT)
Thread (crashed):
libmlx.dylib mlx::core::metal::MetalAllocator::malloc(unsigned long)
libmlx.dylib mlx::core::QuantizedMatmul::eval_gpu(...)
libmlx.dylib mlx::core::gpu::eval(mlx::core::array&)
libmlx.dylib mlx::core::eval_impl(...)
Also observed:
libmlx.dylib mlx::core::gpu::check_error(MTL::CommandBuffer*)
Steps to Reproduce
- Start server:
mlx_lm.server --model mlx-community/Qwen3.5-27B-4bit --port 8080
- Send multiple chat completion requests via OpenAI-compatible API
- Server crashes after several requests (timing varies, sometimes within minutes)
Additional Notes
- Tried
MLX_GPU_MEMORY_LIMIT=48 — does not prevent the crash
- The crash happens during
QuantizedMatmul::eval_gpu, specifically in Metal's resource residency set update (IOGPUResourceGroupUpdateResources)
- Ollama running the same model on the same hardware does not crash (but is slower)
- A watchdog script with auto-restart serves as a workaround
Expected Behavior
Server should handle sustained inference workloads without crashing.
Actual Behavior
Server crashes with SIGABRT triggered by Metal GPU command buffer error after variable number of requests.
Environment
Description
mlx_lm.servercrashes intermittently during inference with a Metal GPU error. The crash occurs during quantized matrix multiplication (QuantizedMatmul::eval_gpu) when the Metal allocator attempts to allocate GPU memory.The server runs successfully for several requests before crashing, suggesting a gradual memory pressure issue rather than an immediate failure.
Crash Signature
Triggered by Thread: com.Metal.CompletionQueueDispatch
Exception Type: EXC_CRASH (SIGABRT)
Thread (crashed):
libmlx.dylib mlx::core::metal::MetalAllocator::malloc(unsigned long)
libmlx.dylib mlx::core::QuantizedMatmul::eval_gpu(...)
libmlx.dylib mlx::core::gpu::eval(mlx::core::array&)
libmlx.dylib mlx::core::eval_impl(...)
Also observed:
libmlx.dylib mlx::core::gpu::check_error(MTL::CommandBuffer*)
Steps to Reproduce
mlx_lm.server --model mlx-community/Qwen3.5-27B-4bit --port 8080Additional Notes
MLX_GPU_MEMORY_LIMIT=48— does not prevent the crashQuantizedMatmul::eval_gpu, specifically in Metal's resource residency set update (IOGPUResourceGroupUpdateResources)Expected Behavior
Server should handle sustained inference workloads without crashing.
Actual Behavior
Server crashes with SIGABRT triggered by Metal GPU command buffer error after variable number of requests.