[WIP] feat: add shared io_uring context pool support by CLiqing · Pull Request #79 · zilliztech/milvus-common

CLiqing · 2026-04-15T03:05:53Z

Introduce UringContextPool in milvus-common and gate it behind WITH_IO_URING with liburing detection so downstream knowhere/cardinal can share io_uring contexts safely.

liliu-z · 2026-04-24T03:07:42Z

+    ASSERT_TRUE(IOContextPool::InitGlobal(cfg));
+
+    auto pool = IOContextPool::GetGlobal();
+    ASSERT_NE(pool, nullptr);


Tests share the global singleton without any reset mechanism, making results depend on GTest execution order. A ResetGlobalForTest() helper or fork()-based isolation (like the fallback test) is needed to prevent flaky failures.

liliu-z · 2026-04-24T09:08:53Z

    static std::shared_ptr<AioContextPool>
    GetGlobalAioPool();

+    static bool


Wait predicate only checks ctx_q_.size() and never checks stop_. Threads blocked here will deadlock on destruction when the queue is empty because notify_all() cannot break the wait.

liliu-z · 2026-04-24T09:08:54Z

+        return true;
+    }
+
+    if (global_uring_pool_size != num_ctx || global_uring_max_entries != max_entries) {


Same problem as AioContextPool: config mismatch is logged but the function still returns true, silently allowing misconfiguration.

liliu-z · 2026-04-24T09:08:55Z

+    static void
+    ResetGlobalForTest();
+
    ~AioContextPool() {


stop_ is a non-atomic bool written without holding ctx_mtx_, which is a data race. Also, io_destroy is called before notify_all, so woken threads may access already-destroyed contexts.

liliu-z · 2026-04-24T09:08:56Z

+
+#include "knowhere/io_context_pool.h"
+
+#if defined(__cpp_lib_span)


IOReaderSpan resolves to different types depending on __cpp_lib_span; using it in public function signatures causes ABI incompatibility across C++17 and C++20 consumers.

liliu-z · 2026-04-24T10:40:43Z

+    size_t num_ctx = default_pool_size;
+#else
+    size_t num_ctx = 1;
+#endif


num_ctx falls back to 1 when libaio is absent, so io_uring-only builds get a single ring and a severe concurrency bottleneck. Default should be reasonable regardless of libaio availability.

liliu-z · 2026-04-27T03:08:41Z

+}
+
+bool
+UringContextPool::InitGlobalUringPool(size_t num_ctx, size_t max_entries) {


Same structural bug as the AIO variant: InitGlobalUringPool(num_ctx, max_entries) internally delegates to IOContextPool::GetGlobal() which uses the default IOContextPoolConfig, so the caller-supplied num_ctx and max_entries are silently ignored. This is a confirmed bug in the new uring API introduced by this PR.

liliu-z · 2026-04-27T08:14:12Z

+}
+
+bool
+AioContextPool::InitGlobalAioPool(size_t num_ctx, size_t max_events) {


InitGlobalAioPool() calls GetGlobal() which uses a default IOContextPoolConfig (num_ctx=512) instead of forwarding the caller's global_aio_pool_size parameter. Any non-default num_ctx configuration is silently ignored — 100% regression for customized deployments. Compare with InitGlobalUringPool (uring_context_pool.cc L92-94) which correctly passes parameters. Test L167 will fail in !WITH_IO_URING environments.

Fix: Construct IOContextPoolConfig from the caller's parameters and pass it directly to InitGlobal(cfg).

liliu-z · 2026-04-29T11:30:19Z


 bool
-AioContextPool::InitGlobalAioPool(size_t num_ctx, size_t max_events) {
+AioContextPool::InitGlobalAioPoolWithValidation(size_t num_ctx, size_t max_events) {


The validation at line 20 only rejects max_events > default_max_events but allows max_events == 0 through. The function returns true (success) and sets global_aio_max_events = 0. Subsequent GetGlobalAioPoolDirect() constructs a pool where every io_setup(0, &ctx) call fails, producing an empty pool. Any pop() call will block indefinitely. Note: the unified entry point IOContextPool::InitGlobal at io_context_pool.cc:60 correctly rejects 0, so this only affects direct callers of the newly-exposed helper API declared in aio_context_pool.h:76.

liliu-z · 2026-05-08T07:20:10Z

+UringContextPool::GetGlobalUringPoolDirect() {
+    std::scoped_lock lk(global_uring_pool_mut);
+    if (global_uring_pool_size == 0) {
+        global_uring_pool_size = 1;


The legacy GetGlobalUringPoolDirect() constructs UringContextPool with num_ctx=1 by default, while AIO pool defaults to 512 contexts. Under direct legacy usage, this severely limits uring concurrency.

Fix: Align the default num_ctx with AIO pool defaults or use a reasonable default like default_pool_size.

Use the configured IO pool defaults for direct uring initialization and count interrupted syscalls toward the retry limit so sustained signals cannot loop forever. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Bound interrupted completion waits and reset checked-out rings when submit failures leave prepared io_uring entries behind. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Invalidate dirty rings on failed completion cleanup and use fmt-style log placeholders so review-time diagnostics stay reliable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Keep already submitted completion-reader requests observable when submit cleanup resets a dirty ring, and ensure CI builds the io_uring path after dependency setup. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Avoid returning dirty io_uring handles after reader failures, make global IO pool initialization explicit, and publish the C++20 requirement exposed by public headers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

liliu-z · 2026-05-26T04:46:11Z

+            }
+            continue;
+        }
+        for (size_t i = result.completed; i < result.completed + static_cast<size_t>(completed); ++i) {


The retry counter in WaitAioBatch accumulates across iterations that made forward progress. With kNumRetries = 10 and max_events = 128, an AIO batch that the kernel returns piecewise will be cut off after 11 partial completions even when nothing is wrong. The io_uring sibling WaitUringBatch does not have this bug — it just increments result.completed per CQE with no retry counter on the progress path.

liliu-z · 2026-05-26T04:46:13Z

+            break;
+        }
+        if (submitted == 0) {
+            if (++retry > kNumRetries) {


Same progress-penalty bug as WaitAioBatch: retry accumulates even when submitted > 0, causing premature abort of legitimate partial submissions.

liliu-z · 2026-05-26T04:46:17Z

+ private:
+    std::shared_ptr<IOContextPool> pool_;
+    IOContextHandle handle_;
+    bool active_ = true;


The WaitAioBatch function takes const std::vector<struct iocb>& cbs but uses const_cast<struct iocb*>(&cbs[i]) to insert pointers into an unordered_set<struct iocb*>. This strips const-correctness and risks undefined behavior if the underlying data is truly const.

liliu-z · 2026-05-26T04:46:19Z

+            }
+            throw std::runtime_error("io_uring_wait_cqe failed");
+        }
+        if (cqe == nullptr) {


After io_uring_wait_cqe returns success (ret == 0), if cqe is nullptr the code does continue with no retry cap, no progress tracking, and no logging. If the kernel ever returns this combination, the worker thread hangs in an infinite loop. Compare ProcessAvailableCompletions at line 336 which correctly breaks on null cqe.

liliu-z · 2026-05-26T07:32:33Z

src/common/io_reader.cc:0 ~IOCompletionReader calls io_uring_queue_exit(), which triggers kernel cancellation asynchronously via a workqueue (io_ring_exit_work). close(ring_fd) returns to userspace before in-flight DMA completes. If the caller then frees the target buffers, the kernel's pending writes corrupt the reallocated memory — silent data corruption, not a crash. The destructor must synchronously drain all outstanding CQEs (observing -ECANCELED) before returning, or the API contract must enforce that callers cannot free buffers until all completions are observed.

liliu-z · 2026-05-26T07:32:34Z

src/common/io_reader.cc:0 When ReadAioAsync or ReadUringAsync encounters a mid-batch failure (hardware error, short read, bad CQE), it calls state.ResetRing() or guard.ResetAio() (triggering io_uring_queue_exit / io_destroy), then returns false via the future. The caller sees failure and frees the buffer, but kernel DMA cancellation from io_destroy/io_uring_queue_exit is fully asynchronous — the block layer may still be writing into the now-freed memory. This occurs on any normal IO error return path, not just destruction.

liliu-z · 2026-05-26T07:32:40Z

 endif()

+set(MILVUS_COMMON_WITH_IO_URING OFF)
+find_path(URING_INCLUDE_DIR liburing.h)


The C++20 standard is linked as PUBLIC, meaning every target that depends on this library is forced to compile with C++20. This is driven by std::span usage in headers. Downstream projects that cannot adopt C++20 will fail to build. The requirement should be PRIVATE, and std::span-using headers should be isolated or guarded.

liliu-z · 2026-05-26T07:32:42Z

src/common/io_reader.cc:0 When the AIO backend is used with O_DIRECT file descriptors, the kernel requires buffer addresses and sizes to be aligned (typically 512-byte or page-aligned). IOReader::Read and IOReader::ReadAsync do not validate alignment at the API boundary. Misaligned requests silently fail with EINVAL deep in the kernel, giving callers no actionable diagnostic.

liliu-z · 2026-05-26T07:32:43Z

src/common/aio_context_pool.h:151 The legacy AioContextPool::Shutdown() is public and called by tests; UringContextPool has no equivalent method. This creates an API asymmetry that forces callers to branch on backend type for lifecycle management.

liliu-z · 2026-05-26T07:32:45Z

src/common/io_context_pool.cc:0 generation_ is set once during InitGlobal() and never mutated afterward. The condition owner_generation_ != generation_ in Release is unreachable in current code — a handle's owner_ shared_ptr keeps the pool alive, so generation always matches. The defensive ClearNoRelease branch would leak the context if it ever fired. This is a latent trap if generation_ becomes mutable in the future.

liliu-z · 2026-05-26T09:22:04Z

@@ -0,0 +1,416 @@
+#include "knowhere/io_context_pool.h"


Enabling WITH_IO_URING causes GetGlobalAioPool() to return nullptr, breaking downstream consumers (knowhere/cardinal) that rely on this API. This is an uncoordinated contract change that will cause crashes in callers that don't null-check the return value.

liliu-z · 2026-05-26T09:22:06Z

+}
+
+IOCompletionReader::RequestId
+IOCompletionReader::Submit(IOCompletionReaderSpan<std::byte* const> buffers, size_t size,


IOReader::ReadAsync validates O_DIRECT alignment via ValidateDirectIoAlignment (io_reader.cc:607), but IOCompletionReader::Submit (lines 102-187) has no such check. With an O_DIRECT fd, unaligned buffers will produce a confusing kernel-side EINVAL instead of a clear library-level error. This creates inconsistent error semantics between the two reader classes in the same library.

liliu-z · 2026-05-26T09:22:11Z

+        return false;
+    }
+
+    auto pool = UringContextPool::GetGlobalUringPoolDirect();


IOContextPool has Push(IOContextHandle&) lvalue overloads that silently move the handle's ownership. A caller passing an lvalue won't realize ownership transferred. This is a classic implicit-move footgun that can lead to double-free or use-after-move bugs.

liliu-z · 2026-05-26T09:22:26Z

+}
+#endif
+#endif
+


The test uses a 100ms sleep as a synchronization mechanism. Under ASan or loaded CI environments, this timing window is unreliable and will produce flaky failures. The test needs proper synchronization (e.g., condition variable or future) instead of a sleep.

liliu-z · 2026-05-26T09:22:32Z

+
+    auto owner = owner_;
+    if (owner == nullptr) {
+        LOG_WARN("IOContextHandle drops context without owner for backend {}", static_cast<int>(backend));


AIO's ResetCheckedOut and uring's reset follow different patterns for returning the new context to the pool. This asymmetry makes the code harder to reason about and creates risk of resource leaks if a maintainer assumes symmetric behavior.

liliu-z · 2026-05-26T09:22:33Z

+    }
+
+    const size_t first_batch = std::min(max_batch, buffers.size());
+    std::vector<struct iocb> first_cbs;


After io_submit, the kernel holds pointers to iocb addresses inside first_cbs. The std::move(first_cbs) into AioReadState::first_cbs_ is safe because vector move is a pointer swap with the default allocator. However, this invariant is fragile — switching to SmallVector or a custom allocator would silently break it by relocating the buffer and corrupting kernel-visible addresses.

liliu-z · 2026-05-26T09:22:39Z

+        return *this;
+    }
+
+    DrainOutstandingNoThrow();


IOCompletionReader::operator=(&&) is not noexcept even though the move constructor is. This asymmetry prevents optimal container behavior (e.g., std::vector reallocation will copy instead of move).

liliu-z · 2026-05-26T09:22:40Z

+    return io_uring_submit(ring);
+}
+
+size_t


PrepareUringBatch doesn't apply an explicit std::min(buffers.size() - start, max_events_per_ctx) cap like ReadAioAsync does. While io_uring_get_sqe returning null provides a natural cap via the SQ ring size, the asymmetry with the AIO path creates a readability and maintenance concern.

CLiqing changed the title ~~feat: add shared io_uring context pool support~~ [WIP] feat: add shared io_uring context pool support Apr 16, 2026