Skip to content

[WIP] feat: add shared io_uring context pool support#79

Open
CLiqing wants to merge 18 commits into
zilliztech:masterfrom
CLiqing:feat/io-uring-context-pool
Open

[WIP] feat: add shared io_uring context pool support#79
CLiqing wants to merge 18 commits into
zilliztech:masterfrom
CLiqing:feat/io-uring-context-pool

Conversation

@CLiqing
Copy link
Copy Markdown

@CLiqing CLiqing commented Apr 15, 2026

Introduce UringContextPool in milvus-common and gate it behind WITH_IO_URING with liburing detection so downstream knowhere/cardinal can share io_uring contexts safely.

@CLiqing CLiqing changed the title feat: add shared io_uring context pool support [WIP] feat: add shared io_uring context pool support Apr 16, 2026
Comment thread src/knowhere/uring_context_pool.cc
Comment thread src/knowhere/io_reader.cc Outdated
Comment thread src/knowhere/io_reader.cc Outdated
Comment thread src/knowhere/io_reader.cc Outdated
Comment thread src/knowhere/io_reader.cc Outdated
Comment thread src/knowhere/aio_context_pool.cc
Comment thread src/knowhere/uring_context_pool.cc
Comment thread src/knowhere/io_context_pool.cc
Comment thread include/knowhere/io_context_pool.h Outdated
Comment thread src/knowhere/io_reader.cc
Comment thread src/knowhere/aio_context_pool.cc
Comment thread src/knowhere/uring_context_pool.cc
Comment thread src/knowhere/io_reader.cc Outdated
Comment thread src/knowhere/io_context_pool.cc
ASSERT_TRUE(IOContextPool::InitGlobal(cfg));

auto pool = IOContextPool::GetGlobal();
ASSERT_NE(pool, nullptr);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests share the global singleton without any reset mechanism, making results depend on GTest execution order. A ResetGlobalForTest() helper or fork()-based isolation (like the fallback test) is needed to prevent flaky failures.

static std::shared_ptr<AioContextPool>
GetGlobalAioPool();

static bool
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait predicate only checks ctx_q_.size() and never checks stop_. Threads blocked here will deadlock on destruction when the queue is empty because notify_all() cannot break the wait.

return true;
}

if (global_uring_pool_size != num_ctx || global_uring_max_entries != max_entries) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same problem as AioContextPool: config mismatch is logged but the function still returns true, silently allowing misconfiguration.

static void
ResetGlobalForTest();

~AioContextPool() {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stop_ is a non-atomic bool written without holding ctx_mtx_, which is a data race. Also, io_destroy is called before notify_all, so woken threads may access already-destroyed contexts.

Comment thread include/knowhere/io_reader.h Outdated

#include "knowhere/io_context_pool.h"

#if defined(__cpp_lib_span)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IOReaderSpan resolves to different types depending on __cpp_lib_span; using it in public function signatures causes ABI incompatibility across C++17 and C++20 consumers.

size_t num_ctx = default_pool_size;
#else
size_t num_ctx = 1;
#endif
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

num_ctx falls back to 1 when libaio is absent, so io_uring-only builds get a single ring and a severe concurrency bottleneck. Default should be reasonable regardless of libaio availability.

}

bool
UringContextPool::InitGlobalUringPool(size_t num_ctx, size_t max_entries) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same structural bug as the AIO variant: InitGlobalUringPool(num_ctx, max_entries) internally delegates to IOContextPool::GetGlobal() which uses the default IOContextPoolConfig, so the caller-supplied num_ctx and max_entries are silently ignored. This is a confirmed bug in the new uring API introduced by this PR.

}

bool
AioContextPool::InitGlobalAioPool(size_t num_ctx, size_t max_events) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

InitGlobalAioPool() calls GetGlobal() which uses a default IOContextPoolConfig (num_ctx=512) instead of forwarding the caller's global_aio_pool_size parameter. Any non-default num_ctx configuration is silently ignored — 100% regression for customized deployments. Compare with InitGlobalUringPool (uring_context_pool.cc L92-94) which correctly passes parameters. Test L167 will fail in !WITH_IO_URING environments.

Fix: Construct IOContextPoolConfig from the caller's parameters and pass it directly to InitGlobal(cfg).

@CLiqing CLiqing force-pushed the feat/io-uring-context-pool branch 4 times, most recently from 55d7b49 to df88d1e Compare April 29, 2026 10:29

bool
AioContextPool::InitGlobalAioPool(size_t num_ctx, size_t max_events) {
AioContextPool::InitGlobalAioPoolWithValidation(size_t num_ctx, size_t max_events) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The validation at line 20 only rejects max_events > default_max_events but allows max_events == 0 through. The function returns true (success) and sets global_aio_max_events = 0. Subsequent GetGlobalAioPoolDirect() constructs a pool where every io_setup(0, &ctx) call fails, producing an empty pool. Any pop() call will block indefinitely. Note: the unified entry point IOContextPool::InitGlobal at io_context_pool.cc:60 correctly rejects 0, so this only affects direct callers of the newly-exposed helper API declared in aio_context_pool.h:76.

@CLiqing CLiqing force-pushed the feat/io-uring-context-pool branch from df88d1e to 83170ae Compare April 30, 2026 02:08
Comment thread src/knowhere/uring_context_pool.cc Outdated
UringContextPool::GetGlobalUringPoolDirect() {
std::scoped_lock lk(global_uring_pool_mut);
if (global_uring_pool_size == 0) {
global_uring_pool_size = 1;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The legacy GetGlobalUringPoolDirect() constructs UringContextPool with num_ctx=1 by default, while AIO pool defaults to 512 contexts. Under direct legacy usage, this severely limits uring concurrency.

Fix: Align the default num_ctx with AIO pool defaults or use a reasonable default like default_pool_size.

CLiqing and others added 6 commits May 18, 2026 08:03
Use the configured IO pool defaults for direct uring initialization and count interrupted syscalls toward the retry limit so sustained signals cannot loop forever.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bound interrupted completion waits and reset checked-out rings when submit failures leave prepared io_uring entries behind.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Invalidate dirty rings on failed completion cleanup and use fmt-style log placeholders so review-time diagnostics stay reliable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Keep already submitted completion-reader requests observable when submit cleanup resets a dirty ring, and ensure CI builds the io_uring path after dependency setup.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Avoid returning dirty io_uring handles after reader failures, make global IO pool initialization explicit, and publish the C++20 requirement exposed by public headers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@CLiqing CLiqing force-pushed the feat/io-uring-context-pool branch from ef3c9a2 to 0ad2d8e Compare May 18, 2026 08:08
Comment thread src/knowhere/io_reader.cc
}
continue;
}
for (size_t i = result.completed; i < result.completed + static_cast<size_t>(completed); ++i) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The retry counter in WaitAioBatch accumulates across iterations that made forward progress. With kNumRetries = 10 and max_events = 128, an AIO batch that the kernel returns piecewise will be cut off after 11 partial completions even when nothing is wrong. The io_uring sibling WaitUringBatch does not have this bug — it just increments result.completed per CQE with no retry counter on the progress path.

Comment thread src/knowhere/io_reader.cc
break;
}
if (submitted == 0) {
if (++retry > kNumRetries) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same progress-penalty bug as WaitAioBatch: retry accumulates even when submitted > 0, causing premature abort of legitimate partial submissions.

Comment thread src/knowhere/io_reader.cc
private:
std::shared_ptr<IOContextPool> pool_;
IOContextHandle handle_;
bool active_ = true;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The WaitAioBatch function takes const std::vector<struct iocb>& cbs but uses const_cast<struct iocb*>(&cbs[i]) to insert pointers into an unordered_set<struct iocb*>. This strips const-correctness and risks undefined behavior if the underlying data is truly const.

}
throw std::runtime_error("io_uring_wait_cqe failed");
}
if (cqe == nullptr) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After io_uring_wait_cqe returns success (ret == 0), if cqe is nullptr the code does continue with no retry cap, no progress tracking, and no logging. If the kernel ever returns this combination, the worker thread hangs in an infinite loop. Compare ProcessAvailableCompletions at line 336 which correctly breaks on null cqe.

@liliu-z
Copy link
Copy Markdown

liliu-z commented May 26, 2026

src/common/io_reader.cc:0 ~IOCompletionReader calls io_uring_queue_exit(), which triggers kernel cancellation asynchronously via a workqueue (io_ring_exit_work). close(ring_fd) returns to userspace before in-flight DMA completes. If the caller then frees the target buffers, the kernel's pending writes corrupt the reallocated memory — silent data corruption, not a crash. The destructor must synchronously drain all outstanding CQEs (observing -ECANCELED) before returning, or the API contract must enforce that callers cannot free buffers until all completions are observed.

@liliu-z
Copy link
Copy Markdown

liliu-z commented May 26, 2026

src/common/io_reader.cc:0 When ReadAioAsync or ReadUringAsync encounters a mid-batch failure (hardware error, short read, bad CQE), it calls state.ResetRing() or guard.ResetAio() (triggering io_uring_queue_exit / io_destroy), then returns false via the future. The caller sees failure and frees the buffer, but kernel DMA cancellation from io_destroy/io_uring_queue_exit is fully asynchronous — the block layer may still be writing into the now-freed memory. This occurs on any normal IO error return path, not just destruction.

Comment thread CMakeLists.txt
endif()

set(MILVUS_COMMON_WITH_IO_URING OFF)
find_path(URING_INCLUDE_DIR liburing.h)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The C++20 standard is linked as PUBLIC, meaning every target that depends on this library is forced to compile with C++20. This is driven by std::span usage in headers. Downstream projects that cannot adopt C++20 will fail to build. The requirement should be PRIVATE, and std::span-using headers should be isolated or guarded.

@liliu-z
Copy link
Copy Markdown

liliu-z commented May 26, 2026

src/common/io_reader.cc:0 When the AIO backend is used with O_DIRECT file descriptors, the kernel requires buffer addresses and sizes to be aligned (typically 512-byte or page-aligned). IOReader::Read and IOReader::ReadAsync do not validate alignment at the API boundary. Misaligned requests silently fail with EINVAL deep in the kernel, giving callers no actionable diagnostic.

@liliu-z
Copy link
Copy Markdown

liliu-z commented May 26, 2026

src/common/aio_context_pool.h:151 The legacy AioContextPool::Shutdown() is public and called by tests; UringContextPool has no equivalent method. This creates an API asymmetry that forces callers to branch on backend type for lifecycle management.

@liliu-z
Copy link
Copy Markdown

liliu-z commented May 26, 2026

src/common/io_context_pool.cc:0 generation_ is set once during InitGlobal() and never mutated afterward. The condition owner_generation_ != generation_ in Release is unreachable in current code — a handle's owner_ shared_ptr keeps the pool alive, so generation always matches. The defensive ClearNoRelease branch would leak the context if it ever fired. This is a latent trap if generation_ becomes mutable in the future.

@@ -0,0 +1,416 @@
#include "knowhere/io_context_pool.h"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enabling WITH_IO_URING causes GetGlobalAioPool() to return nullptr, breaking downstream consumers (knowhere/cardinal) that rely on this API. This is an uncoordinated contract change that will cause crashes in callers that don't null-check the return value.

}

IOCompletionReader::RequestId
IOCompletionReader::Submit(IOCompletionReaderSpan<std::byte* const> buffers, size_t size,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IOReader::ReadAsync validates O_DIRECT alignment via ValidateDirectIoAlignment (io_reader.cc:607), but IOCompletionReader::Submit (lines 102-187) has no such check. With an O_DIRECT fd, unaligned buffers will produce a confusing kernel-side EINVAL instead of a clear library-level error. This creates inconsistent error semantics between the two reader classes in the same library.

return false;
}

auto pool = UringContextPool::GetGlobalUringPoolDirect();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IOContextPool has Push(IOContextHandle&) lvalue overloads that silently move the handle's ownership. A caller passing an lvalue won't realize ownership transferred. This is a classic implicit-move footgun that can lead to double-free or use-after-move bugs.

}
#endif
#endif

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test uses a 100ms sleep as a synchronization mechanism. Under ASan or loaded CI environments, this timing window is unreliable and will produce flaky failures. The test needs proper synchronization (e.g., condition variable or future) instead of a sleep.


auto owner = owner_;
if (owner == nullptr) {
LOG_WARN("IOContextHandle drops context without owner for backend {}", static_cast<int>(backend));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AIO's ResetCheckedOut and uring's reset follow different patterns for returning the new context to the pool. This asymmetry makes the code harder to reason about and creates risk of resource leaks if a maintainer assumes symmetric behavior.

Comment thread src/knowhere/io_reader.cc Outdated
}

const size_t first_batch = std::min(max_batch, buffers.size());
std::vector<struct iocb> first_cbs;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After io_submit, the kernel holds pointers to iocb addresses inside first_cbs. The std::move(first_cbs) into AioReadState::first_cbs_ is safe because vector move is a pointer swap with the default allocator. However, this invariant is fragile — switching to SmallVector or a custom allocator would silently break it by relocating the buffer and corrupting kernel-visible addresses.

return *this;
}

DrainOutstandingNoThrow();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IOCompletionReader::operator=(&&) is not noexcept even though the move constructor is. This asymmetry prevents optimal container behavior (e.g., std::vector reallocation will copy instead of move).

Comment thread src/knowhere/io_reader.cc
return io_uring_submit(ring);
}

size_t
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PrepareUringBatch doesn't apply an explicit std::min(buffers.size() - start, max_events_per_ctx) cap like ReadAioAsync does. While io_uring_get_sqe returning null provides a natural cap via the SQ ring size, the asymmetry with the AIO path creates a readability and maintenance concern.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants