Skip to content

merutable-capi: C ABI layer for the merutable engine#70

Draft
jakeswenson wants to merge 2 commits into
salesforce-misc:mainfrom
jakeswenson:merutable-capi
Draft

merutable-capi: C ABI layer for the merutable engine#70
jakeswenson wants to merge 2 commits into
salesforce-misc:mainfrom
jakeswenson:merutable-capi

Conversation

@jakeswenson
Copy link
Copy Markdown
Collaborator

Summary

Introduces merutable-capi: a C ABI layer over the merutable Rust engine, generated via cbindgen. Exposes the full database lifecycle (open, read, write, close) plus a new lightweight meru_manifest_info function for read-only catalog inspection — the primary entry point for the DuckDB extension.


Commits

  1. feat: add merutable-capi — initial C ABI crate with full CRUD lifecycle, cbindgen header, and smoke test
  2. feat(capi): add meru_manifest_info — shared runtime model, read-only manifest inspection, internal column filtering

Design principles

1. The ABI surface is minimal and stable.
Only types and functions that have a clear external consumer are exported. Internal Rust types (manifests, WAL structures, compaction state) never cross the boundary. cbindgen generates include/merutable.h directly from Rust #[repr(C)] types, so the header is always in sync with the implementation.

2. Ownership is always explicit and single-sided.
Every heap allocation made by the Rust side is documented in the header and freed by a paired *_free function. The caller never calls free() directly on any pointer returned by the API. Passing NULL to any *_free function is safe. This contract is enforced by the Rust Drop / Box::from_raw / CString::from_raw discipline in the free functions.

3. Async I/O is driven by an explicit, caller-owned runtime.
The crate defines a MeruRuntime opaque type wrapping a tokio::runtime::Runtime behind an Arc. The caller creates one runtime with meru_runtime_new() and passes it to every function that performs I/O. Multiple database handles opened on the same runtime share one thread pool. No hidden runtime is ever allocated per-call. The Arc ensures the thread pool stays alive as long as any handle that references it is open, and shuts down naturally when the last reference is dropped.

4. Internal columns are filtered before crossing the ABI boundary.
merutable's Parquet files carry bookkeeping columns (_merutable_ikey, _merutable_seq, _merutable_op, _merutable_value) that are injected by the codec layer and are invisible to the public schema. The MeruManifestInfo.columns array is explicitly filtered against these names in Rust before being handed to C. The C++ extension does not need — and must not need — an IsSystemColumn() guard.

5. All file paths returned are absolute.
Manifest entries use relative paths internally. Any function that returns file paths canonicalizes the base directory and prefixes every entry, so callers receive paths that can be passed directly to a file reader.

6. Dual-format manifest reading.
Functions that read the manifest (including meru_manifest_info and meru_open_existing) prefer v{N}.metadata.pb (the canonical protobuf format introduced in #28) and fall back to v{N}.metadata.json for catalogs committed before the migration. Format detection uses the MRUB magic prefix.


ABI surface

Runtime lifecycle

MeruRuntime *meru_runtime_new(uintptr_t worker_threads, char **err_out);
void         meru_runtime_free(MeruRuntime *rt);
// worker_threads = 0 → one thread per logical CPU (tokio default)

Database lifecycle

int  meru_open(const MeruOpenOptions *opts, MeruRuntime *rt,
               MeruHandle **db_out, char **err_out);
int  meru_open_existing(const char *path, uint8_t read_only, MeruRuntime *rt,
                        MeruHandle **db_out, char **err_out);
int  meru_close(MeruHandle *db, char **err_out);
void meru_free(MeruHandle *db);
int  meru_close_free(MeruHandle *db, char **err_out);   // convenience: close + free
int  meru_is_closed(const MeruHandle *db);

meru_open_existing reads the TableSchema from the manifest on disk — no schema re-supply needed. Returns MeruStatus_ErrNotFound when no catalog exists at the path.

Read / write

int  meru_put(MeruHandle *db, const MeruRow *row, uint64_t *seq_out, char **err_out);
int  meru_put_batch(MeruHandle *db, const MeruRow *rows, uintptr_t count,
                    uint64_t *seq_out, char **err_out);
int  meru_delete(MeruHandle *db, const MeruValue *pk_values, uintptr_t pk_count,
                 uint64_t *seq_out, char **err_out);
int  meru_get(const MeruHandle *db, const MeruValue *pk_values, uintptr_t pk_count,
              int *found, MeruRow **row_out, char **err_out);
int  meru_scan(const MeruHandle *db,
               const MeruValue *start_pk, uintptr_t start_count,
               const MeruValue *end_pk,   uintptr_t end_count,
               MeruScanResult **result_out, char **err_out);

Maintenance

int  meru_flush(MeruHandle *db, char **err_out);
int  meru_compact(MeruHandle *db, char **err_out);
int  meru_refresh(MeruHandle *db, char **err_out);       // read-only replica sync
int  meru_export_iceberg(MeruHandle *db, const char *target_dir, char **err_out);
int  meru_stats(const MeruHandle *db, MeruStats *stats_out, char **err_out);
char *meru_catalog_path(const MeruHandle *db);           // free with meru_free_string

Manifest inspection (new)

typedef struct {
    char          *table_name;    // heap-allocated; free via meru_free_string
    MeruColumnDef *columns;       // heap-allocated array, column_count entries
    uintptr_t      column_count;  // user-visible columns only (internal cols filtered)
    uintptr_t     *primary_key;   // heap-allocated array of column indices
    uintptr_t      pk_count;
    char         **parquet_paths; // heap-allocated array of heap-allocated absolute paths
    uintptr_t      parquet_count; // live (non-deleted) files only
} MeruManifestInfo;

int  meru_manifest_info(MeruRuntime *rt, const char *path,
                        MeruManifestInfo **out, char **err_out);
void meru_manifest_info_free(MeruManifestInfo *info);

meru_manifest_info reads the catalog manifest at path without acquiring a write lock or initializing a MeruHandle. It is the intended first call from the DuckDB extension: inspect the schema and enumerate Parquet files cheaply, then decide whether to open a full handle. Returns MeruStatus_ErrNotFound when no catalog exists at path.

Memory free helpers

void meru_row_free(MeruRow *result);
void meru_scan_result_free(MeruScanResult *result);
void meru_manifest_info_free(MeruManifestInfo *info);
void meru_free_string(char *s);

Key types

Type Purpose
MeruColumnType Column type enum (Boolean, Int32, Int64, Float, Double, ByteArray, FixedLenBytes)
MeruValue Nullable tagged union for a single field value
MeruRow Array of MeruValue in schema column order
MeruColumnDef Column name + type + nullability + optional defaults
MeruSchema / MeruOpenOptions Schema declaration and open parameters
MeruScanResult Heap-allocated row array from meru_scan
MeruStats Engine statistics snapshot
MeruManifestInfo Schema + live Parquet paths from a catalog, no handle required
MeruStatus Integer status codes returned by all fallible functions

Testing

  • crates/merutable-capi/tests/c_smoke.rs — Rust test that compiles examples/smoke.c against the built dylib and runs it end-to-end (open → put → get → scan → stats → close → reopen → close)
  • examples/smoke.c — can also be compiled and run manually against any catalog path

Also in this PR

  • rust-toolchain.toml — pins the workspace to stable with rustfmt and clippy components
  • .gitignore — ignores version-hint.text and metadata/*.metadata.{json,pb} at the repo root, which were being generated by test runs using relative catalog paths

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant