oxidize-pdf

The Rust PDF library built for AI. Parse any PDF into structure-aware, embedding-ready chunks with one line of code. Pure Rust, zero C dependencies, 99.3% success rate on 9,000+ real-world PDFs.

let chunks = PdfDocument::open("paper.pdf")?.rag_chunks()?;
// Each chunk: text, pages, bounding boxes, element types, heading context, token estimate

Why oxidize-pdf for RAG?

Most PDF libraries give you a wall of text. oxidize-pdf gives you structured, metadata-rich chunks ready for your vector store:

What you get	Why it matters
`chunk.full_text`	Heading context prepended -- better embeddings
`chunk.page_numbers`	Citation back to source pages
`chunk.bounding_boxes`	Spatial position for visual grounding
`chunk.element_types`	Filter by "table", "title", "paragraph"
`chunk.token_estimate`	Right-size chunks for your model's context window
`chunk.heading_context`	Section awareness without post-processing

Performance: Pure Rust, 3,000-4,000 pages/sec generation, 85ms full-text extraction for a 930KB PDF.

Quick Start

[dependencies]
oxidize-pdf = "2.3"

RAG Pipeline -- One Liner

use oxidize_pdf::parser::PdfDocument;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let doc = PdfDocument::open("document.pdf")?;

    // Structure-aware chunking with full metadata
    let chunks = doc.rag_chunks()?;

    for chunk in &chunks {
        println!("Chunk {}: pages {:?}, ~{} tokens",
            chunk.chunk_index, chunk.page_numbers, chunk.token_estimate);
        println!("  Types: {}", chunk.element_types.join(", "));
        if let Some(heading) = &chunk.heading_context {
            println!("  Section: {}", heading);
        }

        // Use chunk.full_text for embeddings (includes heading context)
        // Use chunk.text for display (content only)
    }

    Ok(())
}

Custom Chunk Size

use oxidize_pdf::pipeline::HybridChunkConfig;

// Smaller chunks for more precise retrieval
let config = HybridChunkConfig {
    max_tokens: 256,
    ..HybridChunkConfig::default()
};
let chunks = doc.rag_chunks_with(config)?;

JSON for Vector Store Ingestion

// Serialize all chunks to JSON (requires `semantic` feature)
let json = doc.rag_chunks_json()?;
std::fs::write("chunks.json", json)?;

Element Partitioning

For fine-grained control, access the typed element pipeline directly:

use oxidize_pdf::pipeline::ExtractionProfile;

let doc = PdfDocument::open("document.pdf")?;

// Partition into typed elements
let elements = doc.partition()?;
for el in &elements {
    println!("page {} : {}", el.page(), el.text());
}

// Or with a pre-configured profile
let elements = doc.partition_with_profile(ExtractionProfile::Academic)?;

// Build a relationship graph (parent/child sections)
let (elements, graph) = doc.partition_graph(Default::default())?;
for section in graph.top_level_sections() {
    println!("Section: {}", elements[section].text());
}

Also in the box

Beyond RAG, the same crate also handles PDF parsing (99.3 % success on 9,000+ real-world PDFs, CJK, lenient recovery), generation (3,000–4,000 pages/sec), encryption (RC4-40/128, AES-128, AES-256 R5/R6 — read and write), digital signatures (PKCS#7 verification), PDF/A validation (8 conformance levels), JBIG2 image decoding (pure-Rust ITU-T T.88), invoice extraction (ES/EN/DE/IT), and split/merge/rotate operations. One dependency for the full pipeline.

See oxidize-pdf-core/examples/ for working samples (133 examples) and docs.rs for the API surface.

Full Feature Set

AI/RAG Pipeline

Structure-aware chunking with RagChunk metadata (pages, bboxes, types, headings)
Element partitioning: Title, Paragraph, Table, ListItem, Image, CodeBlock, KeyValue
ElementGraph for parent/child section relationships
6 extraction profiles (Standard, Academic, Form, Government, Dense, Presentation)
Reading order strategies (Simple, XYCut)
LLM-optimized export formats (Markdown, Contextual, JSON)
Invoice data extraction (ES, EN, DE, IT)

PDF Processing

Parse PDF 1.0-1.7 with 99.3% success rate (9,000+ PDFs tested)
Generate multi-page documents with text, graphics, images
Encryption: RC4-40/128, AES-128, AES-256 (R5/R6) -- read and write
Digital signatures: detection, PKCS#7 verification, certificate validation
PDF/A validation: 8 conformance levels (1a/b, 2a/b/u, 3a/b/u)
JBIG2 decoder: pure Rust (ITU-T T.88)
Split, merge, rotate operations
CJK text support (Chinese, Japanese, Korean)
Corruption recovery and lenient parsing
Decompression bomb protection

Performance

Operation	Speed
PDF generation	3,000-4,000 pages/sec
Full text extraction (930KB)	85 ms
Page text extraction	546 us
File loading	738 us

Benchmarked with Criterion. Baseline: v2.0.0-profiling.

Testing

7,993 tests across unit, integration, and doc tests. 7-tier corpus (T0-T6) with 9,000+ PDFs.

cargo test --workspace         # Full test suite
cargo clippy -- -D warnings    # Lint check
cargo run --example rag_pipeline -- path/to/file.pdf

License

MIT -- see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 1,183 Commits
.cargo		.cargo
.github		.github
.idea		.idea
benches/lopdf_comparison		benches/lopdf_comparison
dev-tools		dev-tools
docs		docs
landing		landing
lints		lints
oxidize-pdf-core		oxidize-pdf-core
scripts		scripts
test-corpus		test-corpus
test-pdfs		test-pdfs
tests		tests
tools		tools
.claudeignore		.claudeignore
.gitignore		.gitignore
.plan		.plan
.tarpaulin.toml		.tarpaulin.toml
API_DOCUMENTATION.md		API_DOCUMENTATION.md
BENCHMARK_RESULTS.md		BENCHMARK_RESULTS.md
CHANGELOG.md		CHANGELOG.md
CLAUDE.local.md		CLAUDE.local.md
CONTRIBUTING.md		CONTRIBUTING.md
COVERAGE_REPORT.md		COVERAGE_REPORT.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
IMPLEMENTACION_OCR_API.md		IMPLEMENTACION_OCR_API.md
ISO_COMPLIANCE_MATRIX.toml		ISO_COMPLIANCE_MATRIX.toml
ISO_COMPLIANCE_MATRIX_CURATED.toml		ISO_COMPLIANCE_MATRIX_CURATED.toml
ISO_VERIFICATION_STATUS.toml		ISO_VERIFICATION_STATUS.toml
LICENSE		LICENSE
MIGRATION.md		MIGRATION.md
PERFORMANCE.md		PERFORMANCE.md
PNG_DECODER_ISSUES.md		PNG_DECODER_ISSUES.md
PROJECT_PROGRESS.md		PROJECT_PROGRESS.md
README.md		README.md
REPOSITORY_ARCHITECTURE.md		REPOSITORY_ARCHITECTURE.md
SECURITY_MEASURES.md		SECURITY_MEASURES.md
codecov.yml		codecov.yml
coverage_report.txt		coverage_report.txt
tarpaulin_lib_only.txt		tarpaulin_lib_only.txt
tarpaulin_output.txt		tarpaulin_output.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

oxidize-pdf

Why oxidize-pdf for RAG?

Quick Start

RAG Pipeline -- One Liner

Custom Chunk Size

JSON for Vector Store Ingestion

Element Partitioning

Also in the box

Full Feature Set

AI/RAG Pipeline

PDF Processing

Performance

Testing

License

Links

About

Uh oh!

Releases 65

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

oxidize-pdf

Why oxidize-pdf for RAG?

Quick Start

RAG Pipeline -- One Liner

Custom Chunk Size

JSON for Vector Store Ingestion

Element Partitioning

Also in the box

Full Feature Set

AI/RAG Pipeline

PDF Processing

Performance

Testing

License

Links

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 65

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages