foia

FOIA document acquisition and research system. Discovers, downloads, extracts text via OCR, and annotates public records from government agencies and document repositories.

Features

Discovery — find documents via HTML crawling, API search, sitemaps, Wayback Machine, search engines, and pattern-based URL generation
Scraping — download with adaptive rate limiting, browser automation for JS-heavy sites, and optional Tor routing
Analysis — text extraction with poppler-utils, OCR via Groq, OCRS, PaddleOCR, or Tesseract
Annotation — LLM-powered summaries, tags, named entity recognition, and date detection (Ollama, Groq, OpenAI, Together.ai)
Import — ingest from WARC archives, Concordance DAT/OPT load files, URL lists, or stdin
Search — full-text search, entity-based queries, and a web interface
Storage — content-addressable file storage with SHA-256 and BLAKE3 deduplication
Privacy — Tor routing by default with pluggable transports; supports external SOCKS proxies
Database — SQLite (default) or PostgreSQL for larger deployments

Quick Start

Docker

docker run -d --name foia \
  -v ./data:/opt/foia \
  -e DATABASE_URL=sqlite:///opt/foia/foia.db \
  ghcr.io/foiacquire/foia:clearnet

With Tor:

docker run -d --name foia \
  -v ./data:/opt/foia \
  ghcr.io/foiacquire/foia:latest

With browser automation:

docker run -d --name chromium -p 9222:9222 --shm-size=2g ghcr.io/foiacquire/chromium:latest
docker run -d --name foia \
  -v ./data:/opt/foia \
  -e BROWSER_URL=ws://chromium:9222 \
  --link chromium \
  ghcr.io/foiacquire/foia:latest scrape cia_foia

From Source

Requires Rust toolchain and poppler-utils for text extraction.

cargo build --release
./target/release/foia init --config etc/example.json
./target/release/foia scrape --all

Container Images

Image	Tag	Description
`ghcr.io/foiacquire/foia`	`latest`	Tor-enabled (default)
`ghcr.io/foiacquire/foia`	`clearnet`	No Tor
`ghcr.io/foiacquire/foia`	`tesseract`	Tor + Tesseract OCR
`ghcr.io/foiacquire/foia`	`tesseract-clearnet`	Tesseract OCR, no Tor
`ghcr.io/foiacquire/chromium`	`latest`	Stealth browser + Tor (repo)
`ghcr.io/foiacquire/chromium`	`clearnet`	Standard browser

Usage

# Initialize database and import scraper configs
foia init --config config.json

# Discover and download documents
foia scrape --all
foia scrape fbi_vault --limit 100

# Run as daemon
foia scrape --all --daemon --interval 3600

# OCR and text extraction
foia analyze --daemon --interval 300

# LLM annotation (summaries, tags, entities)
foia annotate --daemon --interval 300

# Search and browse
foia search "surveillance program"
foia ls --source cia_foia
foia serve

Commands

Document Acquisition

Command	Description
`scrape <source>`	Crawl and download documents (supports `--all`, `--daemon`)
`crawl <source>`	Discover document URLs without downloading
`download [source]`	Download pending documents from queue
`refresh [source]`	Re-fetch metadata for existing documents

Discovery

Command	Description
`discover sitemap <source>`	Discover from sitemaps and robots.txt
`discover search <source>`	Search engine discovery (DuckDuckGo, Google, Bing, Brave)
`discover wayback <source>`	Wayback Machine historical snapshots
`discover pattern <source>`	Infer URL patterns from existing documents
`discover paths <source>`	Check common document paths
`discover all <source>`	Run all discovery methods

Import

Command	Description
`import warc <files>`	Import from WARC archives (.warc, .warc.gz)
`import concordance <path>`	Import from Concordance DAT/OPT load files
`import urls --file <file>`	Import URLs from a text file
`import stdin --url <url>`	Import content from stdin

Document Processing

Command	Description
`analyze [source]`	Extract text and run OCR (supports `--daemon`)
`analyze-check`	Verify OCR tools are installed
`analyze-compare <file>`	Compare OCR backends on a file
`annotate [source]`	Generate summaries/tags with LLM (supports `--daemon`)
`detect-dates [source]`	Detect publication dates in documents
`extract-entities [source]`	Extract named entities (people, orgs, locations)
`archive [source]`	Extract contents from ZIP/email attachments

Browsing & Search

Command	Description
`ls`	List documents with filtering by source, tag, type
`info <doc_id>`	Show document metadata
`read <doc_id>`	Output document content
`search <query>`	Full-text search
`search-entities <query>`	Search by extracted entities (supports spatial `--near`)
`serve [bind]`	Start web interface (default: 127.0.0.1:3030)
`status`	System status (TUI with `--live`, or `--json`)

Management

Command	Description
`init`	Initialize database and directories
`source list`	List configured sources
`source rename <old> <new>`	Rename a source
`config transfer`	Import config file into database
`config get <key>`	Get a config value
`config set <key> <value>`	Set a config value
`db migrate`	Run database migrations
`db copy <from> <to>`	Copy data between SQLite and PostgreSQL
`db deduplicate`	Deduplicate documents by content hash
`state status`	Show crawl state
`state clear <source>`	Reset crawl state

Configuration

Scraper configs define how to discover and fetch documents per source. See etc/example.json for a complete example.

{
  "scrapers": {
    "fbi_vault": {
      "discovery": {
        "type": "html_crawl",
        "base_url": "https://vault.fbi.gov",
        "start_paths": ["/alphabetical-index"],
        "document_links": ["a[href*='/vault/']"],
        "document_patterns": ["\\.pdf$"],
        "pagination": { "next_selectors": ["a[rel='next']"] }
      },
      "fetch": { "use_browser": false }
    }
  }
}

Environment Variables

Variable	Description
`DATABASE_URL`	Database connection (SQLite path or PostgreSQL URL)
`BROWSER_URL`	Chrome DevTools URL(s), comma-separated for pool
`BROWSER_SELECTION`	Pool strategy: `round-robin`, `random`, `per-domain`
`SOCKS_PROXY`	SOCKS proxy URL
`FOIA_DIRECT`	Set to `1` to disable Tor
`LLM_PROVIDER`	`groq`, `ollama`, `openai`, or `together`
`LLM_MODEL`	Model name for annotation
`GROQ_API_KEY`	Groq API key (also enables Groq OCR backend)
`ANALYSIS_OCR_BACKENDS`	OCR backends (comma-separated: `groq`, `ocrs`, `paddle`, `tesseract`)
`MIGRATE`	Set to `true` to run migrations on container startup
`USER_ID` / `GROUP_ID`	Container user/group mapping
`RUST_LOG`	Log level (`info`, `debug`, `trace`)

Architecture

Rust workspace with 7 crates:

Crate	Purpose
`foia`	Core library — models, config, storage, HTTP client, database, LLM, browser
`foia-cli`	CLI binary (`foia`)
`foia-scrape`	Web scraping and download
`foia-analysis`	OCR and text extraction
`foia-annotate`	LLM annotation, NER, date detection
`foia-import`	WARC, Concordance, URL, and stdin import
`foia-server`	Web interface and API (Axum)

Feature Flags

Feature	Description
`browser`	Chromium automation via chromiumoxide (default)
`postgres`	PostgreSQL backend
`redis-backend`	Redis-backed distributed rate limiting
`amqp-broker`	RabbitMQ job queue
`ocr-ocrs`	OCRS pure-Rust OCR
`ocr-paddle`	PaddleOCR ONNX backend
`gis`	Geographic/spatial features

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 420 Commits
.cargo		.cargo
.github		.github
bin		bin
crates		crates
docker		docker
docs		docs
etc		etc
scripts		scripts
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
.jscpd.json		.jscpd.json
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
README.md		README.md
clippy.toml		clippy.toml
deny.toml		deny.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

foia

Features

Quick Start

Docker

From Source

Container Images

Usage

Commands

Document Acquisition

Discovery

Import

Document Processing

Browsing & Search

Management

Configuration

Environment Variables

Architecture

Feature Flags

License

About

Uh oh!

Releases

Sponsor this project

Uh oh!

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

foia

Features

Quick Start

Docker

From Source

Container Images

Usage

Commands

Document Acquisition

Discovery

Import

Document Processing

Browsing & Search

Management

Configuration

Environment Variables

Architecture

Feature Flags

License

About

Topics

Resources

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Sponsor this project

Uh oh!

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages