Scrapamoja 🕷️

A Python framework for building reliable, extensible web scrapers — with battle-tested resilience, stealth, and observability built in.

What is Scrapamoja?

Scrapamoja blends the English word Scrape with the Swahili word Pamoja — meaning together. Scrape together. One scraper, many sites. One framework, many contributors.

It's also a quiet nod to Moja, Swahili for one — the idea that you shouldn't need a different tool for every site you want to scrape. One framework should be enough, and it should be good enough that anyone can extend it.

That philosophy shapes everything about how Scrapamoja is built. It's not a scraper — it's the infrastructure that makes scrapers reliable: handling anti-bot measures, selector drift, network failures, and browser resource leaks so you don't have to. New sites can be added by anyone, existing ones improved by the community, and the whole thing grows stronger the more people contribute to it.

Scrape together. Build together.

Core Framework Capabilities

🎯 Intelligent Selector Engine

The selector engine is the heart of Scrapamoja. Instead of brittle single-selector lookups, it uses a multi-strategy approach — CSS, XPath, and text-based selectors can all be defined for the same element. Each strategy is weighted, and the engine picks the best match with a confidence score. When a selector fails, it falls back gracefully rather than crashing. Selectors are defined in YAML, not hardcoded, making them easy to maintain without touching Python.

Site → Sport → Status → Context → Element

🛡️ Resilience System

Built around the assumption that things will go wrong. Automatic retries with exponential backoff, failure classification (network vs. selector vs. parse errors), checkpoint-based recovery so long scrapes can resume, and a coordinator that ensures graceful shutdown even mid-scrape.

🕵️ Stealth & Anti-Detection

A dedicated stealth module handles fingerprint randomization, human-like behavior simulation, consent popup handling, and proxy rotation. Sites that actively fight scrapers are manageable targets.

🔍 Snapshot Debugging

When a scrape fails, Scrapamoja captures a full snapshot: the page HTML, a screenshot, structured logs, and selector resolution traces — all correlated by session ID. Debugging a failure means looking at exactly what the browser saw, not guessing.

📊 Telemetry & Observability

Structured JSON logging with correlation IDs, built-in metrics collection (execution time, success rates, selector confidence distributions), and alerting hooks. Production scrapers need production-grade monitoring.

🌐 Browser Lifecycle Management

Browser and page pooling, session state persistence, tab management, resource monitoring (memory, CPU), and corruption detection. Long-running scrapers won't leak memory or leave orphaned browser processes.

🔀 Hybrid Extraction Modes

Scrapamoja chooses the optimal extraction method based on each target site's architecture:

Mode	Description	Use Case
DOM Mode (default)	Navigate with browser, extract from HTML	Sites requiring full rendering
Direct API Mode	Skip browser, call APIs directly	Open APIs, millisecond latency
Network Interception	Capture API responses during browser sessions	Sites requiring browser initialization
Hybrid Mode	Browser once to harvest session, then direct HTTP	Sites requiring authenticated sessions

Architecture

scrapamoja/
├── src/
│   ├── main.py                   # Unified CLI entry point
│   ├── api/                      # FastAPI endpoints and schemas
│   ├── sites/                    # Site implementations
│   │   ├── _template/            # Full-featured template for new sites
│   │   ├── base/                 # BaseSiteScraper, registry, DI container
│   │   ├── direct/               # Direct API mode implementations
│   │   ├── flashscore/           # FlashScore scraper (Basketball, Football)
│   │   └── wikipedia/            # Wikipedia scraper
│   ├── selectors/                # Selector engine (YAML-driven, multi-strategy)
│   ├── browser/                  # Browser lifecycle, sessions, tab management
│   ├── network/                  # Network interception, HTTP client, error handling
│   ├── resilience/               # Retries, failure classification, checkpoints
│   ├── stealth/                  # Anti-detection, fingerprinting, proxies
│   ├── telemetry/                # Metrics, alerting, audit, reporting
│   ├── navigation/               # Route planning and page discovery
│   ├── extractor/                # Data extraction and transformation
│   ├── observability/            # Structured logging and event system
│   ├── interrupt_handling/       # Graceful shutdown and signal handling
│   ├── models/                   # Data models and schemas
│   ├── config/                   # Configuration management
│   ├── storage/                  # Data persistence and caching
│   ├── core/                     # Core utilities and shared components
│   ├── extraction/               # Extraction strategies and processors
│   └── utils/                    # Utility functions and helpers
├── ui/                          # Web UI for feature flag management and system monitoring
│   └── app/                      # React/Vite application with TailwindCSS for managing flags, escalations, and audit logs
├── tests/                        # Unit, integration, performance, stealth tests
├── docs/                         # Architecture docs, workflow guides, API reference
├── examples/                     # Runnable examples
└── scripts/                      # Migration and validation utilities

🚀 Quick Start

Prerequisites

Python 3.12 or higher
2GB RAM minimum (4GB recommended)
Internet connection
Git (for cloning)

Installation

Linux / macOS

# 1. Clone repository
git clone https://github.com/TisoneK/scrapamoja.git
cd scrapamoja

# 2. Create virtual environment
python3 -m venv venv
source venv/bin/activate

# 3. Install dependencies
pip install -r requirements.txt

# 4. Install Playwright browsers
playwright install chromium

# 5. Run your first scrape
python -m src.main flashscore scrape basketball live --limit 1

Windows

:: 1. Clone repository
git clone https://github.com/TisoneK/scrapamoja.git
cd scrapamoja

:: 2. Create virtual environment
python -m venv venv
venv\Scripts\activate

:: 3. Install dependencies
pip install -r requirements.txt

:: 4. Install Playwright browsers
playwright install chromium

:: 5. Run your first scrape
python -m src.main flashscore scrape basketball live --limit 1

First Results

{
  "sport": "Basketball",
  "status": "live",
  "matches": [
    {
      "home_team": "Los Angeles Lakers",
      "away_team": "Boston Celtics",
      "score": "89-87",
      "time": "4th Quarter"
    }
  ],
  "total": 1
}

Building a New Scraper

Scrapamoja ships with a full site template at src/sites/_template/ — it's not a stub, it's a working skeleton with flows, processors, validators, config management, and component wiring already in place.

1. Copy the template

cp -r src/sites/_template src/sites/mysite

2. Implement the scraper

from src.sites.base.site_scraper import BaseSiteScraper

class MySiteScraper(BaseSiteScraper):
    site_id = "mysite"
    site_name = "My Site"
    base_url = "https://example.com"

3. Define selectors in YAML

# src/sites/mysite/selectors/extraction/listings.yaml
description: "Product listing items"
strategies:
  - type: "css"
    selector: ".product-card"
    weight: 1.0
  - type: "xpath"
    selector: "//div[@data-type='product']"
    weight: 0.8

4. Register and run

# Add to SITE_CLIS in src/main.py
'mysite': ('src.sites.mysite.cli.main', 'MySiteCLI'),

Supported Sites

Site	Data	Sports / Topics	Status Types
FlashScore	Live scores, match stats, odds	Basketball, Football	Live, Finished, Scheduled
Wikipedia	Article content, tables, references	Any	N/A

Both are production implementations — the FlashScore scraper handles live match updates, status-aware extraction, and real-time polling. The Wikipedia scraper handles table parsing, multi-language articles, and reference extraction.

Configuration

Global config (`config.yaml`)

browser:
  headless: true
  timeout: 30000
  viewport:
    width: 1920
    height: 1080

scraping:
  max_retries: 5
  retry_delay: 2.0
  rate_limit: 10  # requests per minute

logging:
  level: "INFO"
  structured: true

CLI options

Flag	Description
`--limit N`	Cap the number of results
`--output, -o FORMAT`	Output format: `json`, `csv`, `xml`
`--file, -f PATH`	Write output to file
`--headless / --no-headless`	Browser visibility (`--no-headless` for debugging)
`--verbose, -v`	Detailed logs
`--quiet, -q`	Errors only

Documentation

Doc	Description
`features.md`	Complete feature reference
`modular_template_guide.md`	Guide to building new site scrapers
`snapshot_api_reference.md`	Snapshot debugging system API
`yaml-configuration.md`	YAML selector config reference
`browser-lifecycle-management.md`	Browser pooling and session management
`structured-logging-guide.md`	Logging and observability guide

Roadmap

v1.2 (Q2 2026)

Enhanced error recovery mechanisms
GraphQL API integration
Real-time WebSocket updates
ML-based selector optimization

v1.3 (Q3 2026)

Tennis and Hockey sport support
Enhanced telemetry dashboard
Cloud deployment templates

v2.0 (Q4 2026)

ESPN support
Multi-region scraping
SaaS API offering
Advanced analytics dashboard

Contributing

Fork and clone the repo
Create a feature branch: git checkout -b feature/my-feature
Follow the existing code style (Black + Ruff)
Add tests for new functionality
Run pytest tests/ and ruff check src/ before submitting
Open a PR with a description of what changed and why

License

MIT — see LICENSE for details.

Built with ❤️ by the Scrapamoja Team

Name		Name	Last commit message	Last commit date
Latest commit History 216 Commits
.github		.github
docs		docs
examples		examples
scripts		scripts
src		src
tests		tests
tools		tools
ui/app		ui/app
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scrapamoja 🕷️

What is Scrapamoja?

Core Framework Capabilities

🎯 Intelligent Selector Engine

🛡️ Resilience System

🕵️ Stealth & Anti-Detection

🔍 Snapshot Debugging

📊 Telemetry & Observability

🌐 Browser Lifecycle Management

🔀 Hybrid Extraction Modes

Architecture

🚀 Quick Start

Prerequisites

Installation

First Results

Building a New Scraper

Supported Sites

Configuration

Global config (`config.yaml`)

CLI options

Documentation

Roadmap

v1.2 (Q2 2026)

v1.3 (Q3 2026)

v2.0 (Q4 2026)

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Scrapamoja 🕷️

What is Scrapamoja?

Core Framework Capabilities

🎯 Intelligent Selector Engine

🛡️ Resilience System

🕵️ Stealth & Anti-Detection

🔍 Snapshot Debugging

📊 Telemetry & Observability

🌐 Browser Lifecycle Management

🔀 Hybrid Extraction Modes

Architecture

🚀 Quick Start

Prerequisites

Installation

First Results

Building a New Scraper

Supported Sites

Configuration

Global config (config.yaml)

CLI options

Documentation

Roadmap

v1.2 (Q2 2026)

v1.3 (Q3 2026)

v2.0 (Q4 2026)

Contributing

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Global config (`config.yaml`)

Packages