A Python framework for building reliable, extensible web scrapers β with battle-tested resilience, stealth, and observability built in.
Scrapamoja blends the English word Scrape with the Swahili word Pamoja β meaning together. Scrape together. One scraper, many sites. One framework, many contributors.
It's also a quiet nod to Moja, Swahili for one β the idea that you shouldn't need a different tool for every site you want to scrape. One framework should be enough, and it should be good enough that anyone can extend it.
That philosophy shapes everything about how Scrapamoja is built. It's not a scraper β it's the infrastructure that makes scrapers reliable: handling anti-bot measures, selector drift, network failures, and browser resource leaks so you don't have to. New sites can be added by anyone, existing ones improved by the community, and the whole thing grows stronger the more people contribute to it.
Scrape together. Build together.
The selector engine is the heart of Scrapamoja. Instead of brittle single-selector lookups, it uses a multi-strategy approach β CSS, XPath, and text-based selectors can all be defined for the same element. Each strategy is weighted, and the engine picks the best match with a confidence score. When a selector fails, it falls back gracefully rather than crashing. Selectors are defined in YAML, not hardcoded, making them easy to maintain without touching Python.
Site β Sport β Status β Context β Element
Built around the assumption that things will go wrong. Automatic retries with exponential backoff, failure classification (network vs. selector vs. parse errors), checkpoint-based recovery so long scrapes can resume, and a coordinator that ensures graceful shutdown even mid-scrape.
A dedicated stealth module handles fingerprint randomization, human-like behavior simulation, consent popup handling, and proxy rotation. Sites that actively fight scrapers are manageable targets.
When a scrape fails, Scrapamoja captures a full snapshot: the page HTML, a screenshot, structured logs, and selector resolution traces β all correlated by session ID. Debugging a failure means looking at exactly what the browser saw, not guessing.
Structured JSON logging with correlation IDs, built-in metrics collection (execution time, success rates, selector confidence distributions), and alerting hooks. Production scrapers need production-grade monitoring.
Browser and page pooling, session state persistence, tab management, resource monitoring (memory, CPU), and corruption detection. Long-running scrapers won't leak memory or leave orphaned browser processes.
Scrapamoja chooses the optimal extraction method based on each target site's architecture:
| Mode | Description | Use Case |
|---|---|---|
| DOM Mode (default) | Navigate with browser, extract from HTML | Sites requiring full rendering |
| Direct API Mode | Skip browser, call APIs directly | Open APIs, millisecond latency |
| Network Interception | Capture API responses during browser sessions | Sites requiring browser initialization |
| Hybrid Mode | Browser once to harvest session, then direct HTTP | Sites requiring authenticated sessions |
scrapamoja/
βββ src/
β βββ main.py # Unified CLI entry point
β βββ api/ # FastAPI endpoints and schemas
β βββ sites/ # Site implementations
β β βββ _template/ # Full-featured template for new sites
β β βββ base/ # BaseSiteScraper, registry, DI container
β β βββ direct/ # Direct API mode implementations
β β βββ flashscore/ # FlashScore scraper (Basketball, Football)
β β βββ wikipedia/ # Wikipedia scraper
β βββ selectors/ # Selector engine (YAML-driven, multi-strategy)
β βββ browser/ # Browser lifecycle, sessions, tab management
β βββ network/ # Network interception, HTTP client, error handling
β βββ resilience/ # Retries, failure classification, checkpoints
β βββ stealth/ # Anti-detection, fingerprinting, proxies
β βββ telemetry/ # Metrics, alerting, audit, reporting
β βββ navigation/ # Route planning and page discovery
β βββ extractor/ # Data extraction and transformation
β βββ observability/ # Structured logging and event system
β βββ interrupt_handling/ # Graceful shutdown and signal handling
β βββ models/ # Data models and schemas
β βββ config/ # Configuration management
β βββ storage/ # Data persistence and caching
β βββ core/ # Core utilities and shared components
β βββ extraction/ # Extraction strategies and processors
β βββ utils/ # Utility functions and helpers
βββ ui/ # Web UI for feature flag management and system monitoring
β βββ app/ # React/Vite application with TailwindCSS for managing flags, escalations, and audit logs
βββ tests/ # Unit, integration, performance, stealth tests
βββ docs/ # Architecture docs, workflow guides, API reference
βββ examples/ # Runnable examples
βββ scripts/ # Migration and validation utilities
- Python 3.12 or higher
- 2GB RAM minimum (4GB recommended)
- Internet connection
- Git (for cloning)
Linux / macOS
# 1. Clone repository
git clone https://github.com/TisoneK/scrapamoja.git
cd scrapamoja
# 2. Create virtual environment
python3 -m venv venv
source venv/bin/activate
# 3. Install dependencies
pip install -r requirements.txt
# 4. Install Playwright browsers
playwright install chromium
# 5. Run your first scrape
python -m src.main flashscore scrape basketball live --limit 1Windows
:: 1. Clone repository
git clone https://github.com/TisoneK/scrapamoja.git
cd scrapamoja
:: 2. Create virtual environment
python -m venv venv
venv\Scripts\activate
:: 3. Install dependencies
pip install -r requirements.txt
:: 4. Install Playwright browsers
playwright install chromium
:: 5. Run your first scrape
python -m src.main flashscore scrape basketball live --limit 1{
"sport": "Basketball",
"status": "live",
"matches": [
{
"home_team": "Los Angeles Lakers",
"away_team": "Boston Celtics",
"score": "89-87",
"time": "4th Quarter"
}
],
"total": 1
}Scrapamoja ships with a full site template at src/sites/_template/ β it's not a stub, it's a working skeleton with flows, processors, validators, config management, and component wiring already in place.
1. Copy the template
cp -r src/sites/_template src/sites/mysite2. Implement the scraper
from src.sites.base.site_scraper import BaseSiteScraper
class MySiteScraper(BaseSiteScraper):
site_id = "mysite"
site_name = "My Site"
base_url = "https://example.com"3. Define selectors in YAML
# src/sites/mysite/selectors/extraction/listings.yaml
description: "Product listing items"
strategies:
- type: "css"
selector: ".product-card"
weight: 1.0
- type: "xpath"
selector: "//div[@data-type='product']"
weight: 0.84. Register and run
# Add to SITE_CLIS in src/main.py
'mysite': ('src.sites.mysite.cli.main', 'MySiteCLI'),| Site | Data | Sports / Topics | Status Types |
|---|---|---|---|
| FlashScore | Live scores, match stats, odds | Basketball, Football | Live, Finished, Scheduled |
| Wikipedia | Article content, tables, references | Any | N/A |
Both are production implementations β the FlashScore scraper handles live match updates, status-aware extraction, and real-time polling. The Wikipedia scraper handles table parsing, multi-language articles, and reference extraction.
browser:
headless: true
timeout: 30000
viewport:
width: 1920
height: 1080
scraping:
max_retries: 5
retry_delay: 2.0
rate_limit: 10 # requests per minute
logging:
level: "INFO"
structured: true| Flag | Description |
|---|---|
--limit N |
Cap the number of results |
--output, -o FORMAT |
Output format: json, csv, xml |
--file, -f PATH |
Write output to file |
--headless / --no-headless |
Browser visibility (--no-headless for debugging) |
--verbose, -v |
Detailed logs |
--quiet, -q |
Errors only |
| Doc | Description |
|---|---|
features.md |
Complete feature reference |
modular_template_guide.md |
Guide to building new site scrapers |
snapshot_api_reference.md |
Snapshot debugging system API |
yaml-configuration.md |
YAML selector config reference |
browser-lifecycle-management.md |
Browser pooling and session management |
structured-logging-guide.md |
Logging and observability guide |
- Enhanced error recovery mechanisms
- GraphQL API integration
- Real-time WebSocket updates
- ML-based selector optimization
- Tennis and Hockey sport support
- Enhanced telemetry dashboard
- Cloud deployment templates
- ESPN support
- Multi-region scraping
- SaaS API offering
- Advanced analytics dashboard
- Fork and clone the repo
- Create a feature branch:
git checkout -b feature/my-feature - Follow the existing code style (Black + Ruff)
- Add tests for new functionality
- Run
pytest tests/andruff check src/before submitting - Open a PR with a description of what changed and why
MIT β see LICENSE for details.
Built with β€οΈ by the Scrapamoja Team