Skip to content

TisoneK/scrapamoja

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

216 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Scrapamoja πŸ•·οΈ

A Python framework for building reliable, extensible web scrapers β€” with battle-tested resilience, stealth, and observability built in.

Python 3.12+ Playwright License: MIT


What is Scrapamoja?

Scrapamoja blends the English word Scrape with the Swahili word Pamoja β€” meaning together. Scrape together. One scraper, many sites. One framework, many contributors.

It's also a quiet nod to Moja, Swahili for one β€” the idea that you shouldn't need a different tool for every site you want to scrape. One framework should be enough, and it should be good enough that anyone can extend it.

That philosophy shapes everything about how Scrapamoja is built. It's not a scraper β€” it's the infrastructure that makes scrapers reliable: handling anti-bot measures, selector drift, network failures, and browser resource leaks so you don't have to. New sites can be added by anyone, existing ones improved by the community, and the whole thing grows stronger the more people contribute to it.

Scrape together. Build together.


Core Framework Capabilities

🎯 Intelligent Selector Engine

The selector engine is the heart of Scrapamoja. Instead of brittle single-selector lookups, it uses a multi-strategy approach β€” CSS, XPath, and text-based selectors can all be defined for the same element. Each strategy is weighted, and the engine picks the best match with a confidence score. When a selector fails, it falls back gracefully rather than crashing. Selectors are defined in YAML, not hardcoded, making them easy to maintain without touching Python.

Site β†’ Sport β†’ Status β†’ Context β†’ Element

πŸ›‘οΈ Resilience System

Built around the assumption that things will go wrong. Automatic retries with exponential backoff, failure classification (network vs. selector vs. parse errors), checkpoint-based recovery so long scrapes can resume, and a coordinator that ensures graceful shutdown even mid-scrape.

πŸ•΅οΈ Stealth & Anti-Detection

A dedicated stealth module handles fingerprint randomization, human-like behavior simulation, consent popup handling, and proxy rotation. Sites that actively fight scrapers are manageable targets.

πŸ” Snapshot Debugging

When a scrape fails, Scrapamoja captures a full snapshot: the page HTML, a screenshot, structured logs, and selector resolution traces β€” all correlated by session ID. Debugging a failure means looking at exactly what the browser saw, not guessing.

πŸ“Š Telemetry & Observability

Structured JSON logging with correlation IDs, built-in metrics collection (execution time, success rates, selector confidence distributions), and alerting hooks. Production scrapers need production-grade monitoring.

🌐 Browser Lifecycle Management

Browser and page pooling, session state persistence, tab management, resource monitoring (memory, CPU), and corruption detection. Long-running scrapers won't leak memory or leave orphaned browser processes.

πŸ”€ Hybrid Extraction Modes

Scrapamoja chooses the optimal extraction method based on each target site's architecture:

Mode Description Use Case
DOM Mode (default) Navigate with browser, extract from HTML Sites requiring full rendering
Direct API Mode Skip browser, call APIs directly Open APIs, millisecond latency
Network Interception Capture API responses during browser sessions Sites requiring browser initialization
Hybrid Mode Browser once to harvest session, then direct HTTP Sites requiring authenticated sessions

Architecture

scrapamoja/
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ main.py                   # Unified CLI entry point
β”‚   β”œβ”€β”€ api/                      # FastAPI endpoints and schemas
β”‚   β”œβ”€β”€ sites/                    # Site implementations
β”‚   β”‚   β”œβ”€β”€ _template/            # Full-featured template for new sites
β”‚   β”‚   β”œβ”€β”€ base/                 # BaseSiteScraper, registry, DI container
β”‚   β”‚   β”œβ”€β”€ direct/               # Direct API mode implementations
β”‚   β”‚   β”œβ”€β”€ flashscore/           # FlashScore scraper (Basketball, Football)
β”‚   β”‚   └── wikipedia/            # Wikipedia scraper
β”‚   β”œβ”€β”€ selectors/                # Selector engine (YAML-driven, multi-strategy)
β”‚   β”œβ”€β”€ browser/                  # Browser lifecycle, sessions, tab management
β”‚   β”œβ”€β”€ network/                  # Network interception, HTTP client, error handling
β”‚   β”œβ”€β”€ resilience/               # Retries, failure classification, checkpoints
β”‚   β”œβ”€β”€ stealth/                  # Anti-detection, fingerprinting, proxies
β”‚   β”œβ”€β”€ telemetry/                # Metrics, alerting, audit, reporting
β”‚   β”œβ”€β”€ navigation/               # Route planning and page discovery
β”‚   β”œβ”€β”€ extractor/                # Data extraction and transformation
β”‚   β”œβ”€β”€ observability/            # Structured logging and event system
β”‚   β”œβ”€β”€ interrupt_handling/       # Graceful shutdown and signal handling
β”‚   β”œβ”€β”€ models/                   # Data models and schemas
β”‚   β”œβ”€β”€ config/                   # Configuration management
β”‚   β”œβ”€β”€ storage/                  # Data persistence and caching
β”‚   β”œβ”€β”€ core/                     # Core utilities and shared components
β”‚   β”œβ”€β”€ extraction/               # Extraction strategies and processors
β”‚   └── utils/                    # Utility functions and helpers
β”œβ”€β”€ ui/                          # Web UI for feature flag management and system monitoring
β”‚   └── app/                      # React/Vite application with TailwindCSS for managing flags, escalations, and audit logs
β”œβ”€β”€ tests/                        # Unit, integration, performance, stealth tests
β”œβ”€β”€ docs/                         # Architecture docs, workflow guides, API reference
β”œβ”€β”€ examples/                     # Runnable examples
└── scripts/                      # Migration and validation utilities

πŸš€ Quick Start

Prerequisites

  • Python 3.12 or higher
  • 2GB RAM minimum (4GB recommended)
  • Internet connection
  • Git (for cloning)

Installation

Linux / macOS

# 1. Clone repository
git clone https://github.com/TisoneK/scrapamoja.git
cd scrapamoja

# 2. Create virtual environment
python3 -m venv venv
source venv/bin/activate

# 3. Install dependencies
pip install -r requirements.txt

# 4. Install Playwright browsers
playwright install chromium

# 5. Run your first scrape
python -m src.main flashscore scrape basketball live --limit 1

Windows

:: 1. Clone repository
git clone https://github.com/TisoneK/scrapamoja.git
cd scrapamoja

:: 2. Create virtual environment
python -m venv venv
venv\Scripts\activate

:: 3. Install dependencies
pip install -r requirements.txt

:: 4. Install Playwright browsers
playwright install chromium

:: 5. Run your first scrape
python -m src.main flashscore scrape basketball live --limit 1

First Results

{
  "sport": "Basketball",
  "status": "live",
  "matches": [
    {
      "home_team": "Los Angeles Lakers",
      "away_team": "Boston Celtics",
      "score": "89-87",
      "time": "4th Quarter"
    }
  ],
  "total": 1
}

Building a New Scraper

Scrapamoja ships with a full site template at src/sites/_template/ β€” it's not a stub, it's a working skeleton with flows, processors, validators, config management, and component wiring already in place.

1. Copy the template

cp -r src/sites/_template src/sites/mysite

2. Implement the scraper

from src.sites.base.site_scraper import BaseSiteScraper

class MySiteScraper(BaseSiteScraper):
    site_id = "mysite"
    site_name = "My Site"
    base_url = "https://example.com"

3. Define selectors in YAML

# src/sites/mysite/selectors/extraction/listings.yaml
description: "Product listing items"
strategies:
  - type: "css"
    selector: ".product-card"
    weight: 1.0
  - type: "xpath"
    selector: "//div[@data-type='product']"
    weight: 0.8

4. Register and run

# Add to SITE_CLIS in src/main.py
'mysite': ('src.sites.mysite.cli.main', 'MySiteCLI'),

Supported Sites

Site Data Sports / Topics Status Types
FlashScore Live scores, match stats, odds Basketball, Football Live, Finished, Scheduled
Wikipedia Article content, tables, references Any N/A

Both are production implementations β€” the FlashScore scraper handles live match updates, status-aware extraction, and real-time polling. The Wikipedia scraper handles table parsing, multi-language articles, and reference extraction.


Configuration

Global config (config.yaml)

browser:
  headless: true
  timeout: 30000
  viewport:
    width: 1920
    height: 1080

scraping:
  max_retries: 5
  retry_delay: 2.0
  rate_limit: 10  # requests per minute

logging:
  level: "INFO"
  structured: true

CLI options

Flag Description
--limit N Cap the number of results
--output, -o FORMAT Output format: json, csv, xml
--file, -f PATH Write output to file
--headless / --no-headless Browser visibility (--no-headless for debugging)
--verbose, -v Detailed logs
--quiet, -q Errors only

Documentation

Doc Description
features.md Complete feature reference
modular_template_guide.md Guide to building new site scrapers
snapshot_api_reference.md Snapshot debugging system API
yaml-configuration.md YAML selector config reference
browser-lifecycle-management.md Browser pooling and session management
structured-logging-guide.md Logging and observability guide

Roadmap

v1.2 (Q2 2026)

  • Enhanced error recovery mechanisms
  • GraphQL API integration
  • Real-time WebSocket updates
  • ML-based selector optimization

v1.3 (Q3 2026)

  • Tennis and Hockey sport support
  • Enhanced telemetry dashboard
  • Cloud deployment templates

v2.0 (Q4 2026)

  • ESPN support
  • Multi-region scraping
  • SaaS API offering
  • Advanced analytics dashboard

Contributing

  1. Fork and clone the repo
  2. Create a feature branch: git checkout -b feature/my-feature
  3. Follow the existing code style (Black + Ruff)
  4. Add tests for new functionality
  5. Run pytest tests/ and ruff check src/ before submitting
  6. Open a PR with a description of what changed and why

License

MIT β€” see LICENSE for details.


Built with ❀️ by the Scrapamoja Team

About

Scrapamoja is a Python scraping framework combining Playwright browser automation with direct HTTP API extraction. Supports DOM scraping, network interception, Cloudflare bypass, protobuf decoding, and session harvesting for modern SPAs.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages