A production-grade, high-throughput, fault-tolerant distributed task queue implementing the Reliable Queue Pattern. The system guarantees at-least-once delivery through atomic state transitions and a lease-based crash recovery reaper, verified under concurrent worker failures by automated chaos testing. Features include multi-tier priority queues, delayed execution, batch ingestion, and native Prometheus instrumentation.
The system decouples task ingestion from execution via Redis queues. A background dispatcher manages scheduled executions, while concurrent workers poll tasks sequentially in priority order.
flowchart TD
Client -->|POST /task or /tasks| Producer[Producer :8085]
Producer -->|ZADD| DelayedQueue[(tasks:delayed)]
Producer -->|LPUSH| PendingHigh[(tasks:pending:high)]
Producer -->|LPUSH| PendingDefault[(tasks:pending:default)]
Producer -->|LPUSH| PendingLow[(tasks:pending:low)]
Dispatcher[Delayed Dispatcher] -->|ZRANGEBYSCORE + LPUSH| PendingHigh
Dispatcher -->|ZRANGEBYSCORE + LPUSH| PendingDefault
Dispatcher -->|ZRANGEBYSCORE + LPUSH| PendingLow
DelayedQueue -.->|Read & remove| Dispatcher
Worker[Worker Pool] -->|LMOVE / BLMOVE| PendingHigh
Worker -->|LMOVE / BLMOVE| PendingDefault
Worker -->|LMOVE / BLMOVE| PendingLow
PendingHigh -.->|Popped into| Processing[(tasks:processing)]
PendingDefault -.->|Popped into| Processing
PendingLow -.->|Popped into| Processing
Worker -->|Ack: LREM| Processing
Worker -->|Fail: ZADD retry| DelayedQueue
Worker -->|Fail: LPUSH DLQ| DLQ[(tasks:dead_letter)]
Reaper[Reaper] -->|Scan expired leases| Processing
Reaper -->|Reclaim: LREM + LPUSH| PendingHigh
Reaper -->|Reclaim: LREM + LPUSH| PendingDefault
Reaper -->|Reclaim: LREM + LPUSH| PendingLow
Monitor[Monitor Dashboard] -->|LLEN / ZCARD| DelayedQueue
Monitor -->|LLEN| PendingHigh
Monitor -->|LLEN| PendingDefault
Monitor -->|LLEN| PendingLow
Monitor -->|LLEN| Processing
Monitor -->|LLEN| DLQ
Workers lease tasks during execution by setting a key in Redis (task:lease:<id>) for LEASE_TTL_SECONDS. While the task is active, a per-task goroutine refreshes this lease every LEASE_TTL_SECONDS / 3.
If a worker process dies mid-flight (due to OOM, host crash, or SIGKILL), its lease key expires. A background reaper sweeps the tasks:processing queue, detecting orphaned tasks without leases and reclaiming them back into their original priority queues atomically via a Lua script.
With default configurations (30 s lease + 5 s sweep), crashed worker tasks are safely recovered within 35 seconds with zero task loss.
Producers can opt into deduplication for non-idempotent task handlers by attaching an idempotency_key field to the task payload during enqueueing. Workers skip execution if a task with the same key has already been executed within the configured time window.
- How to use: Add
"idempotency_key": "some-unique-string"when enqueueing a task. - Guarantee: At most one successful execution per key within the TTL window.
- Config: The
IDEMPOTENCY_KEY_TTL_SECONDSenv var controls how long keys are cached (default86400= 24h). - Pattern: A claim-then-confirm pattern is used:
- Claim: On start of processing, the worker sets
task:done:<idempotency_key>to"in_flight"with a short5-minuteTTL (lease window). If the key already exists, the task is skipped and marked as duplicate. - Confirm: Upon successful execution, the key's value is set to
"done"and its TTL is extended toIDEMPOTENCY_KEY_TTL_SECONDS. - Release: If execution fails, the worker deletes the key so that subsequent retries can run successfully.
- Claim: On start of processing, the worker sets
All configuration is driven by environment variables:
| Variable | Default | Description |
|---|---|---|
REDIS_ADDR |
localhost:6379 |
Redis host and port address |
REDIS_DB |
0 |
Redis database index |
WORKER_COUNT |
1 |
Concurrency: number of worker goroutines per worker node |
LEASE_TTL_SECONDS |
30 |
Task lease TTL (seconds) before crash recovery triggers |
REAPER_ENABLED |
true |
Runs the orphaned-task recovery sweep |
REAPER_INTERVAL_SECONDS |
5 |
Latency between reaper sweeps (seconds) |
DISPATCHER_ENABLED |
true |
Runs the scheduled-tasks dispatcher |
DISPATCHER_INTERVAL_SECONDS |
1 |
Interval between scheduled-task sweeps (seconds) |
DISPATCHER_INTERVAL_MS |
0 |
Sub-second dispatcher override (milliseconds, test use only) |
WORKER_SLEEP_MS |
1000 |
Mock processing delay per task (milliseconds) |
WORKER_FAILURE_PCT |
25 |
Mock task processing failure rate (percentage: 0-100) |
WORKER_METRICS_ADDR |
:8086 |
Listen port for worker Prometheus metrics |
Each binary exposes a /metrics endpoint compatible with Prometheus:
- Producer:
http://localhost:8085/metrics - Worker:
http://localhost:8086/metrics - Monitor:
http://localhost:8082/metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
queue_depth |
Gauge | queue (pending_high, pending_default, pending_low, delayed, processing, dlq) |
Current depth of each state queue |
task_duration_seconds |
Histogram | type, status (success, failure) |
Duration of task execution |
task_enqueue_total |
Counter | type |
Total count of ingested tasks |
task_retries_total |
Counter | type |
Total count of retried tasks |
task_reclaimed_total |
Counter | — | Total count of recovered crashed tasks |
A pre-configured Grafana dashboard template is available at deploy/grafana/queue.json.
Boot up the complete Go Distributed Queue stack (Redis, Producer API, 5-Worker Pool, and Monitor Dashboard):
docker compose up -d --build- Monitor Dashboard (HTML):
http://localhost:8082 - Monitor Dashboard (JSON):
http://localhost:8082/stats
Single Task (with priority and execute_at scheduling):
curl -X POST http://localhost:8085/task \
-H "Content-Type: application/json" \
-d '{
"type": "email",
"payload": "user@example.com",
"priority": "high",
"execute_at": "2026-06-03T12:00:00Z"
}'Batch Tasks (pipelined high-throughput):
curl -X POST http://localhost:8085/tasks \
-H "Content-Type: application/json" \
-d '[
{"type": "job", "payload": "1", "priority": "high"},
{"type": "job", "payload": "2", "priority": "low"},
{"type": "job", "payload": "3"}
]'Run tests locally with a running Redis instance or using mock miniredis:
# 1. Run Unit Tests (mocked via miniredis)
go test ./... -count=1
# Proves: logical correctness of client operations, dispatcher priority routing, and parser helpers.
# 2. Run Integration Tests (requires docker-compose up redis)
go test -tags=integration ./... -count=1 -timeout 60s
# Proves: end-to-end task flows, actual recovery of crashed workers, and execution scheduling accuracy.
# 3. Run Chaos Tests (requires docker-compose up redis)
go test -tags=chaos ./internal/reaper/... -count=1 -timeout 120s
# Proves: zero task loss and zero duplicates under simulated worker crashes and restarts.- Authentication: Secure the ingestion endpoints (
/taskand/tasks) using Bearer Token middleware. - Dynamic Scaling: Implement horizontal worker scaling based on
queue_depththresholds.