Skip to content

dshovchko/job-ripper

Repository files navigation

job-ripper (jori)

Zero-dependency CLI that rips through CPU-heavy jobs using Node.js worker_threads.

npm node license

Feed it a file list. Give it a worker script. Chain workers like Unix pipes. It saturates all your CPU cores in parallel — no config, no boilerplate, no dependencies.


Table of Contents


Benchmarks

Benchmark scenario: process files from node_modules across three workload profiles (CPU-bound compression, Markdown rendering, JSON schema validation). Full results across Intel Core Ultra 7 155U and AMD EPYC 9645 — see benchmarks/README.md.

Quick numbers — brotli + pbkdf2-sha256, Intel Core Ultra 7 155U, c=10:

Approach Mean time vs. single-thread
xargs (process per file) 32 s 20× slower
Single-threaded loop 9.4 s baseline
job-ripper 1.6 s 6× faster

Concurrency starting point: 75-100% of cores for CPU-bound tasks, 50-75% for mixed workloads, 15-25% for light ones. For nearly pure I/O, 1-2 workers is enough — the worker still unblocks the main thread even without parallelism.

Run your own baseline and read the full analysis in benchmarks/README.md:

# Requires hyperfine — install instructions in benchmarks/README.md
cd benchmarks
npm run bench:md-html -- -c 8

When to use

✅ CPU-bound work — this is what jori is for:

  • Transpiling / compiling files (TS → JS, SCSS → CSS)
  • Image / video encoding and resizing
  • Markdown → HTML, PDF generation
  • Hash computation, encryption, compression
  • JSON schema validation (large files or schemas), data transformation
  • Static analysis, linting, code formatting

❌ I/O-bound work — use streams instead:

Spawning 8 workers to read 8 files simultaneously won't help if your bottleneck is disk throughput or a remote API rate limit. In those cases, plain Promise.all with a concurrency limiter (e.g. p-limit) is simpler and equally fast.


Install

npm install -g job-ripper          # global CLI
# or
npm install job-ripper             # local, for programmatic use in production (see API section)
# or
npm install --save-dev job-ripper  # local, for use in build scripts / dev tooling only

Requires Node.js ≥ 22.


Quick Start

⚠️ Security note: Only run workers you trust. A worker script executes with full Node.js privileges and can read or modify any file the running user has access to. Review third-party code before use.

1. Write a worker (compress.mjs):

import { gzipSync } from 'node:zlib';
import { readFileSync, writeFileSync } from 'node:fs';

export default async function(filePath, _args) {
  const data = readFileSync(filePath);
  const compressed = gzipSync(data);
  writeFileSync(filePath + '.gz', compressed);
}

2. Run it:

$ jori "src/**/*.js" -w compress.mjs -c 50%

Using concurrency: 6

--- Processing Complete ---
Total files: 312
Success:     312
Failed:      0
Time:        2.41s

That's it. No config files, no require() wrappers, no callbacks.


How it works

                        main thread
                    ┌───────────────┐
   glob / stdin ──► │  file queue   │
                    │               │
                    │  backpressure │ ◄── maxQueue limit
                    └──────┬────────┘
                           │ dispatch
            ┌──────────────┼──────────────┐
            ▼              ▼              ▼
      ┌──────────┐   ┌──────────┐   ┌──────────┐
      │ worker 1 │   │ worker 2 │   │ worker N │  ← N = -c value (default: cpus × 0.75)
      │ (your fn)│   │ (your fn)│   │ (your fn)│
      └────┬─────┘   └────┬─────┘   └────┬─────┘
           └──────────────┼──────────────┘
                          ▼
                  result / error ──► logged to stderr + exit code

Architecture notes:

  • Workers are pre-spawned once at startup (warm pool — no per-file overhead).
  • The main thread reads files and dispatches tasks; it never runs user code.
  • An internal queue with backpressure prevents the in-memory task list from growing unbounded on slow workers.
  • Task-level errors (throw inside your function) are counted as failures and printed to stderr; processing always continues for remaining files. By default the process exits with code 1 if any task failed. Pass -k / --keep-going to exit 0 instead.
  • Fatal errors (worker crash, module not found, missing default export) halt the entire run immediately with a clear message.

Usage

CLI

Usage:
  jori <glob> -w <worker> [options] [-- worker_args...]
  <command> | jori -w <worker> [options] [-- worker_args...]

Arguments:
  <glob>                 File glob pattern or path to a single file

Options:
  -w, --worker <path>    Path to the worker script (required)
  -c, --concurrency <N>  Number of workers or CPU percentage (e.g., 4 or 75%, default: 75%)
  -v, --verbose          Print each processed file and detailed statistics
  -s, --silent           Suppress all non-fatal output (including worker error messages)
  -k, --keep-going       Exit 0 even if some tasks failed (default: exit 1 on any failure)
  --dry-run              Print matched files without running workers
  -h, --help             Show this help message

Concurrency formats:

Value Meaning
4 Exactly 4 workers
50% 50 % of logical CPU cores (rounded down, min 1)
(omitted) Default as 75%

Glob mode

jori "src/**/*.ts" -w build.mjs
jori "images/**/*.png" -w resize.mjs -c 8

Stdin / pipeline mode

When no <glob> argument is given, jori reads file paths from stdin (one per line). This enables Unix-style pipelines:

find . -name "*.log" -mtime -7 | jori -w analyze.mjs
cat file-list.txt              | jori -w process.mjs -c 4

Worker Contract

A worker is any ESM module that exports a default function:

/**
 * @param filePath - Absolute path to the file to process.
 * @param args - Extra arguments forwarded from CLI: `jori ... -- --flag value`.
 * @returns Optional value forwarded to `onSuccess(filePath, result)` in the programmatic API.
 */
export default async function(filePath: string, args: string[]): Promise<unknown> {
  // ...
}

The type annotation above is for documentation purposes only — TypeScript is not required. A plain .mjs / .cjs module works exactly the same.

The signature is async — jori correctly awaits the result, so both sync and async bodies work. However:

Prefer sync APIs inside the body. Each worker runs in its own dedicated thread — blocking it is intentional and expected. readFileSync, gzipSync, createHash etc. avoid unnecessary Promise/microtask overhead. Reserve async for cases where you genuinely need it (e.g. calling an external HTTP API).

Minimal example:

// transform.mjs
import { readFileSync, writeFileSync } from 'node:fs';

export default async function(filePath) {
  const src = readFileSync(filePath, 'utf8');
  writeFileSync(filePath, src.toUpperCase());
}

Error handling:

What you do What jori does
throw new Error(...) Counts as failed, printed to stderr by default (suppressed with --silent). By default the process exits with code 1 after all files are processed. With -k / --keep-going the run finishes normally and exits 0.
Return normally Counts as success; return value is forwarded to onSuccess in the programmatic API
Module has no default export Fatal error — run stops immediately with a clear message
Module file not found Fatal error — run stops immediately

Examples

Looking for more? Check out the examples/ directory in the repository for ready-to-use worker scripts and practical use cases.

Pipeline chain (Unix pipes)

After processing each file, jori echoes the resolved file path to stdout. If the input path was relative (for example from find . -name "*.md"), later pipeline stages will receive the resolved absolute path emitted by the previous stage. Worker scripts are responsible for writing derived files (e.g. .html) to disk themselves; the pipeline does not rewrite paths to derived filenames between stages.

# stage 1: md → html   (writes .html files alongside .md)
# stage 2: minify html (reads and overwrites .html files by convention in minify.mjs)
# stage 3: upload       (I/O-limited; upload.mjs derives the .html path from the .md path)
find . -name "*.md" \
  | jori -w render.mjs  -c 4 \
  | jori -w minify.mjs  -c 4 \
  | jori -w upload.mjs  -c 2

With find, fdir, or fast-glob

# find
find ./src -name "*.ts" -not -path "*/node_modules/*" \
  | jori -w compile.mjs

# fdir (fastest directory crawler)
node --input-type=module << 'EOF' | jori -w compile.mjs
import { fdir } from 'fdir';
const files = new fdir().glob('**/*.ts').crawl('./src').sync();
process.stdout.write(files.join('\n'));
EOF

# fast-glob
node --input-type=module << 'EOF' | jori -w compile.mjs -c 75%
import fg from 'fast-glob';
for (const f of await fg('src/**/*.ts')) console.log(f);
EOF

Dry-run before a destructive operation

# Step 1: preview matched files
jori "logs/**/*.log" -w archive.mjs --dry-run

# Step 2: run for real
jori "logs/**/*.log" -w archive.mjs

Pass arguments to the worker

jori "data/*.json" -w transform.mjs -- --format=pretty --locale=uk

Inside the worker, args is the array of strings after --:

export default async function(filePath, args) {
  const isPretty = args.includes('--format=pretty');
  // ...
}

Programmatic API

CLI vs Programmatic API: The CLI prints each processed file path to stdout and ignores worker return values — it is designed for pipelines where the output is a stream of file paths. If your workers compute results that you need to collect (hashes, metadata, transformed data), use the programmatic API: the onSuccess(filePath, result) callback receives whatever the worker function returns.

import { processFiles } from 'job-ripper';

const result = await processFiles({
  files: ['a.ts', 'b.ts'],        // string[] | Iterable | AsyncIterable
  workerPath: './compile.mjs',    // path to worker module
  concurrency: 4,                 // optional, default: cpus × 0.75
  workerArgs: ['--strict'],       // forwarded to worker as args[]
  dryRun: false,                  // skip actual processing
  onSuccess: (f, result) => console.log('✓', f, result),
  onTaskError: (f, err) => console.error('✗', f, err.message),
});

console.log(result);
// { total: 2, success: 2, failed: 0, durationMs: 310, concurrency: 4 }

The files parameter accepts any iterable or async iterable — arrays, generators, fast-glob streams, fdir crawlers, database cursors, etc.

Returning values from workers: When your worker function returns a value, it is serialized via postMessage (structured clone) and forwarded as the second argument of onSuccess(filePath, result). Keep returned values small and structured-clone-compatible; large objects add IPC overhead.

// hash-worker.mjs
import { readFileSync } from 'node:fs';
import { createHash } from 'node:crypto';

export default async function(filePath) {
  const hash = createHash('sha256').update(readFileSync(filePath)).digest('hex');
  return { filePath, hash };
}

// main.mjs
const hashes = [];
await processFiles({
  files: ['a.bin', 'b.bin'],
  workerPath: './hash-worker.mjs',
  onSuccess: (filePath, result) => hashes.push(result),
});
console.log(hashes);
// [{ filePath: '/abs/a.bin', hash: '3e2b...' }, { filePath: '/abs/b.bin', hash: 'f1a0...' }]

Error handling: Task-level errors (throws inside your worker function) are counted and surfaced via onTaskError if provided, otherwise silent. They are reflected in result.failed. Check that field after the call and decide what to do:

const result = await processFiles({
  // ...
  onTaskError: (filePath, error) => {
    console.error(`Failed: ${filePath}${error.message}`);
  },
});
if (result.failed > 0) {
  console.error(`${result.failed} files failed`);
  process.exit(1);
}

Performance Tips

Prefer sync APIs inside worker bodies

The worker signature is async, but the code inside should be sync whenever possible. worker_threads gives your function its own OS thread — blocking it is intentional. Sync fs, zlib, and crypto calls avoid Promise/microtask overhead:

// ✅ preferred — sync body inside an async worker
export default async function(filePath) {
  const data = readFileSync(filePath);
  writeFileSync(filePath + '.gz', gzipSync(data));
}

// ❌ unnecessary async overhead — the thread is already dedicated to you
export default async function(filePath) {
  const data = await readFile(filePath);
  await writeFile(filePath + '.gz', await gzip(data));
}

Pick concurrency for your task weight

Task weight Computation per file Recommended -c
Light < 10 ms (JSON parse, regex) 25% — tasks finish faster than IPC overhead; extra workers mostly idle
Medium 10-200 ms (transpile, lint) 50-75% (default 75%)
Heavy > 200 ms (image encode, PDF) 75-100% — long tasks justify saturating every core
I/O-bound network / disk limited use p-limit, not jori

Why does lighter work need fewer workers? When each task completes in < 10 ms the bottleneck shifts from CPU to the IPC round-trip between the main thread and workers. Spawning more workers than tasks can be dispatched adds synchronization noise without adding throughput. For heavy tasks the opposite is true — each thread stays busy for hundreds of milliseconds, so every extra core translates directly into lower wall time.

Keep the main thread free

Your worker does the CPU work. The main thread only dispatches tasks. Avoid heavy computation inside onSuccess callbacks — those run on the main thread and will create a bottleneck.

Pipeline tuning

Each stage in a pipeline has its own concurrency budget. Tune them to match the weight of each step:

# Render is CPU-heavy, upload is I/O-limited — different concurrency per stage
find . -name "*.md" | jori -w render.mjs -c 75% | jori -w upload.mjs -c 2

Always preview with --dry-run

Before running a worker that modifies or deletes files, verify what files would be matched. --dry-run prints each matched path to stdout and exits without running workers:

jori "**/*.png" -w resize.mjs --dry-run          # list matched files
jori "**/*.png" -w resize.mjs --dry-run | wc -l  # count them

File discovery: glob vs. find on different platforms

jori accepts file paths in two ways: built-in glob (jori "src/**/*.ts" -w ...) or stdin pipeline (find ... | jori -w ...). The choice can have a significant impact on performance, especially on Windows.

Method Linux / macOS Windows (Git Bash / MSYS2)
Built-in glob / Node.js glob libs (fast-glob, fdir, tinyglobby) Fast — native fs calls Fast — native fs calls
find ... | jori Fast — native binary Slow — MSYS2 POSIX emulation layer
find ... | xargs Fast — native binaries Slow — both find and xargs run under MSYS2 emulation

Why find is slow on Windows: Git Bash ships a POSIX-emulated find (via MSYS2) that translates every path and syscall through a compatibility layer, and piping through MSYS2's emulated shell adds further overhead for every path handed off to jori. Node.js glob libraries call the Windows filesystem API directly and avoid this overhead entirely.

Recommendation:

  • Cross-platform projects — use built-in glob or Node.js glob libraries. They perform consistently on all platforms.
  • Linux/macOS-onlyfind pipelines are fine and sometimes faster for complex filters (-mtime, -size, -user, etc.) that globs can't express.
  • Windows with complex filters — use PowerShell's Get-ChildItem or a Node.js script to produce the file list and pipe it into jori.

Released under the MIT License.