Skip to content

chendbox/mlis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Machine Learning Infrastructure Service

License Python

MLIS is a local-first AI infrastructure reference implementation for durable inference jobs.

It is built for people who want to study or demo the control-plane side of AI systems: job submission, scheduler and worker separation, lease-based recovery, tenant-scoped authorization, artifact-backed payloads, and operator visibility.

Instead of being "just another model-serving wrapper", MLIS tries to make the interesting infrastructure problems concrete and runnable on a laptop.

MLIS platform map

Why This Repo Is Interesting

  • It demonstrates real platform boundaries: API, scheduler, workers, storage, artifacts, auth, audit, and observability are modeled as separate concerns.
  • It shows failure recovery, not just happy-path execution: kill a worker mid-run and watch the job recover through lease expiry and reassignment.
  • It is runnable without a GPU: the default demo exercises GPU-aware scheduling in simulation mode so more people can try it.
  • It is a learning artifact as much as a codebase: the repo includes ADRs, threat model notes, runbooks, and validation docs, not just source files.

Who This Is For

  • Engineers learning how AI infrastructure works behind a /generate endpoint.
  • Platform and MLOps engineers who want a compact reference for scheduler and worker design.
  • Hiring managers, interviewers, or teammates who want a runnable systems project instead of a slide deck.
  • Builders who want a starting point for local-first inference job orchestration experiments.

Demo and Release

See the v0.1.0 release for release notes and the lease recovery demo video.

Contributing

Contributions are welcome, especially in areas that improve the repo as a learning artifact and demo:

  • quickstart clarity and platform-specific docs
  • tests and validation scripts
  • operator console polish and empty-state UX
  • observability, diagnostics, and demo tooling
  • small scheduler, worker, or auth hardening improvements

If you want to help, start with CONTRIBUTING.md and look for issues labeled good first issue, documentation, or help wanted.

What You Can See In 3 Minutes

  1. Start the stack with Docker Compose.
  2. Open the React console at /console.
  3. Submit a sleep or gpu_demo job.
  4. Kill a worker and watch the job recover.

If that sounds useful, jump to Quickstart With Docker Compose.

Highlights

  • Two distinct concepts: a job is user intent, while a job_assignment is one execution attempt.
  • Lease-based recovery: kill a worker mid-run and watch work recover through lease expiry; see the recovery demo.
  • Threat-model-first security: tenant, worker, admin, and token risks map to explicit mitigations in docs/design/threat-model.md.
  • Append-only audit trail: security-sensitive state changes and denied authorization decisions are recorded with request correlation.
  • Decisions documented as ADRs: see docs/adr/README.md.

Design Deep Dives

Why This Project Exists

The project demonstrates practical AI infrastructure design: lifecycle state, capacity control, scheduler/worker separation, lease-based recovery, resource accounting, multi-tenant authorization, observability, and productized developer workflows.

Calling a model is only one part of serving AI workloads. A real platform also needs to answer:

  • Who submitted the work?
  • Which tenant owns the job and result?
  • What resources does the job need?
  • Which worker is allowed to execute it?
  • What happens if a worker crashes mid-run?
  • Where should large inputs and outputs live?
  • Who can read, cancel, or administer the workload?

MLIS turns those questions into concrete APIs, database state, scheduler behavior, worker protocols, tests, CLI commands, and UI flows.

Core Capabilities

  • Durable job lifecycle with 5 explicit states plus lease-based recovery.
  • Scheduler-driven assignment model that separates user intent from concrete execution attempts.
  • Worker registry, heartbeat, lease ownership, and reclaim-oriented failure recovery validated under multi-worker execution.
  • Runner framework with a focused public path: sleep, generic inference, and simulated gpu_demo.
  • Optional runner integrations, including tiny_llm, are kept as examples of how model-specific execution can plug into the worker framework without becoming the default demo path.
  • Artifact-backed inputs and outputs so large payloads do not bloat the metadata store.
  • GPU-aware scheduling foundations with worker groups, node/slot-separated capacity, resource requests, GPU ids, and placement checks.
  • Scoped JWT authorization for job submission, job reads, cancellation, worker administration, and tenant administration.
  • Tenant isolation for job submission, listing, reading, cancellation, and result access.
  • Append-only audit logging for security-sensitive state changes and denied authorization decisions, correlated with request IDs.
  • CLI and React console for local demos and operator workflows.
  • Representative validation scripts live under scripts/; generated benchmark output is intentionally kept out of Git.

Non-Goals

  • Not a model accuracy or training platform; the focus is the inference serving and compute platform layer.
  • Not optimized for peak benchmark throughput; the focus is operability, capacity control, and predictable behavior under load.
  • Not a managed multi-cloud product; it is local-first by design, with Kubernetes as the deployment target.
  • Not an identity provider; auth verifies and authorizes JWTs, but does not replace an external IdP.

Observability

MLIS exposes operational signals for debugging scheduling, capacity, tenant behavior, and security boundaries:

  • Per-tenant job rate, latency, rejection, and state counts.
  • Queue depth and active worker-slot counts by worker group.
  • Scheduler placement decisions with explanation labels such as insufficient capacity, quota pressure, and placement rejection.
  • GPU and node-level capacity signals, including advertised resources, assigned GPU ids, and worker/node accounting.
  • Optional advanced runtime diagnostics for batching and warm-pool internals, kept separate from the primary scheduler demo path.
  • Structured API logs correlated by X-Request-ID.
  • Append-only audit events with actor tenant, target tenant, action, result, and denial reason.

Quickstart With Docker Compose

The recommended public demo path uses Docker Compose with Postgres, one API/control-plane container, and scalable worker/data-plane containers. The API container owns HTTP intake and the push scheduler; worker containers execute jobs. The default stack does not require a physical GPU; gpu_demo runs in simulation mode while still exercising GPU resource accounting.

Prerequisites

  • Python 3.11+
  • Docker Desktop with Docker Compose

Start the Stack

docker compose build
docker compose up -d postgres
docker compose up -d --scale worker=4 app worker

The stack runs Postgres, the API in push-scheduler mode, and 4 worker containers. Each worker container starts multiple logical worker slots from jobs.worker_concurrency, so the console may show more ACTIVE worker slots than Docker containers. For example, 4 containers with concurrency 4 produce 16 logical execution slots. The public compose file sets WORKER_GPU_HEALTH_PROBE=0 because gpu_demo is a simulated GPU scheduling demo and should run without a local NVIDIA runtime.

The public process model is intentionally small:

  • MLIS_ROLE=api: HTTP API, admission, scheduler, cleanup, worker registry maintenance.
  • MLIS_ROLE=worker: runner execution, worker registration, heartbeats, assignment consumption.
  • SCHEDULER_ENABLED=1 belongs on the API container only.

GPU and Non-GPU Demo Modes

The default quickstart is the non-GPU / universally runnable path:

  • No NVIDIA GPU is required.
  • Worker containers advertise simulated GPU capacity from config.
  • WORKER_GPU_HEALTH_PROBE=0 skips nvidia-smi so the simulated gpu_demo can be scheduled.
  • gpu_demo still exercises GPU-aware placement, gpu_ids, and CUDA_VISIBLE_DEVICES, but it does not run real CUDA compute.

If your host has NVIDIA Docker support and you want workers to verify real GPU visibility, use the optional GPU override:

docker run --rm --gpus all nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi
docker compose -f docker-compose.yml -f docker-compose.gpu.yml up -d --scale worker=4 app worker

The override enables gpus: all for workers and sets WORKER_GPU_HEALTH_PROBE=1, so workers register with gpu_health_ok=true only if nvidia-smi -L works inside the container. The default image is still python:3.11-slim, so the public gpu_demo remains a scheduling simulation. Real PyTorch/CUDA execution requires a CUDA/PyTorch runtime image and matching dependencies.

Useful URLs:

Prometheus and Grafana are not bundled in the default compose file. The API exposes Prometheus-format metrics at /metrics; see docs/runbooks/local-demo.md for local verification steps and deploy/grafana/dashboards/mlis-gpu-platform.json for the dashboard template.

Verify the API is ready:

curl -sS http://localhost:8001/health/live
curl -sS http://localhost:8001/health/ready
curl -sS http://localhost:8001/health/ready/full

PowerShell:

curl.exe -sS "http://localhost:8001/health/live"
curl.exe -sS "http://localhost:8001/health/ready"
curl.exe -sS "http://localhost:8001/health/ready/full"

Submit and inspect a CPU-only job:

curl -sS -X POST http://localhost:8001/v1/jobs \
  -H "Content-Type: application/json" \
  --data @examples/job.sleep.json

curl -sS http://localhost:8001/v1/jobs/<job_id>

PowerShell:

curl.exe -sS -X POST "http://localhost:8001/v1/jobs" -H "Content-Type: application/json" --data "@examples/job.sleep.json"
curl.exe -sS "http://localhost:8001/v1/jobs/<job_id>"

Submit and inspect the simulated GPU scheduling demo:

curl -sS -X POST http://localhost:8001/v1/jobs \
  -H "Content-Type: application/json" \
  --data @examples/job.gpu_demo.json

curl -sS http://localhost:8001/v1/jobs/<job_id>
curl -sS "http://localhost:8001/v1/jobs?limit=5"
curl -sS http://localhost:8001/ui/api/overview

PowerShell:

curl.exe -sS -X POST "http://localhost:8001/v1/jobs" -H "Content-Type: application/json" --data "@examples/job.gpu_demo.json"
curl.exe -sS "http://localhost:8001/v1/jobs/<job_id>"
curl.exe -sS "http://localhost:8001/v1/jobs?limit=5"
curl.exe -sS "http://localhost:8001/ui/api/overview"

Expected results:

  • sleep finishes with state: "SUCCEEDED" and a result such as {"slept": 1.0}.
  • gpu_demo finishes with state: "SUCCEEDED", gpu_ids: [0], and cuda_visible_devices: "0" in simulation mode.
  • /console shows the same jobs through the React operator console.

Submit the generic inference runner if you want to verify the non-GPU model-execution path:

curl -sS -X POST http://localhost:8001/v1/jobs \
  -H "Content-Type: application/json" \
  --data @examples/job.inference.json

PowerShell:

curl.exe -sS -X POST "http://localhost:8001/v1/jobs" -H "Content-Type: application/json" --data "@examples/job.inference.json"

tiny_llm remains available as an optional runner integration, but it is intentionally not part of the default quickstart. See docs/runbooks/optional-runners.md.

Stop the Stack

docker compose down

For a fresh demo database and empty artifact volume:

docker compose down -v

Inspect Logs

docker compose logs -f postgres app worker
Windows convenience script
cd path\to\machine-learning-infrastructure-service\open-source-release
.\dev.ps1 start -workers 4
.\dev.ps1 logs
.\dev.ps1 stop

Try Lease-Based Recovery

# Submit a long-running job.
curl -sS -X POST http://localhost:8001/v1/jobs \
  -H "Content-Type: application/json" \
  --data '{"task_type":"sleep","params":{"seconds":80},"cpu":1,"mem_mb":256,"gpu":0}'

# Fetch the job and note latest_assignment.worker_uid.
curl -sS http://localhost:8001/v1/jobs/<job_id>

# Map that worker_uid to a Docker container log prefix such as worker-2.
docker compose logs worker | grep '<worker_uid>'

# Find the matching container name.
docker compose ps worker

# Kill one worker, then watch the job recover after lease expiry.
docker kill <worker_container_id>
curl -sS http://localhost:8001/v1/jobs/<job_id>

This demonstrates the worker lease/heartbeat/reclaim path: work is not permanently lost when an executor disappears. In a successful recovery run, the job first appears under the killed worker's worker_uid, then later appears under a different worker_uid, and finally reaches state: "SUCCEEDED".

PowerShell:

@'
{
  "task_type": "sleep",
  "params": { "seconds": 80 },
  "cpu": 1,
  "mem_mb": 256,
  "gpu": 0
}
'@ | Set-Content long-job.json

curl.exe -sS -X POST "http://localhost:8001/v1/jobs" -H "Content-Type: application/json" --data "@long-job.json"
curl.exe -sS "http://localhost:8001/v1/jobs/<job_id>"
docker compose logs worker | Select-String -Pattern "<worker_uid>"
docker compose ps worker
docker kill <worker_container_name>
curl.exe -sS "http://localhost:8001/v1/jobs/<job_id>"

PowerShell notes:

  • Use curl.exe, not curl, because PowerShell maps curl to Invoke-WebRequest.
  • Use Select-String instead of grep: docker compose logs worker | Select-String -Pattern "<worker_uid>".
  • For JSON payloads, --data "@file.json" is more reliable than inline JSON quoting in PowerShell.

CLI Against The Docker Stack

For local CLI development, keep the Docker stack running and point the mlis console script at the API. This keeps the public process model the same as the demo: one API/control-plane process plus separate worker/data-plane processes.

Mac/Linux/WSL:

python -m venv .venv
source .venv/bin/activate
pip install -e .
export PYTHONPATH=src

Mint a local development token:

export MLIS_TOKEN=$(mlis dev-token --tenant tenant-a --user user-a --scope jobs:submit,jobs:read,jobs:cancel)

Check identity and submit work:

mlis --token "$MLIS_TOKEN" whoami
mlis --token "$MLIS_TOKEN" submit sleep --seconds 10
mlis --token "$MLIS_TOKEN" status <job_id>
mlis --token "$MLIS_TOKEN" cancel <job_id>

Windows PowerShell:

.\.venv\Scripts\Activate.ps1
$env:PYTHONPATH="src"

Mint a local development token:

$env:MLIS_TOKEN = .\.venv\Scripts\mlis.exe dev-token --tenant tenant-a --user user-a --scope jobs:submit,jobs:read,jobs:cancel

Check identity:

.\.venv\Scripts\mlis.exe --token $env:MLIS_TOKEN whoami

Submit and inspect a job:

.\.venv\Scripts\mlis.exe --token $env:MLIS_TOKEN submit sleep --seconds 10
.\.venv\Scripts\mlis.exe --token $env:MLIS_TOKEN status <job_id>
.\.venv\Scripts\mlis.exe --token $env:MLIS_TOKEN cancel <job_id>

Security Model

MLIS treats identity and authorization separately.

  • Authentication verifies JWT issuer, audience, signature, expiry, and subject.
  • Authorization checks route-specific scopes such as jobs:submit, jobs:read, jobs:cancel, admin:workers, and admin:tenants.
  • Tenant context comes from the verified token, not from user-supplied request bodies.
  • Cross-tenant access requires both the correct scope and an explicit cross-tenant capability.
  • Workers register with bootstrap credentials, receive worker JWTs, and report assignment completion with signed dispatch tokens.
  • Audit rows are written for security-sensitive state-changing operations and denied authorization attempts.

See docs/design/threat-model.md and docs/validation/security_test.md.

Testing

Install test dependencies:

pip install -e ".[test]"

Run the public release gate used by CI:

pytest tests/unit \
  tests/integration/test_health.py \
  tests/integration/test_auth_middleware.py \
  tests/integration/test_scoped_authorization.py \
  tests/integration/test_jobs.py \
  tests/integration/test_ui.py \
  -q

Run only the unit tests:

pytest tests/unit -q

Run the curated integration smoke tests:

pytest tests/integration/test_health.py tests/integration/test_auth_middleware.py tests/integration/test_scoped_authorization.py tests/integration/test_jobs.py tests/integration/test_ui.py -q

PowerShell equivalent:

.\.venv\Scripts\python.exe -m pip install -e ".[test]"
.\.venv\Scripts\python.exe -m pytest tests\unit tests\integration\test_health.py tests\integration\test_auth_middleware.py tests\integration\test_scoped_authorization.py tests\integration\test_jobs.py tests\integration\test_ui.py -q

The project includes unit and integration coverage for core state transitions, scheduler behavior, worker paths, artifact handling, auth boundaries, and UI/CLI behavior.

Repository Guide

src/app/api/             FastAPI routes and schemas
src/app/cli/             mlis command-line client
src/app/control_plane/   scheduler and control-plane services
src/app/data_plane/      worker execution and runner logic
src/app/storage/         Postgres storage plus test-local storage adapters
src/app/services/        service-layer modules such as audit and artifacts
src/app/ui-react/        React console source
src/app/ui-react-dist/   built console assets served by FastAPI
configs/                 committed defaults and example config overlays
db/init/                 Postgres schema initialization
deploy/                  Docker, Kubernetes, and Grafana deployment assets
docs/design/             focused design deep dives for recovery, security, and scheduling
docs/runbooks/           local demo and operations-oriented verification notes
docs/validation/         security and integration test reports
tests/                   unit and integration tests

Runtime outputs such as artifacts/, tmp/, logs/, reports/, model files, and machine-local config such as configs/real.yaml are ignored by design.

Public Release Scope

Implemented MVP

  • durable job and assignment lifecycle
  • local Docker deployment
  • Postgres-backed metadata
  • scheduler-driven placement
  • worker lease and heartbeat model
  • artifact-backed payloads and results
  • CLI and React console
  • scoped auth, tenant isolation, worker auth, and audit foundations

Future Hardening Ideas

These are explicit residual risks documented in the threat model, not unknown unknowns:

  • stronger GPU inventory and memory accounting
  • Kubernetes deployment hardening
  • real key rotation/JWKS or KMS-backed signing
  • rate limits for token minting and job submission
  • production secret management
  • audit retention/export policy

About

Distributed AI infrastructure platform with scheduler-worker separation, resource accounting, GPU-aware execution, and lease-based recovery.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors