Skip to content

Latest commit

 

History

History
504 lines (428 loc) · 23.9 KB

File metadata and controls

504 lines (428 loc) · 23.9 KB

Genie - Project Plan

Vision

Genie is a cross-platform app (iOS, Android, Web) where users describe a goal and a containerized LLM agent ("Genie") is provisioned to autonomously pursue that goal. Genies run on schedules, monitor the world, and proactively push updates to their user. Users control what each Genie can access on the network via a real-time approval system.


Architecture Overview

┌─────────────────────────────────────────────────────────┐
│                    Frontend (React Native Web)           │
│              iOS / Android / Web from one codebase       │
│                                                          │
│  ┌──────────┐ ┌──────────┐ ┌────────────┐ ┌──────────┐ │
│  │  Home /  │ │  Create  │ │   Genie    │ │ Terminal │ │
│  │  List    │ │  Flow    │ │   Detail   │ │  (Debug) │ │
│  └──────────┘ └──────────┘ └────────────┘ └──────────┘ │
└──────────────────────┬──────────────────────────────────┘
                       │ REST + WebSocket
┌──────────────────────▼──────────────────────────────────┐
│                    Backend API (Node.js / TS)            │
│                                                          │
│  ┌────────────┐ ┌────────────┐ ┌───────────────────────┐│
│  │    Auth    │ │  Scheduler │ │  Container Orchestrator││
│  │  Service   │ │  Service   │ │  (Docker / Fly / ECS) ││
│  └────────────┘ └────────────┘ └───────────────────────┘│
│  ┌────────────┐ ┌────────────┐ ┌───────────────────────┐│
│  │   Chat /   │ │  Network   │ │   Notification        ││
│  │  Message Q │ │  Approval  │ │   Service (APNS/FCM)  ││
│  └────────────┘ └────────────┘ └───────────────────────┘│
└──────────────────────┬──────────────────────────────────┘
                       │
┌──────────────────────▼──────────────────────────────────┐
│               Genie Runtime (per container)              │
│                                                          │
│  ┌─────────────────────────────────────────────────────┐│
│  │              Genie Harness (Core IP)                 ││
│  │                                                      ││
│  │  ┌──────────┐ ┌──────────┐ ┌───────────────────┐   ││
│  │  │ Planning │ │Execution │ │  Memory Manager   │   ││
│  │  │  LLM     │ │  LLM     │ │  (read/write/     │   ││
│  │  │          │ │          │ │   summarize)      │   ││
│  │  └──────────┘ └──────────┘ └───────────────────┘   ││
│  │  ┌──────────┐ ┌──────────┐ ┌───────────────────┐   ││
│  │  │  Tool    │ │ Network  │ │  Schedule         │   ││
│  │  │  Runner  │ │ Proxy    │ │  Self-Recommender │   ││
│  │  │ (shell,  │ │ (egress  │ │                   │   ││
│  │  │  files)  │ │  filter) │ │                   │   ││
│  │  └──────────┘ └──────────┘ └───────────────────┘   ││
│  │  ┌─────────────────────────────────────────────┐   ││
│  │  │  Metrics & Self-Evaluation Engine            │   ││
│  │  │  (KPIs, tracking, periodic self-review)      │   ││
│  │  └─────────────────────────────────────────────┘   ││
│  └─────────────────────────────────────────────────────┘│
│                                                          │
│  ┌──────────────────┐  ┌──────────────────────────────┐ │
│  │ Persistent Volume │  │  Debug SSH / Terminal Server │ │
│  │ (memory, state)   │  │  (for drop-in diagnostics)  │ │
│  └──────────────────┘  └──────────────────────────────┘ │
└─────────────────────────────────────────────────────────┘
                       │
┌──────────────────────▼──────────────────────────────────┐
│                   Data Layer                              │
│                                                          │
│  ┌──────────┐ ┌──────────────┐ ┌───────────────────┐   │
│  │ Postgres │ │  Object Store │ │  Vector DB        │   │
│  │ (users,  │ │  (S3 - genie │ │  (genie long-term │   │
│  │  genies, │ │   artifacts)  │ │   memory)         │   │
│  │  perms)  │ │               │ │                   │   │
│  └──────────┘ └──────────────┘ └───────────────────┘   │
└─────────────────────────────────────────────────────────┘

Component Breakdown

1. Frontend (React Native Web)

Tech: React Native + react-native-web + Expo

Screens:

Screen Purpose
Auth Sign up / login
Home List of user's genies with status (running, sleeping, error)
Create Genie Multi-step: describe goal -> LLM suggests config -> user reviews/approves -> deploy
Genie Detail Chat interface, latest updates/briefings, network permissions panel, metrics dashboard, settings
Network Approvals Pending approval requests (also push notifications)
Terminal Web terminal (xterm.js) to drop into a genie's container for debugging
Settings Account, notification preferences

Key Features:

  • Push notifications via APNS (iOS) and FCM (Android)
  • WebSocket connection for real-time chat and status updates
  • Offline message queuing (messages sent while genie is asleep are queued)

2. Backend API

Tech: Node.js + TypeScript + Express/Fastify

Database: PostgreSQL

Key Services:

Auth Service

  • JWT-based auth
  • User management

Container Orchestrator

  • Provisions containers on demand (Docker on a VM cluster, or Fly.io Machines API, or AWS ECS)
  • Start/stop/destroy genie containers
  • Attaches persistent volumes for memory
  • Manages container lifecycle (spin up on schedule, spin down after idle)

Scheduler Service

  • Stores each genie's schedule (cron expressions)
  • Triggers container wake-up at scheduled times
  • Genie can recommend its own schedule during planning phase; user approves

Chat / Message Queue

  • Proxies messages between user and genie
  • Queues user messages when genie container is offline
  • On container wake-up, delivers queued messages to genie
  • Stores full conversation history in Postgres

Network Approval Service

  • Receives egress requests from genie containers (via the network proxy)
  • Creates approval requests
  • Sends push notification to user
  • On approval: updates firewall rules for that container
  • Supports "allow once" vs "allow always" (per genie, per domain)

Notification Service

  • APNS + FCM integration
  • Sends: network approval requests, genie briefings/updates, genie status changes

3. Genie Harness (Core IP)

This is the agent runtime that runs inside each container. It is the most critical component.

Tech: Python (best LLM tooling ecosystem)

3a. Planning LLM

  • Used during genie creation to analyze the user's goal
  • Suggests: container specs, schedule, model choices, required tools
  • Also used by the genie for high-level reasoning and re-planning
  • Model: configurable, suggested during creation (e.g., Claude Sonnet for simple tasks, Opus for complex)

3b. Execution LLM

  • Handles the actual task execution: web scraping, data analysis, composing briefings
  • Model: configurable, can be lighter/cheaper than the planning model
  • Runs within tool-use loops

3c. Memory Manager

The genie's persistent brain. This is the key differentiator.

Memory Architecture:

┌─────────────────────────────────────┐
│           Memory Manager            │
│                                     │
│  ┌───────────┐  ┌────────────────┐  │
│  │  Working   │  │   Long-Term    │  │
│  │  Memory    │  │   Memory       │  │
│  │            │  │                │  │
│  │ - Current  │  │ - Vector DB    │  │
│  │   task     │  │   (semantic    │  │
│  │ - Recent   │  │    search)     │  │
│  │   findings │  │ - Structured   │  │
│  │ - Session  │  │   knowledge    │  │
│  │   state    │  │   (JSON/SQLite)│  │
│  └───────────┘  └────────────────┘  │
│                                     │
│  ┌───────────────────────────────┐  │
│  │     Memory Lifecycle          │  │
│  │                               │  │
│  │ 1. After each task run:       │  │
│  │    - Summarize findings       │  │
│  │    - Extract key facts        │  │
│  │    - Store in long-term       │  │
│  │                               │  │
│  │ 2. Before each task run:      │  │
│  │    - Load relevant memories   │  │
│  │    - Reconstruct context      │  │
│  │    - Resume where left off    │  │
│  │                               │  │
│  │ 3. Periodically:              │  │
│  │    - Consolidate/compress     │  │
│  │    - Prune stale info         │  │
│  │    - Re-rank importance       │  │
│  └───────────────────────────────┘  │
└─────────────────────────────────────┘

Storage: Persistent volume mounted at /genie/memory/ survives container restarts.

  • working.json — current session state
  • knowledge.db — SQLite for structured facts
  • vectors/ — local vector index (e.g., ChromaDB) for semantic search over accumulated knowledge
  • history/ — compressed logs of past runs

3d. Tool Runner

Executes actions on behalf of the genie:

  • Shell commands (sandboxed, non-root)
  • File read/write (within the container)
  • Web requests (routed through the network proxy)
  • Data processing (Python libraries available)

3e. Network Proxy (Egress Filter)

  • All outbound HTTP(S) from the container routes through a local proxy
  • Proxy checks domain against the genie's allowlist
  • If domain not approved: blocks request, sends approval request to backend
  • If approved: forwards request
  • Implemented as a transparent proxy (e.g., mitmproxy or a lightweight custom proxy)

3f. Metrics & Self-Evaluation Engine

The genie must measure its own performance against the user's goal. This is critical — without metrics, there's no feedback loop and no improvement.

How it works:

  1. Metric Definition (at creation time): During the planning phase, the Planning LLM analyzes the user's goal and defines measurable KPIs. The user reviews and can adjust these.

    Examples by goal type:

    Goal Metrics
    "Monitor housing prices in Austin" - # of listings surfaced per week
    - % of surfaced listings user found relevant (user feedback)
    - Average time from listing appearing to user notification
    - Coverage: % of major listing sources monitored
    "Daily briefing on Iran conflict" - Briefing delivered on time (Y/N per day)
    - # of unique sources consulted
    - User engagement: did user read/respond?
    - User rating (optional thumbs up/down on briefings)
    "Monitor financial markets" - Briefing timeliness
    - # of actionable insights flagged
    - Accuracy of flagged trends (retroactive self-check)
    - Source diversity
  2. Metric Collection (each run): After every task execution, the genie records metrics to a structured store:

    /genie/memory/metrics/
    ├── definitions.json    # KPI definitions, targets, thresholds
    ├── observations.jsonl  # Append-only log of metric data points per run
    └── evaluations.jsonl   # Periodic self-evaluation summaries
    
  3. Self-Evaluation (periodic): On a configurable cadence (e.g., weekly, or every N runs), the Planning LLM reviews accumulated metrics and produces a self-evaluation:

    • What's going well vs. what's underperforming
    • Root cause analysis for missed targets
    • Proposed adjustments (change sources, adjust schedule, refine search criteria)
    • These adjustments are sent to the user for approval before being applied
  4. User Feedback Loop:

    • User can rate genie outputs (thumbs up/down, or 1-5 stars on briefings)
    • User can flag irrelevant results ("this listing isn't what I'm looking for")
    • This feedback is stored as a metric and factored into self-evaluation
    • The genie learns what the user actually values over time
  5. Metric Dashboard (in app):

    • Genie Detail screen shows a simple performance summary
    • Trend lines for key metrics over time
    • Current self-evaluation score
    • History of adjustments the genie has made

Storage: Metrics persist on the same mounted volume as memory, under /genie/memory/metrics/.

3g. Schedule Self-Recommender

  • After the planning phase, the genie suggests when it should be woken
  • Examples: "I should check every 4 hours", "Once daily at 5:30 AM user's timezone"
  • User approves/modifies the schedule
  • Genie can also request schedule changes over its lifetime

3g. Debug Terminal Server

  • Lightweight SSH or WebSocket terminal server
  • Allows the user to "drop in" to the container from the app
  • Read-only mode available for safe inspection
  • Full shell mode for debugging

4. Data Layer

PostgreSQL — primary database:

  • Users, auth tokens
  • Genies (config, status, schedule, goal, model choices)
  • Conversations (messages between user and genie)
  • Network permissions (per genie, per domain)
  • Approval requests

Object Storage (S3 or equivalent):

  • Genie artifacts (reports, generated files)
  • Exported briefings

Vector DB (per genie, local in container):

  • ChromaDB or similar embedded vector DB
  • Stores genie's accumulated knowledge embeddings
  • Persisted on the mounted volume

Genie Lifecycle

1. CREATION
   User describes goal
   → Planning LLM analyzes goal
   → Suggests: container spec, schedule, models, estimated cost
   → User reviews and approves
   → Container provisioned, harness installed, genie initialized

2. FIRST RUN
   Genie reads its goal
   → Planning LLM creates initial plan
   → Planning LLM defines KPIs/metrics for the goal
   → User reviews and approves metrics
   → Genie recommends its schedule ("wake me every morning at 5 AM")
   → User approves schedule
   → Genie begins first task execution
   → Hits network blocks, requests approvals
   → User approves domains
   → Genie completes first run, stores memories, sends first briefing
   → Container goes to sleep

3. SCHEDULED RUNS
   Scheduler triggers wake-up
   → Container starts
   → Harness loads: reads persisted memory, checks for queued user messages
   → Execution LLM runs task with context from memory
   → Metrics recorded for this run
   → Results stored, briefing pushed to user
   → If self-evaluation due: Planning LLM reviews metrics, proposes adjustments
   → Container sleeps

4. USER-INITIATED INTERACTION
   User sends message in chat
   → If container sleeping: wake it up, deliver message
   → Genie responds via chat
   → Container stays alive for a cooldown period, then sleeps

5. DEBUGGING
   User opens terminal in app
   → Backend starts container if needed
   → WebSocket terminal connects to container's shell
   → User inspects logs, memory, state

6. TERMINATION
   User deletes genie
   → Container destroyed
   → Persistent volume archived or deleted (user choice)
   → Data cleaned up

Project Structure

genie/
├── apps/
│   └── mobile/                  # React Native + Web app (Expo)
│       ├── src/
│       │   ├── screens/
│       │   ├── components/
│       │   ├── services/        # API client, WebSocket, notifications
│       │   ├── store/           # State management
│       │   └── navigation/
│       └── app.json
│
├── backend/
│   ├── src/
│   │   ├── api/                 # REST endpoints
│   │   ├── services/
│   │   │   ├── auth/
│   │   │   ├── container/       # Orchestrator
│   │   │   ├── scheduler/
│   │   │   ├── chat/
│   │   │   ├── network/         # Approval service
│   │   │   └── notifications/
│   │   ├── models/              # DB models
│   │   └── config/
│   └── package.json
│
├── harness/                     # Genie Harness (Core IP)
│   ├── genie/
│   │   ├── core/
│   │   │   ├── harness.py       # Main loop / lifecycle
│   │   │   ├── planner.py       # Planning LLM interface
│   │   │   ├── executor.py      # Execution LLM interface
│   │   │   └── scheduler.py     # Schedule self-recommender
│   │   ├── metrics/
│   │   │   ├── engine.py        # Metric collection + storage
│   │   │   ├── definitions.py   # KPI definition framework
│   │   │   ├── evaluator.py     # Periodic self-evaluation via Planning LLM
│   │   │   └── feedback.py      # User feedback ingestion
│   │   ├── memory/
│   │   │   ├── manager.py       # Memory lifecycle
│   │   │   ├── working.py       # Working memory
│   │   │   ├── longterm.py      # Long-term storage + vector search
│   │   │   └── consolidator.py  # Memory compression / pruning
│   │   ├── tools/
│   │   │   ├── shell.py         # Shell command execution
│   │   │   ├── files.py         # File operations
│   │   │   ├── web.py           # HTTP requests (via proxy)
│   │   │   └── data.py          # Data processing utilities
│   │   ├── network/
│   │   │   ├── proxy.py         # Egress proxy
│   │   │   └── firewall.py      # Allowlist management
│   │   ├── comms/
│   │   │   ├── chat.py          # Chat endpoint (WebSocket client)
│   │   │   └── terminal.py      # Debug terminal server
│   │   └── config.py
│   ├── Dockerfile
│   ├── requirements.txt
│   └── entrypoint.sh
│
├── infra/                       # Infrastructure as code
│   ├── docker-compose.yml       # Local dev
│   ├── terraform/               # Cloud provisioning
│   └── scripts/
│
└── docs/
    └── PLAN.md                  # This file (symlinked or copied)

Tech Stack Summary

Layer Technology
Mobile + Web React Native + Expo + react-native-web
Backend API Node.js + TypeScript + Fastify
Database PostgreSQL
Container Runtime Docker (dev), Fly.io Machines or AWS ECS (prod)
Genie Harness Python 3.12+
LLM Integration Anthropic API (Claude), OpenAI API (GPT), configurable
Vector DB ChromaDB (embedded, per genie)
Network Proxy mitmproxy or custom lightweight proxy
Push Notifications APNS + FCM via firebase-admin
Debug Terminal xterm.js (frontend) + WebSocket shell relay
IaC Terraform + Docker Compose

Development Phases

Phase 1: Foundation (Harness + Backend Core)

  • Genie harness: core loop, planning/execution LLM integration
  • Memory manager: working memory, long-term storage, consolidation
  • Metrics engine: KPI definition, per-run collection, self-evaluation
  • Tool runner: shell, files, web requests
  • Backend: auth, genie CRUD, container orchestrator (Docker locally)
  • Basic chat relay (WebSocket)
  • Local dev environment (docker-compose)

Phase 2: Network & Scheduling

  • Network proxy in container (egress filtering)
  • Network approval flow (backend + push notifications)
  • Scheduler service (cron-based wake/sleep)
  • Schedule self-recommendation by genie
  • Message queuing for offline genies

Phase 3: Frontend (Web First)

Build as a web app first for fast iteration, then wrap for mobile.

  • Web app (React + Vite, or Next.js) — same component library usable in React Native later
  • Auth screens
  • Genie creation flow (with LLM suggestion step)
  • Genie list / home screen
  • Chat interface
  • Network approval UI
  • Push notification integration

Phase 3b: Mobile

  • React Native app wrapping shared components
  • APNS + FCM push notifications
  • App store submission

Phase 4: Debug & Polish

  • Debug terminal (drop-in shell from app)
  • Genie status monitoring
  • Memory inspection UI
  • Error handling and recovery
  • Container health checks

Phase 5: Production Readiness

  • Cloud deployment (Fly.io or AWS)
  • Terraform IaC
  • Monitoring and logging
  • Rate limiting and abuse prevention
  • Billing infrastructure (Stripe)
  • App store submission

Next Steps

Start with Phase 1 — build the harness first since it's the core IP, then the backend to support it.

Immediate first tasks:

  1. Scaffold the harness/ Python project
  2. Implement the core harness loop (wake -> load memory -> plan -> execute -> store memory -> report -> sleep)
  3. Implement memory manager with working + long-term memory
  4. Build a simple CLI to test genies locally before the app exists