Skip to content

[2/N] [History Server Beta] [Feat] LRU cache#4796

Draft
JiangJiaWei1103 wants to merge 28 commits into
ray-project:masterfrom
JiangJiaWei1103:hs-beta-lru-cache
Draft

[2/N] [History Server Beta] [Feat] LRU cache#4796
JiangJiaWei1103 wants to merge 28 commits into
ray-project:masterfrom
JiangJiaWei1103:hs-beta-lru-cache

Conversation

@JiangJiaWei1103
Copy link
Copy Markdown
Member

@JiangJiaWei1103 JiangJiaWei1103 commented May 2, 2026

Why are these changes needed?

Alpha Problems (Continued from [1/N])

  • Unbounded memory growth: The parsed state lived in the EventHandler in-memory maps without eviction

Quoting #4795 description:
"The unbounded memory growth problem will be solved in the next PR via caching."

Beta Strategy

In this PR, we add per-replica LRU cache bounded by the session snapshot count. So, the memory usage is bounded by O(#sessions x max_snapshot_size).

Note

This PR keeps the shared EventHandler populated alongside the LRU so reviewers can focus on bounded memory via LRU eviction with behavior matching [1/N]. Overall cleanup of shared maps will be done in the next PR.

Overview

System Architecture

                        ┌──────────────────┐
                        │   Dashboard UI   │ ◄── user browser
                        └────────┬─────────┘
                                 │ HTTP
                                 ▼
                        ┌──────────────────┐
                        │   Load Balancer  │
                        └────────┬─────────┘
                                 │
                                 ▼
   ┌── Ray Cluster ────────┐  ┌── History Server (replicas, single binary) ─────┐
   │                       │  │                                                 │
   │  Head pod             │  │   HTTP Router                                   │
   │  ray-head+collector   │  │   /enter_cluster (blocking on dead)             │
   │           │           │  │     │                                           │
   │           │           │  │     ▼                                           │
   │           │           │  │   Supervisor                                    │
   │           │           │  │   (singleflight + SnapshotLoader)               │
   │           │           │  │     │                                           │
   │           │           │  │   ┌─┴───────┐                                   │
   │           │ raw       │  │   ▼         ▼                                   │     ┌─────────┐
   │           │ events    │  │ Snapshot   Pipeline ─── isDead? (RayCluster) ───┼────►│  K8s    │
   │           │ (PUT)     │  │ Loader     (parse + buildSnapshotFromHandler)   │     │  API    │
   │           ▼           │  │  (LRU)        │                                 │     └─────────┘
   └───────────┼───────────┘  │   ▲           │ Prime(snap)                     │
               │              │   └───────────┘                                 │
               │              └─────────────────────────────────────────────────┘
               │                               ▲
               ▼                               │  raw events (GET)
   ┌──────────────────────────────────────────────────────────┐
   │  Object Storage  (S3 / GCS / Azure / Aliyun OSS)         │
   │                                                          │
   │  {rootDir}/{cluster}_{ns}/{session}/                     │
   │   node_events/, job_events/, logs/, fetched_endpoints/   │
   └──────────────────────────────────────────────────────────┘

Diff vs [1/N]:

  • Supervisor's loaded set (binary present/absent) replaced by SnapshotLoader (bounded, evicting)
  • Pipeline builds the SessionSnapshot from EventHandler's per-session view
  • HTTP handlers read directly from the LRU

Request Data Flow

sequenceDiagram
  actor User as Winner
  actor User2 as Follower
  participant HS as HS handler
  participant Sup as Supervisor (singleflight)
  participant Loader as SnapshotLoader
  participant Pipe as Pipeline
  participant K8s as K8s API
  participant S3 as Object Storage
  participant EH as EventHandler maps

  User->>HS: GET /enter_cluster (cold, dead session)
  User2->>HS: GET /enter_cluster (concurrent, same session)
  HS->>Sup: Ensure (winner)
  HS->>Sup: Ensure (follower)
  Note over Sup: same singleflight key →<br/>follower joins winner's group
  Sup->>Loader: Load(clusterID, session)
  Loader-->>Sup: ErrSnapshotNotFound (LRU miss)
  Sup->>Pipe: ProcessSession(serverCtx)
  Note over Pipe,Sup: serverCtx (NOT reqCtx) —<br/>winner disconnect cannot cancel<br/>work followers still depend on
  Pipe->>K8s: Get RayCluster
  K8s-->>Pipe: NotFound (dead)
  Pipe->>S3: List + GET event files
  Pipe->>EH: ProcessSingleSession (writes maps)
  Pipe->>Pipe: buildSnapshotFromHandler(snap)
  Pipe-->>Sup: StatusProcessed, *SessionSnapshot
  Sup->>Loader: Prime(snap)
  Sup-->>HS: nil (winner)
  Sup-->>HS: nil (follower, shared result)
  HS-->>User: 200 OK
  HS-->>User2: 200 OK

  Note over User,Loader: subsequent /api/v0/* calls go loader.Load → snap<br/>missing snap on a sibling replica or after eviction →<br/>503 + Retry-After: 600 → frontend re-fires /enter_cluster
Loading

Takeaways:

  • LRU key = {clusterNameID}/{sessionName} (matches singleflight key shape)
  • Prime plants the freshly-built snapshot into the LRU so the immediate follow-up handler call is zero-IO
  • On LRU miss, handler returns 503 + Retry-After; frontend re-fires /enter_cluster

Change Summary

Component File Purpose
SessionSnapshot (new) pkg/snapshot/snapshot.go Schema for the in-memory unit held per dead session (Tasks / Actors / Jobs / Nodes / LogEvents)
SnapshotLoader (new) pkg/historyserver/cache.go Per-replica bounded LRU; Load returns ErrSnapshotNotFound on miss; Prime inserts a freshly-built snapshot
GetRawEventsByJobID (new method) pkg/eventserver/types/log_event.go Raw LogEvent view (vs. existing camelCase API view) for snapshot building
Pipeline (modified) pkg/historyserver/pipeline.go ProcessSession returns (SessionStatus, *SessionSnapshot, error); new helpers buildSnapshotFromHandler + groupTasksByID
Supervisor (modified) pkg/historyserver/supervisor.go Drop loaded set, using SnapshotLoader.Load + SnapshotLoader.Prime
HTTP handlers (modified) pkg/historyserver/router.go All snapshot-backed handlers (/tasks, /actors, /jobs, /nodes, /events, /api/cluster_status) read from loader.Load
Timeline helpers (new) pkg/historyserver/timeline.go generateTimelineFromSnapshot + helpers (taskPrefix, getChromeTraceColor, extractActorIDFromTaskID)
ServerHandler (modified) pkg/historyserver/server.go Hold a *SnapshotLoader reference for handler reads
Wiring (modified) cmd/historyserver/main.go Construct SnapshotLoader; new --snapshot-cache-size flag (default 100)

Related issue number

Part of #4709.

How to test

echo "=== Build ===" && \
go build -o /tmp/hs-bin ./cmd/historyserver/ && rm -f /tmp/hs-bin && echo "BUILD OK"

echo && echo "=== Vet ===" && go vet ./... && echo "VET OK"

# Directly affected packages
echo && echo "=== Tests in packages touched by this PR (must PASS) ===" && \
go test -race -cover ./pkg/historyserver/... ./pkg/snapshot/...

# eventserver: this PR adds one new method (GetRawEventsByJobID);
# the existing failures here are present on master and are NOT introduced by this PR.
echo && echo "=== eventserver tests (failures here are pre-existing on master) ===" && \
go test -race ./pkg/eventserver/types/... && \
go test -race ./pkg/eventserver/ 2>&1 | tail -5 || true

Note on pkg/eventserver failures. Running go test ./pkg/eventserver/ shows 5 pre-existing failures on master:

--- FAIL: TestEventProcessor
--- FAIL: TestStoreEvent
--- FAIL: TestTaskLifecycleEventDeduplication
--- FAIL: TestActorLifecycleEventDeduplication
--- FAIL: TestMultipleReprocessingCycles (panics)

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

- Add package doc
- Move flag declarations into a var() block
- Add description strings to each flag
- Add section dividers for readability
- Group imports into stdlib / third-party / project
- Remove dead initialization of runtimeClassConfigPath (overwritten by flag default)

No behavior change.

Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
- Use logrus.Fatalf(...) for consistent log formatting
- Add required-flag check on --runtime-class-name for fast fail

Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
- Replace manual sigChan + signal.Notify with signal.NotifyContext
- Add bridge goroutine that closes the legacy stop ch when serverCtx fires

No behavior change.

Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
- Add a Supervisor as /enter_cluster broker
  - Use in-process loaded set to record processed sessions
- Add a Pipeline that wraps the parse steps
- Remove eager processing on startup and ticker

Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
@JiangJiaWei1103
Copy link
Copy Markdown
Member Author

JiangJiaWei1103 commented May 2, 2026

Alpha e2e tests pass and the following demonstrates manual test:

History Server Beta - Lazy Loading + LRU Cache

Will switch to "Ready for review" once #4795 ([1/N]) is merged.

Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
…name

Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant