[1/N] [History Server Beta] [Feat] Lazy loading#4795
Conversation
- Add package doc - Move flag declarations into a var() block - Add description strings to each flag - Add section dividers for readability - Group imports into stdlib / third-party / project - Remove dead initialization of runtimeClassConfigPath (overwritten by flag default) No behavior change. Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
- Use logrus.Fatalf(...) for consistent log formatting - Add required-flag check on --runtime-class-name for fast fail Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
- Replace manual sigChan + signal.Notify with signal.NotifyContext - Add bridge goroutine that closes the legacy stop ch when serverCtx fires No behavior change. Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
- Add a Supervisor as /enter_cluster broker - Use in-process loaded set to record processed sessions - Add a Pipeline that wraps the parse steps - Remove eager processing on startup and ticker Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
|
Alpha E2E tests pass and the following demonstrates manual test: |
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: fad772ebf8
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
…name Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: fec4ba535c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Future-Outlier
left a comment
There was a problem hiding this comment.
will do more review today.
| if err := logEventReader.ReadLogEvents(clusterInfo, clusterSessionKey, h.ClusterLogEventMap); err != nil { | ||
| logrus.Errorf("Incomplete Log Events read for %s: %v. /events endpoint may serve partial data.", | ||
| clusterSessionKey, err) | ||
| } |
There was a problem hiding this comment.
ProcessSingleSession ignores context during log event reading
Low Severity
ProcessSingleSession calls ReadLogEvents before checking ctx.Err(), and ReadLogEvents does not accept a context parameter. If the 2-minute loadTimeout or server shutdown fires during log event reading, the cancellation won't be observed until ReadLogEvents completes and the ray events loop begins its first iteration. For sessions with many nodes and log files, this can delay graceful shutdown and exceed the intended per-load timeout contract.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 96a12be. Configure here.
There was a problem hiding this comment.
Fine-grained ctx management will be resolved in the follow-up.
| } | ||
| rayEventsRead++ | ||
|
|
||
| // Corrupt file: don't count as a hard failure since retrying won't help. Accept partial loss. |
There was a problem hiding this comment.
should this comment be inside the err != nil block below?
There was a problem hiding this comment.
Also fixed in 8caa9a1 to explain how the following two errors are handled and why.
| } | ||
| } | ||
|
|
||
| if rayEventsTotal > 0 && rayEventsRead == 0 { |
There was a problem hiding this comment.
same comment here, should this just check rayEventsTotal != rayEventsRead?
There was a problem hiding this comment.
Good question. For log events, total != read is essentially free: errors do not propagate past ProcessSingleSession, so loaded is still set, and the next /enter_cluster hits the fast-path. The only effect is better visibility.
For ray events, errors propagate to doLoadSession, which blocks loaded from being set. So rayEventsTotal != rayEventsRead here means every subsequent /enter_cluster call re-runs the full reload until the partial-failure condition heals.
The proper fix for ray events might be to surface "loaded but incomplete" as a richer marker instead of a bool. I think it is a non-trivial change because it touches loaded semantics in SessionLoader and ProcessSession's returns.
Would it make sense to defer richer error handling to follow-up? Thanks!
| } | ||
|
|
||
| // TODO(jwj): Graceful drain if needed. Currently SIGTERM immediately cancels | ||
| // in-flight work. If 500-during-shutdown becomes a customer pain point, switch |
There was a problem hiding this comment.
This comment is confusing, what is "500-during-shutdown"?
There was a problem hiding this comment.
serverCtx is signal.NotifyContext(SIGINT, SIGTERM), which cancels the moment the signal arrives (no grace window). loadCtx derives from serverCtx, so any in-flight doLoadSession cancels with it and surfaces as HTTP 500 to clients via router.go.
Clarified comments in 0b90fc9.
|
Is there unused code in eventserver.go we can delete now? Like this method? kuberay/historyserver/pkg/eventserver/eventserver.go Lines 87 to 209 in f384932 |
| const defaultSessionLoadTimeout = 2 * time.Minute | ||
|
|
There was a problem hiding this comment.
- rename defaultSessionLoadTimeout -> defaultSessionProcessTimeout
- change to 10 minutes
There was a problem hiding this comment.
Renamed and made it configurable (keeping 2-min default) in ab80886.
Future-Outlier
left a comment
There was a problem hiding this comment.
we should add a flag for user to configure their session process timeout
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
…g event obj Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
| reader: reader, | ||
| clientManager: clientManager, | ||
| eventHandler: eventHandler, | ||
| sessionLoader: sessionLoader, |
There was a problem hiding this comment.
Cold-load timeout exceeds HTTP server WriteTimeout
High Severity
The enter_cluster handler now blocks synchronously on LoadSession, which can run for up to DefaultSessionProcessTimeout (2 minutes). However, the HTTP server's WriteTimeout is only 35 seconds. Any cold-load exceeding 35 seconds causes the server to close the TCP connection before the handler writes a response, resulting in a client-visible error even though the singleflight processing completes successfully in the background.
Additional Locations (2)
Reviewed by Cursor Bugbot for commit ab80886. Configure here.
There was a problem hiding this comment.
This is intentional for now. Processing continues in the background, so users can return later to see the result.
Better UX would be showing a message in the frontend to set expectations around processing time.
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
All dead codes (including tests) are deleted in 1a55a2a! |
…ection Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
andrewsykim
left a comment
There was a problem hiding this comment.
LGTM, just some minor nits, can merge after those are resolved
| // isDead determines if the RayCluster CR is absent. | ||
| func (p *SessionProcessor) isDead(ctx context.Context, session utils.ClusterInfo) (bool, error) { | ||
| rc := &rayv1.RayCluster{} | ||
| err := p.k8sClient.Get(ctx, k8stypes.NamespacedName{ |
There was a problem hiding this comment.
This is a cache read right?
There was a problem hiding this comment.
No, it is uncached now. The K8s pressure is bounded by the number of distinct dead sessions ever accessed, not by request volume. singleflight + the loaded fast-path absorb repeated reads of the same session.
Switching to a cached client (informer) would still be cleaner (failed-load retries stop hitting the API, and reads become O(1) in-memory). Do you prefer introducing this here? Thanks!
There was a problem hiding this comment.
I'm not sure I'm following you, I assumed the controller-runtime client creates an informer by default and does a cache read from informer, is that not the case here?
If we are doing an API GET request here every time that's probably problematic
There was a problem hiding this comment.
ah, even if it's a GET request every time, it's only once per cluster when we load the session, so it's probably fine
There was a problem hiding this comment.
For the current impl, it is client.New(...) (not a Manager-backed client) so no informer, every isDead is a real API GET. Let me switch to ctrl.NewManager + mgr.GetClient().
Sorry for missing your latest reply! Would you prefer to make this switch in the current PR, or handle it separately? Either works for me. Thx.
There was a problem hiding this comment.
separate PR is fine, thanks for checking
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Future-Outlier
left a comment
There was a problem hiding this comment.
LGTM. I only reviewed the main logic and didn’t review the test code.
Please also resolve the TODOs I listed. Thank you!
| <-serverCtx.Done() | ||
| logrus.Info("Received shutdown signal, initiating graceful shutdown...") | ||
| close(stop) | ||
| }() |
There was a problem hiding this comment.
Can we add an issue to use serverCtx and replace/remove the “stop channel”?
Thank you!
| // Client returns the primary controller-runtime client. | ||
| func (c *ClientManager) Client() client.Client { | ||
| return c.clients[0] | ||
| } |
There was a problem hiding this comment.
todo: we should only have 1 client.
| serverCtx context.Context | ||
| processTimeout time.Duration |
There was a problem hiding this comment.
plz do this in a follow-up PR, thanks!
Do not store Contexts inside a struct type; instead, pass a Context explicitly to each function that needs it. This is discussed further in https://go.dev/blog/context-and-structs.
| const ( | ||
| // SessionStatusUnknown is the zero value, reserved as a defensive guard. | ||
| SessionStatusUnknown SessionStatus = iota | ||
| // SessionStatusLive means the RayCluster CR is still present and the | ||
| // session is intentionally skipped. | ||
| SessionStatusLive | ||
| // SessionStatusProcessed means events were ingested into EventHandler's | ||
| // in-memory state. | ||
| SessionStatusProcessed | ||
| // SessionStatusClusterStateUnknown means the cluster state could not be | ||
| // determined (e.g., transient API failure). | ||
| SessionStatusClusterStateUnknown | ||
| // SessionStatusEventsErr means event parsing failed. | ||
| SessionStatusEventsErr | ||
| // SessionStatusCanceled means ctx was canceled mid-processing. | ||
| SessionStatusCanceled | ||
| ) |
There was a problem hiding this comment.
SessionStatusUnknown and SessionStatusClusterStateUnknown feel
ambiguous to me, we can add a follow-up issue to solve this
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b76f5b9979
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if apierrors.IsNotFound(err) { | ||
| return true, nil | ||
| } | ||
| if err != nil { | ||
| return false, err | ||
| } | ||
| return false, nil |
There was a problem hiding this comment.
Distinguish queried session from current live session
isDead treats any existing RayCluster CR as live, but /enter_cluster/{namespace}/{name}/{session} now relies on this result to decide whether to load historical events or rewrite the cookie to live. When a cluster is still running but the user selects an older session_* from storage (common after head restarts), this path incorrectly returns live, skips ProcessSingleSession, and prevents access to that dead session’s data. The dead/live check needs to compare the requested session against the cluster’s current live session_name, not just CR existence.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
We focus on main use cases only and might solve this in the follow-up.
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 3 total unresolved issues (including 2 from previous reviews).
Reviewed by Cursor Bugbot for commit cac9a7b. Configure here.
|
lgtm |



Why are these changes needed?
Alpha Problems
EventHandlerre-parsed all raw events every hour (bounded byO(#sessions))Process restart wiped all in-memory maps and the next startup repeated the entire warm-up cycle.
Beta Strategy
In this PR, we remove startup processing/ticker and introduce lazy loading to process raw events on user requests and records processed dead sessions in in-memory maps (inherited from alpha).
Note
The unbounded memory growth problem will be solved in the next PR via caching and the third PR's dead code cleanup.
Overview
System Architecture
Request Data Flow
sequenceDiagram actor User as Winner actor User2 as Follower participant HS as HS handler participant Loader as SessionLoader (singleflight + loaded set) participant Proc as SessionProcessor participant Resolver as HTTPLiveSessionResolver participant K8s as K8s API participant Dash as Ray Dashboard participant S3 as Object Storage participant EH as EventHandler maps User->>HS: GET /enter_cluster (cold, dead session) User2->>HS: GET /enter_cluster (concurrent, same session) HS->>Loader: LoadSession (winner) HS->>Loader: LoadSession (follower) Note over Loader: same singleflight key →<br/>follower joins winner's group Loader->>Loader: loaded set check → not found Loader->>Proc: ProcessSession(serverCtx) Note over Proc,Loader: serverCtx (NOT reqCtx) —<br/>winner disconnect cannot cancel<br/>work followers still depend on Proc->>K8s: Get RayCluster alt CR NotFound K8s-->>Proc: NotFound → dead else CR exists K8s-->>Proc: RayCluster (Status.Head.ServiceName set) Proc->>Resolver: FetchSessionName(ns, headSvcName) Resolver->>Dash: GET /api/version Dash-->>Resolver: { session_name: "<live_session>" } Resolver-->>Proc: "<live_session>" Note over Proc: queried != live → dead end Proc->>S3: List + GET event files Proc->>EH: ProcessSingleSession (writes into Tasks/Actors/Jobs/Nodes/LogEvents) Proc-->>Loader: SessionStatusProcessed, nil Loader->>Loader: mark session in loaded set Loader-->>HS: nil (winner) Loader-->>HS: nil (follower, shared result) HS-->>User: 200 OK HS-->>User2: 200 OK Note over User,EH: subsequent /api/v0/* calls read from EventHandler maps directly<br/>repeat /enter_cluster returns immediately via loaded set fast-path<br/>(no re-parse on this replica)Components and Design Decisions
SessionProcessorDetermine session liveness by comparing the queried session name against the one currently running on the live RayCluster; execute
ProcessSingleSessionfor dead sessions.Design Decisions
Probe-based comparison replaces the old timestamp comparison, which could not distinguish multiple sessions in the same CR (head-pod restarts) and had a time zone alignment issue in
ParseSessionTimestamp.Trade-off: One HTTP round-trip per cold load; hot path unaffected by
loadedcache.SessionLoaderGate
/enter_clusteruntil session events are in memory.Design Decisions
singleflightper session key prevents redundant storage scans when concurrent callers race on the same dead sessionTrade-off: First-request cold-load latency; subsequent hits are O(1).
HTTPLiveSessionResolverCall
/api/versionon the Ray dashboard and return the cluster's currentsession_name.Design Decisions
Fresh implementation rather than reusing
RayDashboardClientsince the existing interface lackedGetVersion()and its factory requiredmanager.Manager, which is unavailable in HS.Trade-off: URL-construction logic is duplicated with
redirectRequest.ProcessSingleSessionProcess ray-events and log-events into in-memory maps for one session.
Design Decisions
ray-events failure blocks the
loadedmarker; log-events failure is logged but does not block. log-events feeds only/events, so coupling both error paths would let a transient log outage block all other dead-session APIs.Trade-off: Silent data loss on
/eventsif log-events fails; no retry until HS restart./enter_clusterRoute requests to
SessionLoader(dead session) or dashboard proxy (live session).Design Decisions
Session-name validation lives at the router level rather than inside
isDead, following the KubeRay caller-validates convention.Trade-off: —
Related issue number
Part of #4709.
How to test
Checks