Skip to content

Add http client metrics#4822

Open
marosset wants to merge 2 commits into
ray-project:masterfrom
marosset:http-latency-metrics
Open

Add http client metrics#4822
marosset wants to merge 2 commits into
ray-project:masterfrom
marosset:http-latency-metrics

Conversation

@marosset
Copy link
Copy Markdown
Contributor

Why are these changes needed?

  • Adds Prometheus histrogram metrics to output HTTP requests from the operator to dashboard and server apis to enable p95/p99 latency tracking and error rate monitoring.
    This will help debug timeout issues and also look at performance differences between direct and proxied calls.
  • Adds a new row to the grafna dashboard definition json file

Related issue number

Fixes #4697

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

I also verified this in a kind cluster

image

Local testing

To test this out locally

  1. create a kind cluster
  2. build and load the local kuberay operator into kind
  3. install the local operator with --set extraArgs='{--enable-metrics}' and --set metrics.serviceMonitor.enabled=true`

to view the raw metrics query

  1. run
 OPERATOR_POD=$(kubectl get pods -l app.kubernetes.io/component=kuberay-operator -o   jsonpath='{.items[0].metadata.name}')
kubectl port-forward pod/$OPERATOR_POD 8080:8080 &
  1. query the metrics
curl -s localhost:8080/metrics 

to view the metrics in grafana

  1. run `./install;promethus/install.sh --auto-load-dashboard true' to install Prometheus and load the kuberay dashboard
  2. kubectl -n prometheus-system port-forward svc/prometheus-grafana 3000:80 &
  3. open the grafana dashboard at localhost:3000

Signed-off-by: Mark Rossetti <marosset@microsoft.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fa774a8cc2

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

g.Expect(err).NotTo(HaveOccurred())
g.Expect(operatorPods.Items).NotTo(BeEmpty(), "kuberay-operator pod not found")

operatorPod := operatorPods.Items[0]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Select a ready operator pod before scraping metrics

This picks operatorPods.Items[0] without checking phase/readiness, but Kubernetes list order is not stable and can include terminating or not-yet-ready pods during upgrades/restarts. In that case operatorPod.Status.PodIP may be empty or unreachable and the curl scrape becomes flaky even though a healthy operator pod exists. Please filter for a Running/Ready pod (or at least non-empty PodIP) before building the metrics URL; the same pattern also appears in rayservice_httpclient_metrics_test.go.

Useful? React with 👍 / 👎.

// e.g., /api/v1/namespaces/default/services/svc:dashboard/proxy/api/jobs/id
if idx := strings.Index(urlPath, "/proxy/"); idx != -1 {
urlPath = urlPath[idx+len("/proxy"):]
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strings.Index matches wrong /proxy/ in namespace-containing paths

Medium Severity

normalizeEndpoint uses strings.Index to find the Kubernetes API server /proxy/ prefix, but this matches the first occurrence. If the Ray cluster's namespace is literally "proxy", the URL path becomes e.g. /api/v1/namespaces/proxy/services/svc:dashboard/proxy/api/jobs/id, and the first /proxy/ match is the one after namespaces, not the actual K8s proxy delimiter. This causes the stripped path to be /services/svc:dashboard/proxy/api/jobs/id, which doesn't match any known endpoint pattern, so all metrics get labeled ray_endpoint="unknown". Using strings.LastIndex instead would correctly find the K8s proxy delimiter.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit f3b4b78. Configure here.

@Future-Outlier
Copy link
Copy Markdown
Member

cc @win5923 @fscnick @seanlaii to take a look, tks!

@Future-Outlier
Copy link
Copy Markdown
Member

Please let the issue assignee know that you’d like to work on this issue, rather than submitting a PR directly next time. Thank you.

Copy link
Copy Markdown
Contributor

@seanlaii seanlaii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution!

httpClient.Transport = mgr.GetHTTPClient().Transport
dashboardURL = fmt.Sprintf("%s/api/v1/namespaces/%s/services/%s:dashboard/proxy", mgr.GetConfig().Host, rayCluster.Namespace, headSvcName)
}
httpClient.Transport = httpclientmetrics.NewInstrumentedRoundTripper(httpClient.Transport, httpclientmetrics.ClientTypeDashboard, mode)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The instrumented round tripper is unconditionally wrapping transport, even when metrics are disabled.
Although the cost is small, maybe we could conditionally enable it by checking if the metrics is enabled.

return metric.GetHistogram().GetSampleCount()
}

func TestDashboardClientHistogram(t *testing.T) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Consider restructuring tests around the behavior being tested rather than the client type.

The current tests are well-written and cover the right behaviors.

One suggestion for readability: since newTestInstrumentedRT() creates isolated metrics independent of the package-level dashboard/proxy vars, the test names like TestDashboardClientHistogram / TestProxyClientCounter imply a distinction that doesn't actually exist at this level. Both groups exercise the same instrumentedRoundTripper.RoundTrip() code path. Restructuring around the behavior being tested makes it easier for future contributors to understand what's covered:

  • TestInstrumentedRoundTripper_RecordsHistogram: all method/path/code combinations in one table-driven test
  • TestInstrumentedRoundTripper_Modes: verifies the mode label
  • TestInstrumentedRoundTripper_TransportError: verifies code="error" on transport failure
  • TestInstrumentedRoundTripper_CounterIncrement: verifies counter accumulates correctly

This also removes some duplication (e.g., the dashboard and proxy counter tests verify the same counter logic with different paths, which is already covered by TestNormalizeEndpoint).

Also, the "serve applications with query" case in TestNormalizeEndpoint passes a path containing ?api_type=declarative, but normalizeEndpoint receives req.URL.Path which never includes query parameters.
Therefore, suggest removing it to avoid giving the impression that query string handling is tested here.

@seanlaii
Copy link
Copy Markdown
Contributor

seanlaii commented May 12, 2026

Also, could you help fix the e2e tests?
Reference: https://buildkite.com/ray-project/ray-ecosystem-ci-kuberay-ci/builds/14959/canvas?jid=019e1839-f9fc-4039-b67a-fb66e3fa02d6&tab=output#019e1839-f9fc-4039-b67a-fb66e3fa02d6/L1490
Thanks!

Future-Outlier

This comment was marked as resolved.


// NewInstrumentedRoundTripper returns a new http.RoundTripper that records
// latency and request count metrics.
func NewInstrumentedRoundTripper(inner http.RoundTripper, clientType ClientType, mode Mode) http.RoundTripper {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better to let the creation function be independent of type of client? The InstrumentedRoundTripper calculates the elapsed time and counts requests. It is not different by type of client.

ex:

func NewInstrumentedRoundTripper(inner http.RoundTripper, histogram *prometheus.HistogramVec, counter *prometheus.CounterVec, mode Mode) http.RoundTripper 

@marosset
Copy link
Copy Markdown
Contributor Author

My apologies! I will close this PR. @JiangJiaWei1103 let me know if you would like me to re-open this. If not feel free to continue working on the issue.

@marosset marosset closed this May 13, 2026
@JiangJiaWei1103
Copy link
Copy Markdown
Member

My apologies! I will close this PR. @JiangJiaWei1103 let me know if you would like me to re-open this. If not feel free to continue working on the issue.

No worries at all! I've been a bit bandwidth-limited lately, so please feel free to go ahead and finish it up. Thank you!

@Future-Outlier
Copy link
Copy Markdown
Member

I reopend this PR, plz keep contributing to this project, thank you!

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

Fix All in Cursor

Reviewed by Cursor Bugbot for commit f3b4b78. Configure here.

httpClient.Transport = mgr.GetHTTPClient().Transport
dashboardURL = fmt.Sprintf("%s/api/v1/namespaces/%s/services/%s:dashboard/proxy", mgr.GetConfig().Host, rayCluster.Namespace, headSvcName)
}
httpClient.Transport = httpclientmetrics.NewInstrumentedRoundTripper(httpClient.Transport, httpclientmetrics.ClientTypeDashboard, mode)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metrics instrumentation wraps transport even when disabled

Low Severity

NewInstrumentedRoundTripper unconditionally wraps the HTTP transport in both GetRayDashboardClientFunc and GetRayHttpProxyClientFunc, even when the operator runs with metrics disabled. RegisterMetrics is only called inside if config.EnableMetrics in main.go, but these call sites have no awareness of the metrics flag. When metrics are disabled, every outbound HTTP request still incurs timing and counter overhead, and the unregistered Prometheus collectors silently accumulate data that will never be scraped.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit f3b4b78. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] [observability] Add latency metrics (p95, p99) for Ray HTTP clients

5 participants