Add http client metrics#4822
Conversation
Signed-off-by: Mark Rossetti <marosset@microsoft.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: fa774a8cc2
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| g.Expect(err).NotTo(HaveOccurred()) | ||
| g.Expect(operatorPods.Items).NotTo(BeEmpty(), "kuberay-operator pod not found") | ||
|
|
||
| operatorPod := operatorPods.Items[0] |
There was a problem hiding this comment.
Select a ready operator pod before scraping metrics
This picks operatorPods.Items[0] without checking phase/readiness, but Kubernetes list order is not stable and can include terminating or not-yet-ready pods during upgrades/restarts. In that case operatorPod.Status.PodIP may be empty or unreachable and the curl scrape becomes flaky even though a healthy operator pod exists. Please filter for a Running/Ready pod (or at least non-empty PodIP) before building the metrics URL; the same pattern also appears in rayservice_httpclient_metrics_test.go.
Useful? React with 👍 / 👎.
| // e.g., /api/v1/namespaces/default/services/svc:dashboard/proxy/api/jobs/id | ||
| if idx := strings.Index(urlPath, "/proxy/"); idx != -1 { | ||
| urlPath = urlPath[idx+len("/proxy"):] | ||
| } |
There was a problem hiding this comment.
strings.Index matches wrong /proxy/ in namespace-containing paths
Medium Severity
normalizeEndpoint uses strings.Index to find the Kubernetes API server /proxy/ prefix, but this matches the first occurrence. If the Ray cluster's namespace is literally "proxy", the URL path becomes e.g. /api/v1/namespaces/proxy/services/svc:dashboard/proxy/api/jobs/id, and the first /proxy/ match is the one after namespaces, not the actual K8s proxy delimiter. This causes the stripped path to be /services/svc:dashboard/proxy/api/jobs/id, which doesn't match any known endpoint pattern, so all metrics get labeled ray_endpoint="unknown". Using strings.LastIndex instead would correctly find the K8s proxy delimiter.
Reviewed by Cursor Bugbot for commit f3b4b78. Configure here.
|
Please let the issue assignee know that you’d like to work on this issue, rather than submitting a PR directly next time. Thank you. |
seanlaii
left a comment
There was a problem hiding this comment.
Thanks for your contribution!
| httpClient.Transport = mgr.GetHTTPClient().Transport | ||
| dashboardURL = fmt.Sprintf("%s/api/v1/namespaces/%s/services/%s:dashboard/proxy", mgr.GetConfig().Host, rayCluster.Namespace, headSvcName) | ||
| } | ||
| httpClient.Transport = httpclientmetrics.NewInstrumentedRoundTripper(httpClient.Transport, httpclientmetrics.ClientTypeDashboard, mode) |
There was a problem hiding this comment.
The instrumented round tripper is unconditionally wrapping transport, even when metrics are disabled.
Although the cost is small, maybe we could conditionally enable it by checking if the metrics is enabled.
| return metric.GetHistogram().GetSampleCount() | ||
| } | ||
|
|
||
| func TestDashboardClientHistogram(t *testing.T) { |
There was a problem hiding this comment.
nit: Consider restructuring tests around the behavior being tested rather than the client type.
The current tests are well-written and cover the right behaviors.
One suggestion for readability: since newTestInstrumentedRT() creates isolated metrics independent of the package-level dashboard/proxy vars, the test names like TestDashboardClientHistogram / TestProxyClientCounter imply a distinction that doesn't actually exist at this level. Both groups exercise the same instrumentedRoundTripper.RoundTrip() code path. Restructuring around the behavior being tested makes it easier for future contributors to understand what's covered:
TestInstrumentedRoundTripper_RecordsHistogram: all method/path/code combinations in one table-driven test- TestInstrumentedRoundTripper_Modes: verifies the mode label
- TestInstrumentedRoundTripper_TransportError: verifies code="error" on transport failure
- TestInstrumentedRoundTripper_CounterIncrement: verifies counter accumulates correctly
This also removes some duplication (e.g., the dashboard and proxy counter tests verify the same counter logic with different paths, which is already covered by TestNormalizeEndpoint).
Also, the "serve applications with query" case in TestNormalizeEndpoint passes a path containing ?api_type=declarative, but normalizeEndpoint receives req.URL.Path which never includes query parameters.
Therefore, suggest removing it to avoid giving the impression that query string handling is tested here.
|
Also, could you help fix the e2e tests? |
|
|
||
| // NewInstrumentedRoundTripper returns a new http.RoundTripper that records | ||
| // latency and request count metrics. | ||
| func NewInstrumentedRoundTripper(inner http.RoundTripper, clientType ClientType, mode Mode) http.RoundTripper { |
There was a problem hiding this comment.
Would it be better to let the creation function be independent of type of client? The InstrumentedRoundTripper calculates the elapsed time and counts requests. It is not different by type of client.
ex:
func NewInstrumentedRoundTripper(inner http.RoundTripper, histogram *prometheus.HistogramVec, counter *prometheus.CounterVec, mode Mode) http.RoundTripper |
My apologies! I will close this PR. @JiangJiaWei1103 let me know if you would like me to re-open this. If not feel free to continue working on the issue. |
No worries at all! I've been a bit bandwidth-limited lately, so please feel free to go ahead and finish it up. Thank you! |
|
I reopend this PR, plz keep contributing to this project, thank you! |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 2 total unresolved issues (including 1 from previous review).
Reviewed by Cursor Bugbot for commit f3b4b78. Configure here.
| httpClient.Transport = mgr.GetHTTPClient().Transport | ||
| dashboardURL = fmt.Sprintf("%s/api/v1/namespaces/%s/services/%s:dashboard/proxy", mgr.GetConfig().Host, rayCluster.Namespace, headSvcName) | ||
| } | ||
| httpClient.Transport = httpclientmetrics.NewInstrumentedRoundTripper(httpClient.Transport, httpclientmetrics.ClientTypeDashboard, mode) |
There was a problem hiding this comment.
Metrics instrumentation wraps transport even when disabled
Low Severity
NewInstrumentedRoundTripper unconditionally wraps the HTTP transport in both GetRayDashboardClientFunc and GetRayHttpProxyClientFunc, even when the operator runs with metrics disabled. RegisterMetrics is only called inside if config.EnableMetrics in main.go, but these call sites have no awareness of the metrics flag. When metrics are disabled, every outbound HTTP request still incurs timing and counter overhead, and the unregistered Prometheus collectors silently accumulate data that will never be scraped.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit f3b4b78. Configure here.


Why are these changes needed?
This will help debug timeout issues and also look at performance differences between direct and proxied calls.
Related issue number
Fixes #4697
Checks
I also verified this in a kind cluster
Local testing
To test this out locally
--set extraArgs='{--enable-metrics}' and--set metrics.serviceMonitor.enabled=true`to view the raw metrics query
to view the metrics in grafana
kubectl -n prometheus-system port-forward svc/prometheus-grafana 3000:80 &