Skip to main content

Metrics

This page covers how to enable and interpret metrics from an llm-d deployment. For Prometheus and Grafana installation, see Observability Setup first.

note

Commands in this page use ${NAMESPACE} for the namespace where your llm-d workload runs. Set it before following along:

export NAMESPACE=<your-llm-d-namespace>

Prerequisites

  • A running llm-d deployment with an InferencePool and model servers — see the quickstart if needed
  • Prometheus and Grafana installed — see Observability Setup

Step 1: Enable Model Server Metrics

Model server metrics are enabled by default. Configuration varies by deployment method.

Kustomize Deployments

If you deployed your model server using kustomize build, add the monitoring component to your kustomization.yaml:

components:
- ../../../recipes/modelserver/components/monitoring # decode PodMonitor
# - ../../../recipes/modelserver/components/monitoring-pd # add for prefill/decode disaggregation

The monitoring component creates PodMonitors that scrape model server metrics. See guides/recipes/modelserver/components/monitoring/ for details.

Verify PodMonitors

Verify the PodMonitors exist:

kubectl get podmonitors -n ${NAMESPACE}

Expected output:

NAME AGE
decode-podmonitor 5m
prefill-podmonitor 5m

Key vLLM Metrics

MetricWhat it measuresWhy it matters
vllm:num_requests_runningActive requests being processedHigh values indicate GPU saturation; new requests will queue. Watch for sustained spikes
vllm:num_requests_waitingRequests queued, waiting to be processedNon-zero means pods are saturated. Primary signal for autoscaling decisions
vllm:kv_cache_usage_percKV cache utilization (0.0 to 1.0)Above 0.9 means GPU memory is nearly full and requests may get preempted or rejected
vllm:time_to_first_token_seconds (histogram)Time from request arrival to first generated token (TTFT)Directly impacts user experience. Use histogram_quantile() to query percentiles
vllm:inter_token_latency_seconds (histogram)Time between consecutive generated tokens (ITL)Affects streaming response speed. High ITL causes choppy output. Use histogram_quantile() to query percentiles
vllm:prefix_cache_hits_totalNumber of prefix cache hitsCompare with prefix_cache_queries_total to get hit rate. Low hit rate suggests the EPP is not routing effectively
vllm:prefix_cache_queries_totalTotal prefix cache lookupsDivide prefix_cache_hits_total by this to get hit rate. A dropping ratio indicates routing or prompt pattern changes
vllm:prompt_tokens_totalTotal input tokens processedUse rate() to get tokens/sec per pod. Compare across pods to spot uneven load distribution
vllm:generation_tokens_totalTotal output tokens generatedUse rate() alongside prompt tokens to get total throughput. A drop signals degraded model performance

Key SGLang Metrics

MetricWhat it measuresWhy it matters
sglang:num_running_reqsActive requests being processedHigh values indicate GPU saturation; new requests will queue
sglang:num_queue_reqsRequests queued, waiting to be processedNon-zero means pods are saturated. Primary signal for autoscaling decisions
sglang:token_usageKV cache token utilization (0.0 to 1.0)Above 0.9 means GPU memory is nearly full
sglang:time_to_first_token_seconds (histogram)Time from request arrival to first generated token (TTFT)Directly impacts user experience. Use histogram_quantile() to query percentiles
sglang:inter_token_latency_seconds (histogram)Time between consecutive generated tokens (ITL)Affects streaming response speed. Use histogram_quantile() to query percentiles
sglang:prompt_tokens_totalTotal input tokens processedUse rate() to get tokens/sec per pod
sglang:generation_tokens_totalTotal output tokens generatedUse rate() alongside prompt tokens to get total throughput

Step 3: Enable EPP Metrics

EPP (Endpoint Picker) metrics are enabled by default. To verify or enable manually, see the Monitoring & Tracing Configuration section in the llm-d-router Helm chart docs.

Verify the ServiceMonitor exists:

kubectl get servicemonitors -n ${NAMESPACE}

Expected output:

NAME AGE
epp-servicemonitor 5m

Key llm-d Router EPP Metrics

MetricWhat it measuresWhy it matters
llm_d_epp_request_totalTotal request count per flow ID and priorityBaseline for calculating error rate and throughput per model
llm_d_epp_request_error_totalError count per flow ID and priorityRising errors signal backend failures. Alert when error rate exceeds 5%
llm_d_epp_request_duration_secondsResponse latency distribution per flow ID and priorityThe SLO metric. Tracks full round-trip time from request to response
llm_d_epp_request_size_bytesIncoming request size distribution in bytes per flow ID and priorityHelps identify payload size anomalies and exceptionally large incoming prompts
llm_d_epp_response_size_bytesOutgoing response size distribution in bytes per flow ID and priorityTracks outgoing bandwidth usage and response payload size distribution
llm_d_epp_request_input_tokensInput token count distribution per flow ID and priorityHelps identify expensive requests. Long prompts cost more compute
llm_d_epp_request_output_tokensOutput token count distribution per flow ID and priorityCombined with duration, gives normalized cost and generation volume per token
llm_d_epp_request_cached_tokensCached prompt token distribution per flow ID and priorityMeasures prefix cache utilization reported by model servers
llm_d_epp_request_runningActive request count per flow ID and priorityShows real-time load concurrency across models
llm_d_epp_request_ntpot_secondsNormalized time per output token (NTPOT) distribution per flow ID and priorityKey efficiency metric (lower is better). Compare across pods to find stragglers
llm_d_epp_request_ttft_secondsTime to first token (TTFT) distribution per flow ID and priorityDirectly measures user-perceived responsiveness and time to initial output byte
llm_d_epp_request_streaming_tpot_secondsTime per output token (TPOT) distribution per flow ID and priority; applicable to streaming requestsTracks ongoing generation speed excluding initial prompt prefill latency
llm_d_epp_request_streaming_itl_secondsInter-token latency (ITL) distribution per flow ID and priority; applicable to streaming requestsMeasures pacing between consecutive response body chunks; spikes indicate choppy output
llm_d_epp_ready_endpointsNumber of ready endpoints in the poolIf this drops below expected count, pods are crashing or not scheduling
llm_d_epp_scheduler_attempts_totalScheduling attempt counts and outcomesTrack failed scheduling attempts. High failure rate indicates filter/scorer misconfiguration

Flow Control Metrics

When flow control is enabled, these additional metrics are exposed:

MetricWhat it measuresWhy it matters
llm_d_epp_flow_control_queue_sizeQueued request count per flow ID and priorityGrowing queue means the pool cannot keep up. Consider scaling or adjusting priority bands
llm_d_epp_flow_control_queue_bytesQueued payload size in bytes per flow ID and priorityLarge queued payloads can exhaust EPP memory. Monitor alongside maxBytes config
llm_d_epp_flow_control_request_queue_duration_secondsQueuing duration distribution per flow ID and priorityDirectly impacts user-perceived latency. High values mean flow control is holding requests too long
llm_d_epp_flow_control_dispatch_cycle_duration_secondsInternal dispatch cycle duration distributionTracks execution speed of the flow control scheduler loop
llm_d_epp_flow_control_request_enqueue_duration_secondsRequest enqueue duration distribution per flow ID and priorityMeasures admission overhead entering the flow control queue
llm_d_epp_flow_control_pool_saturationPool saturation level (0.0 to 1.0+)Above 1.0 means demand exceeds capacity and flow control is actively throttling. Scale up or shed load

Step 4: View Dashboards

llm-d provides pre-built Grafana dashboards for common monitoring scenarios.

Access Grafana

note

The commands below use namespace and service names from the bundled install script. If you use an existing Prometheus or Grafana instance, adjust the namespace and service names accordingly.

kubectl port-forward -n llm-d-monitoring svc/llmd-grafana 3000:80
# Open http://localhost:3000
# Default login: admin / admin

Import Dashboards

Load all llm-d dashboards into Grafana:

./guides/recipes/observability/load-llm-d-dashboards.sh

Verify dashboards were imported:

kubectl get configmaps -n llm-d-monitoring -l grafana_dashboard=1

Expected output:

NAME DATA AGE
llm-d-vllm-overview 1 30s
llm-d-sglang-overview 1 30s
llm-d-failure-saturation-dashboard 1 30s
llm-d-diagnostic-drilldown-dashboard 1 30s
llm-d-performance-kv-cache 1 30s
llm-d-pd-coordinator-metrics 1 30s

Or import individual dashboard JSON files manually from guides/recipes/observability/grafana/dashboards/:

DashboardWhat it shows
llm-d-vllm-overview.jsonGeneral vLLM metrics overview
llm-d-sglang-overview.jsonGeneral SGLang metrics overview
llm-d-failure-saturation-dashboard.jsonFailure and saturation indicators
llm-d-diagnostic-drilldown-dashboard.jsonDetailed diagnostic metrics for troubleshooting
llm-d-performance-kv-cache.jsonPerformance metrics including KV cache utilization
llm-d-pd-coordinator-metrics.jsonPrefill/decode disaggregation metrics

Step 5: Query Metrics

Access the Prometheus UI:

kubectl port-forward -n llm-d-monitoring svc/llmd-kube-prometheus-stack-prometheus 9090:9090
# Open http://localhost:9090 (or https://localhost:9090 if TLS is enabled)

Cleanup

./guides/recipes/observability/install-prometheus-grafana.sh -u -n llm-d-monitoring

Troubleshooting

Autoscaler reports "http: server gave HTTP response to HTTPS client"

The autoscaler is configured for HTTPS but Prometheus is serving HTTP. Enable TLS:

./guides/recipes/observability/install-prometheus-grafana.sh -u
./guides/recipes/observability/install-prometheus-grafana.sh --enable-tls

Metrics not appearing in Prometheus

  1. Check that PodMonitors and ServiceMonitors exist:

    kubectl get podmonitors,servicemonitors -n ${NAMESPACE}
  2. Verify Prometheus is scraping the targets. Open http://localhost:9090/targets (after port-forwarding) and check that vLLM and EPP targets show UP

  3. Confirm pods expose metrics:

    VLLM_POD=$(kubectl get pods -n ${NAMESPACE} -l app=my-model -o jsonpath='{.items[0].metadata.name}')
    kubectl port-forward -n ${NAMESPACE} ${VLLM_POD} 8000:8000
    curl http://localhost:8000/metrics | head -20

Grafana dashboards show "No data"

  1. Verify the Grafana datasource points to the correct Prometheus URL
  2. Check that metrics are flowing in Prometheus first (use the Prometheus UI)
  3. If using TLS, ensure the Grafana datasource is configured for HTTPS with the correct CA certificate