Skip to main content

llm-d Router Operations Guide

This guide covers operational best practices, high availability deployment architectures, and container sizing recommendations for the llm-d Router components.


1. Endpoint Picker Operations

When deploying the Endpoint Picker (EPP) in either Standalone or Gateway mode, resource allocations and multi-replica scaling behaviors depend on expected query throughput, prefix cache matching complexity, and high availability (HA) requirements.

High Availability & Scaling Modes

When running multiple replicas of the Endpoint Picker (router.epp.replicas > 1), its behavior depends on the configured HA mode.

Active-Passive Mode (Default)

By default, multi-replica EPP deployments automatically enable the --ha-enable-leader-election flag. One leader replica actively serves routing decisions and coordinates lease status, while remaining replicas act as warm standbys.

  • Sizing & Capacity Impact: Scaling replica count does not increase total request throughput capacity, as only the single active leader replica handles external processing requests.

Active-Active Mode

To scale routing throughput concurrently across all EPP replicas, disable leader election by passing ha-enable-leader-election: false under router.epp.flags:

router:
epp:
replicas: 3
flags:
ha-enable-leader-election: false
  • Near-Linear Throughput Scaling: Multiple EPP replicas share incoming request load concurrently:

    ReplicasScaling Factor
    11.0x
    22.0x
    32.7x
    43.5x
  • Warning (Plugin & Prefix Compatibility): In active-active mode, you must only use active-active compatible plugins—specifically plugins that query backend model servers dynamically for real-time metrics and state (such as queue depth or KV-cache utilization scorers). Avoid approximate prefix caching plugins in active-active mode; because replicas do not share local memory state, prefix routing partitions across replicas and degrades cache hit rates significantly.

Container Resource Sizing

CPU Allocation

  • Rule of Thumb: Allocate 0.5 to 1.0 CPU cores per request/second of expected throughput for large agentic workloads (~100k input / 1k output tokens).
  • Prefix Matching Overhead: Increasing maxPrefixBlocksToMatch increases CPU consumption. At lower throughputs, a limit of 6250 blocks can increase CPU consumption by over 100% compared to 256 blocks due to block search overhead.
  • Idle Scraping Overhead: Idle CPU consumption scales with total model-serving pods due to background Prometheus scraping. In a cluster with 100 pods, EPP idle consumption reaches approximately 7.5 cores.

Memory Allocation

  • Inflight Concurrency: Memory footprint scales directly with concurrent inflight requests and output decode length.
  • Sizing Guidelines:
    • At 50 to 100 requests/second with 1k output tokens, EPP requires 4 to 6 GiB of memory.
    • For long-output generation (e.g., 5k+ output tokens), memory footprint can exceed 20 GiB due to concurrent request state accumulation.

Performance Reference Data

Empirical benchmark reference data for Qwen/Qwen3-8B simulation across 100 serving pods:

Throughput and Prefix Block Sizing (100k Input / 1k Output Tokens)

ConfigurationRequest Rate (Req/s)maxPrefixBlocksToMatchPeak CPU (Cores)Peak Memory (GiB)Scheduler P50 Latency (s)
Small Prefix Match5.02561.190.260.00010
Large Prefix Match5.062503.820.650.00010
Small Prefix Match98.725635.172.460.00014
Large Prefix Match98.8625046.503.410.00020

Output Length Variation (50 Req/s Constant Throughput)

Input TokensOutput TokensmaxPrefixBlocksToMatchPeak CPU (Cores)Peak Memory (GiB)
100k50025615.132.27
100k500204817.143.76
100k100025617.513.66
100k1000204820.285.23
100k5000102430.9512.54
100k1000051232.5312.54

2. Proxy Operations in Standalone Mode

The following operational guidelines and proxy scaling architectures apply exclusively to Standalone Mode (llm-d-router-standalone), where a proxy (Envoy or Agentgateway) intercepts client requests and external-processes them via EPP.

Horizontally Scalable Proxy Service

By default, the standalone chart deploys the proxy as a sidecar container inside the EPP pod. To scale data plane throughput independently from control plane intelligence, deploy the proxy as a separate horizontally scalable Deployment and Service by setting router.proxy.mode=service.

In this decoupled architecture, the proxy communicates with EPP over the in-cluster EPP Service. If EPP undergoes active-passive leader failover or momentary pod restarts, the proxy fails open by default (router.proxy.failOpen=true), preserving uninterrupted client request processing.

helm install my-standalone-router ./config/charts/llm-d-router-standalone \
--set router.modelServers.matchLabels.app=my-vllm-service \
--set router.inferencePool.create=false \
--set router.proxy.mode=service \
--set router.proxy.replicas=3

Proxy Container Resource Sizing

When running Envoy as the standalone proxy, CPU consumption scales linearly with client request rate, while memory consumption remains stable across workloads.

CPU & Memory Guidelines

  • CPU Allocation: For < 10 requests/second, 1.2 to 2.0 cores is sufficient. For 100 requests/second at 100k context lengths, allocate at least 8 cores (peak observed at 7.27 cores). For high concurrency at smaller context lengths (892 requests/second at 10k context), allocate at least 10 cores.
  • Memory Footprint: Envoy memory footprint remains stable between 1.3 and 1.4 GiB across all tested throughputs and context lengths. Allocate 2 GiB baseline.

Envoy Performance Reference Data

Input TokensOutput TokensThroughput (Req/s)Peak CPU (Cores)Peak Memory (GiB)
100k1k10.01.201.30
100k1k100.07.27< 1.40
10k1k892.08.781.40

Helm Resource Override Example

Example resource_overrides.yaml configuring container resources for both EPP and standalone Envoy proxy containers supporting 50 requests/second for 100k/1k token workloads:

router:
epp:
resources:
requests:
cpu: "32"
memory: "64Gi"
limits:
memory: "128Gi"

proxy:
resources:
requests:
cpu: "8"
memory: "2Gi"
limits:
memory: "4Gi"
helm install optimize-baseline ./config/charts/llm-d-router-standalone -f resource_overrides.yaml