EPP gRPC APIs Reference

This document lists the gRPC APIs the Endpoint Picker (EPP) supports for inference traffic. gRPC requests flow through the gateway as HTTP/2 (H2C) traffic, and the EPP decodes the gRPC frames and protobuf payloads to do prefix-cache aware routing, plugin decisions, and response usage tracking.

Unlike the HTTP APIs, gRPC parsing is not enabled by default: the matching parser plugin must be configured in the EndpointPickerConfig.

Supported gRPC APIs

gRPC Method	Source	Parser Plugin	Supported
`vllm.grpc.engine.VllmEngine/Generate`	vLLM gRPC engine API	`vllmgrpc-parser`	✅
`vllm.grpc.engine.VllmEngine/Embed`	vLLM gRPC engine API	`vllmgrpc-parser`	✅

The gRPC API is currently token-out only for Generate: responses carry token IDs (chunk.token_ids, complete.output_ids) rather than decoded text, and clients are responsible for detokenization.

Parser Configuration

Parsers are configured via the requestHandler.parsers section of the EndpointPickerConfig. Instantiate the parser plugin in plugins, then reference it by name:

apiVersion: llm-d.ai/v1alpha1
kind: EndpointPickerConfig
plugins:
- name: maxScore
  type: max-score-picker
- name: vllmgrpcParser
  type: vllmgrpc-parser
schedulingProfiles:
  # ... omitted for brevity ...
requestHandler:
  parsers:
  - pluginRef: vllmgrpcParser

InferencePool Configuration

gRPC requires HTTP/2 end to end. For the gateway to connect to the model server pods with HTTP/2 cleartext (h2c), the InferencePool must set appProtocol: kubernetes.io/h2c.

apiVersion: inference.networking.k8s.io/v1
kind: InferencePool
metadata:
  name: vllm-grpc-qwen3-32b
spec:
  targetPorts:
  - number: 8000
  appProtocol: kubernetes.io/h2c
  selector:
    matchLabels:
      app: vllm-grpc-qwen3-32b
  endpointPickerRef:
    name: vllm-grpc-qwen3-32b-epp
    port:
      number: 9002

When deploying with the llm-d-router Helm charts, setting router.modelServers.protocol=grpc configures this automatically.

Request Examples

The examples below use grpcurl with the proxy endpoint as ${IP}, set per the relevant guide's verification steps. They require the vllm_engine.proto definition, and a model server that exposes the vLLM gRPC engine API.

vLLM `VllmEngine/Generate`

Request (text input; alternatively pass pre-tokenized input via the tokenized field):

grpcurl -plaintext -proto vllm_engine.proto \
    -d '{
        "request_id": "req-1",
        "text": "Hello",
        "sampling_params": {"max_tokens": 10}
    }' \
    ${IP}:80 vllm.grpc.engine.VllmEngine/Generate

Response:

{
  "complete": {
    "outputIds": [17993, 1894, 7332, 198, 286, 2415, 1140, 259, 4580, 892],
    "finishReason": "length",
    "promptTokens": 1,
    "completionTokens": 10
  }
}

Streaming request (set "stream": true; the server returns a stream of GenerateResponse messages with incremental chunk payloads followed by a final complete payload):

grpcurl -plaintext -proto vllm_engine.proto \
    -d '{
        "request_id": "req-2",
        "text": "Hello",
        "sampling_params": {"max_tokens": 10},
        "stream": true
    }' \
    ${IP}:80 vllm.grpc.engine.VllmEngine/Generate

Streaming response

Response contents:
{
  "chunk": {
    "tokenIds": [
      883336980
    ],
    "promptTokens": 10,
    "completionTokens": 1
  }
}

Response contents:
{
  "chunk": {
    "tokenIds": [
      186949092
    ],
    "promptTokens": 10,
    "completionTokens": 1
  }
}

Response contents:
{
  "chunk": {
    "tokenIds": [
      446163293
    ],
    "promptTokens": 10,
    "completionTokens": 1
  }
}

Response contents:
{
  "chunk": {
    "tokenIds": [
      186949092
    ],
    "promptTokens": 10,
    "completionTokens": 1
  }
}

Response contents:
{
  "chunk": {
    "tokenIds": [
      3509523577
    ],
    "promptTokens": 10,
    "completionTokens": 1
  }
}

Response contents:
{
  "chunk": {
    "tokenIds": [
      1690122482
    ],
    "promptTokens": 10,
    "completionTokens": 1
  }
}

Response contents:
{
  "complete": {
    "finishReason": "stop",
    "promptTokens": 10
  }
}

vLLM `VllmEngine/Embed`

This method requires pre-tokenized input and an embedding model deployment.

Request:

grpcurl -plaintext -proto vllm_engine.proto \
    -d '{
        "request_id": "req-3",
        "tokenized": {"original_text": "Hello", "input_ids": [9906]}
    }' \
    ${IP}:80 vllm.grpc.engine.VllmEngine/Embed

Response (embedding vector truncated for readability):

{
  "embedding": [-0.01350, -0.02152, -0.01368, "..."],
  "promptTokens": 1,
  "embeddingDim": 1024
}

HTTP Headers

The EPP HTTP headers (request classification, flow control, and SLO headers such as x-llm-d-inference-objective and x-llm-d-inference-fairness-id) work for gRPC requests exactly as they do for HTTP.

Specify them as gRPC metadata on the call. With grpcurl, use -H:

grpcurl -plaintext -proto vllm_engine.proto \
    -H 'x-llm-d-inference-objective: my-objective' \
    -H 'x-llm-d-inference-fairness-id: tenant-a' \
    -d '{
        "request_id": "req-4",
        "text": "Hello",
        "sampling_params": {"max_tokens": 10}
    }' \
    ${IP}:80 vllm.grpc.engine.VllmEngine/Generate

In a Go client, attach the metadata to the outgoing context:

ctx = metadata.AppendToOutgoingContext(ctx,
    "x-llm-d-inference-objective", "my-objective",
    "x-llm-d-inference-fairness-id", "tenant-a")
resp, err := client.Generate(ctx, req)

In Python, pass metadata on the call:

stub.Generate(request, metadata=(
    ("x-llm-d-inference-objective", "my-objective"),
    ("x-llm-d-inference-fairness-id", "tenant-a"),
))

Supported gRPC APIs​

Parser Configuration​

InferencePool Configuration​

Request Examples​

vLLM VllmEngine/Generate​

vLLM VllmEngine/Embed​

HTTP Headers​