Skip to main content

EPP gRPC APIs Reference

This document lists the gRPC APIs the Endpoint Picker (EPP) supports for inference traffic. gRPC requests flow through the gateway as HTTP/2 (H2C) traffic, and the EPP decodes the gRPC frames and protobuf payloads to do prefix-cache aware routing, plugin decisions, and response usage tracking.

Unlike the HTTP APIs, gRPC parsing is not enabled by default: the matching parser plugin must be configured in the EndpointPickerConfig.

Supported gRPC APIs

gRPC MethodSourceParser PluginSupported
vllm.grpc.engine.VllmEngine/GeneratevLLM gRPC engine APIvllmgrpc-parser
vllm.grpc.engine.VllmEngine/EmbedvLLM gRPC engine APIvllmgrpc-parser

The gRPC API is currently token-out only for Generate: responses carry token IDs (chunk.token_ids, complete.output_ids) rather than decoded text, and clients are responsible for detokenization.

Parser Configuration

Parsers are configured via the requestHandler.parsers section of the EndpointPickerConfig. Instantiate the parser plugin in plugins, then reference it by name:

apiVersion: llm-d.ai/v1alpha1
kind: EndpointPickerConfig
plugins:
- name: maxScore
type: max-score-picker
- name: vllmgrpcParser
type: vllmgrpc-parser
schedulingProfiles:
# ... omitted for brevity ...
requestHandler:
parsers:
- pluginRef: vllmgrpcParser

InferencePool Configuration

gRPC requires HTTP/2 end to end. For the gateway to connect to the model server pods with HTTP/2 cleartext (h2c), the InferencePool must set appProtocol: kubernetes.io/h2c.

apiVersion: inference.networking.k8s.io/v1
kind: InferencePool
metadata:
name: vllm-grpc-qwen3-32b
spec:
targetPorts:
- number: 8000
appProtocol: kubernetes.io/h2c
selector:
matchLabels:
app: vllm-grpc-qwen3-32b
endpointPickerRef:
name: vllm-grpc-qwen3-32b-epp
port:
number: 9002

When deploying with the llm-d-router Helm charts, setting router.modelServers.protocol=grpc configures this automatically.


Request Examples

The examples below use grpcurl with the proxy endpoint as ${IP}, set per the relevant guide's verification steps. They require the vllm_engine.proto definition, and a model server that exposes the vLLM gRPC engine API.

vLLM VllmEngine/Generate

Request (text input; alternatively pass pre-tokenized input via the tokenized field):

grpcurl -plaintext -proto vllm_engine.proto \
-d '{
"request_id": "req-1",
"text": "Hello",
"sampling_params": {"max_tokens": 10}
}' \
${IP}:80 vllm.grpc.engine.VllmEngine/Generate

Response:

{
"complete": {
"outputIds": [17993, 1894, 7332, 198, 286, 2415, 1140, 259, 4580, 892],
"finishReason": "length",
"promptTokens": 1,
"completionTokens": 10
}
}

Streaming request (set "stream": true; the server returns a stream of GenerateResponse messages with incremental chunk payloads followed by a final complete payload):

grpcurl -plaintext -proto vllm_engine.proto \
-d '{
"request_id": "req-2",
"text": "Hello",
"sampling_params": {"max_tokens": 10},
"stream": true
}' \
${IP}:80 vllm.grpc.engine.VllmEngine/Generate
Streaming response
Response contents:
{
"chunk": {
"tokenIds": [
883336980
],
"promptTokens": 10,
"completionTokens": 1
}
}

Response contents:
{
"chunk": {
"tokenIds": [
186949092
],
"promptTokens": 10,
"completionTokens": 1
}
}

Response contents:
{
"chunk": {
"tokenIds": [
446163293
],
"promptTokens": 10,
"completionTokens": 1
}
}

Response contents:
{
"chunk": {
"tokenIds": [
186949092
],
"promptTokens": 10,
"completionTokens": 1
}
}

Response contents:
{
"chunk": {
"tokenIds": [
3509523577
],
"promptTokens": 10,
"completionTokens": 1
}
}

Response contents:
{
"chunk": {
"tokenIds": [
1690122482
],
"promptTokens": 10,
"completionTokens": 1
}
}

Response contents:
{
"complete": {
"finishReason": "stop",
"promptTokens": 10
}
}

vLLM VllmEngine/Embed

This method requires pre-tokenized input and an embedding model deployment.

Request:

grpcurl -plaintext -proto vllm_engine.proto \
-d '{
"request_id": "req-3",
"tokenized": {"original_text": "Hello", "input_ids": [9906]}
}' \
${IP}:80 vllm.grpc.engine.VllmEngine/Embed

Response (embedding vector truncated for readability):

{
"embedding": [-0.01350, -0.02152, -0.01368, "..."],
"promptTokens": 1,
"embeddingDim": 1024
}

HTTP Headers

The EPP HTTP headers (request classification, flow control, and SLO headers such as x-llm-d-inference-objective and x-llm-d-inference-fairness-id) work for gRPC requests exactly as they do for HTTP.

Specify them as gRPC metadata on the call. With grpcurl, use -H:

grpcurl -plaintext -proto vllm_engine.proto \
-H 'x-llm-d-inference-objective: my-objective' \
-H 'x-llm-d-inference-fairness-id: tenant-a' \
-d '{
"request_id": "req-4",
"text": "Hello",
"sampling_params": {"max_tokens": 10}
}' \
${IP}:80 vllm.grpc.engine.VllmEngine/Generate

In a Go client, attach the metadata to the outgoing context:

ctx = metadata.AppendToOutgoingContext(ctx,
"x-llm-d-inference-objective", "my-objective",
"x-llm-d-inference-fairness-id", "tenant-a")
resp, err := client.Generate(ctx, req)

In Python, pass metadata on the call:

stub.Generate(request, metadata=(
("x-llm-d-inference-objective", "my-objective"),
("x-llm-d-inference-fairness-id", "tenant-a"),
))