OpenTelemetry Metrics
Hive Router exposes OpenTelemetry metrics for gateway traffic, subgraph traffic, cache behavior, supergraph lifecycle, and GraphQL errors.
This guide explains where to export metrics, how to configure OTLP and Prometheus, how to customize instruments, and what each metric/label means in practice.
Choose your metrics destination
Hive Router exposes metrics through two widely used integration patterns:
- OTLP-based observability backends
- Prometheus scrape endpoints
Most teams already running an OpenTelemetry pipeline tend to integrate via OTLP, while teams built around Prometheus and Grafana typically stick with Prometheus scraping.
Send metrics to OTLP-compatible backends
Hive Router can export metrics using OTLP to standard OpenTelemetry pipelines, including the OpenTelemetry Collector and vendor backends that support OTLP ingestion over HTTP or gRPC.
After enabling the exporter, generate some traffic through the router and confirm that new metric series appear in your backend (for example HTTP server/client latency, cache metrics, and supergraph execution metrics).
If metrics do not appear, verify:
- Endpoint reachability (network, DNS, TLS)
- Authentication credentials or headers
- Exporter protocol matches the backend (OTLP/HTTP vs OTLP/gRPC)
telemetry:
metrics:
exporters:
- kind: otlp
enabled: true
protocol: http
endpoint: https://otel-collector.example.com/v1/metrics
interval: 30s
max_export_timeout: 5s
http:
headers:
authorization:
expression: |
"Bearer " + env("OTLP_TOKEN")Expose metrics for Prometheus scraping
If your observability stack is Prometheus-first, Hive Router can expose an HTTP endpoint that Prometheus scrapes at its configured interval.
The port and path settings define the address where the Router exposes metrics. Prometheus must
be able to reach that address from its runtime environment (local network, Kubernetes service, or VM
network path).
If port is not set, or is the same as the main HTTP server port, the Router exposes metrics
through the same HTTP server that serves the GraphQL API. If the port is different, the Router
starts a separate HTTP server dedicated solely to the Prometheus metrics endpoint.
In production, make sure this endpoint is reachable only by trusted scrapers (for example via
network policy, firewall rules, or private ingress). Once configured, confirm the target appears as
healthy in Prometheus and then verify expected series are present (for example
http.server.request.duration, http.client.request.duration).
telemetry:
metrics:
exporters:
- kind: prometheus
enabled: true
port: 9090
path: /metricsProduction baseline
For production workloads, start with a single primary exporter, define a clear service identity, and keep default instrumentation settings.
telemetry:
resource:
attributes:
service.name: hive-router
service.namespace: your-platform
deployment.environment:
expression: env("ENVIRONMENT")
metrics:
exporters:
- kind: otlp
enabled: true
protocol: grpc
endpoint: https://otel-collector.example.com:4317
interval: 30s
max_export_timeout: 5sThis is a safe baseline and works well before introducing instrumentation-level customization. Additional exporters can be added later, but starting with one simplifies validation and troubleshooting.
Customize instrumentation
You can override behavior per metric under telemetry.metrics.instrumentation.instruments.
falsedisables a metric.truekeeps default behavior.- object form enables metric + optional attribute overrides.
telemetry:
metrics:
instrumentation:
instruments:
# Disable HTTP server request duration metric
http.server.request.duration: false
http.client.request.duration:
attributes:
# Disable the label
subgraph.name: false
# Enable the label (labels are enabled by default)
http.response.status_code: trueAttribute override behavior:
false- drop label from that metrictrue- keep label (all labels are enabled by default)
Histogram aggregation can also be customized under
telemetry.metrics.instrumentation.common.histogram.
explicit(default) uses unit-specific bucket sets. Lets you configure unit-specific buckets:secondsfor histogram unitsbytesfor histogram unitBy
exponentialuses one shared exponential strategy for all histogram metrics.- `record_min_max controls whether min and max are reported for histogram points.
Bucket format rules:
bucketscan be either all numbers or all strings.- mixed arrays are not allowed.
seconds.bucketsstring values are parsed as durations (for example"5ms","1s").bytes.bucketsstring values are parsed as human-readable sizes (for example"1KB","5MB").
In explicit mode, histogram units other than s and By fail startup.
telemetry:
metrics:
instrumentation:
common:
histogram:
aggregation: explicit
seconds:
buckets:
[
'5ms',
'10ms',
'25ms',
'50ms',
'75ms',
'100ms',
'250ms',
'500ms',
'750ms',
'1s',
'2.5s',
'5s',
'7.5s',
'10s'
]
record_min_max: false
bytes:
buckets:
[
'128B',
'512B',
'1KB',
'2KB',
'4KB',
'8KB',
'16KB',
'32KB',
'64KB',
'128KB',
'256KB',
'512KB',
'1MB',
'2MB',
'3MB',
'4MB',
'5MB'
]
record_min_max: falseMetrics reference
GraphQL
GraphQL metrics capture errors surfaced by the router across all stages of a GraphQL request lifecycle.
Metrics
hive.router.graphql.errors_total{error}Total count of GraphQL errors encountered during query processing and execution, categorized by error code.
codecodeGraphQL error code
GRAPHQL_PARSE_FAILEDGRAPHQL_VALIDATION_FAILEDPLAN_EXECUTION_FAILEDUNKNOWN...Uses "extensions.code" values and router's error codes. "UNKNOWN" is used when no code is available.
Supergraph
Supergraph metrics cover polling and processing lifecycle of schema updates.
Metrics
hive.router.supergraph.poll.totalTotal number of supergraph polling attempts, categorized by poll result.
resulthive.router.supergraph.poll.durationSecondsDuration of supergraph polling attempts, categorized by poll result.
resulthive.router.supergraph.process.durationSecondsTime spent processing supergraph updates, categorized by status.
statusresultResult of the poll
updatednot_modifiederrorUsed by "hive.router.supergraph.poll.*" metrics only
statusSupergraph processing status
okerrorUsed by "hive.router.supergraph.process.*" metrics only
HTTP server
HTTP server metrics capture inbound client traffic processed by the router.
Metrics
http.server.request.durationSecondsDuration of inbound HTTP requests handled by the router.
http.request.methodhttp.response.status_codehttp.routenetwork.protocol.namenetwork.protocol.versionurl.schemeerror.typegraphql.operation.namegraphql.operation.typegraphql.response.statushttp.server.request.body.sizeBytesSize of inbound HTTP request bodies handled by the router.
http.request.methodhttp.response.status_codehttp.routenetwork.protocol.namenetwork.protocol.versionurl.schemeerror.typegraphql.operation.namegraphql.operation.typegraphql.response.statushttp.server.response.body.sizeBytesSize of outbound HTTP response bodies returned by the router.
http.request.methodhttp.response.status_codehttp.routenetwork.protocol.namenetwork.protocol.versionurl.schemeerror.typegraphql.operation.namegraphql.operation.typegraphql.response.statushttp.server.active_requests{request}Current number of in-flight inbound HTTP requests.
http.request.methodnetwork.protocol.nameurl.schemehttp.request.methodHTTP method
GETPOSTPUTPATCHDELETEHEADOPTIONSCONNECTTRACEQUERY_OTHER_OTHER is fallback for unknown methods
http.response.status_codeResponse status code
200400500...http.routeNormalized router path
/graphqlnetwork.protocol.nameProtocol name
httpnetwork.protocol.versionProtocol version
0.91.01.123url.schemeURL scheme
httphttpserror.typeError classification for failed requests
status code >= 400Only set for failed requests
graphql.operation.nameGraphQL operation name associated with the HTTP request
UsersQueryIntrospectionQueryUNKNOWNUsed by http.server.request.duration, http.server.request.body.size, and http.server.response.body.size
graphql.operation.typeGraphQL operation type
querymutationsubscriptionUsed by http.server.request.duration, http.server.request.body.size, and http.server.response.body.size. Omitted when unknown
graphql.response.statusGraphQL response status for the request
okerrorUsed by http.server.request.duration, http.server.request.body.size, and http.server.response.body.size. "error" indicates the GraphQL response contains at least one error
HTTP client
HTTP client metrics capture outbound requests the router makes to subgraphs.
Metrics
http.client.request.durationSecondsDuration of outbound HTTP requests sent from router to subgraphs.
http.request.methodserver.addressserver.portnetwork.protocol.namenetwork.protocol.versionurl.schemesubgraph.namehttp.response.status_codeerror.typehttp.client.request.body.sizeBytesSize of outbound HTTP request bodies sent to subgraphs.
http.request.methodserver.addressserver.portnetwork.protocol.namenetwork.protocol.versionurl.schemesubgraph.namehttp.response.status_codeerror.typehttp.client.response.body.sizeBytesSize of HTTP response bodies returned by subgraphs.
http.request.methodserver.addressserver.portnetwork.protocol.namenetwork.protocol.versionurl.schemesubgraph.namehttp.response.status_codeerror.typehttp.client.active_requests{request}Current number of in-flight outbound HTTP requests to subgraphs.
http.request.methodserver.addressserver.porturl.schemesubgraph.namehttp.request.methodHTTP method
GETPOSTPUTPATCHDELETEHEADOPTIONSCONNECTTRACEQUERY_OTHER_OTHER is fallback for unknown methods
http.response.status_codeResponse status code
200400500...network.protocol.nameProtocol name
httpnetwork.protocol.versionProtocol version
0.91.01.123url.schemeURL scheme
httphttpsserver.addressSubgraph host
URI hostunknownURI host, or unknown fallback
server.portSubgraph port
80443Explicit URI port, or fallback 80/443
subgraph.nameSubgraph identifier
accountsConfigured names (for example "accounts")
error.typeError classification
400SUBGRAPH_REQUEST_FAILURE...Numeric status code >= 400 or execution error code string
Cache
Cache metrics track lookup behavior and cache size across router caches used during request preparation and planning stages.
Parsing cache
Parsing cache metrics measure query parse cache hit/miss behavior and cache size.
Metrics
hive.router.parse_cache.requests_totalTotal number of parsing cache lookups, categorized by result.
resulthive.router.parse_cache.durationSecondsDuration of parsing cache lookups, categorized by result.
resulthive.router.parse_cache.sizeCurrent number of entries stored in the parsing cache.
Validation cache
Validation cache metrics measure query validation cache hit/miss behavior and cache size.
Metrics
hive.router.validate_cache.requests_totalTotal number of validation cache lookups, categorized by result.
resulthive.router.validate_cache.durationSecondsDuration of validation cache lookups, categorized by result.
resulthive.router.validate_cache.sizeCurrent number of entries stored in the validation cache.
Normalization cache
Normalization cache metrics measure query normalization cache hit/miss behavior and cache size.
Metrics
hive.router.normalize_cache.requests_totalTotal number of normalization cache lookups, categorized by result.
resulthive.router.normalize_cache.durationSecondsDuration of normalization cache lookups, categorized by result.
resulthive.router.normalize_cache.sizeCurrent number of entries stored in the normalization cache.
Planning cache
Planning cache metrics measure query planning cache hit/miss behavior and cache size.
Metrics
hive.router.plan_cache.requests_totalTotal number of planning cache lookups, categorized by result.
resulthive.router.plan_cache.durationSecondsDuration of planning cache lookups, categorized by result.
resulthive.router.plan_cache.sizeCurrent number of entries stored in the planning cache.
Labels
These labels are shared by cache lookup counters and duration histograms.
resultCache lookup outcome
hitmissUsed by cache `requests_total` and `duration` metrics
What to monitor in production
The examples below show which signals to monitor in production and how to break them down so you can quickly isolate API, subgraph, cache, and GraphQL issues.
Monitor end-to-end latency of your GraphQL API
Use http.server.request.duration as your primary latency
signal.
In production, break this metric down by http.route, http.request.method,
http.response.status_code, and/or graphql.response.status, then track p95 and p99 latency per
route and method. Keep successful and failed responses separated so error-path latency does not get
hidden by healthy traffic.
Monitor health of your subgraphs
Use http.client.request.duration and
http.client.active_requests to monitor dependency health
across your federated graph.
Break these metrics down by subgraph.name, http.response.status_code, and error.type to
identify which subgraph is driving tail latency or error spikes.
Monitor cache effectiveness and planning pressure
Use the cache metrics to evaluate cache hit ratio, miss cost, and pressure over time.
For request and duration metrics, split by result (hit and miss) so you can track hit ratio
and miss latency per cache kind.
Monitor GraphQL errors over time
Use hive.router.graphql.errors_total and break it down
by code to track both volume and error distribution.
In production, monitor how error-code distribution changes over time, not only total count, so you can separate validation issues from execution failures.
Configuration reference
For full options and defaults, see telemetry configuration reference.