OpenTelemetry Metrics

Hive Router exposes OpenTelemetry metrics for gateway traffic, subgraph traffic, cache behavior, supergraph lifecycle, and GraphQL errors.

This guide explains where to export metrics, how to configure OTLP and Prometheus, how to customize instruments, and what each metric/label means in practice.

Choose your metrics destination

Hive Router exposes metrics through two widely used integration patterns:

OTLP-based observability backends
Prometheus scrape endpoints

Most teams already running an OpenTelemetry pipeline tend to integrate via OTLP, while teams built around Prometheus and Grafana typically stick with Prometheus scraping.

Send metrics to OTLP-compatible backends

Hive Router can export metrics using OTLP to standard OpenTelemetry pipelines, including the OpenTelemetry Collector and vendor backends that support OTLP ingestion over HTTP or gRPC.

After enabling the exporter, generate some traffic through the router and confirm that new metric series appear in your backend (for example HTTP server/client latency, cache metrics, and supergraph execution metrics).

If metrics do not appear, verify:

Endpoint reachability (network, DNS, TLS)
Authentication credentials or headers
Exporter protocol matches the backend (OTLP/HTTP vs OTLP/gRPC)

router.config.yaml


telemetry:
  metrics:
    exporters:
      - kind: otlp
        enabled: true
        protocol: http
        endpoint: https://otel-collector.example.com/v1/metrics
        interval: 30s
        max_export_timeout: 5s
        http:
          headers:
            authorization:
              expression: |
                "Bearer " + env("OTLP_TOKEN")

Expose metrics for Prometheus scraping

If your observability stack is Prometheus-first, Hive Router can expose an HTTP endpoint that Prometheus scrapes at its configured interval.

The port and path settings define the address where the Router exposes metrics. Prometheus must be able to reach that address from its runtime environment (local network, Kubernetes service, or VM network path).

Note

If port is not set, or is the same as the main HTTP server port, the Router exposes metrics through the same HTTP server that serves the GraphQL API. If the port is different, the Router starts a separate HTTP server dedicated solely to the Prometheus metrics endpoint.

In production, make sure this endpoint is reachable only by trusted scrapers (for example via network policy, firewall rules, or private ingress). Once configured, confirm the target appears as healthy in Prometheus and then verify expected series are present (for example http.server.request.duration, http.client.request.duration).

router.config.yaml


telemetry:
  metrics:
    exporters:
      - kind: prometheus
        enabled: true
        port: 9090
        path: /metrics

Production baseline

For production workloads, start with a single primary exporter, define a clear service identity, and keep default instrumentation settings.

router.config.yaml


telemetry:
  resource:
    attributes:
      service.name: hive-router
      service.namespace: your-platform
      deployment.environment:
        expression: env("ENVIRONMENT")
  metrics:
    exporters:
      - kind: otlp
        enabled: true
        protocol: grpc
        endpoint: https://otel-collector.example.com:4317
        interval: 30s
        max_export_timeout: 5s

This is a safe baseline and works well before introducing instrumentation-level customization. Additional exporters can be added later, but starting with one simplifies validation and troubleshooting.

Customize instrumentation

You can override behavior per metric under telemetry.metrics.instrumentation.instruments.

false disables a metric.
true keeps default behavior.
object form enables metric + optional attribute overrides.

router.config.yaml


telemetry:
  metrics:
    instrumentation:
      instruments:
        # Disable HTTP server request duration metric
        http.server.request.duration: false
        http.client.request.duration:
          attributes:
            # Disable the label
            subgraph.name: false
            # Enable the label (labels are enabled by default)
            http.response.status_code: true

Attribute override behavior:

false - drop label from that metric
true - keep label (all labels are enabled by default)

Histogram aggregation can also be customized under telemetry.metrics.instrumentation.common.histogram.

explicit (default) uses unit-specific bucket sets. Lets you configure unit-specific buckets:
- seconds for histogram unit s
- bytes for histogram unit By
exponential uses one shared exponential strategy for all histogram metrics.
`record_min_max controls whether min and max are reported for histogram points.

Bucket format rules:

buckets can be either all numbers or all strings.
mixed arrays are not allowed.
seconds.buckets string values are parsed as durations (for example "5ms", "1s").
bytes.buckets string values are parsed as human-readable sizes (for example "1KB", "5MB").

In explicit mode, histogram units other than s and By fail startup.

router.config.yaml


telemetry:
  metrics:
    instrumentation:
      common:
        histogram:
          aggregation: explicit
          seconds:
            buckets:
              [
                '5ms',
                '10ms',
                '25ms',
                '50ms',
                '75ms',
                '100ms',
                '250ms',
                '500ms',
                '750ms',
                '1s',
                '2.5s',
                '5s',
                '7.5s',
                '10s'
              ]
            record_min_max: false
          bytes:
            buckets:
              [
                '128B',
                '512B',
                '1KB',
                '2KB',
                '4KB',
                '8KB',
                '16KB',
                '32KB',
                '64KB',
                '128KB',
                '256KB',
                '512KB',
                '1MB',
                '2MB',
                '3MB',
                '4MB',
                '5MB'
              ]
            record_min_max: false

Metrics reference

GraphQL

GraphQL metrics capture errors surfaced by the router across all stages of a GraphQL request lifecycle.

Metrics

hive.router.graphql.errors_total

Unit:{error}

Counter

Total count of GraphQL errors encountered during query processing and execution, categorized by error code.

Labels

code

GraphQL error code

Typical Values

GRAPHQL_PARSE_FAILEDGRAPHQL_VALIDATION_FAILEDPLAN_EXECUTION_FAILEDUNKNOWN...

Uses "extensions.code" values and router's error codes. "UNKNOWN" is used when no code is available.

Supergraph

Supergraph metrics cover polling and processing lifecycle of schema updates.

Metrics

hive.router.supergraph.poll.total

Counter

Total number of supergraph polling attempts, categorized by poll result.

Labels

result

hive.router.supergraph.poll.duration

Unit:Seconds

Histogram

Duration of supergraph polling attempts, categorized by poll result.

Labels

result

hive.router.supergraph.process.duration

Unit:Seconds

Histogram

Time spent processing supergraph updates, categorized by status.

Labels

status

result

Result of the poll

Typical Values

updatednot_modifiederror

Used by "hive.router.supergraph.poll.*" metrics only

status

Supergraph processing status

Typical Values

okerror

Used by "hive.router.supergraph.process.*" metrics only

HTTP server

HTTP server metrics capture inbound client traffic processed by the router.

Metrics

http.server.request.duration

Unit:Seconds

Histogram

Duration of inbound HTTP requests handled by the router.

Labels

http.request.methodhttp.response.status_codehttp.routenetwork.protocol.namenetwork.protocol.versionurl.schemeerror.typegraphql.operation.namegraphql.operation.typegraphql.response.status

http.server.request.body.size

Unit:Bytes

Histogram

Size of inbound HTTP request bodies handled by the router.

Labels

http.request.methodhttp.response.status_codehttp.routenetwork.protocol.namenetwork.protocol.versionurl.schemeerror.typegraphql.operation.namegraphql.operation.typegraphql.response.status

http.server.response.body.size

Unit:Bytes

Histogram

Size of outbound HTTP response bodies returned by the router.

Labels

http.request.methodhttp.response.status_codehttp.routenetwork.protocol.namenetwork.protocol.versionurl.schemeerror.typegraphql.operation.namegraphql.operation.typegraphql.response.status

http.server.active_requests

Unit:{request}

UpDownCounter

Current number of in-flight inbound HTTP requests.

Labels

http.request.methodnetwork.protocol.nameurl.scheme

http.request.method

HTTP method

Typical Values

GETPOSTPUTPATCHDELETEHEADOPTIONSCONNECTTRACEQUERY_OTHER

_OTHER is fallback for unknown methods

http.response.status_code

Response status code

Typical Values

200400500...

http.route

Normalized router path

Typical Values

/graphql

network.protocol.name

Protocol name

Typical Values

http

network.protocol.version

Protocol version

Typical Values

0.91.01.123

url.scheme

URL scheme

Typical Values

httphttps

error.type

Error classification for failed requests

Typical Values

status code >= 400

Only set for failed requests

graphql.operation.name

GraphQL operation name associated with the HTTP request

Typical Values

UsersQueryIntrospectionQueryUNKNOWN

Used by http.server.request.duration, http.server.request.body.size, and http.server.response.body.size

graphql.operation.type

GraphQL operation type

Typical Values

querymutationsubscription

Used by http.server.request.duration, http.server.request.body.size, and http.server.response.body.size. Omitted when unknown

graphql.response.status

GraphQL response status for the request

Typical Values

okerror

Used by http.server.request.duration, http.server.request.body.size, and http.server.response.body.size. "error" indicates the GraphQL response contains at least one error

HTTP client

HTTP client metrics capture outbound requests the router makes to subgraphs.

Metrics

http.client.request.duration

Unit:Seconds

Histogram

Duration of outbound HTTP requests sent from router to subgraphs.

Labels

http.request.methodserver.addressserver.portnetwork.protocol.namenetwork.protocol.versionurl.schemesubgraph.namehttp.response.status_codeerror.type

http.client.request.body.size

Unit:Bytes

Histogram

Size of outbound HTTP request bodies sent to subgraphs.

Labels

http.request.methodserver.addressserver.portnetwork.protocol.namenetwork.protocol.versionurl.schemesubgraph.namehttp.response.status_codeerror.type

http.client.response.body.size

Unit:Bytes

Histogram

Size of HTTP response bodies returned by subgraphs.

Labels

http.request.methodserver.addressserver.portnetwork.protocol.namenetwork.protocol.versionurl.schemesubgraph.namehttp.response.status_codeerror.type

http.client.active_requests

Unit:{request}

UpDownCounter

Current number of in-flight outbound HTTP requests to subgraphs.

Labels

http.request.methodserver.addressserver.porturl.schemesubgraph.name

http.request.method

HTTP method

Typical Values

GETPOSTPUTPATCHDELETEHEADOPTIONSCONNECTTRACEQUERY_OTHER

_OTHER is fallback for unknown methods

http.response.status_code

Response status code

Typical Values

200400500...

network.protocol.name

Protocol name

Typical Values

http

network.protocol.version

Protocol version

Typical Values

0.91.01.123

url.scheme

URL scheme

Typical Values

httphttps

server.address

Subgraph host

Typical Values

URI hostunknown

URI host, or unknown fallback

server.port

Subgraph port

Typical Values

80443

Explicit URI port, or fallback 80/443

subgraph.name

Subgraph identifier

Typical Values

accounts

Configured names (for example "accounts")

error.type

Error classification

Typical Values

400SUBGRAPH_REQUEST_FAILURE...

Numeric status code >= 400 or execution error code string

Cache

Cache metrics track lookup behavior and cache size across router caches used during request preparation and planning stages.

Parsing cache

Parsing cache metrics measure query parse cache hit/miss behavior and cache size.

Metrics

hive.router.parse_cache.requests_total

Counter

Total number of parsing cache lookups, categorized by result.

Labels

result

hive.router.parse_cache.duration

Unit:Seconds

Histogram

Duration of parsing cache lookups, categorized by result.

Labels

result

hive.router.parse_cache.size

Gauge

Current number of entries stored in the parsing cache.

Validation cache

Validation cache metrics measure query validation cache hit/miss behavior and cache size.

Metrics

hive.router.validate_cache.requests_total

Counter

Total number of validation cache lookups, categorized by result.

Labels

result

hive.router.validate_cache.duration

Unit:Seconds

Histogram

Duration of validation cache lookups, categorized by result.

Labels

result

hive.router.validate_cache.size

Gauge

Current number of entries stored in the validation cache.

Normalization cache

Normalization cache metrics measure query normalization cache hit/miss behavior and cache size.

Metrics

hive.router.normalize_cache.requests_total

Counter

Total number of normalization cache lookups, categorized by result.

Labels

result

hive.router.normalize_cache.duration

Unit:Seconds

Histogram

Duration of normalization cache lookups, categorized by result.

Labels

result

hive.router.normalize_cache.size

Gauge

Current number of entries stored in the normalization cache.

Planning cache

Planning cache metrics measure query planning cache hit/miss behavior and cache size.

Metrics

hive.router.plan_cache.requests_total

Counter

Total number of planning cache lookups, categorized by result.

Labels

result

hive.router.plan_cache.duration

Unit:Seconds

Histogram

Duration of planning cache lookups, categorized by result.

Labels

result

hive.router.plan_cache.size

Gauge

Current number of entries stored in the planning cache.

Labels

These labels are shared by cache lookup counters and duration histograms.

result

Cache lookup outcome

Typical Values

hitmiss

Used by cache `requests_total` and `duration` metrics

What to monitor in production

The examples below show which signals to monitor in production and how to break them down so you can quickly isolate API, subgraph, cache, and GraphQL issues.

Monitor end-to-end latency of your GraphQL API

Use http.server.request.duration as your primary latency signal.

In production, break this metric down by http.route, http.request.method, http.response.status_code, and/or graphql.response.status, then track p95 and p99 latency per route and method. Keep successful and failed responses separated so error-path latency does not get hidden by healthy traffic.

Monitor health of your subgraphs

Use http.client.request.duration and http.client.active_requests to monitor dependency health across your federated graph.

Break these metrics down by subgraph.name, http.response.status_code, and error.type to identify which subgraph is driving tail latency or error spikes.

Monitor cache effectiveness and planning pressure

Use the cache metrics to evaluate cache hit ratio, miss cost, and pressure over time.

For request and duration metrics, split by result (hit and miss) so you can track hit ratio and miss latency per cache kind.

Monitor GraphQL errors over time

Use hive.router.graphql.errors_total and break it down by code to track both volume and error distribution.

In production, monitor how error-code distribution changes over time, not only total count, so you can separate validation issues from execution failures.

Configuration reference

For full options and defaults, see telemetry configuration reference.