ODOCK.AI
ObservabilityLGTM Stack

Metrics

The gateway metric catalog and label policy.

Metrics

Metrics are the aggregated health signal of the LGTM stack. They answer questions like:

  • is the gateway healthy
  • is one provider slow or erroring
  • are plugins adding overhead
  • is the usage or budget pipeline failing

The stack keeps gateway metrics intentionally low-cardinality so dashboards, recording rules, and alerts remain cheap and predictable.

"Metrics on grafana"

Catalog

Requests

MetricTypeNotes
gateway.requests.totalcounterOne increment per request, labeled by status class.
gateway.request.durationhistogramTotal request latency.
gateway.requests.in_flightup-down counterConcurrent in-flight requests.

Providers

MetricType
gateway.provider.requests.totalcounter
gateway.provider.request.durationhistogram
gateway.provider.errors.totalcounter

Plugins

MetricType
gateway.plugin.executions.totalcounter
gateway.plugin.execution.durationhistogram
gateway.plugin.errors.totalcounter

Routing

MetricType
gateway.routing.decisions.totalcounter
gateway.routing.durationhistogram
gateway.routing.attemptshistogram
gateway.routing.errors.totalcounter

Rate Limit, Security, Cache

MetricType
gateway.ratelimit.decisions.totalcounter
gateway.ratelimit.rejections.totalcounter
gateway.security.module_runs.totalcounter
gateway.security.blocks.totalcounter
gateway.security.redactions.totalcounter
gateway.security.decisions.totalcounter
gateway.cache.lookups.totalcounter

Streaming

MetricType
gateway.stream.sessions.totalcounter
gateway.stream.durationhistogram

Usage And Budgets

MetricType
gateway.usage.publish.totalcounter
gateway.usage.enqueue_errors.totalcounter
gateway.usage.ingest_errors.totalcounter
gateway.usage.flush_errors.totalcounter
gateway.usage.flushed.totalcounter
gateway.usage.queue_depthgauge
gateway.budget.decisions.totalcounter

Tokens

gateway.tokens.total counters exist per class, including input, output, cached, reasoning, embeddings, tool-input, tool-output, billable, and rejected tokens.

These series feed token-throughput, spend, and concentration dashboards.

Async Pools

MetricType
gateway.async.queue_depthgauge
gateway.async.tasks.totalcounter
gateway.async.task.durationhistogram

Logger Pipeline

MetricType
gateway.logs.enqueued.totalcounter
gateway.logs.dropped.totalcounter
gateway.logs.write_errors.totalcounter
gateway.logs.batch_sizehistogram
gateway.logs.queue_depthgauge
gateway.logs.flush_durationhistogram

Process Runtime

MetricType
process.runtime.go.goroutinesgauge
process.runtime.go.mem.heap_allocgauge
process.runtime.go.mem.heap_inusegauge
process.runtime.go.mem.heap_objectsgauge
process.runtime.go.gc.cyclescounter
process.runtime.go.gc.pause_totalcounter
process.runtime.uptimegauge

How Organisation Users Usually Consume Metrics

Most organisation users will not query Prometheus directly. They will see these metrics through Grafana dashboards such as:

  • Gateway Request Dashboard
  • Provider Dashboard
  • Rate Limit Dashboard
  • Token Usage Dashboard
  • Logger Health Dashboard

See Grafana dashboards.

Label Policy

Allowed labels are bounded values such as:

  • provider
  • route
  • endpoint
  • plugin_name and plugin_stage
  • ratelimit_stage and ratelimit_module
  • security_module
  • decision
  • status_class
  • stream
  • cache_hit
  • async_pool_name

High-cardinality fields like request_id, trace_id, and organization_id stay on traces and logs instead of metrics.

Tips

Use recorded series and prebuilt dashboards for first-line investigation. Raw histogram queries are usually unnecessary outside deep debugging.

Changing labels on an existing metric is a breaking change for dashboards and alerts. If you extend the gateway, prefer adding a new metric instead of mutating an existing one.

On this page