Dashboards, Alerts, And Logs

Grafana dashboards are provisioned from observability/grafana/dashboards.

Folders:

Infrastructure
Containers
Applications
Operations
Organizations
Logs
Traces

Key Dashboards

Operations dashboards include:

Gateway Request Dashboard
Provider Dashboard
Rate Limit Dashboard
Token Usage Dashboard
Usage Budget Dashboard
Backend Safety Dashboard
Cache Revalidation Dashboard
Logger Health Dashboard

Other useful dashboards:

Applications Overview
Organizations Overview
Logs Overview
Traces Overview
OTel Pipeline Health
Containers Health
Host Health
Infrastructure Overview

Important Metrics Families

Gateway metrics include request, provider, stream, token, budget, rate-limit, plugin, safety, cache, provider-key, usage collector, and logger health metrics.

Look for signals such as:

request rate and 5xx ratio,
provider latency and errors,
stream failure rate,
token throughput,
organisation token concentration,
rate-limit hits,
budget decisions,
usage enqueue/flush errors,
dropped log events,
provider key decrypt failures,
cache hits/misses,
plugin and SafetySec execution outcomes.

Alerts

Prometheus alert rules include:

high host CPU,
high host memory,
container restart loop,
scraped service down,
OTLP export failures,
gateway 5xx spike,
provider error spike,
stream failure spike,
gateway logs dropped,
gateway log write errors,
usage collector enqueue failures,
usage collector flush failures,
Postgres readiness degraded,
Redis readiness degraded,
service metrics missing,
token throughput spike,
organisation token concentration.

Alerts route through Alertmanager.

Structured Logs

Gateway logs include normalized fields such as:

gateway.request_id
gateway.provider
gateway.model
gateway.route
gateway.endpoint
organization.id
trace_id
span_id
severity fields
component and event names

Do not promote high-cardinality identifiers such as request ID, trace ID, or organisation ID as Loki labels. Keep them in structured log fields and query the log body.

Grafana Explore Workflows

From trace to logs

Open Tempo Explore.
Search by service name or trace ID.
Open a trace.
Use traces-to-logs to query Loki around the span time.

From logs to trace

Open Loki Explore.
Search for a request ID or error.
Use the derived TraceID field if present.
Open the trace in Tempo.

From alert to dashboard

Open Alerting in Grafana or Alertmanager.
Inspect labels such as service, environment, provider, and instance.
Open the matching Operations dashboard.
Pivot to logs by service and time range.
Pivot to Tempo with trace IDs from logs or exemplars.

Troubleshooting Missing Telemetry

If metrics are missing:

confirm /metrics is enabled in the gateway,
confirm Prometheus target odock-server is up,
check the service_metrics_missing alert,
confirm no duplicate or conflicting exporter setting.

If traces are missing:

confirm OBSERVABILITY_OTEL_TRACES_EXPORTER=otlphttp,
confirm the collector is reachable at OBSERVABILITY_OTEL_ENDPOINT,
check the OTel Pipeline Health dashboard,
check collector exporter failure metrics.

If logs are missing:

confirm the gateway logs to stdout,
confirm Promtail is running,
confirm Loki is healthy,
query the container service label in Loki.

Dashboards, Alerts, And Logs

On this page