Observability
Dashboards, Alerts, And Logs
Use Grafana dashboards, Prometheus alerts, and structured logs.
Dashboards, Alerts, And Logs
Grafana dashboards are provisioned from observability/grafana/dashboards.
Folders:
- Infrastructure
- Containers
- Applications
- Operations
- Organizations
- Logs
- Traces
Key Dashboards
Operations dashboards include:
- Gateway Request Dashboard
- Provider Dashboard
- Rate Limit Dashboard
- Token Usage Dashboard
- Usage Budget Dashboard
- Backend Safety Dashboard
- Cache Revalidation Dashboard
- Logger Health Dashboard
Other useful dashboards:
- Applications Overview
- Organizations Overview
- Logs Overview
- Traces Overview
- OTel Pipeline Health
- Containers Health
- Host Health
- Infrastructure Overview
Important Metrics Families
Gateway metrics include request, provider, stream, token, budget, rate-limit, plugin, safety, cache, provider-key, usage collector, and logger health metrics.
Look for signals such as:
- request rate and 5xx ratio,
- provider latency and errors,
- stream failure rate,
- token throughput,
- organisation token concentration,
- rate-limit hits,
- budget decisions,
- usage enqueue/flush errors,
- dropped log events,
- provider key decrypt failures,
- cache hits/misses,
- plugin and SafetySec execution outcomes.
Alerts
Prometheus alert rules include:
- high host CPU,
- high host memory,
- container restart loop,
- scraped service down,
- OTLP export failures,
- gateway 5xx spike,
- provider error spike,
- stream failure spike,
- gateway logs dropped,
- gateway log write errors,
- usage collector enqueue failures,
- usage collector flush failures,
- Postgres readiness degraded,
- Redis readiness degraded,
- service metrics missing,
- token throughput spike,
- organisation token concentration.
Alerts route through Alertmanager.
Structured Logs
Gateway logs include normalized fields such as:
gateway.request_idgateway.providergateway.modelgateway.routegateway.endpointorganization.idtrace_idspan_id- severity fields
- component and event names
Do not promote high-cardinality identifiers such as request ID, trace ID, or organisation ID as Loki labels. Keep them in structured log fields and query the log body.
Grafana Explore Workflows
From trace to logs
- Open Tempo Explore.
- Search by service name or trace ID.
- Open a trace.
- Use traces-to-logs to query Loki around the span time.
From logs to trace
- Open Loki Explore.
- Search for a request ID or error.
- Use the derived
TraceIDfield if present. - Open the trace in Tempo.
From alert to dashboard
- Open Alerting in Grafana or Alertmanager.
- Inspect labels such as service, environment, provider, and instance.
- Open the matching Operations dashboard.
- Pivot to logs by service and time range.
- Pivot to Tempo with trace IDs from logs or exemplars.
Troubleshooting Missing Telemetry
If metrics are missing:
- confirm
/metricsis enabled in the gateway, - confirm Prometheus target
odock-serveris up, - check the
service_metrics_missingalert, - confirm no duplicate or conflicting exporter setting.
If traces are missing:
- confirm
OBSERVABILITY_OTEL_TRACES_EXPORTER=otlphttp, - confirm the collector is reachable at
OBSERVABILITY_OTEL_ENDPOINT, - check the OTel Pipeline Health dashboard,
- check collector exporter failure metrics.
If logs are missing:
- confirm the gateway logs to stdout,
- confirm Promtail is running,
- confirm Loki is healthy,
- query the container service label in Loki.