Odock.ai
Observability

Dashboards, Alerts, And Logs

Use Grafana dashboards, Prometheus alerts, and structured logs.

Dashboards, Alerts, And Logs

Grafana dashboards are provisioned from observability/grafana/dashboards.

Folders:

  • Infrastructure
  • Containers
  • Applications
  • Operations
  • Organizations
  • Logs
  • Traces

Key Dashboards

Operations dashboards include:

  • Gateway Request Dashboard
  • Provider Dashboard
  • Rate Limit Dashboard
  • Token Usage Dashboard
  • Usage Budget Dashboard
  • Backend Safety Dashboard
  • Cache Revalidation Dashboard
  • Logger Health Dashboard

Other useful dashboards:

  • Applications Overview
  • Organizations Overview
  • Logs Overview
  • Traces Overview
  • OTel Pipeline Health
  • Containers Health
  • Host Health
  • Infrastructure Overview

Important Metrics Families

Gateway metrics include request, provider, stream, token, budget, rate-limit, plugin, safety, cache, provider-key, usage collector, and logger health metrics.

Look for signals such as:

  • request rate and 5xx ratio,
  • provider latency and errors,
  • stream failure rate,
  • token throughput,
  • organisation token concentration,
  • rate-limit hits,
  • budget decisions,
  • usage enqueue/flush errors,
  • dropped log events,
  • provider key decrypt failures,
  • cache hits/misses,
  • plugin and SafetySec execution outcomes.

Alerts

Prometheus alert rules include:

  • high host CPU,
  • high host memory,
  • container restart loop,
  • scraped service down,
  • OTLP export failures,
  • gateway 5xx spike,
  • provider error spike,
  • stream failure spike,
  • gateway logs dropped,
  • gateway log write errors,
  • usage collector enqueue failures,
  • usage collector flush failures,
  • Postgres readiness degraded,
  • Redis readiness degraded,
  • service metrics missing,
  • token throughput spike,
  • organisation token concentration.

Alerts route through Alertmanager.

Structured Logs

Gateway logs include normalized fields such as:

  • gateway.request_id
  • gateway.provider
  • gateway.model
  • gateway.route
  • gateway.endpoint
  • organization.id
  • trace_id
  • span_id
  • severity fields
  • component and event names

Do not promote high-cardinality identifiers such as request ID, trace ID, or organisation ID as Loki labels. Keep them in structured log fields and query the log body.

Grafana Explore Workflows

From trace to logs

  1. Open Tempo Explore.
  2. Search by service name or trace ID.
  3. Open a trace.
  4. Use traces-to-logs to query Loki around the span time.

From logs to trace

  1. Open Loki Explore.
  2. Search for a request ID or error.
  3. Use the derived TraceID field if present.
  4. Open the trace in Tempo.

From alert to dashboard

  1. Open Alerting in Grafana or Alertmanager.
  2. Inspect labels such as service, environment, provider, and instance.
  3. Open the matching Operations dashboard.
  4. Pivot to logs by service and time range.
  5. Pivot to Tempo with trace IDs from logs or exemplars.

Troubleshooting Missing Telemetry

If metrics are missing:

  • confirm /metrics is enabled in the gateway,
  • confirm Prometheus target odock-server is up,
  • check the service_metrics_missing alert,
  • confirm no duplicate or conflicting exporter setting.

If traces are missing:

  • confirm OBSERVABILITY_OTEL_TRACES_EXPORTER=otlphttp,
  • confirm the collector is reachable at OBSERVABILITY_OTEL_ENDPOINT,
  • check the OTel Pipeline Health dashboard,
  • check collector exporter failure metrics.

If logs are missing:

  • confirm the gateway logs to stdout,
  • confirm Promtail is running,
  • confirm Loki is healthy,
  • query the container service label in Loki.

On this page