ODOCK.AI
ObservabilityLGTM Stack

Alerts

The Prometheus alerting rules shipped with the stack.

Alerts

Alerts are the proactive side of the LGTM stack. They turn degraded telemetry or degraded request health into a notification flow before users need to discover the issue manually.

In most deployments, platform operators own alert routing and contact points. Organisation users who have stack access are still often expected to read the alert state, understand the alert family, and bring the right evidence to the platform team.

Routing And Delivery

Alertmanager rules live under observability/prometheus/rules/alerts.yml. Delivery configuration is provisioned through Grafana alerting and Alertmanager configuration files.

Alert Families

Infrastructure

AlertTriggerSeverity
high_cpu_usageHost CPU above 85% for 10 minuteswarning
high_memory_usageHost memory above 90% for 10 minuteswarning
container_restart_loopMore than 3 restarts in 15 minutes on a compose servicecritical
service_downAny core service down for 2 minutescritical

OTel Pipeline

AlertTriggerSeverity
otlp_export_failures_sustainedCollector exporter failing to send spans, metrics, or log records for 10 minutescritical

Gateway Requests

AlertTriggerSeverity
gateway_request_5xx_spike5xx ratio above 5% over 10 minutes with meaningful volumecritical
provider_error_spikePer-provider error ratio above 10% over 10 minutes with meaningful volumecritical
stream_failure_spikeStreaming failure ratio above 10% over 10 minutes with meaningful volumecritical

Logger Pipeline

AlertTriggerSeverity
gateway_logs_droppedAny increase in dropped log records for 5 minuteswarning
gateway_logs_write_errorsAny increase in log write errors for 5 minutescritical

Usage Collector

AlertTriggerSeverity
usage_collector_flush_failuresAny increase in usage flush failures for 10 minutescritical
usage_collector_enqueue_failuresAny increase in usage enqueue failures for 10 minutescritical

Data Stores

AlertTriggerSeverity
postgres_readiness_degradedPostgres exporter or pg_up down for 5 minutescritical
redis_readiness_degradedRedis exporter or redis_up down for 5 minutescritical

Service Health

AlertTriggerSeverity
service_metrics_missingA service previously seen in the last hour stops emitting runtime metrics for at least 10 minutescritical

Token Volume Anomalies

AlertTriggerSeverity
gateway_token_throughput_spike5-minute token rate above 3x the 1-hour baseline and above 500 tokens per second for 10 minuteswarning
gateway_organization_token_concentrationOne organisation drives more than 60% of token throughput for 15 minuteswarning
gateway_api_key_token_concentrationOne key drives more than 80% of its organisation's token throughput for 15 minuteswarning
gateway_token_request_size_extremeP95 tokens per request above 50,000 for 10 minuteswarning

What To Do When An Alert Fires

Alert familyFirst dashboard or action
InfrastructureOpen Infrastructure dashboards and locate the failing node or service
OTel pipelineOpen Traces folder, then OTel Pipeline Health
Request 5xxOpen Gateway Request Dashboard
ProviderOpen Provider Dashboard for the affected provider
Stream failureOpen Gateway Request Dashboard and filter to streaming traffic
LoggerOpen Logger Health Dashboard
Usage collectorOpen Usage / Budget Dashboard, then cross-check Redis and Postgres health
Token concentrationOpen Token Usage Dashboard and drill into the offending organisation or key

If you are an organisation user without alert-routing permissions, stop at evidence collection and hand the result to the deployment owner.

Tips

Token concentration alerts are an early signal of runaway workloads. Pair them with budgets and quotas where possible.

Critical alerts need configured contact points. Without delivery configuration, Alertmanager can accept alerts but no one will be notified.

On this page