Alerts The Prometheus alerting rules shipped with the stack.
Alerts are the proactive side of the LGTM stack. They turn degraded telemetry or degraded request health into a notification flow before users need to discover the issue manually.
In most deployments, platform operators own alert routing and contact points. Organisation users who have stack access are still often expected to read the alert state, understand the alert family, and bring the right evidence to the platform team.
Alertmanager rules live under observability/prometheus/rules/alerts.yml. Delivery configuration is provisioned through Grafana alerting and Alertmanager configuration files.
Alert Trigger Severity high_cpu_usageHost CPU above 85% for 10 minutes warning high_memory_usageHost memory above 90% for 10 minutes warning container_restart_loopMore than 3 restarts in 15 minutes on a compose service critical service_downAny core service down for 2 minutes critical
Alert Trigger Severity otlp_export_failures_sustainedCollector exporter failing to send spans, metrics, or log records for 10 minutes critical
Alert Trigger Severity gateway_request_5xx_spike5xx ratio above 5% over 10 minutes with meaningful volume critical provider_error_spikePer-provider error ratio above 10% over 10 minutes with meaningful volume critical stream_failure_spikeStreaming failure ratio above 10% over 10 minutes with meaningful volume critical
Alert Trigger Severity gateway_logs_droppedAny increase in dropped log records for 5 minutes warning gateway_logs_write_errorsAny increase in log write errors for 5 minutes critical
Alert Trigger Severity usage_collector_flush_failuresAny increase in usage flush failures for 10 minutes critical usage_collector_enqueue_failuresAny increase in usage enqueue failures for 10 minutes critical
Alert Trigger Severity postgres_readiness_degradedPostgres exporter or pg_up down for 5 minutes critical redis_readiness_degradedRedis exporter or redis_up down for 5 minutes critical
Alert Trigger Severity service_metrics_missingA service previously seen in the last hour stops emitting runtime metrics for at least 10 minutes critical
Alert Trigger Severity gateway_token_throughput_spike5-minute token rate above 3x the 1-hour baseline and above 500 tokens per second for 10 minutes warning gateway_organization_token_concentrationOne organisation drives more than 60% of token throughput for 15 minutes warning gateway_api_key_token_concentrationOne key drives more than 80% of its organisation's token throughput for 15 minutes warning gateway_token_request_size_extremeP95 tokens per request above 50,000 for 10 minutes warning
Alert family First dashboard or action Infrastructure Open Infrastructure dashboards and locate the failing node or service OTel pipeline Open Traces folder, then OTel Pipeline Health Request 5xx Open Gateway Request Dashboard Provider Open Provider Dashboard for the affected provider Stream failure Open Gateway Request Dashboard and filter to streaming traffic Logger Open Logger Health Dashboard Usage collector Open Usage / Budget Dashboard , then cross-check Redis and Postgres health Token concentration Open Token Usage Dashboard and drill into the offending organisation or key
If you are an organisation user without alert-routing permissions, stop at evidence collection and hand the result to the deployment owner.
Token concentration alerts are an early signal of runaway workloads. Pair them with budgets and quotas where possible.
Critical alerts need configured contact points. Without delivery configuration, Alertmanager can accept alerts but no one will be notified.