Understanding Alert Thresholds
Setting appropriate alert thresholds is crucial for effective monitoring. Too sensitive, and you'll suffer from alert fatigue. Too lenient, and you'll miss critical issues.
Two-Tier Alert System
Warning Alerts
Indicate potential issues that need attention but aren't immediately critical:
- Response: Investigate during business hours
- Escalation: Email, Slack notification
- Purpose: Early detection, trend analysis
- Example: Latency 1.5x normal, CPU at 70%
Critical Alerts
Indicate severe issues requiring immediate action:
- Response: Immediate investigation, 24/7
- Escalation: PagerDuty, phone call, SMS
- Purpose: Prevent/mitigate outages
- Example: Latency 2x normal, CPU at 90%
Threshold Strategies
Static Thresholds
Fixed values based on known limits or requirements:
Pros: Simple, predictable, easy to understand
Cons: Doesn't adapt to patterns, may miss gradual degradation
Best for: Hard limits (disk space, memory), SLO targets
Dynamic Thresholds (Statistical)
Based on historical data and standard deviation:
Pros: Adapts to normal patterns, catches anomalies
Cons: More complex, requires historical data
Best for: Traffic patterns, latency, error rates
Rate of Change
Alert on rapid changes rather than absolute values:
Pros: Catches sudden problems, works across scales
Cons: May miss slow degradation
Best for: Error rate spikes, traffic surges
Sensitivity Levels
Low Sensitivity
- Warning: 2x baseline
- Critical: 3x baseline
- Use for: Stable systems, established services
- Benefit: Fewer false positives
- Risk: May miss gradual degradation
Medium Sensitivity
- Warning: 1.5x baseline
- Critical: 2x baseline
- Use for: Most production systems
- Benefit: Balanced approach
- Risk: Moderate alert volume
High Sensitivity
- Warning: 1.2x baseline
- Critical: 1.5x baseline
- Use for: Critical systems, new deployments
- Benefit: Early detection
- Risk: More false positives, alert fatigue
Metric-Specific Guidelines
Latency/Response Time
- Baseline: P95 or P99 latency
- Warning: 1.5x baseline
- Critical: 2x baseline or SLO violation
- Direction: Above threshold
Error Rate
- Baseline: Normal error rate (often < 0.1%)
- Warning: 50% of error budget consumed
- Critical: 100% of error budget or SLO violation
- Direction: Above threshold
CPU/Memory Usage
- Warning: 70-80% utilization
- Critical: 90% utilization
- Direction: Above threshold
- Note: Consider sustained usage (5+ minutes)
Throughput
- Warning: 30% below baseline
- Critical: 50% below baseline
- Direction: Below threshold
- Note: Account for time-of-day patterns
Best Practices
- Start conservative: Begin with low sensitivity and adjust based on experience
- Use duration: Require condition to persist (e.g., 5 minutes) to avoid flapping
- Consider context: Different thresholds for different times (peak vs off-peak)
- Document clearly: Explain why thresholds were chosen
- Review regularly: Adjust based on system changes and alert patterns
- Alert on symptoms: Not root causes (alert on slow response, not high CPU)
- Make alerts actionable: Include runbooks and context
- Track alert quality: Monitor false positive rates
Avoiding Alert Fatigue
- Every alert should be actionable
- Critical alerts should wake someone up
- Warning alerts can wait for business hours
- Mute or fix noisy alerts immediately
- Use aggregation to reduce duplicate alerts
- Implement alert dependencies (don't alert on everything when DB is down)
- Regular alert audit and cleanup