Alert Threshold Calculator

Calculate optimal alert thresholds based on baseline metrics

Baseline Metric Value Normal/expected value for this metric

Metric Type

Alert Sensitivity

Historical Data (Optional) Provide historical values to calculate statistical thresholds based on standard deviation

Understanding Alert Thresholds

Setting appropriate alert thresholds is crucial for effective monitoring. Too sensitive, and you'll suffer from alert fatigue. Too lenient, and you'll miss critical issues.

Two-Tier Alert System

Warning Alerts

Indicate potential issues that need attention but aren't immediately critical:

Response: Investigate during business hours
Escalation: Email, Slack notification
Purpose: Early detection, trend analysis
Example: Latency 1.5x normal, CPU at 70%

Critical Alerts

Indicate severe issues requiring immediate action:

Response: Immediate investigation, 24/7
Escalation: PagerDuty, phone call, SMS
Purpose: Prevent/mitigate outages
Example: Latency 2x normal, CPU at 90%

Threshold Strategies

Static Thresholds

Fixed values based on known limits or requirements:

Pros: Simple, predictable, easy to understand
Cons: Doesn't adapt to patterns, may miss gradual degradation
Best for: Hard limits (disk space, memory), SLO targets

Dynamic Thresholds (Statistical)

Based on historical data and standard deviation:

Pros: Adapts to normal patterns, catches anomalies
Cons: More complex, requires historical data
Best for: Traffic patterns, latency, error rates

Rate of Change

Alert on rapid changes rather than absolute values:

Pros: Catches sudden problems, works across scales
Cons: May miss slow degradation
Best for: Error rate spikes, traffic surges

Sensitivity Levels

Low Sensitivity

Warning: 2x baseline
Critical: 3x baseline
Use for: Stable systems, established services
Benefit: Fewer false positives
Risk: May miss gradual degradation

Medium Sensitivity

Warning: 1.5x baseline
Critical: 2x baseline
Use for: Most production systems
Benefit: Balanced approach
Risk: Moderate alert volume

High Sensitivity

Warning: 1.2x baseline
Critical: 1.5x baseline
Use for: Critical systems, new deployments
Benefit: Early detection
Risk: More false positives, alert fatigue

Metric-Specific Guidelines

Latency/Response Time

Baseline: P95 or P99 latency
Warning: 1.5x baseline
Critical: 2x baseline or SLO violation
Direction: Above threshold

Error Rate

Baseline: Normal error rate (often < 0.1%)
Warning: 50% of error budget consumed
Critical: 100% of error budget or SLO violation
Direction: Above threshold

CPU/Memory Usage

Warning: 70-80% utilization
Critical: 90% utilization
Direction: Above threshold
Note: Consider sustained usage (5+ minutes)

Throughput

Warning: 30% below baseline
Critical: 50% below baseline
Direction: Below threshold
Note: Account for time-of-day patterns

Best Practices

Start conservative: Begin with low sensitivity and adjust based on experience
Use duration: Require condition to persist (e.g., 5 minutes) to avoid flapping
Consider context: Different thresholds for different times (peak vs off-peak)
Document clearly: Explain why thresholds were chosen
Review regularly: Adjust based on system changes and alert patterns
Alert on symptoms: Not root causes (alert on slow response, not high CPU)
Make alerts actionable: Include runbooks and context
Track alert quality: Monitor false positive rates

Avoiding Alert Fatigue

Every alert should be actionable
Critical alerts should wake someone up
Warning alerts can wait for business hours
Mute or fix noisy alerts immediately
Use aggregation to reduce duplicate alerts
Implement alert dependencies (don't alert on everything when DB is down)
Regular alert audit and cleanup

Quick Guide

Alert Tiers

Info: FYI, no action needed
Warning: Investigate soon
Critical: Immediate action

Response Times

Warning: < 4 hours
Critical: < 15 minutes

Common Mistakes

Too many alerts (fatigue)
Non-actionable alerts
No duration requirement
Alerting on causes not symptoms
Same threshold for all times
Not reviewing/tuning alerts