Alert configuration is where most monitoring setups go wrong. Engineers spend time setting up monitors and then configure alerts with default settings — which almost always produces too much noise. After a week of false alarms, the team starts ignoring alerts, and when a real incident hits, nobody notices. Here's how to configure alerts that actually work.

The Alert Quality Litmus Test

Before configuring any alert, answer three questions:

If this fires at 3 AM, what exact action should the on-call engineer take?
Does this alert firing directly indicate user impact?
Will this alert fire for transient, self-resolving issues?

If you can't answer question 1 clearly, the alert shouldn't page. If the answer to question 2 is "not necessarily," reconsider whether it should be a critical alert. If question 3 is "probably yes," add duration requirements.

Setting Alert Thresholds

Thresholds should be based on data, not guesswork.

Step 1: Establish Baselines

Run your monitors for 2-4 weeks before configuring alerts. Collect baseline data:

# Analyze baseline to set thresholds
def calculate_alert_thresholds(historical_data, percentile_threshold=0.99):
    """
    Calculate alert thresholds based on historical performance.
    
    Default: Set threshold at 99th percentile (1% of checks exceed threshold)
    This means alerts fire rarely under normal conditions.
    """
    import statistics
    
    values = sorted(historical_data)
    n = len(values)
    
    p50 = values[int(n * 0.50)]
    p90 = values[int(n * 0.90)]
    p95 = values[int(n * 0.95)]
    p99 = values[int(n * 0.99)]
    
    return {
        "p50_baseline": p50,
        "p90_baseline": p90,
        "p95_baseline": p95,
        "p99_baseline": p99,
        "suggested_warning_threshold": p95 * 1.5,   # 50% above p95
        "suggested_critical_threshold": p99 * 2.0,   # 2x p99
    }

# Example output for a 200ms median service:
# {
#   "p50_baseline": 200,
#   "p95_baseline": 500,
#   "p99_baseline": 800,
#   "suggested_warning_threshold": 750,
#   "suggested_critical_threshold": 1600
# }

Step 2: Add Hysteresis

Set different thresholds for opening and closing alerts to prevent flapping:

alert:
  name: "API Response Time High"
  
  # Alert opens when: p95 > 1500ms for 5 consecutive minutes
  condition:
    metric: response_time_p95
    operator: greater_than
    threshold: 1500
    for: 5m
    
  # Alert closes when: p95 drops below 800ms for 3 minutes
  recovery:
    metric: response_time_p95
    operator: less_than
    threshold: 800
    for: 3m

Without hysteresis, an alert will flap around the threshold — firing and resolving repeatedly as the metric oscillates.

Consecutive Failure Requirements

For availability monitoring, require multiple consecutive failures before alerting:

# Single failure tolerance (reduce false positives from transient issues)
monitor:
  name: "Payment API"
  url: "https://api.example.com/health"
  
  alert_after:
    consecutive_failures: 2  # Alert only after 2 consecutive failures
    # 2 failures × 60s interval = 2 minutes of sustained failure before alert
    
  recovery_after:
    consecutive_successes: 2  # Recover alert after 2 consecutive successes

This prevents alerts for single network timeouts or momentary blips while ensuring sustained failures are caught.

Severity-Based Alert Configuration

Match alert severity to actual impact:

alerts:
  # P1 - Immediate page, wake people up
  critical:
    conditions:
      - metric: availability
        operator: less_than
        value: 0.90  # Less than 90% available
        for: 2m
      - metric: error_rate
        operator: greater_than
        value: 0.10  # More than 10% errors
        for: 3m
    channels:
      - type: pagerduty
        integration_key: "PD_P1_KEY"
        severity: critical
      - type: slack
        webhook: "WEBHOOK_URL"
        channel: "#incidents"
    escalation:
      timeout_minutes: 5
      next_contact: "secondary-oncall"
  
  # P2 - Push notification, investigate today
  warning:
    conditions:
      - metric: response_time_p95
        operator: greater_than
        value: 2000
        for: 10m
      - metric: error_rate
        operator: greater_than
        value: 0.02  # More than 2% errors
        for: 10m
    channels:
      - type: pagerduty
        integration_key: "PD_P2_KEY"
        severity: warning
      - type: slack
        webhook: "WEBHOOK_URL"
        channel: "#alerts"
  
  # P3 - Slack notification, review this week
  info:
    conditions:
      - metric: response_time_p95
        operator: greater_than
        value: 1000
        for: 30m
    channels:
      - type: slack
        webhook: "WEBHOOK_URL"
        channel: "#monitoring"

Multi-Channel Alert Configuration

Configure multiple channels to complement each other:

# P1 Alert - comprehensive notification strategy
p1_alert:
  channels:
    # Primary: Wake on-call
    - type: pagerduty
      urgency: high
      integration_key: "PD_KEY"
      
    # Backup: SMS if PD fails
    - type: sms
      recipients: ["[email protected]"]
      
    # Visibility: Team awareness
    - type: slack
      channel: "#incidents"
      message: |
        🚨 CRITICAL: {{monitor_name}} is DOWN
        URL: {{monitor_url}}
        From: {{failed_regions|join(", ")}}
        Started: {{incident_start}}
        Dashboard: {{dashboard_link}}
        Runbook: {{runbook_link}}
        
    # External: Customer communication trigger
    - type: webhook
      url: "https://api.statuspage.io/v1/pages/PAGE_ID/incidents"
      headers:
        Authorization: "OAuth YOUR_TOKEN"
      body_template: |
        {
          "incident": {
            "name": "Investigating {{monitor_name}} issues",
            "status": "investigating",
            "impact_override": "major"
          }
        }

Configuring Time-Based Alert Behavior

Some alerts should behave differently outside business hours:

# Business hours vs after-hours alert routing
alert:
  name: "Cache Hit Rate Low"
  condition: "cache_hit_rate < 60% for 20 minutes"
  
  routing:
    business_hours:  # 09:00-18:00 weekdays
      channels:
        - slack: "#engineering"
      severity: warning
      
    after_hours:  # Evenings and weekends
      # Not critical enough to wake anyone up
      # Hold notification until business hours
      channels:
        - slack: "#monitoring"
      severity: info
      hold_until_business_hours: true

Not all issues warrant waking people up. Non-critical alerts that can wait until morning should be held.

SSL Certificate Alert Configuration

SSL alerts require different timing — they're not about current availability but about upcoming expiry:

ssl_alerts:
  name: "SSL Certificate Expiry"
  hostname: "api.example.com"
  
  alert_schedule:
    - days_before_expiry: 60
      severity: info
      channels: ["#engineering-weekly-digest"]
      send_once: true  # Don't repeat daily
      
    - days_before_expiry: 30
      severity: warning
      channels: ["#sre-alerts"]
      send_once_per_week: true
      
    - days_before_expiry: 14
      severity: warning
      channels: ["#sre-alerts"]
      send_daily: true
      
    - days_before_expiry: 7
      severity: critical
      channels: ["pagerduty", "#incidents"]
      send_daily: true
      
    - days_before_expiry: 3
      severity: critical
      channels: ["pagerduty", "sms", "#incidents"]
      send_every_6_hours: true

Maintenance Window Configuration

Suppress alerts during planned maintenance to prevent false alarms:

# Create maintenance window via API
import requests

def create_maintenance_window(
    monitor_ids,
    start_time,
    end_time,
    description,
    api_key
):
    """Create maintenance window to suppress alerts"""
    response = requests.post(
        "https://api.azmonitor.com/v1/maintenance",
        headers={"Authorization": f"Bearer {api_key}"},
        json={
            "monitor_ids": monitor_ids,
            "start_time": start_time.isoformat(),
            "end_time": end_time.isoformat(),
            "description": description,
            "notify_team": True  # Notify team that maintenance is scheduled
        }
    )
    return response.json()

# Add to your CI/CD pipeline
from datetime import datetime, timedelta

window = create_maintenance_window(
    monitor_ids=["payment-api", "user-api"],
    start_time=datetime.utcnow(),
    end_time=datetime.utcnow() + timedelta(hours=1),
    description="Deploying payment-service v2.5.0",
    api_key=os.environ["AZMONITOR_API_KEY"]
)

print(f"Maintenance window created: {window['id']}")
# ... deploy here ...
# Window auto-expires, alerts resume automatically

Testing Alert Configurations

Never assume your alerts work — test them:

# Test 1: Test each notification channel
# Use your monitoring tool's "Send test alert" feature
# Verify receipt in: Slack, email, PagerDuty

# Test 2: Test with a real failure
# Temporarily make your monitored URL return an error
# Verify alert fires within expected time
# Verify it reaches expected channels
# Verify it resolves when URL recovers

# Test 3: Test escalation
# Don't acknowledge the test alert
# Verify escalation fires after configured timeout
# Verify escalation reaches secondary contact

# Test 4: Test maintenance windows
# Create maintenance window
# Trigger failure during window
# Verify NO alert fires during window
# Allow window to expire
# Trigger failure again
# Verify alert fires normally

Alert Message Templates

Well-crafted alert messages accelerate incident response:

# Include actionable context in every alert
alert_templates:
  critical:
    title: "[CRITICAL] {{monitor_name}} DOWN — {{failed_regions}} failing"
    body: |
      Service {{monitor_name}} is DOWN.
      
      Failing from: {{failed_regions|join(", ")}}
      First failure: {{incident_start_time}} UTC
      URL: {{monitor_url}}
      
      Recent response: {{status_code}} in {{response_time}}ms
      
      Dashboard: {{dashboard_url}}
      Runbook: {{runbook_url}}
      
      Acknowledge: {{acknowledge_url}}
      
  warning:
    title: "[WARNING] {{monitor_name}} — {{metric}} elevated"
    body: |
      Service {{monitor_name}} is degraded.
      
      Metric: {{metric}} = {{current_value}} (threshold: {{threshold}})
      Duration: {{duration_minutes}} minutes
      
      Dashboard: {{dashboard_url}}

Common Alert Configuration Mistakes

| Mistake | Impact | Fix | |---|---|---| | No consecutive failure requirement | Constant false positives | Require 2-3 consecutive failures | | Single region checks | Alerts on monitoring provider issues | Use 2+ region confirmation | | No hysteresis | Alert flapping | Set different open/close thresholds | | All channels same for all severities | On-call burnout | Route by severity | | No maintenance windows | Alerts during deployments | Automate maintenance window creation | | No recovery notifications | Can't tell when issue resolved | Enable "all clear" notifications |

Conclusion

Alert configuration is an iterative process. Start conservative (higher thresholds, consecutive failure requirements, multi-region confirmation), measure alert quality weekly, and tune based on what you observe. The goal is alerts that fire when something actually needs human attention and never fire for issues that resolve themselves. AzMonitor's alert configuration provides all the knobs needed to tune your monitoring precisely — consecutive failure requirements, multi-region confirmation, time-based routing, and per-channel severity configuration — so you can build a monitoring setup your team actually trusts.

Tags:alert configurationmonitoring alertsalertingon-call

Back to blog

AzMonitor Team

The AzMonitor team writes guides based on experience monitoring millions of endpoints daily across 10,000+ customer environments. Our expertise covers uptime monitoring, SRE practices, and reliability engineering.

Try AzMonitor free

3 monitors free forever · No credit card needed · Set up in 2 minutes

Start monitoring free →

How to Configure Monitoring Alerts: A Practical Guide