Setting up monitoring sounds simple — pick a tool, enter your URL, done. But monitoring configured carelessly is almost as useless as no monitoring at all. You'll get false alarms at 3 AM, miss real failures, and eventually stop trusting your alerts. This guide walks through setting up monitoring properly — deciding what to monitor, configuring it correctly, setting up meaningful alerts, and creating a status page.
Step 1: Decide What to Monitor
Before configuring anything, map out what matters. For a typical web application:
Critical paths (must monitor):
- Homepage / landing page
- Login / authentication endpoint
- Core product functionality (checkout, search, main dashboard)
- Payment processing API
- Primary API endpoints your customers rely on
Important but not critical (should monitor):
- Admin dashboard
- API documentation
- Email/webhook delivery
- Background job health
Nice to have (optional):
- Marketing pages
- Secondary APIs
- Analytics endpoints
Prioritize critical paths first. It's better to monitor 5 things well than 50 things poorly.
# Monitoring priority list example
critical:
- url: "https://app.example.com"
name: "Main Application"
interval: 60s
- url: "https://api.example.com/health"
name: "API Health"
interval: 60s
- url: "https://api.example.com/auth/login"
name: "Authentication"
method: POST
interval: 60s
- url: "https://api.example.com/checkout"
name: "Checkout API"
method: POST
interval: 60s
important:
- url: "https://app.example.com/dashboard"
name: "Dashboard"
interval: 3min
- url: "https://api.example.com/users"
name: "Users API"
interval: 5min
Step 2: Configure Your First Monitor
Let's start with the most common case — an HTTP/HTTPS monitor:
Basic HTTP Monitor Setup
In AzMonitor, create your first monitor:
- Click Add Monitor
- Select HTTP/HTTPS
- Configure:
name: "Main Website"
url: "https://example.com"
method: GET
interval: 60 # Check every minute
timeout: 30000 # 30 second timeout
Add Response Assertions
Don't just check that the server responds — verify it responds correctly:
assertions:
# Must return 200 OK
- type: status_code
operator: equals
value: 200
# Must respond within 3 seconds
- type: response_time
operator: less_than
value: 3000
# Must contain expected content (catches broken deploys)
- type: response_body
operator: contains
value: "Example Company"
The content assertion is particularly valuable — it catches scenarios where your server returns 200 but with an error page or completely wrong content.
Configure Monitoring Regions
Always use multiple regions to avoid false positives:
regions:
- us-east-1
- eu-west-1
- ap-southeast-1
# Alert only when 2+ regions confirm the failure
alert_after_regions: 2
A single-region check will page you when the monitoring provider has a network hiccup. Multi-region checks only alert on real failures.
Step 3: Configure API Monitors
For API endpoints, add more sophisticated checks:
monitor:
name: "User API - Get User"
url: "https://api.example.com/v1/users/{{TEST_USER_ID}}"
method: GET
headers:
Authorization: "Bearer {{API_MONITOR_TOKEN}}"
Accept: "application/json"
assertions:
- type: status_code
value: 200
- type: json_path
path: "$.id"
operator: exists
- type: json_path
path: "$.email"
operator: is_string
- type: response_time
operator: lt
value: 1000
monitor:
name: "Auth API - Token Issuance"
url: "https://api.example.com/auth/token"
method: POST
headers:
Content-Type: "application/json"
body: |
{
"client_id": "{{MONITOR_CLIENT_ID}}",
"client_secret": "{{MONITOR_CLIENT_SECRET}}",
"grant_type": "client_credentials"
}
assertions:
- type: status_code
value: 200
- type: json_path
path: "$.access_token"
operator: exists
- type: json_path
path: "$.expires_in"
operator: greater_than
value: 0
Step 4: Set Up SSL Certificate Monitoring
SSL expiry is a predictable failure mode that always seems to sneak up on teams. Set up SSL monitoring:
ssl_monitor:
name: "Main Domain SSL"
hostname: "example.com"
port: 443
alerts:
- days_before_expiry: 30
severity: warning
channels: ["slack-ops"]
- days_before_expiry: 7
severity: critical
channels: ["pagerduty", "slack-ops", "email-ops"]
Add SSL monitors for every custom domain:
- Your main domain
- API subdomain
- Dashboard subdomain
- Any customer-facing subdomains
Step 5: Configure Alert Channels
Alerts mean nothing if they don't reach the right people through the right channels.
Slack Integration
# Get your Slack webhook URL
# Slack App > Incoming Webhooks > Add New Webhook
# Test the webhook
curl -X POST YOUR_WEBHOOK_URL \
-H "Content-Type: application/json" \
-d '{"text": "Test alert from AzMonitor"}'
Configure different Slack channels for different severity levels:
#incidents— P1 critical failures#alerts— P2/P3 degradations#monitoring— P4 informational
Email Alerts
Add the on-call team email list, not individual accounts. Team email lists ensure alerts reach someone even during vacations.
PagerDuty Integration
For P1 alerts that require 24/7 response:
integrations:
pagerduty:
integration_key: "YOUR_PAGERDUTY_INTEGRATION_KEY"
severity_mapping:
critical: P1 # Triggers immediately
warning: P2 # Sends push notification
info: P3 # Logs to PagerDuty only
Step 6: Set Alert Thresholds Appropriately
The most common monitoring mistake is setting thresholds that generate constant noise:
# Bad: Too sensitive, will generate false positives
alerts:
- condition: "response_time > 500ms"
severity: critical
# Good: Accounts for normal variation, requires persistence
alerts:
- condition: "response_time > 2000ms for 3 consecutive checks"
severity: warning
- condition: "availability < 95% in last 5 checks"
severity: critical
Consecutive failure requirements dramatically reduce false positives. Require 2-3 consecutive failures before alerting for non-critical monitors.
Starting Threshold Values
Until you have baseline data, start with these conservative thresholds and refine:
| Monitor Type | Timeout | Warning | Critical | |---|---|---|---| | Homepage | 30s | > 3s for 3 checks | Down for 2 checks | | API endpoint | 10s | > 1s for 3 checks | Down for 2 checks | | Auth endpoint | 5s | > 500ms for 5 checks | Down for 2 checks | | Payment API | 5s | > 2s for 3 checks | Down for 2 checks |
Step 7: Create a Status Page
Even if you only have 5 customers, a status page is worth setting up. It provides a canonical place for status information and reduces support ticket volume during incidents.
status_page:
name: "Example Status"
domain: "status.example.com"
components:
- name: "Website"
monitors: ["main-website"]
- name: "API"
monitors: ["api-health", "user-api"]
- name: "Authentication"
monitors: ["auth-api"]
- name: "Payment Processing"
monitors: ["payment-api"]
settings:
show_uptime_history: true
history_days: 90
enable_subscriber_notifications: true
custom_css: false
Step 8: Test Your Setup
Before trusting your monitoring, verify it works:
# Test 1: Verify the monitor is running
# Check your monitoring dashboard - should show last check time within 2 minutes
# Test 2: Simulate a failure
# Temporarily block your server's response (or use a non-existent URL)
# Verify you receive an alert within expected time window
# Test 3: Verify alert resolution
# Unblock your server
# Verify you receive an "all clear" alert
# Test 4: Test each notification channel
# Click "Send test alert" for each configured channel
# Verify Slack, email, and PagerDuty receive the test
Step 9: Document Your Monitoring Setup
Create a monitoring runbook that your entire team can reference:
# Monitoring Runbook
## Where to Find Monitoring
Dashboard: https://app.azmonitor.com/dashboard
Status page: https://status.example.com
## Alert Channels
P1 incidents: PagerDuty (Engineering team) + #incidents Slack
P2 degradations: #alerts Slack channel
P3 informational: #monitoring Slack channel
## What Each Monitor Watches
- Main Website: homepage load time and content
- API Health: /health endpoint availability
- Authentication: login API availability
- Payment API: checkout endpoint
- SSL - example.com: certificate expiry
## Alert Response Guide
P1 (Critical): Page on-call, update status page, begin incident response
P2 (Warning): Investigate during business hours, update in #alerts
P3 (Info): Note in next standup, create ticket if persistent
## Maintenance Windows
Create maintenance window before: deployments, database migrations, infra changes
Location: AzMonitor > Settings > Maintenance Windows
Step 10: Review and Improve
Monitoring is never "done." Schedule a quarterly review:
- Which alerts fired in the last quarter?
- Which were false positives? (Tune thresholds)
- Which incidents weren't caught by monitoring? (Add coverage)
- Which monitors haven't fired in 6 months? (Remove or reconsider)
# Monitoring health check script
def quarterly_monitoring_review(monitoring_client):
"""Run quarterly monitoring review checks"""
issues = []
# Find monitors that never alerted (might not be testing the right thing)
silent_monitors = monitoring_client.get_monitors_with_no_alerts(days=90)
if silent_monitors:
issues.append(f"Monitors with no alerts in 90 days: {[m.name for m in silent_monitors]}")
# Find monitors with high false positive rates
noisy_monitors = monitoring_client.get_monitors_with_high_auto_resolve_rate(threshold=0.5)
if noisy_monitors:
issues.append(f"Noisy monitors (>50% auto-resolve): {[m.name for m in noisy_monitors]}")
# Find services without monitoring
all_services = get_all_services()
monitored_domains = {m.get_domain() for m in monitoring_client.get_monitors()}
unmonitored = [s for s in all_services if s.domain not in monitored_domains]
if unmonitored:
issues.append(f"Services without monitoring: {[s.name for s in unmonitored]}")
return issues
Conclusion
Good monitoring is a practice, not a one-time setup. Start with your critical paths, configure multi-region checks with meaningful assertions, route alerts appropriately, and review regularly. AzMonitor makes the setup process straightforward with sensible defaults that get you from zero to production-ready monitoring in under 30 minutes — and the review and improvement cycle that follows turns that initial setup into increasingly reliable coverage over time.
3 monitors free forever · No credit card needed · Set up in 2 minutes
Start monitoring free →