DNS is the phone book of the internet, and when it has problems, nothing works. DNS failures are uniquely frustrating because they're often invisible to affected users (they just see "site not found"), they can be geographically inconsistent (working in one city, broken in another), and they frequently take engineers by surprise because DNS seems like infrastructure that should "just work." Monitoring DNS properly prevents these surprises.
Why DNS Monitoring Matters
DNS issues cause more outages than most teams realize:
- Certificate renewals fail because ACME DNS challenges time out
- Deployments point traffic to wrong servers after a zone file error
- CDN configurations break because CNAMEs don't resolve
- Email delivery fails due to SPF/DKIM/DMARC DNS record issues
- Third-party integrations break when their endpoints change DNS
DNS monitoring should be a standard part of your infrastructure monitoring, not an afterthought.
Types of DNS Records to Monitor
Different record types serve different purposes and need different monitoring:
| Record Type | Purpose | Monitoring Focus | |---|---|---| | A | IPv4 address for hostname | Correct IP, TTL | | AAAA | IPv6 address for hostname | Correct IPv6, TTL | | CNAME | Alias to another hostname | Correct target, chain depth | | MX | Mail server for domain | Correct priority and server | | TXT | Text data (SPF, DKIM, verification) | Correct value, syntax | | NS | Authoritative nameservers | Correct servers, accessibility | | SOA | Zone authority information | Serial number, timing | | CAA | Certificate authority authorization | Correct CA entries |
Monitoring DNS Record Values
Verify your DNS records contain expected values:
import dns.resolver
import dns.exception
def check_dns_record(hostname, record_type, expected_value=None, expected_pattern=None):
"""
Check a DNS record and optionally validate its value.
Args:
hostname: Domain to check
record_type: 'A', 'AAAA', 'CNAME', 'MX', 'TXT', etc.
expected_value: Expected exact value
expected_pattern: Regex pattern to match against
Returns:
Dict with status, values, and any errors
"""
import re
resolver = dns.resolver.Resolver()
resolver.timeout = 5
resolver.lifetime = 10
try:
answers = resolver.resolve(hostname, record_type)
values = []
for rdata in answers:
if record_type == 'A':
values.append(str(rdata.address))
elif record_type == 'CNAME':
values.append(str(rdata.target))
elif record_type == 'MX':
values.append(f"{rdata.preference} {rdata.exchange}")
elif record_type == 'TXT':
values.append(''.join(s.decode() for s in rdata.strings))
else:
values.append(str(rdata))
result = {
"status": "resolved",
"hostname": hostname,
"record_type": record_type,
"values": values,
"ttl": answers.rrset.ttl
}
# Validate against expected value
if expected_value and expected_value not in values:
result["status"] = "value_mismatch"
result["expected"] = expected_value
result["actual"] = values
# Validate against pattern
if expected_pattern:
matched = any(re.match(expected_pattern, v) for v in values)
if not matched:
result["status"] = "pattern_mismatch"
result["pattern"] = expected_pattern
return result
except dns.resolver.NXDOMAIN:
return {
"status": "nxdomain",
"hostname": hostname,
"record_type": record_type,
"error": "Domain does not exist"
}
except dns.resolver.NoAnswer:
return {
"status": "no_record",
"hostname": hostname,
"record_type": record_type,
"error": f"No {record_type} record found"
}
except dns.exception.Timeout:
return {
"status": "timeout",
"hostname": hostname,
"record_type": record_type,
"error": "DNS query timed out"
}
Monitoring DNS Propagation
When you change DNS records, propagation takes time. Monitor how your changes propagate across global resolvers:
def monitor_dns_propagation(hostname, record_type, expected_value, resolvers=None):
"""
Check DNS record propagation across multiple resolvers.
Shows which resolvers have the new value vs old value.
"""
if resolvers is None:
# Public DNS resolvers around the world
resolvers = {
"Google Public DNS (US)": "8.8.8.8",
"Cloudflare DNS (US)": "1.1.1.1",
"OpenDNS (US)": "208.67.222.222",
"Quad9 (Europe)": "9.9.9.9",
"Google DNS (Secondary)": "8.8.4.4",
"Cloudflare Secondary": "1.0.0.1",
"Comcast DNS": "75.75.75.75",
"Level3 DNS": "209.244.0.3",
}
results = {}
propagated_count = 0
for resolver_name, resolver_ip in resolvers.items():
try:
import dns.resolver
resolver = dns.resolver.Resolver()
resolver.nameservers = [resolver_ip]
resolver.timeout = 3
answers = resolver.resolve(hostname, record_type)
values = [str(a) for a in answers]
is_propagated = expected_value in values
if is_propagated:
propagated_count += 1
results[resolver_name] = {
"ip": resolver_ip,
"values": values,
"propagated": is_propagated,
"status": "propagated" if is_propagated else "old_value"
}
except Exception as e:
results[resolver_name] = {
"ip": resolver_ip,
"status": "error",
"error": str(e)
}
total = len(resolvers)
propagation_percentage = (propagated_count / total) * 100
return {
"hostname": hostname,
"expected_value": expected_value,
"propagation_percentage": round(propagation_percentage, 1),
"propagated_count": propagated_count,
"total_resolvers": total,
"results": results,
"fully_propagated": propagated_count == total
}
# Example output:
# {
# "propagation_percentage": 75.0,
# "propagated_count": 6,
# "total_resolvers": 8,
# "results": {
# "Google Public DNS": {"propagated": True, "values": ["192.0.2.1"]},
# "Cloudflare DNS": {"propagated": False, "values": ["192.0.2.0"]} # Old value
# }
# }
DNS TTL Management
TTL (Time To Live) determines how long resolvers cache your records. It directly affects propagation time:
| TTL Value | Cache Duration | Propagation Time | Use Case | |---|---|---|---| | 60s | 1 minute | Very fast | Before planned changes | | 300s | 5 minutes | 5-10 minutes | Active change windows | | 3600s | 1 hour | 1-2 hours | Normal operations | | 86400s | 24 hours | 24-48 hours | Stable, rarely changed records |
Best practice for DNS changes:
- Lower TTL to 300s at least 2x the current TTL before your change window
- Make the DNS change
- Verify propagation (see above)
- Restore TTL to normal value after successful propagation
# Check current TTL
dig api.example.com A +short
# Check against authoritative nameserver (most accurate)
dig @ns1.example.com api.example.com A +noall +answer
# Output includes TTL in second column:
# api.example.com. 300 IN A 192.0.2.1
# ^^^ TTL = 300 seconds
Monitoring DNS Response Time
DNS lookup time contributes to TTFB. Monitor it:
# Measure DNS resolution time from different locations
# Using dig with timing
time dig api.example.com A @8.8.8.8 +noall +answer
# Or curl with timing breakdown
curl -s -o /dev/null -w "DNS: %{time_namelookup}s\nConnect: %{time_connect}s\n" \
https://api.example.com
# Typical healthy values:
# DNS from cache: < 1ms
# DNS uncached (same continent): 20-50ms
# DNS uncached (cross-continent): 100-300ms
Alert when DNS resolution time is unexpectedly high:
monitor:
name: "DNS Resolution Time"
type: dns
hostname: "api.example.com"
record_type: A
assertions:
- type: resolution_time
operator: less_than
value: 100 # ms
- type: record_value
expected: "192.0.2.1"
- type: ttl
operator: greater_than
value: 60 # Ensure minimum TTL is set
Monitoring DNS for Email (MX, SPF, DKIM, DMARC)
Email delivery depends on DNS records that are easy to misconfigure and hard to detect as broken without specific monitoring:
def check_email_dns_health(domain):
"""
Comprehensive check of email-related DNS records.
Returns health status for email deliverability.
"""
checks = {}
# MX Records
try:
mx_records = dns.resolver.resolve(domain, 'MX')
checks['mx'] = {
'status': 'healthy',
'records': sorted([str(r.exchange) for r in mx_records])
}
except Exception as e:
checks['mx'] = {'status': 'missing', 'error': str(e)}
# SPF Record (TXT record starting with "v=spf1")
try:
txt_records = dns.resolver.resolve(domain, 'TXT')
spf_records = [
''.join(s.decode() for s in r.strings)
for r in txt_records
if any(s.decode().startswith('v=spf1') for s in r.strings)
]
if spf_records:
checks['spf'] = {
'status': 'healthy',
'record': spf_records[0]
}
else:
checks['spf'] = {
'status': 'missing',
'error': 'No SPF record found'
}
except Exception as e:
checks['spf'] = {'status': 'error', 'error': str(e)}
# DMARC Record
try:
dmarc_records = dns.resolver.resolve(f'_dmarc.{domain}', 'TXT')
dmarc = ''.join(
''.join(s.decode() for s in r.strings)
for r in dmarc_records
)
if 'v=DMARC1' in dmarc:
# Check policy
policy = 'none'
if 'p=reject' in dmarc:
policy = 'reject'
elif 'p=quarantine' in dmarc:
policy = 'quarantine'
checks['dmarc'] = {
'status': 'healthy',
'policy': policy,
'record': dmarc
}
else:
checks['dmarc'] = {'status': 'invalid', 'record': dmarc}
except Exception as e:
checks['dmarc'] = {'status': 'missing', 'error': str(e)}
# Overall email health
missing_critical = [
k for k, v in checks.items()
if v['status'] in ('missing', 'error')
]
return {
'domain': domain,
'overall_status': 'unhealthy' if missing_critical else 'healthy',
'missing_critical': missing_critical,
'checks': checks
}
DNS Hijacking Detection
DNS hijacking redirects users to malicious servers. Monitor for unexpected IP changes:
import hashlib
import json
from datetime import datetime
class DNSChangeDetector:
"""
Detect unexpected DNS record changes.
Stores baseline and alerts on changes.
"""
def __init__(self, storage_path):
self.storage_path = storage_path
self.baseline = self.load_baseline()
def load_baseline(self):
try:
with open(self.storage_path) as f:
return json.load(f)
except FileNotFoundError:
return {}
def save_baseline(self):
with open(self.storage_path, 'w') as f:
json.dump(self.baseline, f, indent=2)
def check_and_update(self, hostname, record_type):
"""Check current values and compare to baseline"""
current = self.resolve(hostname, record_type)
key = f"{hostname}:{record_type}"
if key not in self.baseline:
# First run - establish baseline
self.baseline[key] = {
'values': current['values'],
'first_seen': datetime.utcnow().isoformat(),
'last_seen': datetime.utcnow().isoformat()
}
self.save_baseline()
return {'status': 'baseline_established', 'values': current['values']}
baseline_values = set(self.baseline[key]['values'])
current_values = set(current['values'])
added = current_values - baseline_values
removed = baseline_values - current_values
if added or removed:
alert = {
'status': 'CHANGE_DETECTED',
'hostname': hostname,
'record_type': record_type,
'added': list(added),
'removed': list(removed),
'baseline_values': list(baseline_values),
'current_values': list(current_values),
'detected_at': datetime.utcnow().isoformat()
}
# This could indicate DNS hijacking
print(f"[DNS_CHANGE] {json.dumps(alert, indent=2)}")
return alert
# Update last seen timestamp
self.baseline[key]['last_seen'] = datetime.utcnow().isoformat()
self.save_baseline()
return {'status': 'no_change', 'values': list(current_values)}
External DNS Monitoring with AzMonitor
Configure continuous DNS monitoring:
monitors:
# A record monitoring
- name: "DNS - API Hostname"
type: dns
hostname: "api.example.com"
record_type: A
interval: 300
assertions:
- type: resolves_to
expected_ips: ["192.0.2.1", "192.0.2.2"]
- type: resolution_time
operator: less_than
value: 200
# SSL cert + DNS consistency
- name: "DNS - SSL Alignment"
type: dns
hostname: "api.example.com"
assertions:
- type: resolves
# Just checks it resolves to something
- type: ttl
operator: greater_than
value: 60
# CNAME chain monitoring
- name: "DNS - CDN CNAME"
type: dns
hostname: "cdn.example.com"
record_type: CNAME
assertions:
- type: resolves_to_pattern
pattern: ".*\\.cloudfront\\.net$"
Conclusion
DNS monitoring sits at the intersection of infrastructure reliability and security. Basic health checks confirm your records resolve correctly. Change detection catches unexpected modifications (which can indicate hijacking or misconfiguration). Propagation monitoring tracks how DNS changes roll out globally. And email DNS checks ensure your deliverability setup stays intact. AzMonitor's monitoring capabilities include DNS checks that verify record values, TTL settings, and resolution time from multiple global locations — providing the continuous DNS visibility that prevents the class of "everything is down and nobody knows why" incidents that turn out to be DNS problems in disguise.
3 monitors free forever · No credit card needed · Set up in 2 minutes
Start monitoring free →