An incident timeline is the authoritative account of what happened, when it happened, and who was involved. It's the foundation of every useful postmortem. Without an accurate timeline, postmortems become debates about memory rather than analyses of facts. With a good timeline, you can pinpoint exactly when a problem started, correlate it with changes, and follow the chain of events that led to resolution.
Why Timelines Are Harder Than They Look
Building an accurate timeline after an incident is harder than it seems:
Clock skew — Different systems use different time zones or have clocks that drift. An event that shows as 14:32 in the application logs might be 14:30 in the database logs and 14:34 in the CDN logs.
Memory compression — Under stress, humans compress time. "That was about 10 minutes after we first noticed" is often off by a factor of 2-3x in either direction.
Incomplete documentation — During incident response, documenting what you're doing is a lower priority than doing it. Actions taken without being logged become gaps in the timeline.
Attribution gaps — "Someone restarted the service" with no record of who or exactly when.
Causation vs correlation — The timeline shows what happened in sequence, but not always which events caused which.
Real-Time Timeline Capture
The best timeline is built during the incident, not reconstructed from memory after it:
## Scribe Template — Real-Time Timeline Log
[All times in UTC. Use ISO 8601 format: YYYY-MM-DD HH:MM:SS]
### Incident: [Name/ID]
### Declared: [timestamp]
### Incident Commander: [name]
---
14:22:00 | SYSTEM | Monitoring alert fired: checkout-api error rate > 5%
14:22:15 | AUTO | PagerDuty notified on-call engineer (Jane Smith)
14:23:00 | Jane | Acknowledged page, joining incident channel
14:23:30 | Jane | IC declared. Pulled in Bob (backend) and Carol (infra)
14:25:00 | Bob | Checking error logs on checkout API service
14:26:15 | Bob | Seeing "connection pool exhausted" errors in app logs
14:27:00 | Carol | Checking database metrics — CPU normal, connections at 95% max
14:28:30 | Jane | Hypothesis: connection pool leak. Checking recent deployments.
14:29:00 | Bob | Found: deploy v2.4.1 at 14:10 touched connection handling code
14:30:00 | Jane | Decision: roll back to v2.4.0. Bob authorized to proceed.
14:31:00 | Bob | Rollback initiated on checkout-api
14:33:45 | Bob | Rollback complete, watching metrics
14:35:00 | Carol | Database connections dropping — from 95% to 60%
14:37:00 | SYSTEM | Monitoring alert resolved: checkout-api error rate < 1%
14:38:00 | Jane | Customer-visible errors confirmed stopped per support team
14:38:30 | Jane | Incident declared resolved.
Reconstructing Timelines from System Data
When real-time documentation is incomplete, reconstruct from system logs:
# timeline_reconstructor.py
from datetime import datetime, timezone
from typing import List, Dict, Any
import json
class TimelineEvent:
def __init__(self, timestamp: datetime, source: str, event: str, actor: str = None):
self.timestamp = timestamp
self.source = source
self.event = event
self.actor = actor
def to_dict(self):
return {
"timestamp": self.timestamp.isoformat(),
"source": self.source,
"event": self.event,
"actor": self.actor
}
class IncidentTimelineReconstructor:
"""
Reconstruct incident timeline from multiple data sources.
Normalizes timestamps to UTC.
"""
def collect_monitoring_events(self, incident_start, incident_end, monitor_ids):
"""Collect alert events from monitoring system."""
events = []
# Fetch monitoring alert history
monitoring_data = self.azmonitor_client.get_alert_history(
monitor_ids=monitor_ids,
start=incident_start,
end=incident_end
)
for alert in monitoring_data:
events.append(TimelineEvent(
timestamp=alert.triggered_at,
source="azmonitor",
event=f"Alert fired: {alert.name} ({alert.status})",
actor="automated"
))
if alert.resolved_at:
events.append(TimelineEvent(
timestamp=alert.resolved_at,
source="azmonitor",
event=f"Alert resolved: {alert.name}",
actor="automated"
))
return events
def collect_deployment_events(self, incident_window_start, incident_window_end):
"""Collect deployment events from CI/CD system."""
events = []
# Extend window backward to catch pre-incident deployments
window_start = incident_window_start - timedelta(hours=2)
deployments = self.cicd_client.get_deployments(
start=window_start,
end=incident_window_end,
environment="production"
)
for deploy in deployments:
events.append(TimelineEvent(
timestamp=deploy.completed_at,
source="cicd",
event=f"Deployment: {deploy.service} {deploy.version} → production",
actor=deploy.triggered_by
))
return events
def collect_application_log_events(self, log_query, start, end):
"""Extract key events from application logs."""
events = []
# Query log aggregation system (e.g., Datadog, Splunk, CloudWatch)
log_entries = self.log_client.query(
query=log_query,
start=start,
end=end,
level="ERROR"
)
for entry in log_entries[:100]: # Sample first occurrences
events.append(TimelineEvent(
timestamp=entry.timestamp,
source="application_logs",
event=f"{entry.level}: {entry.message[:200]}",
actor=None
))
return events
def build_unified_timeline(self, events: List[TimelineEvent]) -> List[Dict]:
"""Sort and format all events into a unified timeline."""
# Sort by timestamp
sorted_events = sorted(events, key=lambda e: e.timestamp)
# Format for output
timeline = []
for event in sorted_events:
timeline.append({
"time": event.timestamp.strftime("%H:%M:%S UTC"),
"source": event.source,
"event": event.event,
"actor": event.actor or "system"
})
return timeline
Annotating the Timeline
Raw events aren't enough — annotations add context that makes the timeline useful:
def annotate_timeline(timeline_events):
"""
Add analytical annotations to timeline events.
"""
annotated = []
for i, event in enumerate(timeline_events):
annotated_event = event.copy()
# Find deployment→alert correlations
if event["source"] == "azmonitor" and "fired" in event["event"]:
# Look for deployments in the 30 minutes before this alert
preceding_events = timeline_events[max(0, i-20):i]
recent_deploys = [
e for e in preceding_events
if e["source"] == "cicd"
]
if recent_deploys:
last_deploy = recent_deploys[-1]
annotated_event["annotation"] = (
f"NOTE: Alert fired {calculate_gap(last_deploy, event)} after "
f"deployment of {last_deploy['event']}"
)
# Flag the detection gap
if event.get("is_incident_start") and event.get("detected_at"):
gap_minutes = (event["detected_at"] - event["actual_start"]).total_seconds() / 60
annotated_event["detection_gap_minutes"] = gap_minutes
annotated_event["annotation"] = (
f"DETECTION GAP: Issue started at {event['actual_start']} but "
f"detected at {event['detected_at']} ({gap_minutes:.0f} min gap)"
)
annotated.append(annotated_event)
return annotated
Timeline Format for Postmortems
The postmortem timeline should be readable by both technical and non-technical stakeholders:
# Incident Timeline — Checkout API Outage
## [2025-04-16 14:22 UTC → 14:38 UTC | 16 minutes]
### Pre-Incident
| Time (UTC) | Event |
|---|---|
| 14:10:00 | Deployment v2.4.1 deployed to production (checkout-api) |
| 14:10:30 | Deployment completed successfully — all health checks passed |
### Incident Begins
| Time (UTC) | Event | Source |
|---|---|---|
| 14:22:00 | First customer errors — checkout API returning 500s | Application logs |
| 14:22:15 | **Alert fired**: checkout-api error rate exceeded 5% threshold | AzMonitor |
| 14:22:25 | On-call engineer (Jane) paged via PagerDuty | PagerDuty |
| 14:23:00 | Jane acknowledged page | PagerDuty |
**Detection gap: 0 minutes** (monitoring detected within 15 seconds of issue onset)
### Investigation
| Time (UTC) | Event | Actor |
|---|---|---|
| 14:23:30 | War room declared, Bob and Carol pulled in | Jane (IC) |
| 14:26:15 | "Connection pool exhausted" errors identified in logs | Bob |
| 14:27:00 | Database connection utilization at 95% — unusually high | Carol |
| 14:29:00 | Deployment v2.4.1 identified as likely cause | Bob |
### Mitigation
| Time (UTC) | Event | Actor |
|---|---|---|
| 14:30:00 | Decision to roll back to v2.4.0 | Jane (IC) |
| 14:31:00 | Rollback initiated | Bob |
| 14:33:45 | Rollback complete | CI/CD system |
| 14:35:00 | Database connections declining | Carol |
### Resolution
| Time (UTC) | Event | Source |
|---|---|---|
| 14:37:00 | **Alert resolved**: checkout-api error rate below threshold | AzMonitor |
| 14:38:00 | Customer-visible errors confirmed stopped | Support team |
| 14:38:30 | Incident declared resolved | Jane (IC) |
### Summary
- **Total duration**: 16 minutes 30 seconds
- **Detection time**: ~15 seconds (monitoring → page)
- **Time to resolution from detection**: 16 minutes
- **Time to root cause identification**: 7 minutes after detection
- **Affected users**: Approximately 15% of checkout traffic
Metrics to Extract From Timelines
Timelines generate metrics that drive improvement:
def extract_timeline_metrics(timeline):
"""Extract key metrics from incident timeline for trend analysis."""
incident_start = find_event(timeline, "incident_start")
alert_fired = find_event(timeline, "alert_fired")
root_cause_identified = find_event(timeline, "root_cause_identified")
mitigation_started = find_event(timeline, "mitigation_started")
incident_resolved = find_event(timeline, "incident_resolved")
return {
# Detection time: when did monitoring catch it vs when did it start?
"detection_lag_minutes": minutes_between(incident_start, alert_fired),
# Time to diagnose: from alert to root cause
"time_to_diagnose_minutes": minutes_between(alert_fired, root_cause_identified),
# Time to act: from diagnosis to mitigation
"time_to_act_minutes": minutes_between(root_cause_identified, mitigation_started),
# Time to resolve: from mitigation to resolution
"time_to_resolve_minutes": minutes_between(mitigation_started, incident_resolved),
# Total MTTR
"total_mttr_minutes": minutes_between(alert_fired, incident_resolved),
# Was this deployment-related?
"deployment_related": any(
"deployment" in e["event"].lower()
for e in timeline
if is_pre_incident(e, incident_start)
)
}
Conclusion
A well-built incident timeline transforms a chaotic event into a structured learning opportunity. The more accurate and detailed the timeline, the better the postmortem analysis and the more specific the action items. Building timelines during incidents is a discipline that pays forward — teams that practice real-time documentation during incidents resolve them faster because they maintain shared situational awareness. Monitoring systems like AzMonitor contribute precise, automated timestamps to timelines: the alert fired at exactly 14:22:15, the check started failing at 14:21:58. These system-generated data points are the most reliable anchors in any incident timeline.
3 monitors free forever · No credit card needed · Set up in 2 minutes
Start monitoring free →