SLA negotiation is where engineering reality meets commercial pressure. Sales teams want to promise 99.99% uptime to close deals; engineering teams know that 99.99% means less than 5 minutes of downtime per month across all failure modes. The negotiation determines what you're contractually obligated to deliver — and whether you can actually do it.
Start With Your Actual Uptime
Before negotiating any SLA, know your real numbers. Promising 99.9% when you're currently delivering 98.5% creates a contract you'll breach from day one.
def calculate_historical_uptime(monitoring_data, months=12):
"""
Calculate actual historical availability as baseline for SLA negotiations.
"""
monthly_availabilities = []
for month in get_last_n_months(months):
period_start, period_end = get_month_bounds(month)
checks = monitoring_data.get_checks(
start=period_start,
end=period_end
)
availability = calculate_availability(checks, period_start, period_end)
monthly_availabilities.append({
"month": month.strftime("%Y-%m"),
"availability_pct": availability["availability_pct"],
"downtime_minutes": availability["downtime_minutes"]
})
availabilities = [m["availability_pct"] for m in monthly_availabilities]
return {
"monthly_data": monthly_availabilities,
"mean_availability": sum(availabilities) / len(availabilities),
"min_availability": min(availabilities),
"max_availability": max(availabilities),
"worst_month": min(monthly_availabilities, key=lambda m: m["availability_pct"]),
"sla_recommendation": {
"achievable_target": min(availabilities) * 0.9995, # Conservative buffer
"aggressive_target": min(availabilities) * 0.9990, # Less conservative
}
}
# If your worst month was 99.7%, don't commit to 99.9%.
# Set the SLA target below your actual worst performance.
Common SLA Structures
Flat Availability SLA
The simplest structure — a single availability percentage with a credit schedule:
Service Level Agreement
Provider commits to 99.9% monthly availability for the Core API.
Availability is measured as: (total minutes - downtime minutes) / total minutes.
Credit Schedule:
- 99.0% - 99.9%: 10% credit of monthly fees
- 95.0% - 99.0%: 25% credit of monthly fees
- Below 95.0%: 50% credit of monthly fees
Maximum credit: 50% of monthly fees in any calendar month.
Credits are the sole remedy for availability failures.
Tiered SLA by Service Component
Different components have different reliability requirements:
Service Level Commitments
| Component | Monthly Availability | Max Downtime |
|---|---|---|
| Core API | 99.9% | 43 min/month |
| Dashboard | 99.5% | 3.6 hr/month |
| Reporting | 99.0% | 7.3 hr/month |
| Analytics | 98.0% | 14.4 hr/month |
Rationale: Core API impacts all user operations. Reporting is
batch-nature and less time-critical. Separate SLAs reflect
actual reliability characteristics of each component.
Response Time SLA
Add latency commitments alongside availability:
Latency Commitments
The API will respond to 99% of requests within 500ms (p99 latency).
The API will respond to 95% of requests within 200ms (p95 latency).
Measurement: Calculated from server-side timing, excluding client
network latency. Measured in 5-minute rolling windows.
Credit for latency SLA breach: 5% of monthly fees for each calendar
day where p99 latency exceeds 1000ms.
Key Definitions to Negotiate
The precise definitions in an SLA determine what counts as downtime:
Availability Definition
## Negotiating the Availability Definition
"Availability" can mean different things. Define clearly:
Option A (Provider-favorable):
"Availability means the API returns HTTP responses. Slow responses,
partial failures, and error responses do not constitute unavailability."
Option B (Customer-favorable):
"Availability means the API returns successful (2xx) responses within
2000ms. Error responses (5xx) and timeouts constitute unavailability."
Option C (Balanced):
"Availability means the error rate for API requests is below 5%
as measured in any 5-minute window, AND p99 response time
is below 2000ms."
Our recommendation: Use a threshold-based definition (error rate %)
rather than binary up/down. This better reflects real user experience
and is more measurable.
Measurement Source
## Who Measures Availability?
Option A: Provider measures (using their own internal monitoring)
Risk: Conflict of interest; provider can manipulate measurement
Option B: Customer measures (using their own monitoring)
Risk: Customer's measurement may not reflect actual service state;
may include customer-side network issues
Option C: Third-party measurement (external monitoring service)
Best practice: Use independent external monitoring as the reference.
Example: "Availability is measured by [AzMonitor/third party tool],
using HTTP checks from 3+ geographic regions, with a 60-second
check interval."
Exclusions
Downtime that doesn't count against the SLA:
## Standard SLA Exclusions
The following are typically excluded from SLA calculations:
1. Scheduled maintenance windows
- Must be announced X hours in advance (typically 48-72 hours)
- Should be limited to N hours/month (typically 4-8 hours)
- Usually must occur during low-usage windows
2. Force majeure events
- Natural disasters, government actions, etc.
- Internet backbone/carrier failures outside provider control
- AWS/GCP/Azure regional outages (if applicable)
3. Customer-caused issues
- Customer-initiated DDoS
- Excessive API usage beyond contract limits
- Customer misconfiguration of their integration
4. Beta features
- Features explicitly marked "beta" or "preview"
- Typically excluded for 6-12 months after introduction
Negotiating tip: Customers should push back on broad exclusions
like "internet failures" without clear definitions. Outages due to
your infrastructure choices (e.g., single-region deployment) should
not be excused.
What Enterprise Customers Will Push For
Know what sophisticated procurement teams will negotiate:
| Customer Request | Provider Position | Compromise | |---|---|---| | 99.99% uptime | We deliver 99.9% consistently | 99.95% with carve-outs | | Unlimited SLA credits | Maximum 50% of monthly fee | Maximum 100% of monthly fee | | Right to terminate after 1 breach | 3 breaches in 12 months | 2 breaches in 12 months | | Consequential damages | Limited to fees paid | Fees paid + documented direct costs | | Real-time uptime data access | API access to monitoring | Dashboard access + monthly report | | Maintenance windows > 24hr notice | 48 hours | 72 hours for enterprise tier |
Structuring Credits to Align Incentives
Credit schedules should incentivize reliability, not just compensate for failures:
## Credit Schedule Design Principles
1. Make credits meaningful but not punishing
- 10% credit for minor breach: shows seriousness
- 50% maximum: maintains viability while compensating impact
- Don't cap credits below 1 month of downtime value
2. Don't create moral hazard
- "Once we breach, we might as well be completely down" thinking
- Avoid: same credit for 1 hour and 24 hours of downtime
- Prefer: tiered credits that increase with breach severity
3. Request-based tracking for API SLAs
For APIs with variable usage:
Credit = (failed_requests / total_requests) * monthly_fee * multiplier
This directly correlates credits to actual impact.
4. Automatic application
"Credits shall be applied automatically to the next invoice
without requiring customer request."
This signals confidence in your reliability.
The Credit Request Process
Define how credits are claimed:
## SLA Credit Process
Standard (Provider-favorable):
Customer must request credits within 30 days of the month end.
Credits applied to following month's invoice.
Customer must provide evidence of downtime to claim credits.
Better (More balanced):
Provider automatically calculates and applies credits
without requiring customer request.
Credits appear on the following month's invoice with explanation.
Best practice (Trust-building):
Provider proactively notifies customer when an SLA breach occurred
and the credit amount, before the customer requests it.
This is a competitive differentiator.
Maintenance Windows
Structure maintenance windows to minimize customer impact:
## Maintenance Window Best Practices
Standard window parameters:
- Maximum 4 hours of scheduled maintenance per month
- Minimum 72 hours advance notice for enterprise customers
- Maintenance windows during customer's lowest usage period
(typically 2am-6am in customer's primary timezone)
Notification requirements:
- Email to customer's primary technical contact
- Status page announcement
- In-app notification for maintenance windows > 30 min
Emergency maintenance (unplanned):
- Maximum 4 hours per quarter outside standard windows
- 1-hour notice minimum
- Counts against SLA only if outage exceeds 4 hours
Negotiating tip: Push for customer opt-out or rescheduling rights
for large enterprise accounts. If a customer has a critical business
event during your maintenance window, they should be able to request
a postponement.
Conclusion
SLA negotiation works best when you enter it with honesty about your capabilities, data supporting your commitments, and a structure that aligns your incentives with your customers' reliability needs. The specific percentages matter less than the measurement methodology, exclusion definitions, and credit processes — these determine whether the SLA is a meaningful commitment or a paper guarantee. AzMonitor provides the independent external monitoring that serves as an objective measurement source for SLA compliance, removing the conflict of interest inherent in self-reported uptime and giving both sides confidence in the accuracy of the availability data underlying the SLA.
3 monitors free forever · No credit card needed · Set up in 2 minutes
Start monitoring free →