GraphQL solves real problems — flexible queries, fewer round trips, self-documenting schemas — but it introduces monitoring challenges that traditional REST tooling wasn't built to handle. With REST, every endpoint is a separate URL you can check independently. With GraphQL, everything goes through a single endpoint, and the complexity lives in the request body. This fundamentally changes how you approach monitoring.
Why GraphQL Monitoring Is Different
A REST monitoring check is simple: send a GET to /api/users, expect a 200. GraphQL isn't like that. All queries go to POST /graphql, which means:
- You can't infer what's being queried from the URL alone
- A 200 response can still contain errors in the response body
- One slow resolver poisons an entire query
- N+1 query problems can hide behind a healthy-looking endpoint
Your monitoring strategy has to account for all of these.
The GraphQL Error Model
GraphQL responses always return HTTP 200, even when something went wrong. Errors live in the response body:
{
"data": {
"user": null
},
"errors": [
{
"message": "User not found",
"locations": [{"line": 2, "column": 3}],
"path": ["user"],
"extensions": {
"code": "USER_NOT_FOUND"
}
}
]
}
This means a traditional HTTP status check will show green while your users are actually getting errors. Your monitoring must parse the response body and check for the errors field.
Writing GraphQL Assertions
monitor:
name: "GraphQL - User Profile Query"
url: "https://api.example.com/graphql"
method: POST
headers:
Content-Type: "application/json"
Authorization: "Bearer ${MONITORING_TOKEN}"
body: |
{
"query": "query MonitorCheck { user(id: \"monitor-test-user\") { id name email } }"
}
assertions:
- type: status_code
value: 200
- type: json_body
path: "$.errors"
operator: not_exists
- type: json_body
path: "$.data.user.id"
operator: exists
- type: response_time
operator: less_than
value: 1500
The key assertion here is checking that $.errors does not exist. If it does, the monitor fails even though the HTTP status was 200.
Monitoring Individual Resolvers
The real performance challenge with GraphQL is resolver-level monitoring. A query that fetches a user with their posts and comments might call 3+ database queries. If any resolver is slow, the whole query suffers.
Using Apollo Studio / GraphOS
If you're using Apollo Server, Apollo Studio provides resolver-level tracing out of the box. Enable it with:
// Apollo Server setup
const server = new ApolloServer({
typeDefs,
resolvers,
plugins: [
ApolloServerPluginUsageReporting({
sendVariableValues: { none: true }, // Don't send PII
sendHeaders: { onlyNames: ["x-request-id"] },
}),
],
});
This gives you per-resolver latency breakdowns, which you can use to identify slow resolvers before they become incidents.
Custom Resolver Timing
If you're not using Apollo, add resolver timing manually:
// Resolver wrapper for timing
function withTiming(name, resolver) {
return async (parent, args, context, info) => {
const start = Date.now();
try {
const result = await resolver(parent, args, context, info);
const duration = Date.now() - start;
metrics.histogram('resolver.duration', duration, { resolver: name });
return result;
} catch (error) {
const duration = Date.now() - start;
metrics.increment('resolver.error', { resolver: name });
throw error;
}
};
}
// Usage
const resolvers = {
Query: {
user: withTiming('Query.user', async (_, { id }, { db }) => {
return db.users.findById(id);
}),
},
};
Detecting the N+1 Problem
The N+1 query problem is GraphQL's most notorious performance killer. When you fetch a list of users and each user resolver makes a separate database call for their posts, you get N+1 queries for N users.
Signs of N+1 in monitoring:
| Users Returned | Expected DB Queries | Actual DB Queries (N+1) | |---|---|---| | 10 | 1 | 11 | | 50 | 1 | 51 | | 100 | 1 | 101 |
If response time scales linearly with result set size, you likely have N+1. Set up monitors that vary the query complexity:
# Small query - baseline
query SmallCheck {
users(limit: 5) { id name posts { id title } }
}
# Medium query - should not be 10x slower
query MediumCheck {
users(limit: 50) { id name posts { id title } }
}
If the medium query takes 10x longer than the small query, you have N+1.
Schema Change Monitoring
GraphQL schemas evolve, and breaking changes can silently break clients. Monitor for schema changes as part of your CI pipeline:
# Using graphql-inspector
npx @graphql-inspector/cli diff \
"https://api.example.com/graphql" \
"./schema/current.graphql"
# Output example:
# ✖ Field 'user.email' changed type from 'String' to 'String!'
# ✖ Field 'user.avatar' was removed
# ✓ Field 'user.profilePicture' was added
Integrate this into your deployment pipeline so breaking changes are caught before they reach production:
# GitHub Actions example
- name: Check for GraphQL breaking changes
run: |
npx @graphql-inspector/cli diff \
"https://staging-api.example.com/graphql" \
"https://api.example.com/graphql" \
--rule suppressRemovalOfDeprecatedField
Query Complexity and Depth Limits
Expensive queries — deeply nested, requesting large collections — can overwhelm your server. Monitor query complexity limits to ensure they're working correctly:
// GraphQL server with complexity limits
const server = new ApolloServer({
validationRules: [
depthLimit(7),
createComplexityRule({
maximumComplexity: 1000,
onComplete: (complexity) => {
console.log('Query complexity:', complexity);
metrics.histogram('graphql.query.complexity', complexity);
},
}),
],
});
Your monitoring should include a query that approaches — but doesn't exceed — your complexity limit to verify the protection is working.
Subscription Monitoring
GraphQL subscriptions add WebSocket connections to the mix. Monitoring subscriptions requires:
- Establishing a WebSocket connection
- Sending a subscription message
- Verifying you receive events within a timeout window
// Subscription health check
async function checkSubscriptionHealth(wsUrl, authToken) {
return new Promise((resolve, reject) => {
const ws = new WebSocket(wsUrl, 'graphql-ws');
const timeout = setTimeout(() => {
ws.close();
reject(new Error('Subscription timeout'));
}, 10000);
ws.on('open', () => {
ws.send(JSON.stringify({
type: 'connection_init',
payload: { authorization: authToken }
}));
});
ws.on('message', (data) => {
const msg = JSON.parse(data);
if (msg.type === 'connection_ack') {
clearTimeout(timeout);
ws.close();
resolve({ healthy: true, latency: Date.now() - startTime });
}
});
});
}
Monitoring Federated GraphQL
If you're using Apollo Federation or GraphQL Federation, you need to monitor both the gateway and individual subgraphs:
| Layer | What to Monitor | Alert Threshold | |---|---|---| | Gateway | Query routing latency | > 50ms overhead | | Auth subgraph | Token validation time | > 100ms | | User subgraph | Entity resolution time | > 200ms | | Products subgraph | Catalog query time | > 300ms | | Gateway | Overall query latency | > 1000ms p95 |
If the gateway is slow but subgraphs are fast, the issue is in query planning. If a subgraph is slow, it's an application or database issue.
Setting Up Effective Alerts
GraphQL-specific alert conditions to configure:
alerts:
- name: "GraphQL Error Rate High"
condition: "graphql_error_rate > 2%"
window: "5 minutes"
severity: critical
- name: "Slow Resolver Detected"
condition: "resolver_p95_latency{resolver='Query.products'} > 500ms"
window: "10 minutes"
severity: warning
- name: "Query Complexity Spike"
condition: "avg_query_complexity > 800"
window: "5 minutes"
severity: warning
- name: "Schema Validation Errors"
condition: "graphql_validation_errors > 0"
window: "1 minute"
severity: critical
Practical Monitoring Checklist
Before considering your GraphQL API properly monitored, verify:
- [ ] Response body is parsed and
errorsfield is checked - [ ] At least one query per major entity type is monitored
- [ ] Resolver latency is tracked via APM or custom instrumentation
- [ ] N+1 detection queries are running at different result set sizes
- [ ] Schema change detection is in CI pipeline
- [ ] Query complexity limits are tested
- [ ] Subscription health is checked if subscriptions are in use
- [ ] Federation subgraphs are monitored independently
- [ ] Error rate alerting accounts for the GraphQL 200-always model
Conclusion
GraphQL monitoring requires reading past the HTTP status code and understanding what's actually happening inside your queries. The teams that invest in proper GraphQL observability — resolver tracing, error body parsing, complexity monitoring, and schema change detection — catch problems in minutes rather than hours. AzMonitor supports custom request bodies and response assertions that work well with GraphQL's single-endpoint model, letting you build comprehensive query health checks without specialized tooling.
3 monitors free forever · No credit card needed · Set up in 2 minutes
Start monitoring free →