Operational Runbooks¶
For AI Agents¶
Prompt: "The OAuth2 server is down in production - help me diagnose and fix the issue using the runbooks"
Common operational scenarios:
| Scenario | Prompt Example |
|---|---|
| Service health check | "Check if the OAuth2 server is healthy and ready" |
| High latency | "The token endpoint is slow - help diagnose the performance issue" |
| Failed deployment | "The latest deployment failed - help me roll back safely" |
| Database issues | "The readiness check is failing - troubleshoot database connectivity" |
| Token errors | "Users are getting 'invalid_token' errors - diagnose the issue" |
| Storage full | "The database is running out of space - help me clean up old tokens" |
| Memory leak | "The server memory usage is growing - help identify the leak" |
| Rate limit tuning | "Too many rate limit rejections - adjust the thresholds" |
First-response checklist:
curl http://localhost:8080/health # Basic liveness
curl http://localhost:8080/ready # Storage connectivity
curl http://localhost:8080/metrics # Current metrics
kubectl get pods -n oauth2-server # Pod status
kubectl logs -n oauth2-server -l app=oauth2-server # Recent logs
Quick actions:
- Roll back: kubectl rollout undo deployment/oauth2-server -n oauth2-server
- Restart: kubectl rollout restart deployment/oauth2-server -n oauth2-server
- Scale: kubectl scale deployment/oauth2-server --replicas=3 -n oauth2-server
This page is the first-response sheet, not an operations novel. Use it when the service is unhealthy, slow, or freshly deployed and suspicious.
First five minutes¶
Start here before guessing:
curl -fsS http://localhost:8080/health
curl -fsS http://localhost:8080/ready
curl -fsS http://localhost:8080/metrics | head
kubectl get pods -n oauth2-server
kubectl logs -n oauth2-server -l app=oauth2-server --tail=100
If /ready fails, treat it as a storage or config problem first. If /health is green but latency is bad, use metrics and recent deploy history before touching the database.
Roll back a bad deploy¶
Kubernetes¶
kubectl rollout history deployment/oauth2-server -n oauth2-server
kubectl rollout undo deployment/oauth2-server -n oauth2-server
kubectl rollout status deployment/oauth2-server -n oauth2-server
Docker Compose¶
docker compose ps
docker compose logs --tail=100 oauth2-server
docker compose down
docker compose up -d
Roll back fast when a deploy correlates with new 5xx, readiness failures, or broken admin login.
Readiness failing¶
Work this list in order:
- verify
OAUTH2_DATABASE_URLand any secret-backed env vars - check migration status (
./scripts/migrate.shlocally, Flyway job in Kubernetes) - inspect database logs
- confirm the app can resolve the database hostname
Useful checks:
kubectl logs postgres-0 -n oauth2-server --tail=100
kubectl get job -n oauth2-server
kubectl describe pod -n oauth2-server <pod-name>
High 5xx or latency¶
Use data before heroics:
- compare request rate and latency in
/metrics - check recent config or image changes
- inspect database saturation and connection pressure
- if eventing is enabled, confirm
/events/healthis not degraded
Focus on:
- request error rate
- request latency percentiles
- database query latency
- rate-limit rejection spikes
- restart counts and rollout events
Eventing health failing¶
If GET /events/health is unhappy:
- confirm the configured backend matches the build features
- verify backend URLs (
OAUTH2_EVENTS_*) - look for fallback warnings in logs
The safe default remains in_memory, so a broker failure usually means degraded integration behavior rather than total server death.
Admin or auth login failures¶
Check these first:
- seed admin credentials (
OAUTH2_SEED_USERNAME,OAUTH2_SEED_PASSWORD) - session key stability (
OAUTH2_SESSION_KEY) - externally visible URL and proxy headers (
OAUTH2_SERVER_PUBLIC_BASE_URL,OAUTH2_SERVER_TRUST_PROXY_HEADERS) - social provider configuration for the specific
/auth/login/{provider}route
Remember that Okta and Auth0 routes currently return 503 by design.
Rotate signing material¶
There are two different operations:
- JWT secret rotation: rotate
OAUTH2_JWT_SECRET, redeploy, and expect existing HS256-signed tokens to stop validating - keyset rotation: use
POST /admin/api/keys/rotatewhen you are using managed signing keys and an authenticated admin session
After rotation, verify:
Revoke everything fast¶
There is no single “revoke all” endpoint. Your practical options are:
- rotate JWT secret or signing keys
- restart with new session and admin credentials if compromise is broader
- document the incident and the cutoff timestamp
Backups and restore¶
Database backup strategy is deployment-specific, so this page does not pretend every team uses the same S3 bucket and cron job.
Use your platform-native Postgres backup process, and verify restores on a non-production environment. For Kubernetes-specific mechanics, use the Kubernetes README.