Operational Runbooks¶

For AI Agents¶

Prompt: "The OAuth2 server is down in production - help me diagnose and fix the issue using the runbooks"

Common operational scenarios:

Scenario	Prompt Example
Service health check	"Check if the OAuth2 server is healthy and ready"
High latency	"The token endpoint is slow - help diagnose the performance issue"
Failed deployment	"The latest deployment failed - help me roll back safely"
Database issues	"The readiness check is failing - troubleshoot database connectivity"
Token errors	"Users are getting 'invalid_token' errors - diagnose the issue"
Storage full	"The database is running out of space - help me clean up old tokens"
Memory leak	"The server memory usage is growing - help identify the leak"
Rate limit tuning	"Too many rate limit rejections - adjust the thresholds"

First-response checklist:

curl http://localhost:8080/health     # Basic liveness
curl http://localhost:8080/ready      # Storage connectivity
curl http://localhost:8080/metrics    # Current metrics
kubectl get pods -n oauth2-server     # Pod status
kubectl logs -n oauth2-server -l app=oauth2-server     # Recent logs

Quick actions: - Roll back: kubectl rollout undo deployment/oauth2-server -n oauth2-server - Restart: kubectl rollout restart deployment/oauth2-server -n oauth2-server - Scale: kubectl scale deployment/oauth2-server --replicas=3 -n oauth2-server

This page is the first-response sheet, not an operations novel. Use it when the service is unhealthy, slow, or freshly deployed and suspicious.

First five minutes¶

Start here before guessing:

curl -fsS http://localhost:8080/health
curl -fsS http://localhost:8080/ready
curl -fsS http://localhost:8080/metrics | head
kubectl get pods -n oauth2-server
kubectl logs -n oauth2-server -l app=oauth2-server --tail=100

If /ready fails, treat it as a storage or config problem first. If /health is green but latency is bad, use metrics and recent deploy history before touching the database.

Roll back a bad deploy¶

Kubernetes¶

kubectl rollout history deployment/oauth2-server -n oauth2-server
kubectl rollout undo deployment/oauth2-server -n oauth2-server
kubectl rollout status deployment/oauth2-server -n oauth2-server

Docker Compose¶

docker compose ps
docker compose logs --tail=100 oauth2-server
docker compose down
docker compose up -d

Roll back fast when a deploy correlates with new 5xx, readiness failures, or broken admin login.

Readiness failing¶

Work this list in order:

verify OAUTH2_DATABASE_URL and any secret-backed env vars
check migration status (./scripts/migrate.sh locally, Flyway job in Kubernetes)
inspect database logs
confirm the app can resolve the database hostname

Useful checks:

kubectl logs postgres-0 -n oauth2-server --tail=100
kubectl get job -n oauth2-server
kubectl describe pod -n oauth2-server <pod-name>

High `5xx` or latency¶

Use data before heroics:

compare request rate and latency in /metrics
check recent config or image changes
inspect database saturation and connection pressure
if eventing is enabled, confirm /events/health is not degraded

Focus on:

request error rate
request latency percentiles
database query latency
rate-limit rejection spikes
restart counts and rollout events

Eventing health failing¶

If GET /events/health is unhappy:

confirm the configured backend matches the build features
verify backend URLs (OAUTH2_EVENTS_*)
look for fallback warnings in logs

The safe default remains in_memory, so a broker failure usually means degraded integration behavior rather than total server death.

Check these first:

seed admin credentials (OAUTH2_SEED_USERNAME, OAUTH2_SEED_PASSWORD)
session key stability (OAUTH2_SESSION_KEY)
externally visible URL and proxy headers (OAUTH2_SERVER_PUBLIC_BASE_URL, OAUTH2_SERVER_TRUST_PROXY_HEADERS)
social provider configuration for the specific /auth/login/{provider} route

Remember that Okta and Auth0 routes currently return 503 by design.

Rotate signing material¶

There are two different operations:

JWT secret rotation: rotate OAUTH2_JWT_SECRET, redeploy, and expect existing HS256-signed tokens to stop validating
keyset rotation: use POST /admin/api/keys/rotate when you are using managed signing keys and an authenticated admin session

After rotation, verify:

curl -fsS http://localhost:8080/.well-known/jwks.json
curl -fsS http://localhost:8080/health

Revoke everything fast¶

There is no single “revoke all” endpoint. Your practical options are:

rotate JWT secret or signing keys
restart with new session and admin credentials if compromise is broader
document the incident and the cutoff timestamp

Backups and restore¶

Database backup strategy is deployment-specific, so this page does not pretend every team uses the same S3 bucket and cron job.

Use your platform-native Postgres backup process, and verify restores on a non-production environment. For Kubernetes-specific mechanics, use the Kubernetes README.