Operational Runbooks¶
This directory contains operational runbooks for common tasks and procedures when managing the OAuth2 Server.
Runbook Index¶
Deployment¶
Database¶
Monitoring¶
Troubleshooting¶
Security¶
Initial Deployment¶
Prerequisites¶
- Kubernetes cluster (1.24+)
- kubectl configured
- Docker images built and pushed
- Secrets configured
Steps¶
-
Create namespace
kubectl apply -f k8s/base/namespace.yaml -
Configure secrets
# Generate JWT secret JWT_SECRET=$(openssl rand -base64 32) # Create secret kubectl create secret generic oauth2-server-secret \ --from-literal=OAUTH2_JWT_SECRET="$JWT_SECRET" \ --from-literal=POSTGRES_PASSWORD="$(openssl rand -base64 20)" \ -n oauth2-server -
Deploy PostgreSQL
kubectl apply -k k8s/overlays/production kubectl wait --for=condition=ready pod/postgres-0 -n oauth2-server --timeout=300s -
Run migrations
kubectl apply -f k8s/base/flyway-migration-job.yaml kubectl logs -f job/flyway-migration -n oauth2-server -
Deploy application
kubectl apply -k k8s/overlays/production kubectl wait --for=condition=ready pod -l app=oauth2-server -n oauth2-server --timeout=300s -
Verify deployment
kubectl get all -n oauth2-server curl -f https://oauth.example.com/health
Rollback¶
If deployment fails:
kubectl delete -k k8s/overlays/production
kubectl delete namespace oauth2-server
Update Deployment¶
Prerequisites¶
- New Docker image built and pushed
- Tested in staging environment
Steps¶
-
Update image tag
cd k8s/overlays/production kustomize edit set image docker.io/ianlintner068/oauth2-server:v1.1.0 -
Apply update
kubectl apply -k k8s/overlays/production -
Monitor rollout
kubectl rollout status deployment/oauth2-server -n oauth2-server -
Verify new pods
kubectl get pods -n oauth2-server kubectl logs -f deployment/oauth2-server -n oauth2-server -
Health check
curl -f https://oauth.example.com/health
Rollback¶
If issues occur:
kubectl rollout undo deployment/oauth2-server -n oauth2-server
kubectl rollout status deployment/oauth2-server -n oauth2-server
Rollback Deployment¶
When to Rollback¶
- High error rate (>5%)
- Failed health checks
- Database connection issues
- Performance degradation
Steps¶
-
View rollout history
kubectl rollout history deployment/oauth2-server -n oauth2-server -
Rollback to previous version
kubectl rollout undo deployment/oauth2-server -n oauth2-server -
Or rollback to specific revision
kubectl rollout undo deployment/oauth2-server --to-revision=5 -n oauth2-server -
Monitor rollback
kubectl rollout status deployment/oauth2-server -n oauth2-server -
Verify health
kubectl logs -f deployment/oauth2-server -n oauth2-server curl -f https://oauth.example.com/health
Database Backup¶
Schedule¶
- Daily: Automated backup at 2 AM UTC
- Before major changes: Manual backup
- Monthly: Full database dump to cold storage
Manual Backup¶
-
Create backup
kubectl exec -n oauth2-server postgres-0 -- \ pg_dump -U oauth2_user -F c oauth2 > backup-$(date +%Y%m%d-%H%M).dump -
Verify backup
pg_restore --list backup-$(date +%Y%m%d-%H%M).dump | head -20 -
Upload to S3 (if configured)
aws s3 cp backup-$(date +%Y%m%d-%H%M).dump \ s3://oauth2-backups/$(date +%Y/%m/%d)/ -
Test restore (on staging)
# On staging database pg_restore -U oauth2_user -d oauth2_staging backup-$(date +%Y%m%d-%H%M).dump
Automated Backup Script¶
#!/bin/bash
# /opt/scripts/backup-oauth2-db.sh
DATE=$(date +%Y%m%d-%H%M)
BACKUP_FILE="/backups/oauth2-$DATE.dump"
S3_BUCKET="s3://oauth2-backups"
# Create backup
kubectl exec -n oauth2-server postgres-0 -- \
pg_dump -U oauth2_user -F c oauth2 > "$BACKUP_FILE"
# Upload to S3
aws s3 cp "$BACKUP_FILE" "$S3_BUCKET/$(date +%Y/%m/%d)/"
# Keep local backups for 7 days
find /backups -name "oauth2-*.dump" -mtime +7 -delete
# Verify
if [ $? -eq 0 ]; then
echo "Backup successful: $BACKUP_FILE"
else
echo "Backup failed!" >&2
exit 1
fi
Database Restore¶
Prerequisites¶
- Valid backup file
- Database accessible
- Application pods scaled to 0
Steps¶
-
Scale down application
kubectl scale deployment oauth2-server --replicas=0 -n oauth2-server kubectl wait --for=delete pod -l app=oauth2-server -n oauth2-server --timeout=60s -
Download backup from S3 (if needed)
aws s3 cp s3://oauth2-backups/2024/01/15/backup-20240115-1400.dump . -
Restore database
kubectl exec -i -n oauth2-server postgres-0 -- \ pg_restore -U oauth2_user -d oauth2 -c < backup-20240115-1400.dump -
Verify restoration
kubectl exec -it postgres-0 -n oauth2-server -- \ psql -U oauth2_user -d oauth2 -c "\dt" # Check record counts kubectl exec -it postgres-0 -n oauth2-server -- \ psql -U oauth2_user -d oauth2 -c " SELECT 'clients' as table, COUNT(*) FROM clients UNION ALL SELECT 'tokens', COUNT(*) FROM tokens UNION ALL SELECT 'users', COUNT(*) FROM users; " -
Scale up application
kubectl scale deployment oauth2-server --replicas=2 -n oauth2-server kubectl wait --for=condition=ready pod -l app=oauth2-server -n oauth2-server --timeout=300s -
Verify application
curl -f https://oauth.example.com/health curl -f https://oauth.example.com/metrics
Run Database Migrations¶
Prerequisites¶
- Migration files in
migrations/sql/ - Database accessible
- Tested in development/staging
Steps¶
-
Create ConfigMap with migrations
kubectl create configmap flyway-migrations \ --from-file=migrations/sql/ \ -n oauth2-server \ --dry-run=client -o yaml | kubectl apply -f - -
Apply migration job
# Delete old job if exists kubectl delete job flyway-migration -n oauth2-server --ignore-not-found # Apply new job kubectl apply -f k8s/base/flyway-migration-job.yaml -
Monitor migration
kubectl logs -f job/flyway-migration -n oauth2-server -
Verify migration
kubectl exec -it postgres-0 -n oauth2-server -- \ psql -U oauth2_user -d oauth2 -c " SELECT installed_rank, version, description, installed_on, success FROM flyway_schema_history ORDER BY installed_rank DESC LIMIT 5; " -
Test application
kubectl logs -f deployment/oauth2-server -n oauth2-server curl -f https://oauth.example.com/health
Rollback Migration¶
If migration fails:
-
Check migration status
kubectl logs job/flyway-migration -n oauth2-server -
Manually revert changes
kubectl exec -it postgres-0 -n oauth2-server -- \ psql -U oauth2_user -d oauth2 # Run rollback SQL -
Restore from backup if needed
# See "Database Restore" runbook
Check Server Health¶
Quick Health Check¶
# Health endpoint
curl -f https://oauth.example.com/health | jq
# Readiness endpoint
curl -f https://oauth.example.com/ready | jq
# Kubernetes pod status
kubectl get pods -n oauth2-server
# Recent logs
kubectl logs -n oauth2-server -l app=oauth2-server --tail=20
Detailed Health Check¶
# Application metrics
curl -s https://oauth.example.com/metrics | grep oauth2_server
# Database connectivity
kubectl exec -it postgres-0 -n oauth2-server -- \
pg_isready -U oauth2_user
# Pod resource usage
kubectl top pods -n oauth2-server
# Check for errors in logs
kubectl logs -n oauth2-server -l app=oauth2-server --since=1h | grep -i error
# HPA status
kubectl get hpa -n oauth2-server
# Service endpoints
kubectl get endpoints -n oauth2-server
View Application Logs¶
Real-time Logs¶
# All pods
kubectl logs -f -n oauth2-server -l app=oauth2-server
# Specific pod
kubectl logs -f -n oauth2-server oauth2-server-abc123-xyz
# Last 100 lines
kubectl logs -n oauth2-server -l app=oauth2-server --tail=100
# Since 1 hour ago
kubectl logs -n oauth2-server -l app=oauth2-server --since=1h
# Previous pod (after crash)
kubectl logs -p -n oauth2-server oauth2-server-abc123-xyz
Log Analysis¶
# Count errors
kubectl logs -n oauth2-server -l app=oauth2-server | grep -c ERROR
# Find authentication failures
kubectl logs -n oauth2-server -l app=oauth2-server | grep "401\|403"
# Find slow queries
kubectl logs -n oauth2-server -l app=oauth2-server | grep "query took"
# Export logs for analysis
kubectl logs -n oauth2-server -l app=oauth2-server --tail=10000 > oauth2-logs.txt
Monitor Metrics¶
Prometheus Queries¶
Access Prometheus and run these queries:
# Request rate
rate(oauth2_server_http_requests_total[5m])
# Error rate percentage
100 * (
rate(oauth2_server_http_requests_total{status=~"5.."}[5m]) /
rate(oauth2_server_http_requests_total[5m])
)
# Token issuance rate
rate(oauth2_server_oauth_token_issued_total[5m])
# Active tokens
oauth2_server_oauth_active_tokens
# P95 response time
histogram_quantile(0.95,
rate(oauth2_server_http_request_duration_seconds_bucket[5m]))
# Database query latency
histogram_quantile(0.95,
rate(oauth2_server_db_query_duration_seconds_bucket[5m]))
Key Metrics to Monitor¶
| Metric | Threshold | Action |
|---|---|---|
| Error rate | > 5% | Investigate logs |
| Response time (P95) | > 500ms | Check database |
| Active tokens | > 100,000 | Consider cleanup |
| Database CPU | > 80% | Scale or optimize |
| Memory usage | > 80% | Scale pods |
Set Up Alerts¶
Alertmanager Rules¶
groups:
- name: oauth2_server
interval: 30s
rules:
- alert: HighErrorRate
expr: |
100 * (
rate(oauth2_server_http_requests_total{status=~"5.."}[5m]) /
rate(oauth2_server_http_requests_total[5m])
) > 5
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value }}%"
- alert: HighResponseTime
expr: |
histogram_quantile(0.95,
rate(oauth2_server_http_request_duration_seconds_bucket[5m])
) > 0.5
for: 10m
labels:
severity: warning
annotations:
summary: "High response time"
description: "P95 response time is {{ $value }}s"
- alert: DatabaseDown
expr: up{job="oauth2-database"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Database is down"
- alert: PodRestartLoop
expr: rate(kube_pod_container_status_restarts_total{namespace="oauth2-server"}[15m]) > 0
labels:
severity: warning
annotations:
summary: "Pod is restarting frequently"
Database Performance Tuning¶
- Confirm symptoms (latency, error rate, slow queries).
- Check database health:
- CPU/memory/disk
- connection count
- long-running queries
- Validate schema + indexes (especially for token lookup/revocation patterns).
- Scale / tune:
- increase Postgres resources
- add connection pooling (e.g., PgBouncer)
- tune
max_connections,shared_buffers, andwork_memfor your workload
Pod Not Starting¶
- Describe the pod:
kubectl describe pod <pod> -n oauth2-server- Check events for image pull errors, missing secrets, scheduling issues.
- Check logs:
kubectl logs <pod> -n oauth2-server --previous- Common causes:
- missing
OAUTH2_JWT_SECRET - database not reachable
- migrations job failing
High Error Rate¶
- Check
/metricsand logs for spikes in 4xx vs 5xx. - Correlate with deploy/rollout events.
- Validate dependency health:
- database (
/ready) - eventing (
/events/health, if enabled) - If 5xx persists, consider rollback:
kubectl rollout undo deployment/oauth2-server -n oauth2-server
Database Connection Issues¶
- Check readiness:
GET /readyshould report databaseok. - Verify Kubernetes service/DNS:
kubectl get svc -n oauth2-server- Verify credentials/secrets:
kubectl get secret oauth2-server-secret -n oauth2-server -o yaml- Check Postgres logs:
kubectl logs postgres-0 -n oauth2-server
Performance Degradation¶
- Compare latency (P50/P95) before/after the degradation window.
- Check resource saturation (CPU/mem), restarts, and database health.
- If eventing is enabled, verify it is not misconfigured:
- failing backends are best-effort, but can add log noise.
- Consider temporarily reducing load and/or scaling up.
Rotate JWT Secret¶
JWT secret rotation invalidates existing tokens. Plan a maintenance window.
- Generate a new secret and update the Kubernetes secret.
- Roll out the deployment.
- Validate new token issuance and introspection.
Rotate Database Password¶
- Update the database user password in Postgres.
- Update the Kubernetes secret used by the app.
- Restart/roll out the app.
- Verify
/readyreturnsok.
Revoke All Tokens¶
There is no global revoke endpoint by default. Recommended approach is to rotate the JWT secret (see above) and/or rotate signing keys.
Security Incident Response¶
- Contain: revoke credentials/secrets and restrict access.
- Eradicate: rotate JWT secret and database passwords.
- Recover: redeploy from a known-good version.
- Post-incident: open an issue and follow the repository security policy.
Additional Runbooks¶
For more specific scenarios, see:
- Operations Agent - Comprehensive operational procedures
- Database Agent - Database-specific operations
- Security Agent - Security incident response
Support¶
- Documentation:
/docsdirectory - Issues: GitHub Issues
- Discussions: GitHub Discussions
- Security: See SECURITY.md