Operational Runbooks¶

This directory contains operational runbooks for common tasks and procedures when managing the OAuth2 Server.

Runbook Index¶

Initial Deployment¶

Prerequisites¶

Kubernetes cluster (1.24+)
kubectl configured
Docker images built and pushed
Secrets configured

Steps¶

Create namespace

kubectl apply -f k8s/base/namespace.yaml

Configure secrets

# Generate JWT secret
JWT_SECRET=$(openssl rand -base64 32)

# Create secret
kubectl create secret generic oauth2-server-secret \
  --from-literal=OAUTH2_JWT_SECRET="$JWT_SECRET" \
  --from-literal=POSTGRES_PASSWORD="$(openssl rand -base64 20)" \
  -n oauth2-server

Deploy PostgreSQL

kubectl apply -k k8s/overlays/production
kubectl wait --for=condition=ready pod/postgres-0 -n oauth2-server --timeout=300s

Run migrations

kubectl apply -f k8s/base/flyway-migration-job.yaml
kubectl logs -f job/flyway-migration -n oauth2-server

Deploy application

kubectl apply -k k8s/overlays/production
kubectl wait --for=condition=ready pod -l app=oauth2-server -n oauth2-server --timeout=300s

Verify deployment

kubectl get all -n oauth2-server
curl -f https://oauth.example.com/health

Rollback¶

If deployment fails:

kubectl delete -k k8s/overlays/production
kubectl delete namespace oauth2-server

Update Deployment¶

Prerequisites¶

New Docker image built and pushed
Tested in staging environment

Steps¶

Update image tag

cd k8s/overlays/production
kustomize edit set image docker.io/ianlintner068/oauth2-server:v1.1.0

Apply update

kubectl apply -k k8s/overlays/production

Monitor rollout

kubectl rollout status deployment/oauth2-server -n oauth2-server

Verify new pods

kubectl get pods -n oauth2-server
kubectl logs -f deployment/oauth2-server -n oauth2-server

Health check

curl -f https://oauth.example.com/health

Rollback¶

If issues occur:

kubectl rollout undo deployment/oauth2-server -n oauth2-server
kubectl rollout status deployment/oauth2-server -n oauth2-server

Rollback Deployment¶

When to Rollback¶

High error rate (>5%)
Failed health checks
Database connection issues
Performance degradation

Steps¶

View rollout history

kubectl rollout history deployment/oauth2-server -n oauth2-server

Rollback to previous version

kubectl rollout undo deployment/oauth2-server -n oauth2-server

Or rollback to specific revision

kubectl rollout undo deployment/oauth2-server --to-revision=5 -n oauth2-server

Monitor rollback

kubectl rollout status deployment/oauth2-server -n oauth2-server

Verify health

kubectl logs -f deployment/oauth2-server -n oauth2-server
curl -f https://oauth.example.com/health

Database Backup¶

Schedule¶

Daily: Automated backup at 2 AM UTC
Before major changes: Manual backup
Monthly: Full database dump to cold storage

Manual Backup¶

Create backup

kubectl exec -n oauth2-server postgres-0 -- \
  pg_dump -U oauth2_user -F c oauth2 > backup-$(date +%Y%m%d-%H%M).dump

Verify backup

pg_restore --list backup-$(date +%Y%m%d-%H%M).dump | head -20

Upload to S3 (if configured)

aws s3 cp backup-$(date +%Y%m%d-%H%M).dump \
  s3://oauth2-backups/$(date +%Y/%m/%d)/

Test restore (on staging)

# On staging database
pg_restore -U oauth2_user -d oauth2_staging backup-$(date +%Y%m%d-%H%M).dump

Automated Backup Script¶

#!/bin/bash
# /opt/scripts/backup-oauth2-db.sh

DATE=$(date +%Y%m%d-%H%M)
BACKUP_FILE="/backups/oauth2-$DATE.dump"
S3_BUCKET="s3://oauth2-backups"

# Create backup
kubectl exec -n oauth2-server postgres-0 -- \
  pg_dump -U oauth2_user -F c oauth2 > "$BACKUP_FILE"

# Upload to S3
aws s3 cp "$BACKUP_FILE" "$S3_BUCKET/$(date +%Y/%m/%d)/"

# Keep local backups for 7 days
find /backups -name "oauth2-*.dump" -mtime +7 -delete

# Verify
if [ $? -eq 0 ]; then
  echo "Backup successful: $BACKUP_FILE"
else
  echo "Backup failed!" >&2
  exit 1
fi

Database Restore¶

Prerequisites¶

Valid backup file
Database accessible
Application pods scaled to 0

Steps¶

Scale down application

kubectl scale deployment oauth2-server --replicas=0 -n oauth2-server
kubectl wait --for=delete pod -l app=oauth2-server -n oauth2-server --timeout=60s

Download backup from S3 (if needed)

aws s3 cp s3://oauth2-backups/2024/01/15/backup-20240115-1400.dump .

Restore database

kubectl exec -i -n oauth2-server postgres-0 -- \
  pg_restore -U oauth2_user -d oauth2 -c < backup-20240115-1400.dump

Verify restoration

kubectl exec -it postgres-0 -n oauth2-server -- \
  psql -U oauth2_user -d oauth2 -c "\dt"

# Check record counts
kubectl exec -it postgres-0 -n oauth2-server -- \
  psql -U oauth2_user -d oauth2 -c "
    SELECT 'clients' as table, COUNT(*) FROM clients
    UNION ALL
    SELECT 'tokens', COUNT(*) FROM tokens
    UNION ALL
    SELECT 'users', COUNT(*) FROM users;
  "

Scale up application

kubectl scale deployment oauth2-server --replicas=2 -n oauth2-server
kubectl wait --for=condition=ready pod -l app=oauth2-server -n oauth2-server --timeout=300s

Verify application

curl -f https://oauth.example.com/health
curl -f https://oauth.example.com/metrics

Run Database Migrations¶

Prerequisites¶

Migration files in migrations/sql/
Database accessible
Tested in development/staging

Steps¶

Create ConfigMap with migrations

kubectl create configmap flyway-migrations \
  --from-file=migrations/sql/ \
  -n oauth2-server \
  --dry-run=client -o yaml | kubectl apply -f -

Apply migration job

# Delete old job if exists
kubectl delete job flyway-migration -n oauth2-server --ignore-not-found

# Apply new job
kubectl apply -f k8s/base/flyway-migration-job.yaml

Monitor migration

kubectl logs -f job/flyway-migration -n oauth2-server

Verify migration

kubectl exec -it postgres-0 -n oauth2-server -- \
  psql -U oauth2_user -d oauth2 -c "
    SELECT installed_rank, version, description, installed_on, success 
    FROM flyway_schema_history 
    ORDER BY installed_rank DESC 
    LIMIT 5;
  "

Test application

kubectl logs -f deployment/oauth2-server -n oauth2-server
curl -f https://oauth.example.com/health

Rollback Migration¶

If migration fails:

Check migration status

kubectl logs job/flyway-migration -n oauth2-server

Manually revert changes

kubectl exec -it postgres-0 -n oauth2-server -- \
  psql -U oauth2_user -d oauth2

# Run rollback SQL

Restore from backup if needed
```
# See "Database Restore" runbook
```

Check Server Health¶

Quick Health Check¶

# Health endpoint
curl -f https://oauth.example.com/health | jq

# Readiness endpoint
curl -f https://oauth.example.com/ready | jq

# Kubernetes pod status
kubectl get pods -n oauth2-server

# Recent logs
kubectl logs -n oauth2-server -l app=oauth2-server --tail=20

Detailed Health Check¶

# Application metrics
curl -s https://oauth.example.com/metrics | grep oauth2_server

# Database connectivity
kubectl exec -it postgres-0 -n oauth2-server -- \
  pg_isready -U oauth2_user

# Pod resource usage
kubectl top pods -n oauth2-server

# Check for errors in logs
kubectl logs -n oauth2-server -l app=oauth2-server --since=1h | grep -i error

# HPA status
kubectl get hpa -n oauth2-server

# Service endpoints
kubectl get endpoints -n oauth2-server

View Application Logs¶

Real-time Logs¶

# All pods
kubectl logs -f -n oauth2-server -l app=oauth2-server

# Specific pod
kubectl logs -f -n oauth2-server oauth2-server-abc123-xyz

# Last 100 lines
kubectl logs -n oauth2-server -l app=oauth2-server --tail=100

# Since 1 hour ago
kubectl logs -n oauth2-server -l app=oauth2-server --since=1h

# Previous pod (after crash)
kubectl logs -p -n oauth2-server oauth2-server-abc123-xyz

Log Analysis¶

# Count errors
kubectl logs -n oauth2-server -l app=oauth2-server | grep -c ERROR

# Find authentication failures
kubectl logs -n oauth2-server -l app=oauth2-server | grep "401\|403"

# Find slow queries
kubectl logs -n oauth2-server -l app=oauth2-server | grep "query took"

# Export logs for analysis
kubectl logs -n oauth2-server -l app=oauth2-server --tail=10000 > oauth2-logs.txt

Monitor Metrics¶

Prometheus Queries¶

Access Prometheus and run these queries:

# Request rate
rate(oauth2_server_http_requests_total[5m])

# Error rate percentage
100 * (
  rate(oauth2_server_http_requests_total{status=~"5.."}[5m]) /
  rate(oauth2_server_http_requests_total[5m])
)

# Token issuance rate
rate(oauth2_server_oauth_token_issued_total[5m])

# Active tokens
oauth2_server_oauth_active_tokens

# P95 response time
histogram_quantile(0.95, 
  rate(oauth2_server_http_request_duration_seconds_bucket[5m]))

# Database query latency
histogram_quantile(0.95,
  rate(oauth2_server_db_query_duration_seconds_bucket[5m]))

Key Metrics to Monitor¶

Metric	Threshold	Action
Error rate	> 5%	Investigate logs
Response time (P95)	> 500ms	Check database
Active tokens	> 100,000	Consider cleanup
Database CPU	> 80%	Scale or optimize
Memory usage	> 80%	Scale pods

Set Up Alerts¶

Alertmanager Rules¶

groups:
  - name: oauth2_server
    interval: 30s
    rules:
      - alert: HighErrorRate
        expr: |
          100 * (
            rate(oauth2_server_http_requests_total{status=~"5.."}[5m]) /
            rate(oauth2_server_http_requests_total[5m])
          ) > 5
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate detected"
          description: "Error rate is {{ $value }}%"

      - alert: HighResponseTime
        expr: |
          histogram_quantile(0.95,
            rate(oauth2_server_http_request_duration_seconds_bucket[5m])
          ) > 0.5
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High response time"
          description: "P95 response time is {{ $value }}s"

      - alert: DatabaseDown
        expr: up{job="oauth2-database"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Database is down"

      - alert: PodRestartLoop
        expr: rate(kube_pod_container_status_restarts_total{namespace="oauth2-server"}[15m]) > 0
        labels:
          severity: warning
        annotations:
          summary: "Pod is restarting frequently"

Database Performance Tuning¶

Confirm symptoms (latency, error rate, slow queries).
Check database health:
CPU/memory/disk
connection count
long-running queries
Validate schema + indexes (especially for token lookup/revocation patterns).
Scale / tune:
increase Postgres resources
add connection pooling (e.g., PgBouncer)
tune max_connections, shared_buffers, and work_mem for your workload

Pod Not Starting¶

Describe the pod:
kubectl describe pod <pod> -n oauth2-server
Check events for image pull errors, missing secrets, scheduling issues.
Check logs:
kubectl logs <pod> -n oauth2-server --previous
Common causes:
missing OAUTH2_JWT_SECRET
database not reachable
migrations job failing

High Error Rate¶

Check /metrics and logs for spikes in 4xx vs 5xx.
Correlate with deploy/rollout events.
Validate dependency health:
database (/ready)
eventing (/events/health, if enabled)
If 5xx persists, consider rollback:
kubectl rollout undo deployment/oauth2-server -n oauth2-server

Database Connection Issues¶

Check readiness: GET /ready should report database ok.
Verify Kubernetes service/DNS:
kubectl get svc -n oauth2-server
Verify credentials/secrets:
kubectl get secret oauth2-server-secret -n oauth2-server -o yaml
Check Postgres logs:
kubectl logs postgres-0 -n oauth2-server

Performance Degradation¶

Compare latency (P50/P95) before/after the degradation window.
Check resource saturation (CPU/mem), restarts, and database health.
If eventing is enabled, verify it is not misconfigured:
failing backends are best-effort, but can add log noise.
Consider temporarily reducing load and/or scaling up.

Rotate JWT Secret¶

JWT secret rotation invalidates existing tokens. Plan a maintenance window.

Generate a new secret and update the Kubernetes secret.
Roll out the deployment.
Validate new token issuance and introspection.

Rotate Database Password¶

Update the database user password in Postgres.
Update the Kubernetes secret used by the app.
Restart/roll out the app.
Verify /ready returns ok.

Revoke All Tokens¶

There is no global revoke endpoint by default. Recommended approach is to rotate the JWT secret (see above) and/or rotate signing keys.

Security Incident Response¶

Contain: revoke credentials/secrets and restrict access.
Eradicate: rotate JWT secret and database passwords.
Recover: redeploy from a known-good version.
Post-incident: open an issue and follow the repository security policy.

Additional Runbooks¶

For more specific scenarios, see:

Operations Agent - Comprehensive operational procedures
Database Agent - Database-specific operations
Security Agent - Security incident response

Support¶

Documentation: /docs directory
Issues: GitHub Issues
Discussions: GitHub Discussions
Security: See SECURITY.md

Operational Runbooks¶

Runbook Index¶

Deployment¶

Database¶

Monitoring¶

Troubleshooting¶

Security¶

Initial Deployment¶

Prerequisites¶

Steps¶

Rollback¶

Update Deployment¶

Prerequisites¶

Steps¶

Rollback¶

Rollback Deployment¶

When to Rollback¶

Steps¶

Database Backup¶

Schedule¶

Manual Backup¶

Automated Backup Script¶

Database Restore¶

Prerequisites¶

Steps¶

Run Database Migrations¶

Prerequisites¶

Steps¶

Rollback Migration¶

Check Server Health¶

Quick Health Check¶

Detailed Health Check¶

View Application Logs¶

Real-time Logs¶

Log Analysis¶

Monitor Metrics¶

Prometheus Queries¶

Key Metrics to Monitor¶

Set Up Alerts¶

Alertmanager Rules¶

Database Performance Tuning¶

Pod Not Starting¶

High Error Rate¶

Database Connection Issues¶

Performance Degradation¶

Rotate JWT Secret¶

Rotate Database Password¶

Revoke All Tokens¶

Security Incident Response¶

Additional Runbooks¶

Support¶