Operational Runbooks

This directory contains operational runbooks for common tasks and procedures when managing the OAuth2 Server.

Runbook Index

Deployment

Database

Monitoring

Troubleshooting

Security


Initial Deployment

Prerequisites

  • Kubernetes cluster (1.24+)
  • kubectl configured
  • Docker images built and pushed
  • Secrets configured

Steps

  1. Create namespace

    kubectl apply -f k8s/base/namespace.yaml

  2. Configure secrets

    # Generate JWT secret
    JWT_SECRET=$(openssl rand -base64 32)
    
    # Create secret
    kubectl create secret generic oauth2-server-secret \
      --from-literal=OAUTH2_JWT_SECRET="$JWT_SECRET" \
      --from-literal=POSTGRES_PASSWORD="$(openssl rand -base64 20)" \
      -n oauth2-server

  3. Deploy PostgreSQL

    kubectl apply -k k8s/overlays/production
    kubectl wait --for=condition=ready pod/postgres-0 -n oauth2-server --timeout=300s

  4. Run migrations

    kubectl apply -f k8s/base/flyway-migration-job.yaml
    kubectl logs -f job/flyway-migration -n oauth2-server

  5. Deploy application

    kubectl apply -k k8s/overlays/production
    kubectl wait --for=condition=ready pod -l app=oauth2-server -n oauth2-server --timeout=300s

  6. Verify deployment

    kubectl get all -n oauth2-server
    curl -f https://oauth.example.com/health

Rollback

If deployment fails:

kubectl delete -k k8s/overlays/production
kubectl delete namespace oauth2-server


Update Deployment

Prerequisites

  • New Docker image built and pushed
  • Tested in staging environment

Steps

  1. Update image tag

    cd k8s/overlays/production
    kustomize edit set image docker.io/ianlintner068/oauth2-server:v1.1.0

  2. Apply update

    kubectl apply -k k8s/overlays/production

  3. Monitor rollout

    kubectl rollout status deployment/oauth2-server -n oauth2-server

  4. Verify new pods

    kubectl get pods -n oauth2-server
    kubectl logs -f deployment/oauth2-server -n oauth2-server

  5. Health check

    curl -f https://oauth.example.com/health

Rollback

If issues occur:

kubectl rollout undo deployment/oauth2-server -n oauth2-server
kubectl rollout status deployment/oauth2-server -n oauth2-server


Rollback Deployment

When to Rollback

  • High error rate (>5%)
  • Failed health checks
  • Database connection issues
  • Performance degradation

Steps

  1. View rollout history

    kubectl rollout history deployment/oauth2-server -n oauth2-server

  2. Rollback to previous version

    kubectl rollout undo deployment/oauth2-server -n oauth2-server

  3. Or rollback to specific revision

    kubectl rollout undo deployment/oauth2-server --to-revision=5 -n oauth2-server

  4. Monitor rollback

    kubectl rollout status deployment/oauth2-server -n oauth2-server

  5. Verify health

    kubectl logs -f deployment/oauth2-server -n oauth2-server
    curl -f https://oauth.example.com/health


Database Backup

Schedule

  • Daily: Automated backup at 2 AM UTC
  • Before major changes: Manual backup
  • Monthly: Full database dump to cold storage

Manual Backup

  1. Create backup

    kubectl exec -n oauth2-server postgres-0 -- \
      pg_dump -U oauth2_user -F c oauth2 > backup-$(date +%Y%m%d-%H%M).dump

  2. Verify backup

    pg_restore --list backup-$(date +%Y%m%d-%H%M).dump | head -20

  3. Upload to S3 (if configured)

    aws s3 cp backup-$(date +%Y%m%d-%H%M).dump \
      s3://oauth2-backups/$(date +%Y/%m/%d)/

  4. Test restore (on staging)

    # On staging database
    pg_restore -U oauth2_user -d oauth2_staging backup-$(date +%Y%m%d-%H%M).dump

Automated Backup Script

#!/bin/bash
# /opt/scripts/backup-oauth2-db.sh

DATE=$(date +%Y%m%d-%H%M)
BACKUP_FILE="/backups/oauth2-$DATE.dump"
S3_BUCKET="s3://oauth2-backups"

# Create backup
kubectl exec -n oauth2-server postgres-0 -- \
  pg_dump -U oauth2_user -F c oauth2 > "$BACKUP_FILE"

# Upload to S3
aws s3 cp "$BACKUP_FILE" "$S3_BUCKET/$(date +%Y/%m/%d)/"

# Keep local backups for 7 days
find /backups -name "oauth2-*.dump" -mtime +7 -delete

# Verify
if [ $? -eq 0 ]; then
  echo "Backup successful: $BACKUP_FILE"
else
  echo "Backup failed!" >&2
  exit 1
fi

Database Restore

Prerequisites

  • Valid backup file
  • Database accessible
  • Application pods scaled to 0

Steps

  1. Scale down application

    kubectl scale deployment oauth2-server --replicas=0 -n oauth2-server
    kubectl wait --for=delete pod -l app=oauth2-server -n oauth2-server --timeout=60s

  2. Download backup from S3 (if needed)

    aws s3 cp s3://oauth2-backups/2024/01/15/backup-20240115-1400.dump .

  3. Restore database

    kubectl exec -i -n oauth2-server postgres-0 -- \
      pg_restore -U oauth2_user -d oauth2 -c < backup-20240115-1400.dump

  4. Verify restoration

    kubectl exec -it postgres-0 -n oauth2-server -- \
      psql -U oauth2_user -d oauth2 -c "\dt"
    
    # Check record counts
    kubectl exec -it postgres-0 -n oauth2-server -- \
      psql -U oauth2_user -d oauth2 -c "
        SELECT 'clients' as table, COUNT(*) FROM clients
        UNION ALL
        SELECT 'tokens', COUNT(*) FROM tokens
        UNION ALL
        SELECT 'users', COUNT(*) FROM users;
      "

  5. Scale up application

    kubectl scale deployment oauth2-server --replicas=2 -n oauth2-server
    kubectl wait --for=condition=ready pod -l app=oauth2-server -n oauth2-server --timeout=300s

  6. Verify application

    curl -f https://oauth.example.com/health
    curl -f https://oauth.example.com/metrics


Run Database Migrations

Prerequisites

  • Migration files in migrations/sql/
  • Database accessible
  • Tested in development/staging

Steps

  1. Create ConfigMap with migrations

    kubectl create configmap flyway-migrations \
      --from-file=migrations/sql/ \
      -n oauth2-server \
      --dry-run=client -o yaml | kubectl apply -f -

  2. Apply migration job

    # Delete old job if exists
    kubectl delete job flyway-migration -n oauth2-server --ignore-not-found
    
    # Apply new job
    kubectl apply -f k8s/base/flyway-migration-job.yaml

  3. Monitor migration

    kubectl logs -f job/flyway-migration -n oauth2-server

  4. Verify migration

    kubectl exec -it postgres-0 -n oauth2-server -- \
      psql -U oauth2_user -d oauth2 -c "
        SELECT installed_rank, version, description, installed_on, success 
        FROM flyway_schema_history 
        ORDER BY installed_rank DESC 
        LIMIT 5;
      "

  5. Test application

    kubectl logs -f deployment/oauth2-server -n oauth2-server
    curl -f https://oauth.example.com/health

Rollback Migration

If migration fails:

  1. Check migration status

    kubectl logs job/flyway-migration -n oauth2-server

  2. Manually revert changes

    kubectl exec -it postgres-0 -n oauth2-server -- \
      psql -U oauth2_user -d oauth2
    
    # Run rollback SQL

  3. Restore from backup if needed

    # See "Database Restore" runbook


Check Server Health

Quick Health Check

# Health endpoint
curl -f https://oauth.example.com/health | jq

# Readiness endpoint
curl -f https://oauth.example.com/ready | jq

# Kubernetes pod status
kubectl get pods -n oauth2-server

# Recent logs
kubectl logs -n oauth2-server -l app=oauth2-server --tail=20

Detailed Health Check

# Application metrics
curl -s https://oauth.example.com/metrics | grep oauth2_server

# Database connectivity
kubectl exec -it postgres-0 -n oauth2-server -- \
  pg_isready -U oauth2_user

# Pod resource usage
kubectl top pods -n oauth2-server

# Check for errors in logs
kubectl logs -n oauth2-server -l app=oauth2-server --since=1h | grep -i error

# HPA status
kubectl get hpa -n oauth2-server

# Service endpoints
kubectl get endpoints -n oauth2-server

View Application Logs

Real-time Logs

# All pods
kubectl logs -f -n oauth2-server -l app=oauth2-server

# Specific pod
kubectl logs -f -n oauth2-server oauth2-server-abc123-xyz

# Last 100 lines
kubectl logs -n oauth2-server -l app=oauth2-server --tail=100

# Since 1 hour ago
kubectl logs -n oauth2-server -l app=oauth2-server --since=1h

# Previous pod (after crash)
kubectl logs -p -n oauth2-server oauth2-server-abc123-xyz

Log Analysis

# Count errors
kubectl logs -n oauth2-server -l app=oauth2-server | grep -c ERROR

# Find authentication failures
kubectl logs -n oauth2-server -l app=oauth2-server | grep "401\|403"

# Find slow queries
kubectl logs -n oauth2-server -l app=oauth2-server | grep "query took"

# Export logs for analysis
kubectl logs -n oauth2-server -l app=oauth2-server --tail=10000 > oauth2-logs.txt

Monitor Metrics

Prometheus Queries

Access Prometheus and run these queries:

# Request rate
rate(oauth2_server_http_requests_total[5m])

# Error rate percentage
100 * (
  rate(oauth2_server_http_requests_total{status=~"5.."}[5m]) /
  rate(oauth2_server_http_requests_total[5m])
)

# Token issuance rate
rate(oauth2_server_oauth_token_issued_total[5m])

# Active tokens
oauth2_server_oauth_active_tokens

# P95 response time
histogram_quantile(0.95, 
  rate(oauth2_server_http_request_duration_seconds_bucket[5m]))

# Database query latency
histogram_quantile(0.95,
  rate(oauth2_server_db_query_duration_seconds_bucket[5m]))

Key Metrics to Monitor

Metric Threshold Action
Error rate > 5% Investigate logs
Response time (P95) > 500ms Check database
Active tokens > 100,000 Consider cleanup
Database CPU > 80% Scale or optimize
Memory usage > 80% Scale pods

Set Up Alerts

Alertmanager Rules

groups:
  - name: oauth2_server
    interval: 30s
    rules:
      - alert: HighErrorRate
        expr: |
          100 * (
            rate(oauth2_server_http_requests_total{status=~"5.."}[5m]) /
            rate(oauth2_server_http_requests_total[5m])
          ) > 5
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate detected"
          description: "Error rate is {{ $value }}%"

      - alert: HighResponseTime
        expr: |
          histogram_quantile(0.95,
            rate(oauth2_server_http_request_duration_seconds_bucket[5m])
          ) > 0.5
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High response time"
          description: "P95 response time is {{ $value }}s"

      - alert: DatabaseDown
        expr: up{job="oauth2-database"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Database is down"

      - alert: PodRestartLoop
        expr: rate(kube_pod_container_status_restarts_total{namespace="oauth2-server"}[15m]) > 0
        labels:
          severity: warning
        annotations:
          summary: "Pod is restarting frequently"

Database Performance Tuning

  1. Confirm symptoms (latency, error rate, slow queries).
  2. Check database health:
  3. CPU/memory/disk
  4. connection count
  5. long-running queries
  6. Validate schema + indexes (especially for token lookup/revocation patterns).
  7. Scale / tune:
  8. increase Postgres resources
  9. add connection pooling (e.g., PgBouncer)
  10. tune max_connections, shared_buffers, and work_mem for your workload

Pod Not Starting

  1. Describe the pod:
  2. kubectl describe pod <pod> -n oauth2-server
  3. Check events for image pull errors, missing secrets, scheduling issues.
  4. Check logs:
  5. kubectl logs <pod> -n oauth2-server --previous
  6. Common causes:
  7. missing OAUTH2_JWT_SECRET
  8. database not reachable
  9. migrations job failing

High Error Rate

  1. Check /metrics and logs for spikes in 4xx vs 5xx.
  2. Correlate with deploy/rollout events.
  3. Validate dependency health:
  4. database (/ready)
  5. eventing (/events/health, if enabled)
  6. If 5xx persists, consider rollback:
  7. kubectl rollout undo deployment/oauth2-server -n oauth2-server

Database Connection Issues

  1. Check readiness: GET /ready should report database ok.
  2. Verify Kubernetes service/DNS:
  3. kubectl get svc -n oauth2-server
  4. Verify credentials/secrets:
  5. kubectl get secret oauth2-server-secret -n oauth2-server -o yaml
  6. Check Postgres logs:
  7. kubectl logs postgres-0 -n oauth2-server

Performance Degradation

  1. Compare latency (P50/P95) before/after the degradation window.
  2. Check resource saturation (CPU/mem), restarts, and database health.
  3. If eventing is enabled, verify it is not misconfigured:
  4. failing backends are best-effort, but can add log noise.
  5. Consider temporarily reducing load and/or scaling up.

Rotate JWT Secret

JWT secret rotation invalidates existing tokens. Plan a maintenance window.

  1. Generate a new secret and update the Kubernetes secret.
  2. Roll out the deployment.
  3. Validate new token issuance and introspection.

Rotate Database Password

  1. Update the database user password in Postgres.
  2. Update the Kubernetes secret used by the app.
  3. Restart/roll out the app.
  4. Verify /ready returns ok.

Revoke All Tokens

There is no global revoke endpoint by default. Recommended approach is to rotate the JWT secret (see above) and/or rotate signing keys.

Security Incident Response

  1. Contain: revoke credentials/secrets and restrict access.
  2. Eradicate: rotate JWT secret and database passwords.
  3. Recover: redeploy from a known-good version.
  4. Post-incident: open an issue and follow the repository security policy.

Additional Runbooks

For more specific scenarios, see:


Support

  • Documentation: /docs directory
  • Issues: GitHub Issues
  • Discussions: GitHub Discussions
  • Security: See SECURITY.md