Adding Custom SLOs

This guide explains how to add custom Service Level Objectives (SLOs) for your specific needs. The Greenfield Cluster provides a flexible framework for defining SLOs at different levels.

Overview

SLOs are defined using Prometheus recording rules that pre-calculate metrics. The general pattern is:

- record: {scope}:{metric}:{aggregation}_{time_window}
  expr: |
    # PromQL expression

SLO Components

Every SLO should have: 1. Recording rule: Pre-calculated metric 2. Alert rule: Fires when SLO is violated 3. Grafana dashboard panel: Visualization 4. Documentation: What it measures and why

Distribution SLOs

Distribution SLOs track the distribution of values (latency, response sizes, etc.) using percentiles.

Use Cases

API response time (P50, P95, P99)
Database query duration
Message processing time
File upload sizes

Example: Custom Latency SLO

Step 1: Define Recording Rule

Edit kustomize/base/observability/slos/application-slos.yaml:

- name: custom_latency_slos
  interval: 30s
  rules:
    # P50 latency for custom service
    - record: myapp:http_request:duration:p50_rate5m
      expr: |
        histogram_quantile(0.50,
          sum(rate(myapp_http_request_duration_seconds_bucket[5m]))
          by (service, endpoint, le)
        )

    # P95 latency for custom service
    - record: myapp:http_request:duration:p95_rate5m
      expr: |
        histogram_quantile(0.95,
          sum(rate(myapp_http_request_duration_seconds_bucket[5m]))
          by (service, endpoint, le)
        )

    # P99 latency for custom service
    - record: myapp:http_request:duration:p99_rate5m
      expr: |
        histogram_quantile(0.99,
          sum(rate(myapp_http_request_duration_seconds_bucket[5m]))
          by (service, endpoint, le)
        )

    # P99.9 latency for critical paths
    - record: myapp:http_request:duration:p999_rate5m
      expr: |
        histogram_quantile(0.999,
          sum(rate(myapp_http_request_duration_seconds_bucket[5m]))
          by (service, endpoint, le)
        )

Step 2: Add Alert Rules

Edit kustomize/base/observability/alerts/application-alerts.yaml:

- alert: MyAppHighP95Latency
  expr: |
    myapp:http_request:duration:p95_rate5m > 0.5
  for: 10m
  labels:
    severity: warning
    component: myapp
    slo: latency
  annotations:
    summary: "MyApp P95 latency above 500ms"
    description: "MyApp P95 latency is {{ $value }}s (target: < 0.5s)"

- alert: MyAppHighP99Latency
  expr: |
    myapp:http_request:duration:p99_rate5m > 1.0
  for: 5m
  labels:
    severity: critical
    component: myapp
    slo: latency
  annotations:
    summary: "MyApp P99 latency above 1s"
    description: "MyApp P99 latency is {{ $value }}s (target: < 1s)"

Step 3: Instrument Your Application

In your application code (Python example):

from prometheus_client import Histogram

# Define histogram with buckets appropriate for your use case
request_duration = Histogram(
    'myapp_http_request_duration_seconds',
    'HTTP request duration',
    ['service', 'endpoint'],
    buckets=[0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0]
)

# In your request handler
with request_duration.labels(service='myapp', endpoint='/api/users').time():
    # Your code
    pass

Service-Level SLOs

Service-level SLOs track request/response patterns for entire services.

Use Cases

API availability
Success rate per endpoint
Throughput (requests per second)
Error rates by error type

Example: Endpoint-Specific Success Rate

Recording Rule:

- name: service_endpoint_slos
  interval: 30s
  rules:
    # Success rate per endpoint
    - record: myapp:http_requests:success_ratio_per_endpoint_rate5m
      expr: |
        sum(rate(myapp_http_requests_total{status!~"5.."}[5m]))
        by (service, endpoint)
        /
        sum(rate(myapp_http_requests_total[5m]))
        by (service, endpoint)

    # Error rate per error type
    - record: myapp:http_requests:error_ratio_by_type_rate5m
      expr: |
        sum(rate(myapp_http_requests_total{status=~"5.."}[5m]))
        by (service, endpoint, status)
        /
        sum(rate(myapp_http_requests_total[5m]))
        by (service, endpoint)

    # Requests per second per endpoint
    - record: myapp:http_requests:rps_per_endpoint_rate5m
      expr: |
        sum(rate(myapp_http_requests_total[5m]))
        by (service, endpoint)

Alert Rule:

- alert: MyAppEndpointHighErrorRate
  expr: |
    myapp:http_requests:success_ratio_per_endpoint_rate5m < 0.95
    and
    myapp:http_requests:rps_per_endpoint_rate5m > 0.1
  for: 5m
  labels:
    severity: warning
    component: myapp
    slo: availability
  annotations:
    summary: "High error rate on {{ $labels.endpoint }}"
    description: "{{ $labels.service }}/{{ $labels.endpoint }} success rate: {{ $value | humanizePercentage }}"

App/Pod-Level SLOs

App/Pod-level SLOs track resource usage and health at the application deployment level.

Use Cases

Pod restarts per application
Memory usage per pod
CPU throttling
Container health

Example: Application Restart Rate SLO

Recording Rule:

- name: app_health_slos
  interval: 30s
  rules:
    # Pod restart rate per application
    - record: app:pod_restarts:rate1h
      expr: |
        rate(kube_pod_container_status_restarts_total[1h])

    # Pods in CrashLoopBackOff per app
    - record: app:pods:crashloop_count
      expr: |
        count(kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff"})
        by (namespace, pod)

    # Application uptime (time since last restart)
    - record: app:uptime:seconds
      expr: |
        time() - kube_pod_start_time

    # Pods not ready per application
    - record: app:pods:not_ready_count
      expr: |
        count(kube_pod_status_ready{condition="false"})
        by (namespace, pod)

Alert Rule:

- alert: AppHighRestartRate
  expr: |
    sum(app:pod_restarts:rate1h) by (namespace, pod) > 0.5
  for: 15m
  labels:
    severity: warning
    component: application
    slo: stability
  annotations:
    summary: "Application {{ $labels.pod }} restarting frequently"
    description: "{{ $labels.pod }} in {{ $labels.namespace }} is restarting {{ $value }} times per hour"

- alert: AppInCrashLoop
  expr: |
    app:pods:crashloop_count > 0
  for: 5m
  labels:
    severity: critical
    component: application
    slo: stability
  annotations:
    summary: "Application {{ $labels.pod }} in CrashLoopBackOff"
    description: "{{ $labels.pod }} in {{ $labels.namespace }} is in CrashLoopBackOff state"

Example: Resource Saturation SLO

Recording Rule:

- name: app_resource_slos
  interval: 30s
  rules:
    # CPU saturation by application
    - record: app:cpu:saturation_by_app
      expr: |
        sum(rate(container_cpu_usage_seconds_total{container!=""}[5m]))
        by (namespace, app)
        /
        sum(container_spec_cpu_quota{container!=""}/container_spec_cpu_period{container!=""})
        by (namespace, app)

    # Memory saturation by application
    - record: app:memory:saturation_by_app
      expr: |
        sum(container_memory_working_set_bytes{container!=""})
        by (namespace, app)
        /
        sum(container_spec_memory_limit_bytes{container!=""})
        by (namespace, app)

    # Pods at resource limits
    - record: app:pods:at_limits_count
      expr: |
        count(
          (container_cpu_usage_seconds_total / container_spec_cpu_quota > 0.95)
          or
          (container_memory_working_set_bytes / container_spec_memory_limit_bytes > 0.95)
        ) by (namespace, app)

Custom Business Metric SLOs

Business metric SLOs track domain-specific metrics that matter to your business.

Use Cases

Order success rate
Payment processing time
User registration flow completion
Data processing throughput
Queue depth

Example: Order Processing SLO

Step 1: Expose Business Metrics

In your application:

from prometheus_client import Counter, Histogram

# Order metrics
orders_total = Counter(
    'orders_total',
    'Total orders',
    ['status']  # success, failed, cancelled
)

order_processing_duration = Histogram(
    'order_processing_duration_seconds',
    'Order processing duration',
    ['order_type'],
    buckets=[0.1, 0.5, 1.0, 2.5, 5.0, 10.0, 30.0, 60.0]
)

# In your order processing code
with order_processing_duration.labels(order_type='standard').time():
    result = process_order(order)

if result.success:
    orders_total.labels(status='success').inc()
else:
    orders_total.labels(status='failed').inc()

Step 2: Define SLO Recording Rules

- name: business_slos
  interval: 30s
  rules:
    # Order success rate (SLO: 99.5%)
    - record: business:orders:success_ratio_rate5m
      expr: |
        sum(rate(orders_total{status="success"}[5m]))
        /
        sum(rate(orders_total[5m]))

    # Order success rate by type
    - record: business:orders:success_ratio_by_type_rate5m
      expr: |
        sum(rate(orders_total{status="success"}[5m])) by (order_type)
        /
        sum(rate(orders_total[5m])) by (order_type)

    # Order processing P95 latency (SLO: < 5s)
    - record: business:orders:processing_duration:p95_rate5m
      expr: |
        histogram_quantile(0.95,
          sum(rate(order_processing_duration_seconds_bucket[5m]))
          by (order_type, le)
        )

    # Order error budget remaining
    - record: business:orders:error_budget_remaining
      expr: |
        (0.005 - (1 - business:orders:success_ratio_rate30m)) / 0.005

    # Order throughput (orders per minute)
    - record: business:orders:throughput_per_minute
      expr: |
        sum(rate(orders_total[1m])) * 60

Step 3: Add Alerts

- alert: OrderSuccessRateBelowSLO
  expr: |
    business:orders:success_ratio_rate5m < 0.995
    and
    sum(rate(orders_total[5m])) > 0.01
  for: 5m
  labels:
    severity: critical
    component: orders
    slo: success_rate
  annotations:
    summary: "Order success rate below 99.5% SLO"
    description: "Order success rate is {{ $value | humanizePercentage }} (target: > 99.5%)"

- alert: OrderProcessingSlowP95
  expr: |
    business:orders:processing_duration:p95_rate5m > 5
  for: 10m
  labels:
    severity: warning
    component: orders
    slo: latency
  annotations:
    summary: "Order processing P95 latency above 5s"
    description: "Order processing P95 is {{ $value }}s (target: < 5s)"

- alert: OrderErrorBudgetLow
  expr: |
    business:orders:error_budget_remaining < 0.1
  for: 5m
  labels:
    severity: warning
    component: orders
    slo: error_budget
  annotations:
    summary: "Order error budget nearly exhausted"
    description: "Only {{ $value | humanizePercentage }} of error budget remains"

Example: Queue Depth SLO

- name: queue_slos
  interval: 30s
  rules:
    # Queue depth (SLO: < 1000 messages)
    - record: business:queue:depth
      expr: |
        sum(queue_depth) by (queue_name)

    # Queue age (oldest message age in seconds)
    - record: business:queue:oldest_message_age_seconds
      expr: |
        max(queue_message_age_seconds) by (queue_name)

    # Queue processing rate (messages/sec)
    - record: business:queue:processing_rate
      expr: |
        rate(queue_messages_processed_total[5m])

Best Practices

1. Start Simple

Begin with the four golden signals: - Latency: How long does it take? - Traffic: How much demand? - Errors: How many requests fail? - Saturation: How full are resources?

2. Choose Appropriate Time Windows

5m: Real-time alerting, fast feedback
30m: Error budget calculations
1h: Trend analysis
1d: Capacity planning

3. Use Consistent Naming

Follow the pattern: {scope}:{metric}:{aggregation}_{time_window}

Examples: - myapp:http_requests:success_ratio_rate5m - business:orders:processing_duration:p95_rate5m - app:cpu:saturation_by_app

4. Set Realistic Targets

SLO Target	Downtime per Month	Use Case
90%	3 days	Development/testing
95%	1.5 days	Internal tools
99%	7.2 hours	Standard service
99.5%	3.6 hours	Important service
99.9%	43 minutes	Critical service
99.95%	21 minutes	Payment systems
99.99%	4 minutes	Life-critical systems

5. Track Error Budgets

Error budgets help balance reliability and velocity:

# Error budget for 99.9% SLO
- record: myapp:error_budget_remaining
  expr: |
    (0.001 - (1 - myapp:success_ratio_rate30m)) / 0.001

When error budget > 50%: Ship features
When error budget < 10%: Focus on reliability

6. Alert on SLO Violations, Not Symptoms

Bad ❌:

- alert: HighCPU
  expr: cpu_usage > 80%

Good ✅:

- alert: LatencySLOViolation
  expr: myapp:http_request:duration:p95_rate5m > 1.0

7. Use Traffic Filters

Avoid false positives in low-traffic scenarios:

expr: |
  myapp:success_ratio < 0.99
  and
  myapp:requests_rate5m > 0.1  # Only alert if traffic > 0.1 req/s

8. Document Your SLOs

For each SLO, document: - What: What does it measure? - Why: Why does it matter? - Target: What's the target value? - Error Budget: How much unreliability is acceptable? - Alerting: When should we be notified?

9. Review and Iterate

Monthly SLO review meetings
Analyze SLO violations
Adjust targets based on data
Update runbooks with learnings

10. Multi-Window Multi-Burn-Rate Alerts

For critical SLOs, use multiple windows to catch both fast and slow burns:

# Fast burn (critical) - 2% budget consumed in 1 hour
- alert: MyAppErrorBudgetFastBurn
  expr: |
    (
      myapp:success_ratio_rate5m < 0.98  # 2% error rate
      and
      myapp:success_ratio_rate1h < 0.98
    )
  for: 2m
  labels:
    severity: critical

# Slow burn (warning) - consuming budget steadily
- alert: MyAppErrorBudgetSlowBurn
  expr: |
    (
      myapp:success_ratio_rate30m < 0.995  # 0.5% error rate
      and
      myapp:success_ratio_rate6h < 0.995
    )
  for: 15m
  labels:
    severity: warning

Testing Your SLOs

1. Simulate Load

# Generate traffic
for i in {1..1000}; do
  curl http://myapp/api/endpoint
done

2. Introduce Errors

# Call error endpoint
for i in {1..10}; do
  curl http://myapp/api/error-endpoint
done

3. Check Metrics

# Query Prometheus
curl 'http://localhost:9090/api/v1/query?query=myapp:success_ratio_rate5m'

4. Verify Alerts

Check Prometheus alerts page or AlertManager to see if alerts fire as expected.

Example: Complete SLO Implementation

Here's a complete example for a payment service:

Recording Rules:

- name: payment_slos
  interval: 30s
  rules:
    # Success rate (target: 99.95%)
    - record: payment:transactions:success_ratio_rate5m
      expr: |
        sum(rate(payment_transactions_total{status="success"}[5m]))
        /
        sum(rate(payment_transactions_total[5m]))

    # P99 latency (target: < 2s)
    - record: payment:transactions:duration:p99_rate5m
      expr: |
        histogram_quantile(0.99,
          sum(rate(payment_transaction_duration_seconds_bucket[5m]))
          by (payment_method, le)
        )

    # Error budget (0.05% allowed)
    - record: payment:transactions:error_budget_remaining
      expr: |
        (0.0005 - (1 - payment:transactions:success_ratio_rate30m)) / 0.0005

Alerts:

- alert: PaymentSuccessRateCritical
  expr: |
    payment:transactions:success_ratio_rate5m < 0.9995
    and
    sum(rate(payment_transactions_total[5m])) > 0.01
  for: 2m
  labels:
    severity: critical
    component: payment
    slo: success_rate
  annotations:
    summary: "Payment success rate below 99.95%"
    description: "Success rate: {{ $value | humanizePercentage }}"
    runbook_url: "https://wiki.example.com/runbooks/payment-slo-violation"

- alert: PaymentLatencyHigh
  expr: |
    payment:transactions:duration:p99_rate5m > 2
  for: 5m
  labels:
    severity: warning
    component: payment
    slo: latency
  annotations:
    summary: "Payment P99 latency above 2s"
    description: "P99 latency: {{ $value }}s"

Grafana Dashboard Panel:

{
  "title": "Payment Success Rate vs SLO",
  "targets": [
    {
      "expr": "payment:transactions:success_ratio_rate5m * 100",
      "legendFormat": "Success Rate"
    },
    {
      "expr": "99.95",
      "legendFormat": "SLO Target (99.95%)"
    }
  ],
  "yaxis": {
    "min": 99.9,
    "max": 100
  }
}

Adding Custom SLOs

Table of Contents

Overview

SLO Components

Distribution SLOs

Use Cases

Example: Custom Latency SLO

Service-Level SLOs

Use Cases

Example: Endpoint-Specific Success Rate

App/Pod-Level SLOs

Use Cases

Example: Application Restart Rate SLO

Example: Resource Saturation SLO

Custom Business Metric SLOs

Use Cases

Example: Order Processing SLO

Example: Queue Depth SLO

Best Practices

1. Start Simple

2. Choose Appropriate Time Windows

3. Use Consistent Naming

4. Set Realistic Targets

5. Track Error Budgets

6. Alert on SLO Violations, Not Symptoms

7. Use Traffic Filters

8. Document Your SLOs

9. Review and Iterate

10. Multi-Window Multi-Burn-Rate Alerts

Testing Your SLOs

1. Simulate Load

2. Introduce Errors

3. Check Metrics

4. Verify Alerts

Example: Complete SLO Implementation

Further Reading