System Design and Engineering Interview Guide¶

A practical, interview-focused guide that helps you structure answers, make tradeoffs explicit, and communicate clearly. Includes frameworks, cheat sheets, worked examples, and question banks spanning system design, microservices, load balancing, consistency, and distributed systems.

Contents: - 1. Interview Strategy - 2. System Design Framework - 3. Capacity Estimation Cheat Sheet - 4. Core Building Blocks - 5. Microservice Design - 6. Load Balancing - 7. Consistency and CAP - 8. Data Partitioning and Storage - 9. Caching - 10. Messaging and Streaming - 11. Observability, Reliability, and Operations - 12. Security Basics - 13. Common Design Templates - 14. Question Bank (Prompts and Follow-ups) - 15. Glossary - 16. References - 17. Distributed Systems Fundamentals - 18. Apache Kafka and Event Streaming - 19. Kubernetes Essentials - 20. Cloud Services Cheat Sheet (AWS, GCP, Azure)

1) Interview Strategy¶

Clarify requirements
Functional: core features, APIs, data flows, SLAs.
Non-functional: scale (RPS/QPS), latency budgets (p95/p99), availability target, durability, cost constraints, data privacy/regulations.
Access patterns: read/write ratios, skew (hot keys), geographic distribution.
Define constraints and scope
Back-of-the-envelope estimates.
Prioritize must-have vs nice-to-have.
Propose a high-level design
Draw components: client, CDN, LB, API, services, storage, cache, queue, analytics.
Define data flow and control flow.
Drill into key challenges
Bottlenecks; consistency vs latency; partitioning; failure modes; backpressure.
Alternatives with tradeoffs.
Deep dive on one or two subsystems
Pick the riskiest/highest impact subsystem and go deep.
Address operational concerns
Observability, rollouts, incident response, cost, security.
Summarize tradeoffs and next steps
“We optimized for X at the cost of Y. If requirements change to Z, we can pivot to …”

Signals interviewers look for: - Structured thinking, clear tradeoffs, pragmatic decisions, correct use of concepts, realistic estimations, and strong communication.

2) System Design Framework¶

APIs and Contracts
Public APIs (HTTP/gRPC), input validation, idempotency keys, pagination, error semantics.
Data Modeling
Entities, relationships, indexes, normalization vs denormalization, schema evolution.
Storage
SQL vs NoSQL; consistency needs; OLTP vs OLAP; hot/cold tiers; durability needs.
Compute
Stateless vs stateful services; synchronous vs asynchronous; CPU vs I/O bound.
Caching
Where: client, CDN, edge, service, DB.
What: objects, query results, computed views.
How: TTL, invalidation strategies, write policies.
Partitioning and Replication
Hash-based, range-based, geo-sharding; replication factors and placement.
Consistency Model
Strong vs eventual; session/RYW; quorum math; isolation levels.
Load Balancing and Routing
L4 vs L7; algorithms; sticky sessions; circuit breaking; retries and timeouts.
Messaging
Queues vs streams; at-least-once vs exactly-once patterns; outbox/inbox.
Reliability and SLOs
Error budgets, backpressure, autoscaling, degradation strategies.
Observability
Metrics, logs, traces, SLO dashboards, alerting.
Security and Privacy
AuthN/Z, mTLS, token scopes, PII handling, encryption at rest/in transit.
Deployment and Operations
Blue/green, canary, feature flags, rollbacks, disaster recovery (RPO/RTO).

3) Capacity Estimation Cheat Sheet¶

Common conversions: - 1 KB ≈ 10^3 bytes; 1 MB ≈ 10^6; 1 GB ≈ 10^9; 1 TB ≈ 10^12 - 1 day ≈ 86,400 seconds

Back-of-the-envelope steps: - Requests per second (RPS) = daily active users × requests per user per day ÷ 86,400 - Data rate = event size × events per second - Storage per day = data rate × 86,400 - Network egress cost often dominates at scale; be mindful.

Latency budget example (p95 target 200 ms end-to-end): - CDN/cache: 5-20 ms - LB + TLS: 5-10 ms - App logic: 20-50 ms - DB read: 2-10 ms (cache hit) or 10-50 ms (miss) - Cross-region adds 50-150 ms

4) Core Building Blocks¶

CDN
Offload static content; edge caching; signed URLs; cache-control headers.
API Gateway
Routing, auth, rate limiting, request/response transformations.
Service Mesh
mTLS, retries, timeouts, circuit breaking, observability sidecars.
Databases
SQL (Postgres/MySQL): transactions, joins, strong consistency.
NoSQL (Cassandra/DynamoDB): horizontal scale, tunable consistency.
Search (Elasticsearch/OpenSearch), Analytics (ClickHouse, BigQuery).
Object Storage
Durable blob storage (S3/GCS); lifecycle policies; CDN fronting.
Caches
Redis/Memcached; eviction policies; distributed locks carefully.
Queues/Streams
SQS/RabbitMQ (queues), Kafka/Pulsar (streams); ordering, partitioning.
Compute
Containers/K8s; autoscaling; serverless for bursty workloads.

5) Microservice Design¶

Service boundaries
Domain-driven design (DDD) bounded contexts; avoid chatty RPC between services.
Contracts and versioning
Backward-compatible schemas; consumer-driven contracts; API versioning.
Data ownership
Each service owns its data; avoid shared DB across many services.
Communication patterns
Sync: gRPC/HTTP; Async: events/queues for decoupling and resilience.
Distributed transactions
Avoid 2PC across services; use Saga pattern (choreography or orchestration).
Resilience
Retries with jittered backoff; timeouts; circuit breakers; idempotency.
Service discovery
DNS, Consul, Kubernetes services; health checks; TTLs.
API Gateway/BFF
Backend-for-Frontend to tailor APIs per client; aggregate calls; caching.
Observability
Trace context propagation; RED/USE metrics; structured logging; exemplars.
Deployment
Independent deployability; canary releases; feature flags; schema evolution.

Tradeoffs: - Microservices improve team autonomy and scale-out but increase complexity, operational overhead, and consistency challenges. Start simple, split only when justified.

6) Load Balancing¶

Layers
L4 (TCP/UDP) vs L7 (HTTP/gRPC) load balancing.
Algorithms
Round robin, weighted round robin, least connections, least response time, power of two choices, consistent hashing (sticky to partitions).
Health checking
Active (probes) and passive (error rate-based); outlier detection.
Session affinity
Cookie-based or IP-based; minimize stateful affinity by making services stateless.
Global load balancing
GeoDNS, anycast, GSLB; route to nearest healthy region; failover plans.
TLS termination
At edge or at service; offload vs end-to-end encryption.
Retry and timeout policies
Prevent retry storms; use hedged requests sparingly; enforce budgets.
Rate limiting
Token bucket/leaky bucket; per-client/per-route; global vs local counters.

7) Consistency and CAP¶

CAP theorem
Under partition, choose between availability and consistency. Most internet-scale systems are AP with tunable consistency.
Consistency models
Strong (linearizable), sequential, causal, eventual, read-your-writes, monotonic reads/writes.
Quorums (for N replicas)
Read quorum R + write quorum W > N to achieve strong consistency for reads.
Example: N=3, W=2, R=2 gives R+W=4>3.
Isolation levels (DB)
Read uncommitted, read committed, repeatable read, serializable.
Conflict resolution
Last write wins (with caveats), vector clocks, CRDTs, app-specific merges.
Idempotency
Use idempotency keys; put operations behind unique request IDs.

8) Data Partitioning and Storage¶

Partitioning strategies
Hash-based (uniform, hard to range scan), range-based (good for range scans, hot ranges risk), directory/lookup (flexible, metadata overhead).
Hot keys and skew
Mitigate with time-bucketed keys, random suffixes, or consistent hashing with virtual nodes.
Replication
Synchronous (lower RPO, higher latency) vs asynchronous (higher RPO risk).
Leader-follower vs leaderless (Dynamo-style).
Secondary indexes
Local vs global secondary indexes; write amplification; consistency implications.
Schema evolution
Backward-compatible changes; dual writes/migrations; online backfills.

9) Caching¶

Layers
Client cache, CDN/edge, service-level cache (Redis), DB cache (buffer pool).
Patterns
Read-through, write-through, write-back, cache-aside.
Invalidation
TTLs, explicit invalidation on writes, versioned keys.
Pitfalls
Thundering herd: add jitter, request coalescing, locks, stale-while-revalidate.
Inconsistent cache + DB: accept eventual consistency or enforce write-through.
Key design
Namespacing, include version/schema hash; avoid unbounded cardinality.

10) Messaging and Streaming¶

Queues vs streams
Queues: competing consumers, at-least-once, work distribution.
Streams: ordered partitions, replays, multiple consumer groups, event sourcing.
Delivery semantics
At-most-once, at-least-once (most common), effectively-exactly-once (with idempotency and transactional outbox).
Outbox pattern
Write data and outbox in same transaction; relay to stream asynchronously.
Backpressure
Consumer lag, dynamic concurrency, dead-letter queues; circuit breaking upstream.

11) Observability, Reliability, and Operations¶

SLI/SLO/Error budgets
Define latency, availability, throughput. Track p50/p95/p99, saturation.
Metrics
RED (Rate, Errors, Duration), USE (Utilization, Saturation, Errors).
Tracing
Propagate context; sample smartly; link to logs; analyze critical paths.
Logging
Structured JSON; correlation IDs; PII scrubbing; retention policies.
Resilience patterns
Timeouts, retries with jitter, circuit breakers, bulkheads, load shedding.
Rollouts
Blue/green, canary, feature flags, progressive delivery; rollback plans.
DR and backups
RPO/RTO objectives; multi-AZ/region; backup verification; chaos testing.

12) Security Basics¶

AuthN/AuthZ
OAuth2/OIDC, JWT, short-lived tokens, scopes; ABAC/RBAC.
Transport security
TLS everywhere, mTLS between services; cert rotation.
Data security
Encryption at rest (KMS); key rotation; field-level encryption for PII.
Secrets management
Vault/KMS/SM; never bake secrets into images.
Threat modeling
OWASP Top 10; input validation; WAF; rate limiting; audit logging.

13) Common Design Templates¶

Each template lists: API, data model, architecture, scale, and key challenges.

A) URL Shortener - API - POST /shorten {long_url} -> {short_code} - GET /{short_code} -> 301 redirect to long_url - Data model - short_code (PK), long_url, created_at, owner_id, ttl (optional), visit_count - Architecture - CDN + edge cache for GET - API service (stateless) - DB: KV or SQL with unique index on short_code - Cache: short_code -> long_url - ID generation: base62 from sequence or hash(long_url) - Scale - Heavy read, moderate write - Pre-warm hot codes; Bloom filter to reduce DB misses - Challenges - Custom aliases collisions; abuse detection; TTL/purge; analytics separate

B) News Feed (Fan-out) - API - POST /post; GET /feed?user_id - Data model - posts, user_follows, user_feed (denormalized) - Architecture - Write path: enqueue fan-out to followers’ feeds (asynchronous) - Read path: merge user_feed + recency + personalization - Storage: posts in object store + metadata in DB; user_feed in KV - Cache: user_feed pages - Scale - Hot users with millions of followers: partial fan-out, on-read merge - Challenges - Ordering, dedupe, pagination, privacy, spam

C) Chat/Messaging - API - WebSocket for realtime; REST for history - Data model - conversations, messages (conversation_id, sender_id, seq_no, timestamp) - Architecture - Gateway (sticky by conversation), message broker, storage (append-only) - Presence service; delivery receipts; typing indicators (ephemeral) - Scale - Partition by conversation_id; global ordering per conversation only - Challenges - Mobile offline, end-to-end encryption (optional), spam, abuse

D) Rate Limiter - Algorithms - Token bucket/leaky bucket, fixed/sliding window - Architecture - Local limiter in gateway + global counter in Redis; Lua for atomic ops - Consistent hashing of keys; approximate counters for large scale - Challenges - Cluster-wide sync vs eventual; fairness; burst handling

E) File Storage Service - API - POST /upload; GET /download; signed URLs; multipart - Architecture - CDN + object storage; metadata DB; background virus scan - Deduplication via content hash; lifecycle to cold storage - Challenges - Large files, resumable uploads, encryption, egress costs

F) Notifications (Email/SMS/Push) - Architecture - Producer -> queue -> worker pools -> provider fan-out with retries - Idempotency per user+template+dedupe window - Challenges - Provider failures, deliverability, rate limits, opt-outs, compliance

14) Question Bank¶

System Design Prompts: - Design a URL shortener for 1B URLs and 1M RPS reads. - Design Twitter timeline with hot celebrities and 50M DAU. - Design WhatsApp-like chat with end-to-end encryption and 100M MAU. - Design a globally available file sharing service with 99.99% availability. - Design a realtime ride-hailing dispatch system with surge pricing. - Design a globally distributed configuration service with low-latency reads.

Microservices: - When to split a monolith; identify service boundaries. - How to implement Saga for order -> payment -> inventory -> shipping. - Design API gateway + service mesh architecture. - Versioning strategy for breaking API changes.

Load Balancing: - Choose between least-connections vs power-of-two choices. - Global LB across three regions; failover plan under partition. - Sticky sessions vs stateless services; when and how.

Consistency: - Choose consistency model for cart checkout; read-your-writes requirements. - Quorum configuration for N=5 replicas targeting high availability. - Handling conflicting updates with CRDTs vs LWW vs app-level merges.

Data: - Partitioning strategy for time-series metrics with hot tenants. - Designing global secondary indexes with write-heavy workload. - Schema evolution with rolling deployments and online backfills.

Caching: - Prevent thundering herd under cache miss for hot keys. - Cache invalidation strategies for profile updates. - Layered caching for product catalog and pricing.

Messaging: - Exactly-once pipeline design; outbox pattern; deduplication. - Choosing Kafka vs RabbitMQ; consumer lag management; dead-letter queues.

Observability/Operations: - Define SLOs for an API; error budget policy. - Incident response flow for a cascading failure. - Safe rollout plan for a high-risk change.

Follow-up Probes: - Failure modes and mitigation. - Tradeoffs if requirement X changes. - Cost awareness and optimizations. - Testing strategies (property tests, chaos experiments).

15) Glossary¶

CAP: Consistency, Availability, Partition tolerance.
RPO/RTO: Recovery Point/Time Objective.
Quorum: Minimum number of replicas participating to accept an operation.
SLO/SLI/SLA: Objective/Indicator/Agreement for reliability.
Saga: Sequence of local transactions with compensations across services.
Idempotency: Replaying an operation yields same effect.
Backpressure: Mechanism to slow producers when consumers lag.
Hedged requests: Duplicate requests to reduce tail latency (use sparingly).

16) References¶

Designing Data-Intensive Applications (Kleppmann)
Site Reliability Engineering (Google SRE)
The Art of Scalability (Abbott, Fisher)
Papers: Dynamo, Spanner, Raft
Production Ready Microservices (Newman)
Architecture blogs: AWS Builders, Google Cloud, ACM Queue

Usage in interviews: - Start with the framework (Section 2), do quick estimates (Section 3), assemble blocks (Section 4), and go deep on the hardest parts (Sections 5–12). Use templates (Section 13) to accelerate common designs and the question bank (Section 14) to practice.

17) Distributed Systems Fundamentals¶

Key concepts: - Failure models: process crash, network partitions, slow nodes (the common case), split brain, correlated failures (AZ outage). - Time and clocks: - Wall vs monotonic clocks; NTP drift; don’t rely on exact time ordering across nodes. - Logical clocks: Lamport clocks (causal ordering), vector clocks (conflict detection). - Consensus and membership: - Raft/Paxos for leader election and log replication; Single-writer (leader) simplifies invariants. - Failure detectors, heartbeats, quorum-based membership, gossip protocols. - Quorums and replication: - For N replicas, choose R/W such that R + W > N for read-write strong reads. - Anti-entropy, hinted handoff, read repair for AP systems. - Idempotency and exactly-once: - Exactly-once delivery is a system-level illusion; implement idempotent handlers with request IDs/outbox/inbox. - Backpressure and flow control: - Bounded queues, shedding load, circuit breakers; push back to callers with retry-after. - Data movement: - Rebalancing on scale-out/in; consistent hashing with virtual nodes; directory/lookup services. - Testing and resilience: - Chaos experiments, fault injection; steady-state SLO verification.

Interview prompts: - Explain why wall-clock timestamps can’t ensure total order. How do you detect/resolve conflicts? - When to choose leader-based vs leaderless replication? - Design a membership service with gossip and failure suspicion.

18) Apache Kafka and Event Streaming¶

Core model: - Topics split into partitions; ordering is guaranteed within a partition. - Producers choose partitions (by key hashing or custom strategy). - Consumer groups: each partition assigned to one consumer per group; parallelism = partitions. - Offsets: consumers control position; commits are how you checkpoint.

Storage/retention: - Append-only log; retention by time/size; log compaction keeps latest record per key (good for changelog tables). - Tiered storage (in some distributions/clouds) extends retention at lower cost.

Delivery semantics: - At-least-once by default: handle duplicates via idempotent consumers or dedupe tables. - Idempotent producer + transactions enable effectively exactly-once (EOS) with careful design. - Producer configs: enable.idempotence=true, acks=all, appropriate retries/backoff.

Schema and evolution: - Use a schema registry (Avro/Protobuf/JSON-Schema); enforce compatibility (backward/forward/full). - Version events; avoid breaking changes; prefer additive evolution.

Partitioning and keys: - Choose keys to balance load and preserve locality (e.g., user_id). - Hot keys: add random suffixes or bucketization; handle skew.

Rebalancing and availability: - Rebalance triggers on membership change; tune session/heartbeat timeouts; cooperative rebalancing to minimize disruption. - Replication factor ≥ 3; min.insync.replicas ≥ 2 for durability under broker failure.

Multi-region and DR: - Disaster recovery via MirrorMaker 2 / cluster linking; accept RPO > 0 unless synchronous stretch (high latency). - Geo-local consumers to reduce egress; consider per-region topics + async replication.

Ecosystem: - Kafka Connect for source/sink connectors; Single Message Transforms. - Stream processing: Kafka Streams, ksqlDB, Apache Flink/Spark Structured Streaming. - Observability: lag metrics per consumer group/partition; broker health, ISR, request latency.

Interview pitfalls: - “Exactly-once” claims without idempotency or transactions. - Mis-sized partitions (too few limits parallelism; too many wastes resources). - Using time-based ordering across partitions instead of per-key ordering.

19) Kubernetes Essentials¶

Core resources: - Pod (smallest unit), Deployment (stateless), StatefulSet (stable identity/storage), DaemonSet (per-node). - Service (ClusterIP/NodePort/LoadBalancer/Headless), Ingress/Ingress Controller for L7 routing. - ConfigMap/Secret for config; RBAC for authZ; ServiceAccount for identity.

Reliability and scaling: - Probes: liveness, readiness, startup. Use readiness to gate traffic; liveness for self-heal. - HPA (CPU/memory/custom metrics); PDB to protect against voluntary disruptions. - Requests/limits to get proper QoS; avoid CPU throttling and OOMKills. - Node autoscaling (cluster autoscaler); prioritize via PriorityClass and preemption.

Networking and security: - CNI provides pod networking; NetworkPolicies for east-west controls; mTLS via service mesh (Istio/Linkerd). - PodSecurity admission; image scanning; secrets mounted via CSI/KMS.

Stateful workloads (e.g., Kafka/ZooKeeper): - Use StatefulSets with PersistentVolumeClaims; set PodDisruptionBudgets and ordered updates. - Headless Service for stable DNS; rack/zone awareness via topology spread constraints.

Rollouts and ops: - RollingUpdate, Blue/Green, Canary (Argo Rollouts/Flagger); set maxSurge/maxUnavailable. - Troubleshooting: kubectl describe/get/logs/exec; common issues: CrashLoopBackOff, ImagePullBackOff, OOMKilled, Evicted. - Observability: metrics-server, Prometheus/Grafana, OpenTelemetry, events; set SLOs per service.

Interview prompts: - Design a multi-tenant K8s platform with resource isolation and network policies. - Deploy Kafka on K8s safely—what would you configure for storage, disruption, and upgrades?

20) Cloud Services Cheat Sheet (AWS, GCP, Azure)¶

Compute and orchestration: - Containers: AWS EKS, GCP GKE, Azure AKS. - Serverless: AWS Lambda, GCP Cloud Functions/Cloud Run, Azure Functions/Container Apps. - Batch/queues: AWS ECS/Fargate, GCP Cloud Run Jobs, Azure Container Instances.

Storage: - Object: AWS S3, GCP Cloud Storage (GCS), Azure Blob Storage. - Block: AWS EBS, GCP Persistent Disk, Azure Managed Disks. - Files: AWS EFS/FSx, GCP Filestore, Azure Files.

Databases: - Relational: AWS RDS/Aurora, GCP Cloud SQL/AlloyDB, Azure SQL Database. - NoSQL KV/Wide-column: AWS DynamoDB/Keyspaces, GCP Bigtable/Firestore, Azure Cosmos DB (multiple APIs). - Search/Analytics: AWS OpenSearch/Redshift, GCP BigQuery/Dataproc, Azure Synapse/Data Explorer.

Messaging/streaming: - Kafka: AWS MSK/Confluent Cloud, GCP Confluent Cloud, Azure Event Hubs for Kafka API. - Native: AWS Kinesis + SQS/SNS, GCP Pub/Sub, Azure Service Bus/Event Hubs.

Networking and delivery: - LB/Proxy: AWS ALB/NLB, GCP External/Internal LBs, Azure Application Gateway/Front Door. - CDN: AWS CloudFront, GCP Cloud CDN, Azure CDN. - DNS: Route 53, Cloud DNS, Azure DNS. - VPC/VNet, PrivateLink/Private Service Connect/Private Link Service for private connectivity.

Security and identity: - IAM: AWS IAM, GCP IAM, Azure RBAC. - Secrets/KMS: AWS Secrets Manager + KMS, GCP Secret Manager + KMS, Azure Key Vault. - mTLS/service identity: SPIRE, mesh integrations.

Observability: - AWS CloudWatch/X-Ray, GCP Cloud Monitoring/Trace/Logging, Azure Monitor/App Insights. - Managed OpenTelemetry collectors available across clouds.

Data governance and DR: - Cross-region replication (S3/Cloud Storage/Blob); multi-region DB options (DynamoDB Global Tables, Spanner, Cosmos DB). - RTO/RPO planning; backups and point-in-time recovery.

Cost and egress: - Data egress charges between regions/clouds can dominate; colocate consumers with producers; use CDNs to reduce origin egress. - Pick managed services where possible to reduce ops toil; compare SLAs.

Interview prompts: - Design a multi-region read-local/write-global service on AWS—how do you use DynamoDB Global Tables and Route 53?