DEVOPS / SRE

SLO SLI SLA

  • Service Level Objective
    • 99.9% availability per month
    • Error Budget
      • 99.9% availability per month -> 0.1% per month -> 43mins downtime / per
  • Service Level Indicator
    • request error rate
    • request latency
  • Service Level Agreement
    • commit availability to client
    • addition buffer
      • SLO = 99.9% -> SLA = 99.5%

Monitoring

  • Google SRE: Golden signals
    • response latency
      • p50 / p95 / p99 latency
    • Traffic
      • QPS(Query Per Second)
    • Errors
      • error rate
      • exception count
      • timeout
    • Saturation
      • CPU usage
      • Memory usage
      • Disk I/O
      • Queue length
  • tools
    • Prometheus
    • Grafana
    • GCP Cloud Monitoring
    • Amazon CloudWatch

deploy

  • Ansible
    • batch deploy bash instruction on machine
  • Terraform
    • control cloud infra as code
  • K8s
    • docker pod with backup and more complex config

zero downtime deployments

  • Rolling Updates:
    • Gradually replacing instances of the application without taking the entire system offline.
  • Blue-Green Deployments:
    • Maintaining two production environments (blue and green) where one is active while the other is idle, allowing for seamless switching during updates.
  • Canary Releases:
    • Deploying new features to a small subset of users first, monitoring performance before a full rollout.

design error budget

  • Multi-zone deployment
  • DB replication(multi-region)
  • Load Balancer
  • health check

Incident Response

  • post-mortem
  • RCA(Root Cause Analysis)