DEVOPS / SRE
SLO SLI SLA
- Service Level Objective
- 99.9% availability per month
- Error Budget
- 99.9% availability per month -> 0.1% per month -> 43mins downtime / per
- Service Level Indicator
- request error rate
- request latency
- Service Level Agreement
- commit availability to client
- addition buffer
- SLO = 99.9% -> SLA = 99.5%
Monitoring
- Google SRE: Golden signals
- response latency
- p50 / p95 / p99 latency
- Traffic
- QPS(Query Per Second)
- Errors
- error rate
- exception count
- timeout
- Saturation
- CPU usage
- Memory usage
- Disk I/O
- Queue length
- response latency
- tools
- Prometheus
- Grafana
- GCP Cloud Monitoring
- Amazon CloudWatch
deploy
- Ansible
- batch deploy bash instruction on machine
- Terraform
- control cloud infra as code
- K8s
- docker pod with backup and more complex config
zero downtime deployments
- Rolling Updates:
- Gradually replacing instances of the application without taking the entire system offline.
- Blue-Green Deployments:
- Maintaining two production environments (blue and green) where one is active while the other is idle, allowing for seamless switching during updates.
- Canary Releases:
- Deploying new features to a small subset of users first, monitoring performance before a full rollout.
design error budget
- Multi-zone deployment
- DB replication(multi-region)
- Load Balancer
- health check
Incident Response
- post-mortem
- RCA(Root Cause Analysis)