Time for System Reliability , SRE, Production Go-live
defining SLI and SLO for system to ensure system reliability
error budget will be counted by %
sre layers diving:
---
1 - Application
---
2 - Monitoring
---
3 - Alert
applying SPC into SLO
- interval
- time window
- stats
4 golden signle of SRE
- Avaiability
- Latency
- Error
- Saturation
mission of SRE:
- toil elimination: reduce unnecessary logs.
- maintain service level - risk: go or not go for release if it’s high risk
- handle failure: incident management, SOP, post-mortems
Simply, SRE is team runs production system
it is combining SRW and System engineer
Reduce Toil to 50% is target (Google team)
- Target is: system reliability and availability
- productivity increase,
- toolset standardization, system
- simplification
- culture of automation
Example 20 working days x 6 SREs = 120 days
so, breaking it down:
40 days for oncall - 20 days for additional toil - 60 days: engineering budget
Important! On-call look like - Keep the Production Live
- timely response
- managing incident: triage / commander, communication, ops lead
- working on incident effectively: model (triage, examine, diagnose, test, cure. til:Service Mesh Architecture
- post mortem is the key: everything is documented, identify, fix, report, review, publish. continuous improvement.
- toil elimination: reduce unnecessary logs.
- maintain service level - risk: go or not go for release if it’s high risk
- handle failure: incident management, SOP, post-mortems
Guidance
- production readiness reviews
- capacity planning
- designing for simplicity
practical advice
- load balancing
- cascading failure
- testing for reliability
Ref links
- https://deepwiki.com/linkedin/school-of-sre/3.5-data-systems
- https://deepwiki.com/mxssl/sre-interview-prep-guide/2.4.2-sre-processes
- https://deepwiki.com/bregman-arie/sre-checklist/4.4-gitops-with-argocd