Time for System Reliability , SRE, Production Go-live

defining SLI and SLO for system to ensure system reliability

error budget will be counted by %

sre layers diving:

---
1 - Application
---
2 - Monitoring
---
3 - Alert

applying SPC into SLO

  • interval
  • time window
  • stats

4 golden signle of SRE

  • Avaiability
  • Latency
  • Error
  • Saturation

mission of SRE:

  • toil elimination: reduce unnecessary logs.
  • maintain service level - risk: go or not go for release if it’s high risk
  • handle failure: incident management, SOP, post-mortems

    Simply, SRE is team runs production system

it is combining SRW and System engineer

Reduce Toil to 50% is target (Google team)

  • Target is: system reliability and availability
  • productivity increase,
  • toolset standardization, system
  • simplification
  • culture of automation
Example 20 working days x 6 SREs = 120 days
so, breaking it down:
40 days for oncall - 20 days for additional toil - 60 days: engineering budget

Important! On-call look like - Keep the Production Live

  • timely response
  • managing incident: triage / commander, communication, ops lead
  • working on incident effectively: model (triage, examine, diagnose, test, cure. til:Service Mesh Architecture
  • post mortem is the key: everything is documented, identify, fix, report, review, publish. continuous improvement.
  • toil elimination: reduce unnecessary logs.
  • maintain service level - risk: go or not go for release if it’s high risk
  • handle failure: incident management, SOP, post-mortems

Guidance

  • production readiness reviews
  • capacity planning
  • designing for simplicity

practical advice

  • load balancing
  • cascading failure
  • testing for reliability
  • https://deepwiki.com/linkedin/school-of-sre/3.5-data-systems
  • https://deepwiki.com/mxssl/sre-interview-prep-guide/2.4.2-sre-processes
  • https://deepwiki.com/bregman-arie/sre-checklist/4.4-gitops-with-argocd