Time for System Reliability , SRE, Production Go-live

Created: 04 Jul 2025, Modified: •

3 min read

defining SLI and SLO for system to ensure system reliability

error budget will be counted by %

sre layers diving:

---
1 - Application
---
2 - Monitoring
---
3 - Alert

toil elimination: reduce unnecessary logs.
maintain service level - risk: go or not go for release if it’s high risk
handle failure: incident management, SOP, post-mortems

Simply, SRE is team runs production system

Example 20 working days x 6 SREs = 120 days
so, breaking it down:
40 days for oncall - 20 days for additional toil - 60 days: engineering budget

timely response
managing incident: triage / commander, communication, ops lead
working on incident effectively: model (triage, examine, diagnose, test, cure. til:Service Mesh Architecture
post mortem is the key: everything is documented, identify, fix, report, review, publish. continuous improvement.
toil elimination: reduce unnecessary logs.
maintain service level - risk: go or not go for release if it’s high risk
handle failure: incident management, SOP, post-mortems