Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

You do not need 100% automation. What you need is a systematic approach to handling problems followed by fixing the root cause.

Runbooks came from techops in broadcasting, power plant operations, etc where there was a clear division between operators who pushed buttons, ran cables, etc and those that made decisions about buttons to push and cables to run. Dumb hands + runbooks created "smart hands".

If your SRE runs like that it is not SRE.

Look at the incident handling:

1. Identify the issue

2. Implement a workaround to restore the service

3. Identify the root cause

4. Implement a fix for the root cause

5. Remove the workaround

Runbooks cover 1. and 2.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: