> Runbooks are more of an anti-pattern than anything.
No. They're essential to the sanity of the individual SRE oncall, and of the SRE team they are part of. It's also how institutional knowledge is preserved after successive rounds of oncall shifts.
If you think a playbook is bad, try being oncall for the first time for a massively (Google-scale) distributed system without a playbook.
> A list of commands you should run to accomplish X -> that's a script.
A list of possible root causes an SRE should consider, complex interactions with other systems that might be problematic, not overreacting to spurious alerts -> that's a playbook.
It seems the problems your describing are management problems not engineering problems. The people who designed, built and are responsible for maintaining and updating the system should be on call if something goes wrong.
I view runbooks in the same way I view "knowledge transfers". It's management's desperate hope that somehow a person leaving a project or company can convey their acquired knowledge of a system on some Confluence page or from a few meetings. It's a complete failure to recognizing the essence of the work.
I don't think runbooks are necessarily bad if they are used to bootstrap new team members. But thinking their existence is a green light to allow people without domain expertise in a system to be responsible for administering the system during "off hours" is misguided.
> The people who designed, built and are responsible for maintaining and updating the system should be on call if something goes wrong.
In my experience, the devs were somewhere in the US west coast, and the SRE teams were geographically distributed to cover the 24 hour period during local daytime (nobody likes to be paged in the middle of the night). As an SRE in Zürich, I got paged in what was the middle of the night for the Kirkland people, dealt with the emergency (using the playbook), root-caused it (with the assistance of the playbook), and filed bugs to be looked at by the dev team when they woke up.
The systems stayed up, everyone could sleep at night, working as intended.
You do not need 100% automation. What you need is a systematic approach to handling problems followed by fixing the root cause.
Runbooks came from techops in broadcasting, power plant operations, etc where there was a clear division between operators who pushed buttons, ran cables, etc and those that made decisions about buttons to push and cables to run. Dumb hands + runbooks created "smart hands".
No. They're essential to the sanity of the individual SRE oncall, and of the SRE team they are part of. It's also how institutional knowledge is preserved after successive rounds of oncall shifts.
If you think a playbook is bad, try being oncall for the first time for a massively (Google-scale) distributed system without a playbook.
> A list of commands you should run to accomplish X -> that's a script.
A list of possible root causes an SRE should consider, complex interactions with other systems that might be problematic, not overreacting to spurious alerts -> that's a playbook.