Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Runbooks are more of an anti-pattern than anything.

No. They're essential to the sanity of the individual SRE oncall, and of the SRE team they are part of. It's also how institutional knowledge is preserved after successive rounds of oncall shifts.

If you think a playbook is bad, try being oncall for the first time for a massively (Google-scale) distributed system without a playbook.

> A list of commands you should run to accomplish X -> that's a script.

A list of possible root causes an SRE should consider, complex interactions with other systems that might be problematic, not overreacting to spurious alerts -> that's a playbook.



It seems the problems your describing are management problems not engineering problems. The people who designed, built and are responsible for maintaining and updating the system should be on call if something goes wrong.

I view runbooks in the same way I view "knowledge transfers". It's management's desperate hope that somehow a person leaving a project or company can convey their acquired knowledge of a system on some Confluence page or from a few meetings. It's a complete failure to recognizing the essence of the work.

I don't think runbooks are necessarily bad if they are used to bootstrap new team members. But thinking their existence is a green light to allow people without domain expertise in a system to be responsible for administering the system during "off hours" is misguided.


> The people who designed, built and are responsible for maintaining and updating the system should be on call if something goes wrong.

In my experience, the devs were somewhere in the US west coast, and the SRE teams were geographically distributed to cover the 24 hour period during local daytime (nobody likes to be paged in the middle of the night). As an SRE in Zürich, I got paged in what was the middle of the night for the Kirkland people, dealt with the emergency (using the playbook), root-caused it (with the assistance of the playbook), and filed bugs to be looked at by the dev team when they woke up.

The systems stayed up, everyone could sleep at night, working as intended.


> and the SRE teams were geographically distributed to cover the 24 hour period during local daytime

Management problem number 1. These people should not be responsible for the running system.

> nobody likes to be paged in the middle of the night

Excellent motivation for the people that should be responsible for the running system to build quality software.


Why would randomly waking your engineers up in the middle of the night be an excellent motivation strategy?


It's an incentive to not release stuff that breaks in the middle of the night.

On the flip side that can lead to slower releases, or more expensive solutions


> If you think a playbook is bad, try being oncall for the first time for a massively (Google-scale) distributed system without a playbook.

You are not Google scale. Don't invent a pen that writes in space when a pencil would do the trick.


I'd argue 100% automation is a space pen while run books are a pencil.


You do not need 100% automation. What you need is a systematic approach to handling problems followed by fixing the root cause.

Runbooks came from techops in broadcasting, power plant operations, etc where there was a clear division between operators who pushed buttons, ran cables, etc and those that made decisions about buttons to push and cables to run. Dumb hands + runbooks created "smart hands".

If your SRE runs like that it is not SRE.

Look at the incident handling:

1. Identify the issue

2. Implement a workaround to restore the service

3. Identify the root cause

4. Implement a fix for the root cause

5. Remove the workaround

Runbooks cover 1. and 2.


> You are not Google scale.

I don't know, I think I was kind of Google scale when working as a Google SRE.


1) https://www.transposit.com/ is not Google.

2) Your impact on Google scale was nearly 0 because you were one of a thousand SREs at Google.


There are only 500.


My bad


There’s no tolerance for rude and unsubstantiated comments here


Did you just assume my scale?


And check your metaphors before using them.

The pencil was more expensive.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: