Table of Contents

Preface

Part I. New to SRE

1. Site Reliability Engineering in Six Words

2. Do We Know Why We Really Want Reliability?

3. Building Self-Regulating Processes

4. Four Engineers of an SRE Seder

5. The Reliability Stack

6. Infrastructure: It’s Where the Power Is

7. Thinking About Resilience

8. Observability in the Development Cycle

9. There Is No Magic

10. How Wikipedia Is Served to You

11. Why You Should Understand (a Little) About TCP

12. The Importance of a Management Interface

13. When It Comes to Storage, Think Distributed

14. The Role of Cardinality

15. Security Is like an Onion

16. Use Your Words

17. Where to SRE

18. Dear Future Team

19. Sustainability and Burnout

20. Don’t Take Advice from Graybeards

21. Facing That First Page

Part II. Zero to One

22. SRE, at Any Size, Is Cultural

23. Everyone Is an SRE in a Small Organization

24. Auditing Your Environment for Improvements

25. With Incident Response, Start Small

26. Solo SRE: Effecting Large-Scale Change as a Single Individual

27. Design Goals for SLO Measurement

28. I Have an Error Budget—Now What?

29. How to Change Things

30. Methodological Debugging

31. How Startups Can Build an SRE Mindset

32. Bootstrapping SRE in Enterprises

33. It’s Okay Not to Know, and It’s Okay to Be Wrong

34. Storytelling Is a Superpower

35. Get Your Work Recognized: Write a Brag Document

Part III. One to Ten

36. Making Work Visible

37. An Overlooked Engineering Skill

38. Unpacking the On-Call Divide

39. The Maestros of Incident Response

40. Effortless Incident Management

41. If You’re Doing Runbooks, Do Them Well

42. Why I Hate Our Playbooks

43. What Machines Do Well

44. Integrating Empathy into SRE Tools

45. Using ChatOps to Implement Empathy

46. Move Fast to Unbreak Things

47. You Don’t Know for Sure Until It Runs in Production

48. Sometimes the Fix Is the Problem

49. Legendary

50. Metrics Are Not SLIs (The Measure Everything Trap)

51. When SLOs Attack: Pathological SLOs and How to Fix Them

52. Holistic Approach to Product Reliability

53. In Search of the Lost Time

54. Unexpected Lessons from Office Hours

55. Building Tools for Internal Customers that They Actually Want to Use

56. It’s About the Individuals and Interactions

57. The Human Baseline in SRE

58. Remotely Productive or Productively Remote

59. Of Margins and Individuals

60. The Importance of Margins in Systems

61. Fewer Spreadsheets, More Napkins

62. Sneaking in Your DevOps Deliciously

63. Effecting SRE Cultural Changes in Enterprises

64. To All the SREs I’ve Loved

65. Complex: The Most Overloaded Word in Technology

Part IV. Ten to Hundred

66. The Best Advice I Can Give to Teams

67. Create Your Supporting Artifacts

68. The Order of Operations for Getting SLO Buy-In

69. Heroes Are Necessary, but Hero Culture Is Not

70. On-Call Rotations that People Want to Join

71. Study of Human Factors and Team Culture to Improve Pager Fatigue

72. Optimize for MTTBTB (Mean Time to Back to Bed)

73. Mitigating and Preventing Cascading Failures

74. On-Call Health: The Metric You Could Be Measuring

75. Helping Leaders Prioritize On-Call Health

76. The SRE as a Diplomat

77. The Forward-Deployed SRE

78. Test Your Disaster Plan

79. Why Training Matters to an SRE Practice and SRE Matters to Your Training Program

80. The Power of Uniformity

81. Bytes per User Value

82. Make Your Engineering Blog a Priority

83. Don’t Let Anyone Run Code in Your Context

84. Trading Places: SRE and Product

85. You See Teams, I See Product

86. The Performance Emergency Fund

87. Important but Not Urgent: Roadmaps for SREs

Part V. The Future of SRE

88. That 50% Thing

89. Following the Path of Safety-Critical Systems

90. Applicable and Achievable Static Analysis

91. The Importance of Formal Specification

92. Risk and Rot in Sociotechnical Systems

93. SRE in Crisis

94. Expected Risk Limitations

95. Beyond Local Risk: Accounting for Angry Birds

96. A Word from Software Safety Nerds

97. Incidents: A Window into Gaps

98. The Third Age of SRE