97 Things Every SRE Should Know-toc
Table of Contents
1. Site Reliability Engineering in Six Words
2. Do We Know Why We Really Want Reliability?
3. Building Self-Regulating Processes
4. Four Engineers of an SRE Seder
6. Infrastructure: It’s Where the Power Is
8. Observability in the Development Cycle
10. How Wikipedia Is Served to You
11. Why You Should Understand (a Little) About TCP
12. The Importance of a Management Interface
13. When It Comes to Storage, Think Distributed
19. Sustainability and Burnout
20. Don’t Take Advice from Graybeards
22. SRE, at Any Size, Is Cultural
23. Everyone Is an SRE in a Small Organization
24. Auditing Your Environment for Improvements
25. With Incident Response, Start Small
26. Solo SRE: Effecting Large-Scale Change as a Single Individual
27. Design Goals for SLO Measurement
28. I Have an Error Budget—Now What?
31. How Startups Can Build an SRE Mindset
32. Bootstrapping SRE in Enterprises
33. It’s Okay Not to Know, and It’s Okay to Be Wrong
34. Storytelling Is a Superpower
35. Get Your Work Recognized: Write a Brag Document
37. An Overlooked Engineering Skill
38. Unpacking the On-Call Divide
39. The Maestros of Incident Response
40. Effortless Incident Management
41. If You’re Doing Runbooks, Do Them Well
44. Integrating Empathy into SRE Tools
45. Using ChatOps to Implement Empathy
46. Move Fast to Unbreak Things
47. You Don’t Know for Sure Until It Runs in Production
48. Sometimes the Fix Is the Problem
50. Metrics Are Not SLIs (The Measure Everything Trap)
51. When SLOs Attack: Pathological SLOs and How to Fix Them
52. Holistic Approach to Product Reliability
53. In Search of the Lost Time
54. Unexpected Lessons from Office Hours
55. Building Tools for Internal Customers that They Actually Want to Use
56. It’s About the Individuals and Interactions
58. Remotely Productive or Productively Remote
59. Of Margins and Individuals
60. The Importance of Margins in Systems
61. Fewer Spreadsheets, More Napkins
62. Sneaking in Your DevOps Deliciously
63. Effecting SRE Cultural Changes in Enterprises
64. To All the SREs I’ve Loved
65. Complex: The Most Overloaded Word in Technology
66. The Best Advice I Can Give to Teams
67. Create Your Supporting Artifacts
68. The Order of Operations for Getting SLO Buy-In
69. Heroes Are Necessary, but Hero Culture Is Not
70. On-Call Rotations that People Want to Join
71. Study of Human Factors and Team Culture to Improve Pager Fatigue
72. Optimize for MTTBTB (Mean Time to Back to Bed)
73. Mitigating and Preventing Cascading Failures
74. On-Call Health: The Metric You Could Be Measuring
75. Helping Leaders Prioritize On-Call Health
79. Why Training Matters to an SRE Practice and SRE Matters to Your Training Program
82. Make Your Engineering Blog a Priority
83. Don’t Let Anyone Run Code in Your Context
84. Trading Places: SRE and Product
85. You See Teams, I See Product
86. The Performance Emergency Fund
87. Important but Not Urgent: Roadmaps for SREs
89. Following the Path of Safety-Critical Systems
90. Applicable and Achievable Static Analysis
91. The Importance of Formal Specification
92. Risk and Rot in Sociotechnical Systems
95. Beyond Local Risk: Accounting for Angry Birds
96. A Word from Software Safety Nerds