SRE Weekly Issue #476

View on sreweekly.com Automation and The Substitution Myth The myth is: The underlying and often unexamined assumption for the benefits of automation is the notion that computers/machines are better at some tasks, and humans are better at a different, non-overlapping set of tasks. This article lays out several pitfalls to this approach, with references.   Courtney Nash Incident Report: Spotify Outage on April 16, 2025 Wow, I seriously love this one. It’s written in an a very approachable style that’s easy to understand from the outside. It lays a series of cringe-worthy contributing factors that could happen to any of us, making them a great learning opportunity. ...

May 12, 2025

SRE Weekly Issue #475

View on sreweekly.com Anomaly Detection in Time Series Using Statistical Analysis I haven’t seen this level of detail in an article on anomaly detection in quite awhile. Still, the math is very approachable even if you slept through stats class.   Ivan Shubin — Booking.com A Key Incident Response Skill That Can Reduce Resolution Time TL;DR: The Power of Knowledge Overlap in Incident Response There’s an anecdote in this one that’s really making me think. ...

May 5, 2025

SRE Weekly Issue #474

View on sreweekly.com A message from our sponsor, incident.io: We’ve just raised $62M at incident.io to build AI agents that resolve incidents with you. See how we’re pioneering a new era of incident management. https://go.incident.io/blog/incident.io-raises-62m Why do we do blameless incident reviews? This is a truly outstanding article about blameless incident analysis! Beyond just “why”, it covers many of the pitfalls that trip people up when they try to enact a blameless culture, including questions about accountability. ...

April 28, 2025

SRE Weekly Issue #473

View on sreweekly.com A message from our sponsor, incident.io: We’ve just raised $62M at incident.io to build AI agents that resolve incidents with you. See how we’re pioneering a new era of incident management. https://go.incident.io/blog/incident.io-raises-62m Scaling Nextdoor’s Datastores: Part 5 In this final installment of the Scaling Nextdoor’s Datastores blog series, we detail how the Core-Services team at Nextdoor solved cache consistency challenges as part of a holistic approach to improve our database and cache scalability and usability. ...

April 21, 2025

SRE Weekly Issue #472

View on sreweekly.com A message from our sponsor, incident.io: We’ve just raised $62M at incident.io to build AI agents that resolve incidents with you. See how we’re pioneering a new era of incident management. https://go.incident.io/blog/incident.io-raises-62m Scaling Nextdoor’s Datastores: Part 4 In this part of the Scaling Nextdoor’s Datastores blog series, we will see how the Core-Services team at Nextdoor keeps its cache consistent with database updates and avoids stale writes to the cache. ...

April 14, 2025

SRE Weekly Issue #471

View on sreweekly.com A message from our sponsor, incident.io: We’re building an AI agent that investigates incidents with you—diagnosing the problem and even fixing it. Go behind the scenes with the incident.io engineers rethinking what’s possible with AI, one ambitious idea (and bug) at a time. https://go.incident.io/building-with-ai Models, models every where, so let’s have a think The author of this one draws a line between their two interests of formal methods and resilience engineering, and I’m so here for it. ...

April 7, 2025