SRE Weekly Issue #480

View on sreweekly.com A message from our sponsor, PagerDuty: 🔍 Notable PagerDuty shift: Full incident management now spans all paid tiers. The upgraded Slack-first and Teams-first experience means fewer tools to juggle during incidents. Only leveraging PagerDuty for basic alerting? Time to check out what’s newly available in your plan! ...

June 9, 2025

SRE Weekly Issue #479

View on sreweekly.com Automatic rollbacks are a last resort Rollbacks don’t always return you to a previous system state. They can return you to a state you’ve never tested or operated before.   Steve Fenton — Octopus Deploy Burn rate is a better error rate This article explains the math of burn rate alerting and gives well thought out reasoning or why burn rates are better.   James Frullo — Datadog Is There A Purpose In Assigning Incident Severity? This hot take is worth thinking about: what do you want to get out of assigning incident severity levels, and is it working? ...

June 2, 2025

SRE Weekly Issue #478

View on sreweekly.com Security and SRE: How Datadog’s combined approach aims to tackle security and reliability challenges Datadog has fully merged their SRE and Security teams. In this post, we’ll look at essential elements of SRE and security, the benefits we’ve realized by combining the two disciplines, and what that approach looks like for us.   Bianca Lankford — Datadog What I Really Mean When I Say “Good Communication” in Incident Response I love the way this article describes three different audiences for your communication during incidents. It describes what each audience is looking for and gives both positive and negative examples of how to communicate with them. ...

May 26, 2025

SRE Weekly Issue #477

View on sreweekly.com Human Error Strikes Again… or Does It? Why don’t we look for the root cause of a successful outcome?   Hamed Silatani — Uptime Labs How we optimized LLM use for cost, quality, and safety to facilitate writing postmortems They took a great deal of care to avoid the potential pitfalls of using an LLM in this way, and they share a lot of detail about the steps they took. ...

May 19, 2025

SRE Weekly Issue #476

View on sreweekly.com Automation and The Substitution Myth The myth is: The underlying and often unexamined assumption for the benefits of automation is the notion that computers/machines are better at some tasks, and humans are better at a different, non-overlapping set of tasks. This article lays out several pitfalls to this approach, with references.   Courtney Nash Incident Report: Spotify Outage on April 16, 2025 Wow, I seriously love this one. It’s written in an a very approachable style that’s easy to understand from the outside. It lays a series of cringe-worthy contributing factors that could happen to any of us, making them a great learning opportunity. ...

May 12, 2025

SRE Weekly Issue #475

View on sreweekly.com Anomaly Detection in Time Series Using Statistical Analysis I haven’t seen this level of detail in an article on anomaly detection in quite awhile. Still, the math is very approachable even if you slept through stats class.   Ivan Shubin — Booking.com A Key Incident Response Skill That Can Reduce Resolution Time TL;DR: The Power of Knowledge Overlap in Incident Response There’s an anecdote in this one that’s really making me think. ...

May 5, 2025

SRE Weekly Issue #474

View on sreweekly.com A message from our sponsor, incident.io: We’ve just raised $62M at incident.io to build AI agents that resolve incidents with you. See how we’re pioneering a new era of incident management. https://go.incident.io/blog/incident.io-raises-62m Why do we do blameless incident reviews? This is a truly outstanding article about blameless incident analysis! Beyond just “why”, it covers many of the pitfalls that trip people up when they try to enact a blameless culture, including questions about accountability. ...

April 28, 2025

SRE Weekly Issue #473

View on sreweekly.com A message from our sponsor, incident.io: We’ve just raised $62M at incident.io to build AI agents that resolve incidents with you. See how we’re pioneering a new era of incident management. https://go.incident.io/blog/incident.io-raises-62m Scaling Nextdoor’s Datastores: Part 5 In this final installment of the Scaling Nextdoor’s Datastores blog series, we detail how the Core-Services team at Nextdoor solved cache consistency challenges as part of a holistic approach to improve our database and cache scalability and usability. ...

April 21, 2025

SRE Weekly Issue #472

View on sreweekly.com A message from our sponsor, incident.io: We’ve just raised $62M at incident.io to build AI agents that resolve incidents with you. See how we’re pioneering a new era of incident management. https://go.incident.io/blog/incident.io-raises-62m Scaling Nextdoor’s Datastores: Part 4 In this part of the Scaling Nextdoor’s Datastores blog series, we will see how the Core-Services team at Nextdoor keeps its cache consistent with database updates and avoids stale writes to the cache. ...

April 14, 2025

SRE Weekly Issue #471

View on sreweekly.com A message from our sponsor, incident.io: We’re building an AI agent that investigates incidents with you—diagnosing the problem and even fixing it. Go behind the scenes with the incident.io engineers rethinking what’s possible with AI, one ambitious idea (and bug) at a time. https://go.incident.io/building-with-ai Models, models every where, so let’s have a think The author of this one draws a line between their two interests of formal methods and resilience engineering, and I’m so here for it. ...

April 7, 2025