You show up at the office, and your boss says, “Alright — starting today, your job is to Reduce Incidents.” …Where do you begin? That’s the journey we’ll explore with you today
from real-world firefighting 2. Turning incident data into action 3. Rolling out a unified process company-wide Disclaimer 1: Single-case study(N = 1); findings are context-specific Please keep it in mind A Six-Month Journey on the Incident Task Force
& Ops … Dev already handles Ops + incidents 2. Applying SRE/DevOps best practices … Lessons drawn from the field Disclaimer 2: I’m not a pro SRE or DevOps guru!
How do we even define “a lot” of incidents? • Ikuo: Are we actually making that many changes? • Are changes even the root cause of these incidents? • What kind of changes are we talking about? At this point—it was all just gut feeling, guessing. Hold up! (Though I’ve learned that senior engineer’s nose for trouble is not to be underestimated.) 1-1. The Beginning
SmartView (Article) Ads Ikuo! News & Push D! Ranking R! CoreSystem T! Mobile M! SmartView T! VPoE K! * Let me call myself an “all-star” just for the sake of this story 🙏 (Manager) CTO Report To 1-2. Assemble the Strongest Team We pulled in the all-stars from every division…
was real. Pulling aces from every team showed how serious the company was about this. At the same time... we had no excuses. The downside of a top-down project 1-2. Assemble the Strongest Team
answers “Reduce incidents.” Sounds simple—turns out, it's massive problem area. • Where do we even start? • What’s the real problem? What actually helps? • And... are there even that many incidents :)?
Change” really mean? 1. Reduce critical incidents 2. Install SRE best practices into the org • Define Key KPIs to improve: • Mean Time Between Failures (MTBF) / Change Failure Rate (CFR) = # of incidents • Mean Time to Recovery (MTTR) = Recovery time The “why are we here?” got crystal clear Thanks to our awesome VPoE 1-3. Guiding the Team
Dirty — Jump into every incident! • Page(PagerDuty call) an ACT member for every incident • Pull every ACT member into each live incident • Fight the fire if it’s in your domain • Handle updates, escalation, and biz communications even if it’s not Brutal!!
they will now be the “pager monkeys” whose job it is to follow a script at 2 a.m. when the service goes down — From “Becoming SRE” Chapter 3: SRE Culture Definitely a bad practice Jumping into every incident won’t stop incidents… 2-1. P0: Supporting Ongoing Incident Handling
started thinking: Incident = ACT. And we earned a lot of trust! ACT shows up when there’s trouble. ACT’s got our back during incidents. ACT gets things done! 2-1. P0: Supporting Incident Handling
action items — Why? Thus, There were definitely ones that just got… forgotten • We had a culture of writing incident reports. • And even listing action items for prevention. — Awesome! • But those items weren’t being tracked. • No assignees. No due dates. No status. … WHAT?? • And the report format different for each division… • Sometimes even different per person.
• Gather all action items from every incident report • Then auto-create Jira tickets, send reminders! • But… each report had a totally different format. • Now what? Help me, ChatGPT… That’s not happening. 2-2. P1. Crush unresolved critical action items
#1: Always store data in machine-readable format!! Heck yeah! We manually moved a year’s worth of incident AIs into a Notion database! *Once it’s a database, you can pull it via API. Easy mode. Lesson #2: Don’t be afraid to get your hands dirty if it serves the mission. 2-2. P1. Crush unresolved critical action items
Response Process • Well, Because… • Each division had its own way of handling incidents. • Before, We’d tried to build a company-wide protocol— the “IRF: Incident Response Framework”. • But... it never really used • Why? — It only reflected the needs of one division. Why didn’t we already have a unified company-wide process?
as a company-wide & lightweight • Domain “all-stars” filled in the gaps • Must stay lightweight — who reads a wall of text during a fire🔥? • Borrowed proven parts from public frameworks • e.g. PagerDuty Incident Response How did we build a company-wide process and framework? Compile unified framework for the whole company! 2-3. P2: Root Fixes/ Rolling Out a Unified Incident Response Process
3. Workflow 4. Communication Guideline 5. Incident Report Template, Postmortem Let me walk you through the key parts Please check the slide later for the details! 2-3. P2: Root Fixes/ Rolling Out a Unified Incident Response Process
engineer on call. Triages alerts and escalates to the IC if necessary, initiating the IRF (declaring the incident). • Incident Commander(IC) • Leads the incident response. Brings in necessary people and organizes information. May also act as the CL (Communication Lead). • Usually a Tech Lead or Engineering Manager. • Their responsibility is not to directly fix the issue, but to organize and make decisions. • Responder • Handles the actual work—such as rollbacks, config changes, etc. • Communication Lead(CL) • Handles communication with external stakeholders (i.e., non-engineers). Key point: Separate responsibilities between IC and Responder 2-3. P2: Root Fixes/ Rolling Out a Unified Incident Response Process
severity judgment when declaring the incident. The final severity level is determined during the postmortem. • 🔥 SEV-1 • Complete failure of core UX features (e.g., news reading becomes unavailable) • 🧨 SEV-2 • Partial failure of core UX features, or complete failure of sub-UX features • 🕯 SEV-3 • Partial failure of sub-UX features It's crucial to estimate severity early on— severe incidents should be resolved faster 2-3. P2: Root Fixes/ Rolling Out a Unified Incident Response Process
of an incident. 🩸 = Bleeding status (ongoing impact) 1. 🩸 Occurrence • An issue arises. Common triggers include deployments or config changes. 2. 🩸 Detection • The on-call engineer detects the issue via alerts. Triage begins. 3. 🩸 Declaration • The incident is officially declared. IRF begins under the IC's lead. External communication starts as needed. • While bleeding, updates must be continuously provided. 4. ❤🩹 Mitigation • Temporarily eliminate the cause (e.g., rollback) and stop further impact. 5. Resolution • Permanently fix the issue (e.g., bug fix, data correction). Bleeding is fully stopped. 6. Postmortem • Investigate root causes and discuss recurrence prevention based on the incident report. 2-3. P2: Root Fixes/ Rolling Out a Unified Incident Response Process
place (Slack channels): • #incident • Used for status updates to the entire company and for communication with external stakeholders. • #incident-irf-[incidentId]-[title] • Or technical communication to resolve the issue. • All relevant discussions and information are gathered here. Having all discussions and info in one place makes writing the report much easier later 2-3. P2: Root Fixes/ Rolling Out a Unified Incident Response Process
company-wide template includes: • Summary • Impact • Direct Cause, Mitigation • Root Cause Analysis (5-whys) • It’s crucial to analyze direct and root causes separately. • Based on root causes, define action items to prevent recurrence • Timeline • Use a machine-readable format!!!! We standardized templates across divisions (super important!) and centralized all postmortems. 2-3. P2: Root Fixes/ Rolling Out a Unified Incident Response Process
into every incident “Hello there, it’s me, Uncle IRF😎 Alright, I’ll be the Incident Commander this time! Everyone else, focus on firefighting!” Lesson #4: In an emergency, no one has time to learn a new protocol. Just do & learn it! Lesson #5: Use it ourselves first, and build a feedback loop! 2-3. P2: Root Fixes/ Rolling Out a Unified Incident Response Process
did a lot! But… How many incidents did we handle this month? Or last month…? How long did it take to resolve each one? 2-4. P2: Root Fixes/ Enhancing Incident Clarity
we handle this month? Or last month…? 42 A critical realization: We’re not tracking KPIs! How long did it take to resolve each one? 2-4. P2: Root Fixes/ Enhancing Incident Clarity No clue!!
Title • Status • State Machine.(we’ll get to this later) • Severity • SEV 1~3(IRF2.0) • Direct Cause • (explained later) • Direct Cause System • Group of components defined at the microservice level • Direct Cause Workload • Online Service, Offline Pipeline, … Define as many fields as possible using Enums! 45 Free-form input → high cardinality, analysis breaks 2-4. P2: Root Fixes/ Enhancing Incident Clarity
transition between them — a State Machine! 2-4. P2: Root Fixes/ Enhancing Incident Clarity Record time for every transition; key time metrics pop out automatically
the data definition is solid, the source can be flexible. (As long as the data is trustworthy, of course.) 48 2-4. P2: Root Fixes/ Enhancing Incident Clarity • Make required fields into mandatory attributes. • Add a Notion Database for state timeline • Have people record when states change • Make it machine-readable!!!!!
unified format) • Different formats across Divisions • Free-format input, missing attributes, etcetc… • Now what? Give it three months and we’ll have plenty of data. Right? We only have six months!! 2-4. P2: Root Fixes/ Enhancing Incident Clarity
Heck yeah! We manually migrated one year’s worth of incident reports We divided the work and got it done within a week. 2-4. P2: Root Fixes/ Enhancing Incident Clarity Re: Lesson #2: Don’t be afraid to get your hands dirty if it serves the mission.
Declared 4. Mitigated 5. Resolved Time To Detect Time To Mitigate Now we know where the time is going—and where we stand! Time To Resolve 2-4. P2: Root Fixes/ Enhancing Incident Clarity
caused by incidents e.g. Revenue, Reputation, Developer velocity…. Right? Especially, Revenue Loss… Maybe not. it’s the What do we really want to reduce — Incident count? 🤔
incident impact Severity Factor (impact level of an incident) Number of Incidents × For us (B2C and Ads) this pretty much defines the revenue impact • Shorten the time → Quick win — Relatively easy to improve • Reduce severity → Ideal, but hard to control • Reduce incident count → Requires mid/long-term efforts 3-1. What Does It Mean to Reduce Incidents? Σ
impact incident Severity Factor (impact level of an incident) Number of Incidents × For us(BtoC and ads) this pretty much defines the revenue impact • Shorten the time → Quick win — Relatively easy to improve • Reduce severity → Ideal, but hard to control • Reduce incident count → Requires mid/long-term efforts 3-1. What Does It Mean to Reduce Incidents? Σ We’re starting here It also aligns with the KPIs we set when ACT was first formed! But a few months in, we gained much better clarity.
2. Detected 3. Declared 4. Mitigated 5. Resolved Time To Detect Time To Resolve Time To Mitigate Each one has different significance — and needs a different solution! Mainly in the alerting domain. Bleeding, Most critical — but also the easiest to improve. This is where IRF comes in. Time spent on root fixes and data correction. Bleeding has stopped — now it’s about accuracy, not speed. 3-2. Approaching Incident Resolution Time
more alerts doesn’t help • It can even make things worse • “Over-monitoring is a harder problem to solve than under-monitoring.” — SRE: How Google Runs Production System • Too many alerts(maybe false positives) → Alert fatigue → Real Alerts get buried/ignored • Alert on SLO / Error-Budget burn instead • Not something you can fix overnight Still a work in progress — We’ll revisit this in Chapter 4 3-2. Approaching Incident Resolution Time
unified framework: IRF 2.0 • Clear incident definition – when to call it an incident • Unified response workflow & communication guideline • Role split: Incident Commander vs Responder • Responder can focus to firefighting • Ongoing drills & training — Aces lead by example Deploying top aces + rolling out IRF 2.0 had a huge impact! 3-2. Approaching Incident Resolution Time
without testing Postmortem discussion… • Why was it deployed without testing? • → Because it could only be tested in production. • Why only in production? • → Lack of data, broken staging environment, etc… • …. Alright! Let’s fix the staging environment! 3-3. Approaching the Number of Incidents
Environments Still in Progress: Way harder than we thought! • There are tons of components. • Each division—News, Ads, Infra—has different needs and usage. • Ads is B2B, tied directly to revenue → needs to be solid and stable • News is B2C, speed of feature delivery is key Trying to build staging for everything? Not realistic, not even useful So we started with Ads, where the demand was highest 3-3. Approaching the Number of Incidents
Tests? • Why didn’t we catch it with unit tests? • Because we didn’t have any… • … 😭 Alright! Let’s collect test coverage! 3-3. Approaching the Number of Incidents
Coverage • Jumped into systems lacking coverage tracking • Opened PRs for generating reports • Plotted unit test coverage vs. # of incidents by system # of Incident Ave. Coverage 3-3. Approaching the Number of Incidents
incidents? → Yes, there was.(= Low coverage means more incidents) • Then: Does higher coverage actually reduce incidents? → Not sure. Correlation ≠ Causation. • Still, digging into low-coverage / high-incident systems shows the similar roots: • Hard to write tests / no testing culture / etc… Alright! Let’s jump into the low-coverage systems and help to write unit tests! Approach #2 to Lack of Testing: Analyzing Unit Test Coverage 3-3. Approaching the Number of Incidents
Tests Get our hands dirty! Add tests to everything We hit 3–4 components…but it didn’t really change anything We thought if we provided a few examples, others would follow… 3-3. Approaching the Number of Incidents 1. Use SonarQube to find files with high LOC and low coverage 2. Use LLMs to help generate tests 3. Repeat until the entire component hits 50% + coverage
to build a habit of writing tests continuously • Problems: • No incentive, No shared value • Everyone is busy: Pressured by hard deadline • (2025 May Update): LLM could change the game! This is a team culture and organizational challenge. To be continued in Chapter 4… 3-3. Approaching the Number of Incidents Approach #2 to Lack of Testing: Building Out Unit Tests
dynamically/online → e.g. A/B Testing and Feature Flags, Testing in production • We have in-house platforms for both • Problems • They were complicated… → Unintended A/B assignments and misconfigurations caused frequent issues Alright, let’s clean up A/B testing and feature flags! 3-3. Approaching the Number of Incidents
Establish usage guidelines for feature flags • Strengthen validation logic • (Bad configs caused parse errors and crashes…) Collaborated with the platform team, Made a lot of improvements! 3-3. Approaching the Number of Incidents Approaching Config Changes
bunch of offline streaming Flink jobs… • e.g. Server → Kafka → Flink → Scylla, ClickHouse,… • We have in-house platform for it as well • Problems • Few Flink experts on the app team side → Led to frequent issues like performance, restarts, no UnitTest, bug, etc Alright! Let’s revamp the Flink platform! 3-3. Approaching the Number of Incidents
deployments, and more • Nurtured best practices • Provided best practices documentation • Provided template projects (including tests!) • Sent direct refactor PRs to various components • Implemented best practices and tests Collaborated with the platform team together, Improved the platform and its docs! 3-3. Approaching the Number of Incidents Approaching Offline Batch
• Holiday rush → last-minute changes? • IRF2.0 roll out side effect? • Clear definition → more detection? • Maslow’s Hammer: • “If all you have is IRF, everything starts to look like an incident” • After January, started trending down Keep our eyes on it — continuous effort required. 3-4. Results: Did We Really “Halve” Incidents?
improvement in MTTMitigate • Thanks to the power of IRF2.0! • But MTTDetect didn’t improve • Detection is still a challenge Definitely felt the momentum of change! 3-4. Results: Did We Really “Halve” Incidents?
— (# of Incident ↑) x (Resolution Time ↓) → Impact slightly down We didn’t quite halve incidents… However, challenges are clear, and the foundation is set for the improvement! Let’s make it happen 3-4. Results: Did We Really “Halve” Incidents?
In Reality — It never happens To truly minimize incidents… • Just stop feature releasing? • → A slow death 😇 • Put infinite cost (people, time) at prevention? • More cost likely correlates with fewer incidents… • → Keep testing until we feel 100% “safe”? 4-1. Remaining Challenges: Risk Management & Alerting
incident risk. • But How can we find a “right” balance? • Even it’s different for each system or project • Required speed & release frequency • Cost we can throw in • Acceptable level of risk (≒ number of incidents, failure rate) • Ads is B2B and tied directly to revenue → needs to be rock-solid • News is B2C → speed of delivery comes first! Quantify our risk tolerance And use that to control how many incidents we accept. 4-1. Remaining Challenges: Risk Management & Alerting
= Service Level Objective: “How much failure is acceptable?” • e.g. 99.9% available -> means 0.1% failure is “allowed” • Attach objectives to SLIs—metrics that reflect real UX harm • Error Budget: “How much failure room we have left” • When budget remains → We can take risk • Even bold releases are fair game • When budget runs out → Can’t take risk: UX is already suffering • No more risk — time to slow down — Ref: Implementing SLOs — Google SRE Error Budgets let us express risk tolerance—numerically. And in theory… this sounds pretty solid. 4-1. Remaining Challenges: Risk Management & Alerting
Time to get those SLOs in place! • Alert on fast Error-Budget burn(consumed). • e.g. Burn-Rate based alerting • If you ignore it, Error Budget run out • Violate SLO • That means real UX damage! User suffered! • Can’t ignore it — it is an incident! — Ref: Alerting on SLOs Sounds good 4-1. Remaining Challenges: Risk Management & Alerting
rolling out SLOs in some places, but… • Defining effective SLOs isn’t easy • Biz and PdMs don’t always have the answers • Engineers have no time to implement SLOs • They can’t even find time to write Unit Test! • And even if we set them up— (actually I did some…) • If no one respects SLOs, what’s the point? There’s no silver bullet…
need everyone on board, across the company • Need an approach to culture • Ultimately, it’s about what we truly value • “Do we believe that balancing cost and risk with SLOs is worth it?” We want to install SLOs— and ultimately, the mindset of SRE—into our engineering culture. 4-2. Remaining Challenge: Shaping Org & Culture
PdMs—get everyone involved • Top-Down: Higher Ups • They are actually supportive of SRE • Ask support and direction from leadership SRE and DevOps are culture — They don’t take root in a day. It takes sweat, patience, and steady effort. 4-2. Remaining Challenge: Shaping Org & Culture How do we make SLOs actually work?
was coming to an end. • The challenge remained: Install SRE into SmartNews’s engineering culture • Implement and uphold SLOs • And more… • Boost observability • Track and act on DORA metrics, etcetc… These require ongoing effort How can we keep the momentum going— and tackle the remaining challenges even after ACT disbands?
Team” After ACT ends, ex-members return to their teams and continue SRE work using X% of their time It sounded reasonable to me… maybe? 4-3. What Comes Next…
their full-time job. • And allocating “X%” of time… yeah, that never really works. • Our decision: • Ex-ACTors would keep helping and promoting SRE, but we’d take the time to build a dedicated SRE team. We made that call as a team. There’s still lot left unfinished—but no regrets! How should we disband ACT? — Team’s Call 4-3. What Comes Next…
has ended. Did we truly create an “Awesome Change”? Honestly… I’m not sure. 4-3. What Comes Next… But we do feel like “We’ve taken the first step on a long journey toward SRE!” And a huge thanks to my teammates for fighting through these past six months!
session helps some You show up at the office after the conference and your boss says, “Alright — starting today, your job is to Reduce Incidents.” …Where do you begin?
at Scale • Site Reliability Engineering: How Google Runs Production Systems • SRE Google Workbook • Effective DevOps: Building a Culture of Collaboration, Affinity, and Tooling at Scale • Fearless Change: Patterns for Introducing New Ideas