DevOpsDays Taipei 2025 -- Creating Awesome Chan...

Ikuo Suyama

June 03, 2025

Technology

200

DevOpsDays Taipei 2025 -- Creating Awesome Change in SmartNews!

Slide for DevOpsDays Taipei 2025

Ikuo Suyama

June 03, 2025

More Decks by Ikuo Suyama

See All by Ikuo Suyama

[zh-TW] DevOpsDays Taipei 2025 -- Creating Awesome Change in SmartNews!(machine translation)

martin_lover

680

Creating Awesome Change in SmartNews! En

martin_lover

120

Creating Awesome Change in SmartNews

martin_lover

390

Dive into JVM JIT Compiler

martin_lover

240

InvokeDynamic完全に理解した / Completely Understand InvokeDynamic

martin_lover

930

10分で完全に理解するInvokeDynamic / 10min To Understand InvokeDynamic

martin_lover

840

High Performance FastAPI EN

martin_lover

1.1k

High Performance FastAPI

martin_lover

エッセンシャルモブプログラミング〜実践者が考えるモブの価値,原則,プラクティス〜 / Essential Mob Programming

martin_lover

7.8k

Other Decks in Technology

See All in Technology

OpenJDKエコシステムと開発中の機能を紹介 2025夏版

chiroito

1.1k

Machine Intelligence for Vision, Language, and Actions

keio_smilab

PRO

520

データプレーンプログラミングとは？ DPU＆スイッチASICの開発経験から語る

ebiken

PRO

290

会社紹介資料 / Sansan Company Profile

sansan33

PRO

370k

Digitization部紹介資料

sansan33

PRO

3.9k

うちの会社の評判は？SNSの投稿分析にAIを使ってみた

doumae

570

Java で学ぶ代数的データ型

ysknsid25

1.1k

組織とセキュリティ文化と、自分の一歩

maimyyym

1.3k

Introduction to Sansan for Engineers / エンジニア向け会社紹介

sansan33

38k

sansan33

3.1k

okaru

170

GoogleのAI Agent

shukob

190

Featured

See All Featured

YesSQL, Process and Tooling at Scale

rocio

172

14k

Performance Is Good for Brains [We Love Speed 2024]

tammyeverts

850

Build The Right Thing And Hit Your Dates

maggiecrowley

2.7k

Understanding Cognitive Biases in Performance Measurement

bluesmoon

1.7k

How To Stay Up To Date on Web Technology

chriscoyier

790

250k

Art, The Web, and Tiny UX

lynnandtonic

298

21k

RailsConf & Balkan Ruby 2019: The Past, Present, and Future of Rails at GitHub

eileencodes

137

34k

For a Future-Friendly Web

brad_frost

178

9.8k

Imperfection Machines: The Place of Print at Facebook

scottboms

267

13k

Unsuck your backbone

ammeep

671

58k

A Tale of Four Properties

chriscoyier

159

23k

Done Done

chrislema

184

16k

Transcript

Creating "Awesome Change" in SmartNews! 5IF.JTTJPOUP)BMWF*ODJEFOUTCZ5BTL'PSDFl"$5z June 6, 2025
Who am I? • Nov. 2020~ SmartNews, Inc. • Staff
Engineer • Ads-Backend Expert • Interest: Fishing, Camping, Gunpla, Anime Ikuo Suyama / ಃࢁҭஉ
About SmartNews: One of the biggest startup companies in Japan
Mission: “To deliver the world's quality information to the people who need it.” User-base No.1 Japan’s biggest news app
4 Today, Incidents! I’m going to talk about…
5 Let me ask you to think for a moment...
You show up at the office, and your boss says, “Alright — starting today, your job is to Reduce Incidents.” …Where do you begin? That’s the journey we’ll explore with you today
6 What I will talk about today 1. Pulling lessons
from real-world firefighting 2. Turning incident data into action 3. Rolling out a unified process company-wide Disclaimer 1: Single-case study(N = 1); findings are context-specific Please keep it in mind A Six-Month Journey on the Incident Task Force
7 What I won’t (can’t) talk about 1. Integrating Dev
& Ops … Dev already handles Ops + incidents 2. Applying SRE/DevOps best practices … Lessons drawn from the field Disclaimer 2: I’m not a pro SRE or DevOps guru!
Phase 1: Assemble! Task Force “ACT”! Phase 2: Slogan “Get
our hands dirty!” Phase 3: Halving incidents!? Phase 4: What remains, and what’s next Agenda
01 Phase 1: Assemble! Task Force “ACT”!
10 1-1. The Beginning It all started back in September...
The Awesome Change Team— “ACT”! Too many incidents! Cut them in HALF. Let’s build a task force! CTO
11 …Could it be because you force us to ship
a massive number of changes? ME: 1-1. The Beginning
12 MEɿ 1-1. The Beginning …Could it be because you
force us to ship a massive number of changes? Hold up!
13 • CTO: Are incidents really happening that often? •
How do we even define “a lot” of incidents? • Ikuo: Are we actually making that many changes? • Are changes even the root cause of these incidents? • What kind of changes are we talking about? At this point—it was all just gut feeling, guessing. Hold up! (Though I’ve learned that senior engineer’s nose for trouble is not to be underestimated.) 1-1. The Beginning
14 1-2. Assemble the Strongest Team With a six-month time
limit, Assembling “The Strongest Team” —as our top priority! Advantage of a top-down project: This is a top-down project from the CTO
15 Ads News Ranking Push Notification Core System (Infra) Mobile
SmartView (Article) 1-2. Assemble the Strongest Team We pulled in the all-stars from every division…
16 Ads News Ranking Push Notification Core System (Infra) Mobile
SmartView (Article) Ads Ikuo! News & Push D! Ranking R! CoreSystem T! Mobile M! SmartView T! VPoE K! * Let me call myself an “all-star” just for the sake of this story 🙏 (Manager) CTO Report To 1-2. Assemble the Strongest Team We pulled in the all-stars from every division…
17 “We had just six months to succeed” —That pressure
was real. Pulling aces from every team showed how serious the company was about this. At the same time... we had no excuses. The downside of a top-down project 1-2. Assemble the Strongest Team
18 1-3. Guiding the Team Tackling ambiguous problems without clear
answers “Reduce incidents.” Sounds simple—turns out, it's massive problem area. • Where do we even start? • What’s the real problem? What actually helps? • And... are there even that many incidents :)?
19 Set a clear goal • Define what does “Awesome
Change” really mean? 1. Reduce critical incidents 2. Install SRE best practices into the org • Define Key KPIs to improve: • Mean Time Between Failures (MTBF) / Change Failure Rate (CFR) = # of incidents • Mean Time to Recovery (MTTR) = Recovery time The “why are we here?” got crystal clear Thanks to our awesome VPoE 1-3. Guiding the Team
20 Set clear priorities • P0: Support ongoing incident handling
• P1: Crush unresolved critical action items • P2: Prevent incidents by fixing root causes What we need to do right now? Clear! 1-3. Guiding the Team
02 Phase 2: Slogan ”Get our hands dirty!”
22 2-1. P0: Supporting Ongoing Incident Handling Get our Hands
Dirty — Jump into every incident! • Page(PagerDuty call) an ACT member for every incident • Pull every ACT member into each live incident • Fight the fire if it’s in your domain • Handle updates, escalation, and biz communications even if it’s not Brutal!!
23 Anti-pattern: Pager Monkey SRE handles all on call —
they will now be the “pager monkeys” whose job it is to follow a script at 2 a.m. when the service goes down — From “Becoming SRE” Chapter 3: SRE Culture Definitely a bad practice Jumping into every incident won’t stop incidents… 2-1. P0: Supporting Ongoing Incident Handling
24 It’s an anti-pattern… but it wasn’t all bad People
started thinking: Incident = ACT. And we earned a lot of trust! ACT shows up when there’s trouble. ACT’s got our back during incidents. ACT gets things done! 2-1. P0: Supporting Incident Handling
25 2-2. P1. Crush unresolved critical action items The forgotten
action items — Why? Thus, There were definitely ones that just got… forgotten • We had a culture of writing incident reports. • And even listing action items for prevention. — Awesome! • But those items weren’t being tracked. • No assignees. No due dates. No status. … WHAT?? • And the report format different for each division… • Sometimes even different per person.
26 List every forgotten item, track it in ticket system
• Gather all action items from every incident report • Then auto-create Jira tickets, send reminders! • But… each report had a totally different format. • Now what? Help me, ChatGPT… That’s not happening. 2-2. P1. Crush unresolved critical action items
27 Get our Hands Dirtyɿorganize the data by hand Lesson
#1: Always store data in machine-readable format!! Heck yeah! We manually moved a year’s worth of incident AIs into a Notion database! *Once it’s a database, you can pull it via API. Easy mode. Lesson #2: Don’t be afraid to get your hands dirty if it serves the mission. 2-2. P1. Crush unresolved critical action items
28 Get our Hands DirtyɿCrush everything still unfinished This isn’t
done?! Crush it!! Lesson #3: People aren’t ignoring the work—they’re just too busy. Getting our hands dirty! 2-2. P1. Crush unresolved critical action items
29 2-3. P2: Root Fixes/ Rolling Out a Unified Incident
Response Process • Well, Because… • Each division had its own way of handling incidents. • Before, We’d tried to build a company-wide protocol— the “IRF: Incident Response Framework”. • But... it never really used • Why? — It only reflected the needs of one division. Why didn’t we already have a unified company-wide process?
30 • The IRF was well organized • Rebuilt it
as a company-wide & lightweight • Domain “all-stars” filled in the gaps • Must stay lightweight — who reads a wall of text during a fire🔥? • Borrowed proven parts from public frameworks • e.g. PagerDuty Incident Response How did we build a company-wide process and framework? Compile unified framework for the whole company! 2-3. P2: Root Fixes/ Rolling Out a Unified Incident Response Process
31 IRF 2.0 Contents 1. Role, Playbook 2. Severity Definition
3. Workflow 4. Communication Guideline 5. Incident Report Template, Postmortem Let me walk you through the key parts Please check the slide later for the details! 2-3. P2: Root Fixes/ Rolling Out a Unified Incident Response Process
32 IRF 2.0: Role, Playbook • On-Call Engineer • The
engineer on call. Triages alerts and escalates to the IC if necessary, initiating the IRF (declaring the incident). • Incident Commander(IC) • Leads the incident response. Brings in necessary people and organizes information. May also act as the CL (Communication Lead). • Usually a Tech Lead or Engineering Manager. • Their responsibility is not to directly fix the issue, but to organize and make decisions. • Responder • Handles the actual work—such as rollbacks, config changes, etc. • Communication Lead(CL) • Handles communication with external stakeholders (i.e., non-engineers). Key point: Separate responsibilities between IC and Responder 2-3. P2: Root Fixes/ Rolling Out a Unified Incident Response Process
33 IRF 2.0: Severity Definition The IC makes a tentative
severity judgment when declaring the incident. The final severity level is determined during the postmortem. • 🔥 SEV-1 • Complete failure of core UX features (e.g., news reading becomes unavailable) • 🧨 SEV-2 • Partial failure of core UX features, or complete failure of sub-UX features • 🕯 SEV-3 • Partial failure of sub-UX features It's crucial to estimate severity early on— severe incidents should be resolved faster 2-3. P2: Root Fixes/ Rolling Out a Unified Incident Response Process
34 IRF 2.0: Workflow The flow from start to finish
of an incident. 🩸 = Bleeding status (ongoing impact) 1. 🩸 Occurrence • An issue arises. Common triggers include deployments or config changes. 2. 🩸 Detection • The on-call engineer detects the issue via alerts. Triage begins. 3. 🩸 Declaration • The incident is officially declared. IRF begins under the IC's lead. External communication starts as needed. • While bleeding, updates must be continuously provided. 4. ❤🩹 Mitigation • Temporarily eliminate the cause (e.g., rollback) and stop further impact. 5. Resolution • Permanently fix the issue (e.g., bug fix, data correction). Bleeding is fully stopped. 6. Postmortem • Investigate root causes and discuss recurrence prevention based on the incident report. 2-3. P2: Root Fixes/ Rolling Out a Unified Incident Response Process
35 IRF 2.0: Communication Guideline Defines where communication should take
place (Slack channels): • #incident • Used for status updates to the entire company and for communication with external stakeholders. • #incident-irf-[incidentId]-[title] • Or technical communication to resolve the issue. • All relevant discussions and information are gathered here. Having all discussions and info in one place makes writing the report much easier later 2-3. P2: Root Fixes/ Rolling Out a Unified Incident Response Process
36 IRF 2.0: Incident Report Template & Postmortem A unified
company-wide template includes: • Summary • Impact • Direct Cause, Mitigation • Root Cause Analysis (5-whys) • It’s crucial to analyze direct and root causes separately. • Based on root causes, define action items to prevent recurrence • Timeline • Use a machine-readable format!!!! We standardized templates across divisions (super important!) and centralized all postmortems. 2-3. P2: Root Fixes/ Rolling Out a Unified Incident Response Process
37 We built it—but how do we make it land?
“Here’s our amazing IRF 2.0. It’s perfect—so just read it and follow 2-3. P2: Root Fixes/ Rolling Out a Unified Incident Response Process
38 We built it—but how do we make it land?
“Here’s our amazing IRF 2.0. It’s perfect—so just read it and follow 2-3. P2: Root Fixes/ Rolling Out a Unified Incident Response Process Noooo!
39 Get Our Hands Dirty: Forcefully apply IRF2.0 by diving
into every incident “Hello there, it’s me, Uncle IRF😎 Alright, I’ll be the Incident Commander this time! Everyone else, focus on firefighting!” Lesson #4: In an emergency, no one has time to learn a new protocol. Just do & learn it! Lesson #5: Use it ourselves first, and build a feedback loop! 2-3. P2: Root Fixes/ Rolling Out a Unified Incident Response Process
40 A critical question came from… Hard at work, huh?
So… did the number of incidents actually go down? What about MTTR? CTO Well… ME 2-4. P2: Root Fixes/ Enhancing Incident Clarity
41 A critical realization: We’re not tracking KPIs! Yes I
did a lot! But… How many incidents did we handle this month? Or last month…? How long did it take to resolve each one? 2-4. P2: Root Fixes/ Enhancing Incident Clarity
Yes I did a lot! But… How many incidents did
we handle this month? Or last month…? 42 A critical realization: We’re not tracking KPIs! How long did it take to resolve each one? 2-4. P2: Root Fixes/ Enhancing Incident Clarity No clue!!
43 Let’s look at the data: what we need Data
Collection Visualization Do this, then that, and boom! 2-4. P2: Root Fixes/ Enhancing Incident Clarity
44 Data Collection Visualization Data Definition Is Key! 2-4. P2:
Root Fixes/ Enhancing Incident Clarity Let’s look at the data: what we need
Data Definition: Modeling Incidents • Attributes of an Incident •
Title • Status • State Machine.(we’ll get to this later) • Severity • SEV 1~3(IRF2.0) • Direct Cause • (explained later) • Direct Cause System • Group of components defined at the microservice level • Direct Cause Workload • Online Service, Offline Pipeline, … Define as many fields as possible using Enums! 45 Free-form input → high cardinality, analysis breaks 2-4. P2: Root Fixes/ Enhancing Incident Clarity
Incident Modeling: Direct Cause The direct cause of the incident
Define it in a way that makes it analyzable and actionable 46 2-4. P2: Root Fixes/ Enhancing Incident Clarity
Incident Modeling: Status -- State Machine Incidents have states and
transition between them — a State Machine! 2-4. P2: Root Fixes/ Enhancing Incident Clarity Record time for every transition; key time metrics pop out automatically
Data Collection: Incident Report Update the incident report template If
the data definition is solid, the source can be flexible. (As long as the data is trustworthy, of course.) 48 2-4. P2: Root Fixes/ Enhancing Incident Clarity • Make required fields into mandatory attributes. • Add a Notion Database for state timeline • Have people record when states change • Make it machine-readable!!!!!
The rest is easy: do this, then this, and boom!
49 ChatGPT did it overnight 2-4. P2: Root Fixes/ Enhancing Incident Clarity Data Collection Visualization
Incident Dashboard: Visualize the key metrics All green! 2-4. P2:
Root Fixes/ Enhancing Incident Clarity
Incident Dashboard: Visualize the key metrics All green! 2-4. P2:
Root Fixes/ Enhancing Incident Clarity Hold up!
What about the past data? Reports before IRF 2.0 (the
unified format) • Different formats across Divisions • Free-format input, missing attributes, etcetc… • Now what? Give it three months and we’ll have plenty of data. Right? We only have six months!! 2-4. P2: Root Fixes/ Enhancing Incident Clarity
What about past data? Of course—we Get Our Hands Dirty!
Heck yeah! We manually migrated one year’s worth of incident reports We divided the work and got it done within a week. 2-4. P2: Root Fixes/ Enhancing Incident Clarity Re: Lesson #2: Don’t be afraid to get your hands dirty if it serves the mission.
Incident Dashboard: Visualizing key metrics 2-4. P2: Root Fixes/ Enhancing
Incident Clarity All green! All green!
Side Effects: Observing MTTR Breakdown 1. Occurred 2. Detected 3.
Declared 4. Mitigated 5. Resolved Time To Detect Time To Mitigate Now we know where the time is going—and where we stand! Time To Resolve 2-4. P2: Root Fixes/ Enhancing Incident Clarity
Phase 3: Halving incidents!? 03
57 3-1. What Does It Mean to Reduce Incidents? Impact
caused by incidents e.g. Revenue, Reputation, Developer velocity…. Right? Especially, Revenue Loss… Maybe not. it’s the What do we really want to reduce — Incident count? 🤔
58 MTTR(MTTD + MTTM) Time to stop the bleeding Estimating
incident impact Severity Factor (impact level of an incident) Number of Incidents × For us (B2C and Ads) this pretty much defines the revenue impact • Shorten the time → Quick win — Relatively easy to improve • Reduce severity → Ideal, but hard to control • Reduce incident count → Requires mid/long-term efforts 3-1. What Does It Mean to Reduce Incidents? Σ
59 MTTR(MTTD + MTTM) Time to stop the bleeding Estimating
impact incident Severity Factor (impact level of an incident) Number of Incidents × For us(BtoC and ads) this pretty much defines the revenue impact • Shorten the time → Quick win — Relatively easy to improve • Reduce severity → Ideal, but hard to control • Reduce incident count → Requires mid/long-term efforts 3-1. What Does It Mean to Reduce Incidents? Σ We’re starting here It also aligns with the KPIs we set when ACT was first formed! But a few months in, we gained much better clarity.
60 3-2. Approaching Incident Resolution Time How do we reduce
MTTR? Seriously? Lesson #6: If a top-tier ace jumps into an incident, an incident is resolved faster(…maybe?)
Improving MTTR Clarity: Breakdown MTTR by State Machine 1. Occurred
2. Detected 3. Declared 4. Mitigated 5. Resolved Time To Detect Time To Resolve Time To Mitigate Each one has different significance — and needs a different solution! Mainly in the alerting domain. Bleeding, Most critical — but also the easiest to improve. This is where IRF comes in. Time spent on root fixes and data correction. Bleeding has stopped — now it’s about accuracy, not speed. 3-2. Approaching Incident Resolution Time
62 Approaching MTTD(Mean Time To Detect) — Alerting • Adding
more alerts doesn’t help • It can even make things worse • “Over-monitoring is a harder problem to solve than under-monitoring.” — SRE: How Google Runs Production System • Too many alerts(maybe false positives) → Alert fatigue → Real Alerts get buried/ignored • Alert on SLO / Error-Budget burn instead • Not something you can fix overnight Still a work in progress — We’ll revisit this in Chapter 4 3-2. Approaching Incident Resolution Time
63 Approaching MTTM(Mean Time To Mitigate) — IRF 2.0 A
unified framework: IRF 2.0 • Clear incident definition – when to call it an incident • Unified response workflow & communication guideline • Role split: Incident Commander vs Responder • Responder can focus to firefighting • Ongoing drills & training — Aces lead by example Deploying top aces + rolling out IRF 2.0 had a huge impact! 3-2. Approaching Incident Resolution Time
64 3-3. Approaching the Number of Incidents How do we
reduce incidents themselves? Nooo… Lesson #7: Even if top-tier aces jump into incidents… The number of incidents won’t go down!!
65 What we need: Tackle the Bottlenecks We’ve got the
data, don’t we?! Come forth—Incident Dashboard!! 3-3. Approaching the Number of Incidents
66 Now we know where and why incidents happen. Backed
by data, we tackled each root cause head-on! 3-3. Approaching the Number of Incidents What we need: Tackle the Bottlenecks
67 #1 Incident Cause Lack of Testing 3-3. Approaching the
Number of Incidents Tackle the Bottlenecks
68 Approach #1 to Lack of Testing: Released to production
without testing Postmortem discussion… • Why was it deployed without testing? • → Because it could only be tested in production. • Why only in production? • → Lack of data, broken staging environment, etc… • …. Alright! Let’s fix the staging environment! 3-3. Approaching the Number of Incidents
69 Approach #1 to Lack of Testing: Building Out Staging
Environments Still in Progress: Way harder than we thought! • There are tons of components. • Each division—News, Ads, Infra—has different needs and usage. • Ads is B2B, tied directly to revenue → needs to be solid and stable • News is B2C, speed of feature delivery is key Trying to build staging for everything? Not realistic, not even useful So we started with Ads, where the demand was highest 3-3. Approaching the Number of Incidents
70 Approach #2 to Lack of Testing: What about Unit
Tests? • Why didn’t we catch it with unit tests? • Because we didn’t have any… • … 😭 Alright! Let’s collect test coverage! 3-3. Approaching the Number of Incidents
71 Approach #2 to Lack of Testing: Analyzing Unit Test
Coverage • Jumped into systems lacking coverage tracking • Opened PRs for generating reports • Plotted unit test coverage vs. # of incidents by system # of Incident Ave. Coverage 3-3. Approaching the Number of Incidents
72 • Was there a correlation between test coverage and
incidents? → Yes, there was.(= Low coverage means more incidents) • Then: Does higher coverage actually reduce incidents? → Not sure. Correlation ≠ Causation. • Still, digging into low-coverage / high-incident systems shows the similar roots: • Hard to write tests / no testing culture / etc… Alright! Let’s jump into the low-coverage systems and help to write unit tests! Approach #2 to Lack of Testing: Analyzing Unit Test Coverage 3-3. Approaching the Number of Incidents
73 Approach #2 to Lack of Testing: Building Out Unit
Tests Get our hands dirty! Add tests to everything We hit 3–4 components…but it didn’t really change anything We thought if we provided a few examples, others would follow… 3-3. Approaching the Number of Incidents 1. Use SonarQube to find files with high LOC and low coverage 2. Use LLMs to help generate tests 3. Repeat until the entire component hits 50% + coverage
74 • Raising coverage, example didn’t change behavior • Need
to build a habit of writing tests continuously • Problems: • No incentive, No shared value • Everyone is busy: Pressured by hard deadline • (2025 May Update): LLM could change the game! This is a team culture and organizational challenge. To be continued in Chapter 4… 3-3. Approaching the Number of Incidents Approach #2 to Lack of Testing: Building Out Unit Tests
75 #2 Incident Cause Config Changes 3-3. Approaching the Number
of Incidents Tackle the Bottlenecks
76 Approaching Config Changes • “Config Changes”: Control app behavior
dynamically/online → e.g. A/B Testing and Feature Flags, Testing in production • We have in-house platforms for both • Problems • They were complicated… → Unintended A/B assignments and misconfigurations caused frequent issues Alright, let’s clean up A/B testing and feature flags! 3-3. Approaching the Number of Incidents
77 • Bulk deletion of unused (defaulted) feature flags •
Establish usage guidelines for feature flags • Strengthen validation logic • (Bad configs caused parse errors and crashes…) Collaborated with the platform team, Made a lot of improvements! 3-3. Approaching the Number of Incidents Approaching Config Changes
78 #3 Incident Cause Offline Batch …basically, Flink 3-3. Approaching
the Number of Incidents Tackle the Bottlenecks
79 Approaching Offline Batch: Flink(Open Source Streaming Lib) • A
bunch of offline streaming Flink jobs… • e.g. Server → Kafka → Flink → Scylla, ClickHouse,… • We have in-house platform for it as well • Problems • Few Flink experts on the app team side → Led to frequent issues like performance, restarts, no UnitTest, bug, etc Alright! Let’s revamp the Flink platform! 3-3. Approaching the Number of Incidents
80 • Improved the platform itself • Better UI, automated
deployments, and more • Nurtured best practices • Provided best practices documentation • Provided template projects (including tests!) • Sent direct refactor PRs to various components • Implemented best practices and tests Collaborated with the platform team together, Improved the platform and its docs! 3-3. Approaching the Number of Incidents Approaching Offline Batch
81 ACT era vs. before ACT Number of Incidents… +32%
Increase!! MTTR… -48% Decrease!! Halved!!! 3-4. Results: Did We Really “Halve” Incidents?
ACT era vs. before ACT 3-4. Results: Did We Really
“Halve” Incidents? 82 Number of Incidents… +32% Increase!! MTTR… -48% Decrease!! Halved!!! Hold up!
83 Aren’t incidents actually rising…? • December spiked😭 just seasonal?
• Holiday rush → last-minute changes? • IRF2.0 roll out side effect? • Clear definition → more detection? • Maslow’s Hammer: • “If all you have is IRF, everything starts to look like an incident” • After January, started trending down Keep our eyes on it — continuous effort required. 3-4. Results: Did We Really “Halve” Incidents?
84 On the other hand, MTTR was Halved! • Dramatic
improvement in MTTMitigate • Thanks to the power of IRF2.0! • But MTTDetect didn’t improve • Detection is still a challenge Definitely felt the momentum of change! 3-4. Results: Did We Really “Halve” Incidents?
85 Overall Assessment No major changes in the severity breakdown
— (# of Incident ↑) x (Resolution Time ↓) → Impact slightly down We didn’t quite halve incidents… However, challenges are clear, and the foundation is set for the improvement! Let’s make it happen 3-4. Results: Did We Really “Halve” Incidents?
Phase 4: What remains, and what’s next 04
87 Again: We Want to Reduce Incidents, But… 4-1. Remaining
Challenges We’ll never get incidents to zero. No way…
88 Can we really make them zero? Or should we?
In Reality — It never happens To truly minimize incidents… • Just stop feature releasing? • → A slow death 😇 • Put infinite cost (people, time) at prevention? • More cost likely correlates with fewer incidents… • → Keep testing until we feel 100% “safe”? 4-1. Remaining Challenges: Risk Management & Alerting
89 We want to balance delivery speed, quality, cost, and
incident risk. • But How can we find a “right” balance? • Even it’s different for each system or project • Required speed & release frequency • Cost we can throw in • Acceptable level of risk (≒ number of incidents, failure rate) • Ads is B2B and tied directly to revenue → needs to be rock-solid • News is B2C → speed of delivery comes first! Quantify our risk tolerance And use that to control how many incidents we accept. 4-1. Remaining Challenges: Risk Management & Alerting
90 SLOs and Error Budgets: Visualizing Risk Tolerance • SLO
= Service Level Objective: “How much failure is acceptable?” • e.g. 99.9% available -> means 0.1% failure is “allowed” • Attach objectives to SLIs—metrics that reflect real UX harm • Error Budget: “How much failure room we have left” • When budget remains → We can take risk • Even bold releases are fair game • When budget runs out → Can’t take risk: UX is already suffering • No more risk — time to slow down — Ref: Implementing SLOs — Google SRE Error Budgets let us express risk tolerance—numerically. And in theory… this sounds pretty solid. 4-1. Remaining Challenges: Risk Management & Alerting
91 Improving Alerting — An Alert = an Incident Alright!
Time to get those SLOs in place! • Alert on fast Error-Budget burn(consumed). • e.g. Burn-Rate based alerting • If you ignore it, Error Budget run out • Violate SLO • That means real UX damage! User suffered! • Can’t ignore it — it is an incident! — Ref: Alerting on SLOs Sounds good 4-1. Remaining Challenges: Risk Management & Alerting
92 4-2. Remaining Challenge: Shaping Org & Culture We tried
rolling out SLOs in some places, but… • Defining effective SLOs isn’t easy • Biz and PdMs don’t always have the answers • Engineers have no time to implement SLOs • They can’t even find time to write Unit Test! • And even if we set them up— (actually I did some…) • If no one respects SLOs, what’s the point? There’s no silver bullet…
93 How do we make SLOs actually work? • We
need everyone on board, across the company • Need an approach to culture • Ultimately, it’s about what we truly value • “Do we believe that balancing cost and risk with SLOs is worth it?” We want to install SLOs— and ultimately, the mindset of SRE—into our engineering culture. 4-2. Remaining Challenge: Shaping Org & Culture
94 • Bottom-Up: ACT • Educate and train engineers, Biz,
PdMs—get everyone involved • Top-Down: Higher Ups • They are actually supportive of SRE • Ask support and direction from leadership SRE and DevOps are culture — They don’t take root in a day. It takes sweat, patience, and steady effort. 4-2. Remaining Challenge: Shaping Org & Culture How do we make SLOs actually work?
95 4-3. What Comes Next… Our 6-month mission as ACT
was coming to an end. • The challenge remained: Install SRE into SmartNews’s engineering culture • Implement and uphold SLOs • And more… • Boost observability • Track and act on DORA metrics, etcetc… These require ongoing effort How can we keep the momentum going— and tackle the remaining challenges even after ACT disbands?
96 How should we disband ACT? — Proposal: “Distributed SRE
Team” After ACT ends, ex-members return to their teams and continue SRE work using X% of their time It sounded reasonable to me… maybe? 4-3. What Comes Next…
97 • Rejected • No one wanted SRE to be
their full-time job. • And allocating “X%” of time… yeah, that never really works. • Our decision: • Ex-ACTors would keep helping and promoting SRE, but we’d take the time to build a dedicated SRE team. We made that call as a team. There’s still lot left unfinished—but no regrets! How should we disband ACT? — Team’s Call 4-3. What Comes Next…
98 Our Awesome Change! Our (tough!!) six-month mission as ACT
has ended. Did we truly create an “Awesome Change”? Honestly… I’m not sure. 4-3. What Comes Next… But we do feel like “We’ve taken the first step on a long journey toward SRE!” And a huge thanks to my teammates for fighting through these past six months!
99 Your Awesome Change! 4-3. What Comes Next… Hope this
session helps some You show up at the office after the conference and your boss says, “Alright — starting today, your job is to Reduce Incidents.” …Where do you begin?
Thank you for Your Kind Attention!
101 References • Seeking SRE: Conversations About Running Production Systems
at Scale • Site Reliability Engineering: How Google Runs Production Systems • SRE Google Workbook • Effective DevOps: Building a Culture of Collaboration, Affinity, and Tooling at Scale • Fearless Change: Patterns for Introducing New Ideas