DevOpsDays Taipei 2025 -- Creating Awesome Change in SmartNews!

Slide 1

Slide 1 text

Creating "Awesome Change" in SmartNews! 5IF.JTTJPOUP)BMWF*ODJEFOUTCZ5BTL'PSDFl"$5z June 6, 2025

Slide 2

Slide 2 text

Who am I? • Nov. 2020~ SmartNews, Inc. • Staff Engineer • Ads-Backend Expert • Interest: Fishing, Camping, Gunpla, Anime Ikuo Suyama / ಃࢁҭஉ

Slide 3

Slide 3 text

About SmartNews: One of the biggest startup companies in Japan Mission: “To deliver the world's quality information to the people who need it.” User-base No.1 Japan’s biggest news app

Slide 4

Slide 4 text

4 Today, Incidents! I’m going to talk about…

Slide 5

Slide 5 text

5 Let me ask you to think for a moment... You show up at the office, and your boss says, “Alright — starting today, your job is to Reduce Incidents.” …Where do you begin? That’s the journey we’ll explore with you today

Slide 6

Slide 6 text

6 What I will talk about today 1. Pulling lessons from real-world firefighting 2. Turning incident data into action 3. Rolling out a unified process company-wide Disclaimer 1: Single-case study(N = 1); findings are context-specific Please keep it in mind A Six-Month Journey on the Incident Task Force

Slide 7

Slide 7 text

7 What I won’t (can’t) talk about 1. Integrating Dev & Ops … Dev already handles Ops + incidents 2. Applying SRE/DevOps best practices … Lessons drawn from the field Disclaimer 2: I’m not a pro SRE or DevOps guru!

Slide 8

Slide 8 text

Phase 1: Assemble! Task Force “ACT”! Phase 2: Slogan “Get our hands dirty!” Phase 3: Halving incidents!? Phase 4: What remains, and what’s next Agenda

Slide 9

Slide 9 text

01 Phase 1: Assemble! Task Force “ACT”!

Slide 10

Slide 10 text

10 1-1. The Beginning It all started back in September... The Awesome Change Team— “ACT”! Too many incidents! Cut them in HALF. Let’s build a task force! CTO

Slide 11

Slide 11 text

11 …Could it be because you force us to ship a massive number of changes? ME: 1-1. The Beginning

Slide 12

Slide 12 text

12 MEɿ 1-1. The Beginning …Could it be because you force us to ship a massive number of changes? Hold up!

Slide 13

Slide 13 text

13 • CTO: Are incidents really happening that often? • How do we even define “a lot” of incidents? • Ikuo: Are we actually making that many changes? • Are changes even the root cause of these incidents? • What kind of changes are we talking about? At this point—it was all just gut feeling, guessing. Hold up! (Though I’ve learned that senior engineer’s nose for trouble is not to be underestimated.) 1-1. The Beginning

Slide 14

Slide 14 text

14 1-2. Assemble the Strongest Team With a six-month time limit, Assembling “The Strongest Team” —as our top priority! Advantage of a top-down project: This is a top-down project from the CTO

Slide 15

Slide 15 text

15 Ads News Ranking Push Notification Core System (Infra) Mobile SmartView (Article) 1-2. Assemble the Strongest Team We pulled in the all-stars from every division…

Slide 16

Slide 16 text

16 Ads News Ranking Push Notification Core System (Infra) Mobile SmartView (Article) Ads Ikuo! News & Push D! Ranking R! CoreSystem T! Mobile M! SmartView T! VPoE K! * Let me call myself an “all-star” just for the sake of this story 🙏 (Manager) CTO Report To 1-2. Assemble the Strongest Team We pulled in the all-stars from every division…

Slide 17

Slide 17 text

17 “We had just six months to succeed” —That pressure was real. Pulling aces from every team showed how serious the company was about this. At the same time... we had no excuses. The downside of a top-down project 1-2. Assemble the Strongest Team

Slide 18

Slide 18 text

18 1-3. Guiding the Team Tackling ambiguous problems without clear answers “Reduce incidents.” Sounds simple—turns out, it's massive problem area. • Where do we even start? • What’s the real problem? What actually helps? • And... are there even that many incidents :)?

Slide 19

Slide 19 text

19 Set a clear goal • Define what does “Awesome Change” really mean? 1. Reduce critical incidents 2. Install SRE best practices into the org • Define Key KPIs to improve: • Mean Time Between Failures (MTBF) / Change Failure Rate (CFR) = # of incidents • Mean Time to Recovery (MTTR) = Recovery time The “why are we here?” got crystal clear Thanks to our awesome VPoE 1-3. Guiding the Team

Slide 20

Slide 20 text

20 Set clear priorities • P0: Support ongoing incident handling • P1: Crush unresolved critical action items • P2: Prevent incidents by fixing root causes What we need to do right now? Clear! 1-3. Guiding the Team

Slide 21

Slide 21 text

02 Phase 2: Slogan ”Get our hands dirty!”

Slide 22

Slide 22 text

22 2-1. P0: Supporting Ongoing Incident Handling Get our Hands Dirty — Jump into every incident! • Page(PagerDuty call) an ACT member for every incident • Pull every ACT member into each live incident • Fight the fire if it’s in your domain • Handle updates, escalation, and biz communications even if it’s not Brutal!!

Slide 23

Slide 23 text

23 Anti-pattern: Pager Monkey SRE handles all on call — they will now be the “pager monkeys” whose job it is to follow a script at 2 a.m. when the service goes down — From “Becoming SRE” Chapter 3: SRE Culture Definitely a bad practice Jumping into every incident won’t stop incidents… 2-1. P0: Supporting Ongoing Incident Handling

Slide 24

Slide 24 text

24 It’s an anti-pattern… but it wasn’t all bad People started thinking: Incident = ACT. And we earned a lot of trust! ACT shows up when there’s trouble. ACT’s got our back during incidents. ACT gets things done! 2-1. P0: Supporting Incident Handling

Slide 25

Slide 25 text

25 2-2. P1. Crush unresolved critical action items The forgotten action items — Why? Thus, There were definitely ones that just got… forgotten • We had a culture of writing incident reports. • And even listing action items for prevention. — Awesome! • But those items weren’t being tracked. • No assignees. No due dates. No status. … WHAT?? • And the report format different for each division… • Sometimes even different per person.

Slide 26

Slide 26 text

26 List every forgotten item, track it in ticket system • Gather all action items from every incident report • Then auto-create Jira tickets, send reminders! • But… each report had a totally different format. • Now what? Help me, ChatGPT… That’s not happening. 2-2. P1. Crush unresolved critical action items

Slide 27

Slide 27 text

27 Get our Hands Dirtyɿorganize the data by hand Lesson #1: Always store data in machine-readable format!! Heck yeah! We manually moved a year’s worth of incident AIs into a Notion database! *Once it’s a database, you can pull it via API. Easy mode. Lesson #2: Don’t be afraid to get your hands dirty if it serves the mission. 2-2. P1. Crush unresolved critical action items

Slide 28

Slide 28 text

28 Get our Hands DirtyɿCrush everything still unfinished This isn’t done?! Crush it!! Lesson #3: People aren’t ignoring the work—they’re just too busy. Getting our hands dirty! 2-2. P1. Crush unresolved critical action items

Slide 29

Slide 29 text

29 2-3. P2: Root Fixes/ Rolling Out a Unified Incident Response Process • Well, Because… • Each division had its own way of handling incidents. • Before, We’d tried to build a company-wide protocol— the “IRF: Incident Response Framework”. • But... it never really used • Why? — It only reflected the needs of one division. Why didn’t we already have a unified company-wide process?

Slide 30

Slide 30 text

30 • The IRF was well organized • Rebuilt it as a company-wide & lightweight • Domain “all-stars” filled in the gaps • Must stay lightweight — who reads a wall of text during a fire🔥? • Borrowed proven parts from public frameworks • e.g. PagerDuty Incident Response How did we build a company-wide process and framework? Compile unified framework for the whole company! 2-3. P2: Root Fixes/ Rolling Out a Unified Incident Response Process

Slide 31

Slide 31 text

31 IRF 2.0 Contents 1. Role, Playbook 2. Severity Definition 3. Workflow 4. Communication Guideline 5. Incident Report Template, Postmortem Let me walk you through the key parts Please check the slide later for the details! 2-3. P2: Root Fixes/ Rolling Out a Unified Incident Response Process

Slide 32

Slide 32 text

32 IRF 2.0: Role, Playbook • On-Call Engineer • The engineer on call. Triages alerts and escalates to the IC if necessary, initiating the IRF (declaring the incident). • Incident Commander(IC) • Leads the incident response. Brings in necessary people and organizes information. May also act as the CL (Communication Lead). • Usually a Tech Lead or Engineering Manager. • Their responsibility is not to directly fix the issue, but to organize and make decisions. • Responder • Handles the actual work—such as rollbacks, config changes, etc. • Communication Lead(CL) • Handles communication with external stakeholders (i.e., non-engineers). Key point: Separate responsibilities between IC and Responder 2-3. P2: Root Fixes/ Rolling Out a Unified Incident Response Process

Slide 33

Slide 33 text

33 IRF 2.0: Severity Definition The IC makes a tentative severity judgment when declaring the incident. The final severity level is determined during the postmortem. • 🔥 SEV-1 • Complete failure of core UX features (e.g., news reading becomes unavailable) • 🧨 SEV-2 • Partial failure of core UX features, or complete failure of sub-UX features • 🕯 SEV-3 • Partial failure of sub-UX features It's crucial to estimate severity early on— severe incidents should be resolved faster 2-3. P2: Root Fixes/ Rolling Out a Unified Incident Response Process

Slide 34

Slide 34 text

34 IRF 2.0: Workflow The flow from start to finish of an incident. 🩸 = Bleeding status (ongoing impact) 1. 🩸 Occurrence • An issue arises. Common triggers include deployments or config changes. 2. 🩸 Detection • The on-call engineer detects the issue via alerts. Triage begins. 3. 🩸 Declaration • The incident is officially declared. IRF begins under the IC's lead. External communication starts as needed. • While bleeding, updates must be continuously provided. 4. ❤🩹 Mitigation • Temporarily eliminate the cause (e.g., rollback) and stop further impact. 5. Resolution • Permanently fix the issue (e.g., bug fix, data correction). Bleeding is fully stopped. 6. Postmortem • Investigate root causes and discuss recurrence prevention based on the incident report. 2-3. P2: Root Fixes/ Rolling Out a Unified Incident Response Process

Slide 35

Slide 35 text

35 IRF 2.0: Communication Guideline Defines where communication should take place (Slack channels): • #incident • Used for status updates to the entire company and for communication with external stakeholders. • #incident-irf-[incidentId]-[title] • Or technical communication to resolve the issue. • All relevant discussions and information are gathered here. Having all discussions and info in one place makes writing the report much easier later 2-3. P2: Root Fixes/ Rolling Out a Unified Incident Response Process

Slide 36

Slide 36 text

36 IRF 2.0: Incident Report Template & Postmortem A unified company-wide template includes: • Summary • Impact • Direct Cause, Mitigation • Root Cause Analysis (5-whys) • It’s crucial to analyze direct and root causes separately. • Based on root causes, define action items to prevent recurrence • Timeline • Use a machine-readable format!!!! We standardized templates across divisions (super important!) and centralized all postmortems. 2-3. P2: Root Fixes/ Rolling Out a Unified Incident Response Process

Slide 37

Slide 37 text

37 We built it—but how do we make it land? “Here’s our amazing IRF 2.0. It’s perfect—so just read it and follow 2-3. P2: Root Fixes/ Rolling Out a Unified Incident Response Process

Slide 38

Slide 38 text

38 We built it—but how do we make it land? “Here’s our amazing IRF 2.0. It’s perfect—so just read it and follow 2-3. P2: Root Fixes/ Rolling Out a Unified Incident Response Process Noooo!

Slide 39

Slide 39 text

39 Get Our Hands Dirty: Forcefully apply IRF2.0 by diving into every incident “Hello there, it’s me, Uncle IRF😎 Alright, I’ll be the Incident Commander this time! Everyone else, focus on firefighting!” Lesson #4: In an emergency, no one has time to learn a new protocol. Just do & learn it! Lesson #5: Use it ourselves first, and build a feedback loop! 2-3. P2: Root Fixes/ Rolling Out a Unified Incident Response Process

Slide 40

Slide 40 text

40 A critical question came from… Hard at work, huh? So… did the number of incidents actually go down? What about MTTR? CTO Well… ME 2-4. P2: Root Fixes/ Enhancing Incident Clarity

Slide 41

Slide 41 text

41 A critical realization: We’re not tracking KPIs! Yes I did a lot! But… How many incidents did we handle this month? Or last month…? How long did it take to resolve each one? 2-4. P2: Root Fixes/ Enhancing Incident Clarity

Slide 42

Slide 42 text

Yes I did a lot! But… How many incidents did we handle this month? Or last month…? 42 A critical realization: We’re not tracking KPIs! How long did it take to resolve each one? 2-4. P2: Root Fixes/ Enhancing Incident Clarity No clue!!

Slide 43

Slide 43 text

43 Let’s look at the data: what we need Data Collection Visualization Do this, then that, and boom! 2-4. P2: Root Fixes/ Enhancing Incident Clarity

Slide 44

Slide 44 text

44 Data Collection Visualization Data Definition Is Key! 2-4. P2: Root Fixes/ Enhancing Incident Clarity Let’s look at the data: what we need

Slide 45

Slide 45 text

Data Definition: Modeling Incidents • Attributes of an Incident • Title • Status • State Machine.(we’ll get to this later) • Severity • SEV 1~3(IRF2.0) • Direct Cause • (explained later) • Direct Cause System • Group of components defined at the microservice level • Direct Cause Workload • Online Service, Offline Pipeline, … Define as many fields as possible using Enums! 45 Free-form input → high cardinality, analysis breaks 2-4. P2: Root Fixes/ Enhancing Incident Clarity

Slide 46

Slide 46 text

Incident Modeling: Direct Cause The direct cause of the incident Define it in a way that makes it analyzable and actionable 46 2-4. P2: Root Fixes/ Enhancing Incident Clarity

Slide 47

Slide 47 text

Incident Modeling: Status -- State Machine Incidents have states and transition between them — a State Machine! 2-4. P2: Root Fixes/ Enhancing Incident Clarity Record time for every transition; key time metrics pop out automatically

Slide 48

Slide 48 text

Data Collection: Incident Report Update the incident report template If the data definition is solid, the source can be flexible. (As long as the data is trustworthy, of course.) 48 2-4. P2: Root Fixes/ Enhancing Incident Clarity • Make required fields into mandatory attributes. • Add a Notion Database for state timeline • Have people record when states change • Make it machine-readable!!!!!

Slide 49

Slide 49 text

The rest is easy: do this, then this, and boom! 49 ChatGPT did it overnight 2-4. P2: Root Fixes/ Enhancing Incident Clarity Data Collection Visualization

Slide 50

Slide 50 text

Incident Dashboard: Visualize the key metrics All green! 2-4. P2: Root Fixes/ Enhancing Incident Clarity

Slide 51

Slide 51 text

Incident Dashboard: Visualize the key metrics All green! 2-4. P2: Root Fixes/ Enhancing Incident Clarity Hold up!

Slide 52

Slide 52 text

What about the past data? Reports before IRF 2.0 (the unified format) • Different formats across Divisions • Free-format input, missing attributes, etcetc… • Now what? Give it three months and we’ll have plenty of data. Right? We only have six months!! 2-4. P2: Root Fixes/ Enhancing Incident Clarity

Slide 53

Slide 53 text

What about past data? Of course—we Get Our Hands Dirty! Heck yeah! We manually migrated one year’s worth of incident reports We divided the work and got it done within a week. 2-4. P2: Root Fixes/ Enhancing Incident Clarity Re: Lesson #2: Don’t be afraid to get your hands dirty if it serves the mission.

Slide 54

Slide 54 text

Incident Dashboard: Visualizing key metrics 2-4. P2: Root Fixes/ Enhancing Incident Clarity All green! All green!

Slide 55

Slide 55 text

Side Effects: Observing MTTR Breakdown 1. Occurred 2. Detected 3. Declared 4. Mitigated 5. Resolved Time To Detect Time To Mitigate Now we know where the time is going—and where we stand! Time To Resolve 2-4. P2: Root Fixes/ Enhancing Incident Clarity

Slide 56

Slide 56 text

Phase 3: Halving incidents!? 03

Slide 57

Slide 57 text

57 3-1. What Does It Mean to Reduce Incidents? Impact caused by incidents e.g. Revenue, Reputation, Developer velocity…. Right? Especially, Revenue Loss… Maybe not. it’s the What do we really want to reduce — Incident count? 🤔

Slide 58

Slide 58 text

58 MTTR(MTTD + MTTM) Time to stop the bleeding Estimating incident impact Severity Factor (impact level of an incident) Number of Incidents × For us (B2C and Ads) this pretty much defines the revenue impact • Shorten the time → Quick win — Relatively easy to improve • Reduce severity → Ideal, but hard to control • Reduce incident count → Requires mid/long-term efforts 3-1. What Does It Mean to Reduce Incidents? Σ

Slide 59

Slide 59 text

59 MTTR(MTTD + MTTM) Time to stop the bleeding Estimating impact incident Severity Factor (impact level of an incident) Number of Incidents × For us(BtoC and ads) this pretty much defines the revenue impact • Shorten the time → Quick win — Relatively easy to improve • Reduce severity → Ideal, but hard to control • Reduce incident count → Requires mid/long-term efforts 3-1. What Does It Mean to Reduce Incidents? Σ We’re starting here It also aligns with the KPIs we set when ACT was first formed! But a few months in, we gained much better clarity.

Slide 60

Slide 60 text

60 3-2. Approaching Incident Resolution Time How do we reduce MTTR? Seriously? Lesson #6: If a top-tier ace jumps into an incident, an incident is resolved faster(…maybe?)

Slide 61

Slide 61 text

Improving MTTR Clarity: Breakdown MTTR by State Machine 1. Occurred 2. Detected 3. Declared 4. Mitigated 5. Resolved Time To Detect Time To Resolve Time To Mitigate Each one has different significance — and needs a different solution! Mainly in the alerting domain. Bleeding, Most critical — but also the easiest to improve. This is where IRF comes in. Time spent on root fixes and data correction. Bleeding has stopped — now it’s about accuracy, not speed. 3-2. Approaching Incident Resolution Time

Slide 62

Slide 62 text

62 Approaching MTTD(Mean Time To Detect) — Alerting • Adding more alerts doesn’t help • It can even make things worse • “Over-monitoring is a harder problem to solve than under-monitoring.” — SRE: How Google Runs Production System • Too many alerts(maybe false positives) → Alert fatigue → Real Alerts get buried/ignored • Alert on SLO / Error-Budget burn instead • Not something you can fix overnight Still a work in progress — We’ll revisit this in Chapter 4 3-2. Approaching Incident Resolution Time

Slide 63

Slide 63 text

63 Approaching MTTM(Mean Time To Mitigate) — IRF 2.0 A unified framework: IRF 2.0 • Clear incident definition – when to call it an incident • Unified response workflow & communication guideline • Role split: Incident Commander vs Responder • Responder can focus to firefighting • Ongoing drills & training — Aces lead by example Deploying top aces + rolling out IRF 2.0 had a huge impact! 3-2. Approaching Incident Resolution Time

Slide 64

Slide 64 text

64 3-3. Approaching the Number of Incidents How do we reduce incidents themselves? Nooo… Lesson #7: Even if top-tier aces jump into incidents… The number of incidents won’t go down!!

Slide 65

Slide 65 text

65 What we need: Tackle the Bottlenecks We’ve got the data, don’t we?! Come forth—Incident Dashboard!! 3-3. Approaching the Number of Incidents

Slide 66

Slide 66 text

66 Now we know where and why incidents happen. Backed by data, we tackled each root cause head-on! 3-3. Approaching the Number of Incidents What we need: Tackle the Bottlenecks

Slide 67

Slide 67 text

67 #1 Incident Cause Lack of Testing 3-3. Approaching the Number of Incidents Tackle the Bottlenecks

Slide 68

Slide 68 text

68 Approach #1 to Lack of Testing: Released to production without testing Postmortem discussion… • Why was it deployed without testing? • → Because it could only be tested in production. • Why only in production? • → Lack of data, broken staging environment, etc… • …. Alright! Let’s fix the staging environment! 3-3. Approaching the Number of Incidents

Slide 69

Slide 69 text

69 Approach #1 to Lack of Testing: Building Out Staging Environments Still in Progress: Way harder than we thought! • There are tons of components. • Each division—News, Ads, Infra—has different needs and usage. • Ads is B2B, tied directly to revenue → needs to be solid and stable • News is B2C, speed of feature delivery is key Trying to build staging for everything? Not realistic, not even useful So we started with Ads, where the demand was highest 3-3. Approaching the Number of Incidents

Slide 70

Slide 70 text

70 Approach #2 to Lack of Testing: What about Unit Tests? • Why didn’t we catch it with unit tests? • Because we didn’t have any… • … 😭 Alright! Let’s collect test coverage! 3-3. Approaching the Number of Incidents

Slide 71

Slide 71 text

71 Approach #2 to Lack of Testing: Analyzing Unit Test Coverage • Jumped into systems lacking coverage tracking • Opened PRs for generating reports • Plotted unit test coverage vs. # of incidents by system # of Incident Ave. Coverage 3-3. Approaching the Number of Incidents

Slide 72

Slide 72 text

72 • Was there a correlation between test coverage and incidents? → Yes, there was.(= Low coverage means more incidents) • Then: Does higher coverage actually reduce incidents? → Not sure. Correlation ≠ Causation. • Still, digging into low-coverage / high-incident systems shows the similar roots: • Hard to write tests / no testing culture / etc… Alright! Let’s jump into the low-coverage systems and help to write unit tests! Approach #2 to Lack of Testing: Analyzing Unit Test Coverage 3-3. Approaching the Number of Incidents

Slide 73

Slide 73 text

73 Approach #2 to Lack of Testing: Building Out Unit Tests Get our hands dirty! Add tests to everything We hit 3–4 components…but it didn’t really change anything We thought if we provided a few examples, others would follow… 3-3. Approaching the Number of Incidents 1. Use SonarQube to find files with high LOC and low coverage 2. Use LLMs to help generate tests 3. Repeat until the entire component hits 50% + coverage

Slide 74

Slide 74 text

74 • Raising coverage, example didn’t change behavior • Need to build a habit of writing tests continuously • Problems: • No incentive, No shared value • Everyone is busy: Pressured by hard deadline • (2025 May Update): LLM could change the game! This is a team culture and organizational challenge. To be continued in Chapter 4… 3-3. Approaching the Number of Incidents Approach #2 to Lack of Testing: Building Out Unit Tests

Slide 75

Slide 75 text

75 #2 Incident Cause Config Changes 3-3. Approaching the Number of Incidents Tackle the Bottlenecks

Slide 76

Slide 76 text

76 Approaching Config Changes • “Config Changes”: Control app behavior dynamically/online → e.g. A/B Testing and Feature Flags, Testing in production • We have in-house platforms for both • Problems • They were complicated… → Unintended A/B assignments and misconfigurations caused frequent issues Alright, let’s clean up A/B testing and feature flags! 3-3. Approaching the Number of Incidents

Slide 77

Slide 77 text

77 • Bulk deletion of unused (defaulted) feature flags • Establish usage guidelines for feature flags • Strengthen validation logic • (Bad configs caused parse errors and crashes…) Collaborated with the platform team, Made a lot of improvements! 3-3. Approaching the Number of Incidents Approaching Config Changes

Slide 78

Slide 78 text

78 #3 Incident Cause Offline Batch …basically, Flink 3-3. Approaching the Number of Incidents Tackle the Bottlenecks

Slide 79

Slide 79 text

79 Approaching Offline Batch: Flink(Open Source Streaming Lib) • A bunch of offline streaming Flink jobs… • e.g. Server → Kafka → Flink → Scylla, ClickHouse,… • We have in-house platform for it as well • Problems • Few Flink experts on the app team side → Led to frequent issues like performance, restarts, no UnitTest, bug, etc Alright! Let’s revamp the Flink platform! 3-3. Approaching the Number of Incidents

Slide 80

Slide 80 text

80 • Improved the platform itself • Better UI, automated deployments, and more • Nurtured best practices • Provided best practices documentation • Provided template projects (including tests!) • Sent direct refactor PRs to various components • Implemented best practices and tests Collaborated with the platform team together, Improved the platform and its docs! 3-3. Approaching the Number of Incidents Approaching Offline Batch

Slide 81

Slide 81 text

81 ACT era vs. before ACT Number of Incidents… +32% Increase!! MTTR… -48% Decrease!! Halved!!! 3-4. Results: Did We Really “Halve” Incidents?

Slide 82

Slide 82 text

ACT era vs. before ACT 3-4. Results: Did We Really “Halve” Incidents? 82 Number of Incidents… +32% Increase!! MTTR… -48% Decrease!! Halved!!! Hold up!

Slide 83

Slide 83 text

83 Aren’t incidents actually rising…? • December spiked😭 just seasonal? • Holiday rush → last-minute changes? • IRF2.0 roll out side effect? • Clear definition → more detection? • Maslow’s Hammer: • “If all you have is IRF, everything starts to look like an incident” • After January, started trending down Keep our eyes on it — continuous effort required. 3-4. Results: Did We Really “Halve” Incidents?

Slide 84

Slide 84 text

84 On the other hand, MTTR was Halved! • Dramatic improvement in MTTMitigate • Thanks to the power of IRF2.0! • But MTTDetect didn’t improve • Detection is still a challenge Definitely felt the momentum of change! 3-4. Results: Did We Really “Halve” Incidents?

Slide 85

Slide 85 text

85 Overall Assessment No major changes in the severity breakdown — (# of Incident ↑) x (Resolution Time ↓) → Impact slightly down We didn’t quite halve incidents… However, challenges are clear, and the foundation is set for the improvement! Let’s make it happen 3-4. Results: Did We Really “Halve” Incidents?

Slide 86

Slide 86 text

Phase 4: What remains, and what’s next 04

Slide 87

Slide 87 text

87 Again: We Want to Reduce Incidents, But… 4-1. Remaining Challenges We’ll never get incidents to zero. No way…

Slide 88

Slide 88 text

88 Can we really make them zero? Or should we? In Reality — It never happens To truly minimize incidents… • Just stop feature releasing? • → A slow death 😇 • Put infinite cost (people, time) at prevention? • More cost likely correlates with fewer incidents… • → Keep testing until we feel 100% “safe”? 4-1. Remaining Challenges: Risk Management & Alerting

Slide 89

Slide 89 text

89 We want to balance delivery speed, quality, cost, and incident risk. • But How can we find a “right” balance? • Even it’s different for each system or project • Required speed & release frequency • Cost we can throw in • Acceptable level of risk (≒ number of incidents, failure rate) • Ads is B2B and tied directly to revenue → needs to be rock-solid • News is B2C → speed of delivery comes first! Quantify our risk tolerance And use that to control how many incidents we accept. 4-1. Remaining Challenges: Risk Management & Alerting

Slide 90

Slide 90 text

90 SLOs and Error Budgets: Visualizing Risk Tolerance • SLO = Service Level Objective: “How much failure is acceptable?” • e.g. 99.9% available -> means 0.1% failure is “allowed” • Attach objectives to SLIs—metrics that reflect real UX harm • Error Budget: “How much failure room we have left” • When budget remains → We can take risk • Even bold releases are fair game • When budget runs out → Can’t take risk: UX is already suffering • No more risk — time to slow down — Ref: Implementing SLOs — Google SRE Error Budgets let us express risk tolerance—numerically. And in theory… this sounds pretty solid. 4-1. Remaining Challenges: Risk Management & Alerting

Slide 91

Slide 91 text

91 Improving Alerting — An Alert = an Incident Alright! Time to get those SLOs in place! • Alert on fast Error-Budget burn(consumed). • e.g. Burn-Rate based alerting • If you ignore it, Error Budget run out • Violate SLO • That means real UX damage! User suffered! • Can’t ignore it — it is an incident! — Ref: Alerting on SLOs Sounds good 4-1. Remaining Challenges: Risk Management & Alerting

Slide 92

Slide 92 text

92 4-2. Remaining Challenge: Shaping Org & Culture We tried rolling out SLOs in some places, but… • Defining effective SLOs isn’t easy • Biz and PdMs don’t always have the answers • Engineers have no time to implement SLOs • They can’t even find time to write Unit Test! • And even if we set them up— (actually I did some…) • If no one respects SLOs, what’s the point? There’s no silver bullet…

Slide 93

Slide 93 text

93 How do we make SLOs actually work? • We need everyone on board, across the company • Need an approach to culture • Ultimately, it’s about what we truly value • “Do we believe that balancing cost and risk with SLOs is worth it?” We want to install SLOs— and ultimately, the mindset of SRE—into our engineering culture. 4-2. Remaining Challenge: Shaping Org & Culture

Slide 94

Slide 94 text

94 • Bottom-Up: ACT • Educate and train engineers, Biz, PdMs—get everyone involved • Top-Down: Higher Ups • They are actually supportive of SRE • Ask support and direction from leadership SRE and DevOps are culture — They don’t take root in a day. It takes sweat, patience, and steady effort. 4-2. Remaining Challenge: Shaping Org & Culture How do we make SLOs actually work?

Slide 95

Slide 95 text

95 4-3. What Comes Next… Our 6-month mission as ACT was coming to an end. • The challenge remained: Install SRE into SmartNews’s engineering culture • Implement and uphold SLOs • And more… • Boost observability • Track and act on DORA metrics, etcetc… These require ongoing effort How can we keep the momentum going— and tackle the remaining challenges even after ACT disbands?

Slide 96

Slide 96 text

96 How should we disband ACT? — Proposal: “Distributed SRE Team” After ACT ends, ex-members return to their teams and continue SRE work using X% of their time It sounded reasonable to me… maybe? 4-3. What Comes Next…

Slide 97

Slide 97 text

97 • Rejected • No one wanted SRE to be their full-time job. • And allocating “X%” of time… yeah, that never really works. • Our decision: • Ex-ACTors would keep helping and promoting SRE, but we’d take the time to build a dedicated SRE team. We made that call as a team. There’s still lot left unfinished—but no regrets! How should we disband ACT? — Team’s Call 4-3. What Comes Next…

Slide 98

Slide 98 text

98 Our Awesome Change! Our (tough!!) six-month mission as ACT has ended. Did we truly create an “Awesome Change”? Honestly… I’m not sure. 4-3. What Comes Next… But we do feel like “We’ve taken the first step on a long journey toward SRE!” And a huge thanks to my teammates for fighting through these past six months!

Slide 99

Slide 99 text

99 Your Awesome Change! 4-3. What Comes Next… Hope this session helps some You show up at the office after the conference and your boss says, “Alright — starting today, your job is to Reduce Incidents.” …Where do you begin?

Slide 100

Slide 100 text

Thank you for Your Kind Attention!

Slide 101

Slide 101 text

101 References • Seeking SRE: Conversations About Running Production Systems at Scale • Site Reliability Engineering: How Google Runs Production Systems • SRE Google Workbook • Effective DevOps: Building a Culture of Collaboration, Affinity, and Tooling at Scale • Fearless Change: Patterns for Introducing New Ideas