Slide 1

Slide 1 text

Monorepo Error Management
 Runbooks and Term-Targeted Alert Distribution Shota Iwami | Platform Engineer | newmo, Inc

Slide 2

Slide 2 text

NEW ERROR noti fi cations fl ooding your #alerts-channel that nobody checks until the next morning…? 2 Have you ever experienced…

Slide 3

Slide 3 text

What happened? Who owns it? Who’s on it? What’s my move? 3

Slide 4

Slide 4 text

Today's Takeaways and keywords: 🔔Alerting Priority-based triage — especially in a monorepo 📖Runbooks Declarative & auto-delivered — no more silos 🚚GitOps One work fl ow — for all observability assets

Slide 5

Slide 5 text

Today's Takeaways and keywords: 🔔Alerting Priority-based triage — especially in a monorepo 📖Runbooks Declarative & auto-delivered — no more silos 🚚GitOps One work fl ow — for all observability assets

Slide 6

Slide 6 text

Today's Takeaways and keywords: 🔔Alerting Priority-based triage — especially in a monorepo 📖Runbooks Declarative & auto-delivered — no more silos 🚚GitOps One work fl ow — for all observability assets

Slide 7

Slide 7 text

Today's Takeaways and keywords: 🔔Alerting Priority-based triage — especially in a monorepo 📖Runbooks Declarative & auto-delivered — no more silos 🚚GitOps One work fl ow — for all observability assets

Slide 8

Slide 8 text

8 Shota Iwami Platform Engineer newmo, Inc.

Slide 9

Slide 9 text

Shota Iwami Platform Engineer newmo, Inc. 9 I have 2 dogs...

Slide 10

Slide 10 text

10 I have 2 dogs...good friends! Shota Iwami Platform Engineer newmo, Inc.

Slide 11

Slide 11 text

11 A Colorful journey with newmo ...is a Japan-Based Mobility & FinTech start-up founded Jan 2024. Operates Taxi and Ridesharing platforms & Car-leasing services

Slide 12

Slide 12 text

Organizational Context — the case of newmo

Slide 13

Slide 13 text

13 Organizational Context — the case of newmo 🚗 Multi-product mobility startup 📦 Monorepo, fluid team moves 🤝 Business/Engineering co-own incidents 👀 Visible alert channels for all

Slide 14

Slide 14 text

Alert channel noise Key Pain Points & Solutions 14 No culture of checking alert channels Unknown troubleshooting procedures Impact Ambiguous responsibilities due to team changes Noti fi cations arrive without metadata - responders do not understand root cause or the fi x Unclear ownership Problem

Slide 15

Slide 15 text

Alert channel noise Key Pain Points & Solutions 15 No culture of checking alert channels Unknown troubleshooting procedures Impact Ambiguous responsibilities due to team changes Noti fi cations arrive without metadata - responders do not understand root cause or the fi x Unclear ownership Problem

Slide 16

Slide 16 text

Alert channel noise Key Pain Points & Solutions 16 No culture of checking alert channels Unknown troubleshooting procedures Impact Ambiguous responsibilities due to team changes Noti fi cations arrive without metadata - responders do not understand root cause or the fi x Unclear ownership Problem

Slide 17

Slide 17 text

Alert channel noise Key Pain Points & Solutions 17 No culture of checking alert channels Unknown troubleshooting procedures Impact Ambiguous responsibilities due to team changes Noti fi cations arrive without metadata - responders do not understand root cause or the fi x Unclear ownership Problem

Slide 18

Slide 18 text

Broken Windows Theory One unrepaired broken window is a signal that no one cares, and so breaking more windows costs nothing. (James Q. Wilson, George L. Kelling ) Image: AI-generated (OpenAI / Raycastʣ

Slide 19

Slide 19 text

Broken Windows Theory One unrepaired broken window is a signal that no one cares, and so breaking more windows costs nothing. (James Q. Wilson, George L. Kelling ) Radically Re-Defining Alert Channels An unrepaired window says nobody cares…an untriaged alert says the same! Image: AI-generated (OpenAI / Raycastʣ

Slide 20

Slide 20 text

Two Approaches Covered in This Session Automatic Runbooks Generation 📖 Team-Specific Alert Notification 🚚

Slide 21

Slide 21 text

Two Approaches Covered in This Session Automatic Runbooks Generation 📖 Team-Specific Alert Notification 🚚

Slide 22

Slide 22 text

Automatic Runbooks Generation “Declarative Runbook Management”

Slide 23

Slide 23 text

About Runbook 23 😵 Unclear where to start 💤 Issues ignored 🤯 High cognitive load A World without procedures A World with procedures ✅ Immediately starting point 🚀 Fast action by more people 🙌 Anyone can help 🚨 💡

Slide 24

Slide 24 text

A World with procedures A World without procedures About Runbook 24 Runbooks Runbooks 🚨 💡 😵 Unclear where to start 💤 Issues ignored 🤯 High cognitive load ✅ Immediately starting point 🚀 Fast action by more people 🙌 Anyone can help

Slide 25

Slide 25 text

About Runbook 25 “A runbook is an excellent way to quickly indicate the direction you should take when an alert comes in. As environments become more complex, not everyone on the team knows every system, and runbooks become an excellent way to spread knowledge.” (Mike Julian, Practical Monitoring, O’Reilly, 2017) Having procedures documented —aka “Runbooks”— helps prevent silos.

Slide 26

Slide 26 text

26 Why Runbooks Still Fail ❌ Scattered docs (Not Git-managed) ❌ Hard to reference from logs ❌ Log volume limits rich error messages But even with runbooks..

Slide 27

Slide 27 text

27 Why Runbooks Still Fail ❌ Scattered docs (Not Git-managed) ❌ Hard to reference from logs ❌ Log volume limits rich error messages But even with runbooks..

Slide 28

Slide 28 text

28 Why Runbooks Still Fail ❌ Scattered docs (Not Git-managed) ❌ Hard to reference from logs ❌ Log volume limits rich error messages But even with runbooks..

Slide 29

Slide 29 text

29 Why Runbooks Still Fail ❌ Scattered docs (Not Git-managed) ❌ Hard to reference from logs ❌ Log volume limits rich error messages But even with runbooks..

Slide 30

Slide 30 text

Scattered Information ( e.g. Linked via Custom Code ) 30 Embedding custom code in errors & keeping runbooks separate increases both lookup and maintenance costs. Error Message Stack Trace Datadog URL Link etc… + Custom Code for Runbooks Runbook with Custom Code Linked via Custom Code From the custom code in the error, search for the runbook, check the content, and take action….

Slide 31

Slide 31 text

One Runbook for All Alert Flows

Slide 32

Slide 32 text

Custom Code (We Call It “Reason Code”) 32 Reason Code Runbook RC00000 A foo error occurred. Please check resources at https://example.com. If it's bar, please proceed with the standard workaround. RC00001 5xx errors returned from bar's system. Their system might be experiencing issues. Check the logs, and if it's not temporary, contact the responsible party via Slack. BAR, Inc. 555-1234-5678 Previously, these were documented in Notion/Confluence/Github Wiki, etc…

Slide 33

Slide 33 text

Go Generate!

Slide 34

Slide 34 text

Architecture 34

Slide 35

Slide 35 text

Architecture 35 ✅ Here's our WHY: We want all the benefits of runbooks without the maintenance nightmare.

Slide 36

Slide 36 text

Architecture 36 ✅ Here's our WHY: We want all the benefits of runbooks without the maintenance nightmare. Change it once → instantly reflected everywhere!

Slide 37

Slide 37 text

Architecture 37

Slide 38

Slide 38 text

------------------------------------------ reason_code.proto: ------------------------------------------ enum ReasonCode { RC00000 = 0[(ext.reason_code) = { message: [ "A foo error occurred.", "Please check resources at https://example.com.", "If it's bar, please proceed”, “with the standard workaround." ] }]; RC00001 = 1[(ext.reason_code) = { message: [ "5xx errors returned from foo's system.", "Their system might be experiencing issues.", "Check logs and if it's not temporary,”, "contact the responsible party through PM.", "foo, Inc.”, "080-1111-1111" ] }]; } 38 Single Source of Truth: Proto File ● Unique code linking implementation and runbook Message (Runbook) ● Runbook associated with the corresponding reason code Reason Code

Slide 39

Slide 39 text

------------------------------------------ reason_code.proto: ------------------------------------------ enum ReasonCode { RC00000 = 0[(ext.reason_code) = { message: [ "A foo error occurred.", "Please check resources at https://example.com.", "If it's bar, please proceed”, “with the standard workaround." ] }]; RC00001 = 1[(ext.reason_code) = { message: [ "5xx errors returned from foo's system.", "Their system might be experiencing issues.", "Check logs and if it's not temporary,”, "contact the responsible party through PM.", "foo, Inc.”, "080-1111-1111" ] }]; } 39 Single Source of Truth: Proto File Message (Runbook) ● Runbook associated with the corresponding reason code ● Unique code linking implementation and runbook Reason Code

Slide 40

Slide 40 text

------------------------------------------ reason_code.proto: ------------------------------------------ enum ReasonCode { RC00000 = 0[(ext.reason_code) = { message: [ "A foo error occurred.", "Please check resources at https://example.com.", "If it's bar, please proceed”, “with the standard workaround." ] }]; RC00001 = 1[(ext.reason_code) = { message: [ "5xx errors returned from foo's system.", "Their system might be experiencing issues.", "Check logs and if it's not temporary,”, "contact the responsible party through PM.", "foo, Inc.”, "080-1111-1111" ] }]; } 40 ● Unique code linking implementation and runbook Single Source of Truth: Proto File Reason Code Message (Runbook) ● Runbook associated with the corresponding reason code

Slide 41

Slide 41 text

Architecture 41

Slide 42

Slide 42 text

------------------------------------------ reason_code.pb.go: ------------------------------------------ // Code generated by protoc-gen-go-reason-code. type ReasonCode string const ( RC00000 ReasonCode = "RC00000" // Message: A foo... RC00001 ReasonCode = "RC00001" // Message: 5xx err... ) ------------------------------------------ error.go: ------------------------------------------ // Custom Error type Error struct { error ... reasonCodes []ReasonCode ... } func WithReasonCode(rc ReasonCode) Option { return func(e *Error) { e.reasonCodes = append(e.reasonCodes, rc) } } 42 ● Auto-generated reason code from proto file Application Code: Go Reason Code Constants Custom Error ● Custom errors with reason codes for Datadog integration

Slide 43

Slide 43 text

------------------------------------------ reason_code.pb.go: ------------------------------------------ // Code generated by protoc-gen-go-reason-code. type ReasonCode string const ( RC00000 ReasonCode = "RC00000" // Message: A foo... RC00001 ReasonCode = "RC00001" // Message: 5xx err... ) ------------------------------------------ error.go: ------------------------------------------ // Custom Error type Error struct { error ... reasonCodes []ReasonCode ... } func WithReasonCode(rc ReasonCode) Option { return func(e *Error) { e.reasonCodes = append(e.reasonCodes, rc) } } 43 ● Auto-generated reason code from proto file Application Code: Go Reason Code Constants Custom Error ● Custom errors with reason codes for Datadog integration

Slide 44

Slide 44 text

------------------------------------------ reason_code.pb.go: ------------------------------------------ // Code generated by protoc-gen-go-reason-code. type ReasonCode string const ( RC00000 ReasonCode = "RC00000" // Message: A foo... RC00001 ReasonCode = "RC00001" // Message: 5xx err... ) ------------------------------------------ error.go: ------------------------------------------ // Custom Error type Error struct { error ... reasonCodes []ReasonCode ... } func WithReasonCode(rc ReasonCode) Option { return func(e *Error) { e.reasonCodes = append(e.reasonCodes, rc) } } 44 ● Auto-generated reason code from proto file Application Code: Go Reason Code Constants Custom Error ● Custom errors with reason codes for Datadog integration

Slide 45

Slide 45 text

{ "error": { "reason_codes": [ "RC00000" ], ... "message": “foo error" } } 45 ● Reason codes in error field (structured logs) Error JSON

Slide 46

Slide 46 text

Architecture 46

Slide 47

Slide 47 text

Datadog Monitors 47 Conditional variables

Slide 48

Slide 48 text

Datadog Monitors 48 Conditional variables

Slide 49

Slide 49 text

● Branch message by reason code with is_match ● Show runbook content when reason code matches ● Context-aware alert messages 49 {{#is_match "log.attributes.error.reason_codes" "\"RC00000\""}} reason_code: RC00000 ``` A foo error occurred. Please check resources at https://example.com. If it's bar, please proceed with the standard workaround. ``` {{/is_match}} Message Branching with is_match

Slide 50

Slide 50 text

● Branch message by reason code with is_match ● Show runbook content when reason code matches ● Context-aware alert messages 50 {{#is_match "log.attributes.error.reason_codes" "\"RC00000\""}} reason_code: RC00000 ``` A foo error occurred. Please check resources at https://example.com. If it's bar, please proceed with the standard workaround. ``` {{/is_match}} Message Branching with is_match

Slide 51

Slide 51 text

● Branch message by reason code with is_match ● Show runbook content when reason code matches ● Context-aware alert messages 51 {{#is_match "log.attributes.error.reason_codes" "\"RC00000\""}} reason_code: RC00000 ``` A foo error occurred. Please check resources at https://example.com. If it's bar, please proceed with the standard workaround. ``` {{/is_match}} Message Branching with is_match 💭 OK, I get how this works with Datadog!! But so what?

Slide 52

Slide 52 text

From Garbage Drift to Gift Delivery 52 Alert channel = 🗑Garbage fl oating in digital ocean 🗑 🗑 🗑 💩 💩 🗑 💩 Let's be honest: our alert channel was a digital wasteland… BEFORE

Slide 53

Slide 53 text

Now it's a treasure chest of actionable intelligence!! From Garbage Drift to Gift Delivery 53 AFTER Alert channel = 🎁Smart gifts arriving with perfect timing 🎁 ✨ 🎁 🎁 🎁 ✨ ✨ 🎁 A foo error occurred. Please check resources at… 5xx errors returned from foo's system…. 5xx errors returned from bar’s system…. 429 errors returned from foo's system…. System A is down…

Slide 54

Slide 54 text

Architecture 54

Slide 55

Slide 55 text

------------------------------------------ application_error_alert_message.pb.md: ------------------------------------------ {{!-- Monitor generated by protoc-gen-go-reason-code. DO NOT EDIT. --}} {{ log.message }} {{#is_match "log.attributes.error.reason_codes" "\"RC00000\"" }} reason_code: `RC00000` A foo error occurred. Please check resources at https://example.com. If it's bar, please proceed with the standard workaround. {{/is_match}} {{#is_match "log.attributes.error.reason_codes" "\"RC00001\"" }} reason_code: `RC00001` 5xx errors returned from bar's system. Their system might be experiencing issues. Check the logs, and if it's not temporary, contact the responsible party via Slack. bar, Inc. 555-1234-5678 {{/is_match}} ... {{#is_alert}} @${notification_alert_channel} {{/is_alert}} ------------------------------------------ monitor.tf: ------------------------------------------ resource "datadog_monitor" "application_error_alert" { ... message = templatefile( “${path.module}/messages/application_error_alert_message.pb.md”, { notification_alert_channel = var.notification_alert_channel } ) } 55 Monitors as Code ● Monitor markdown auto-generated ● is_match branches auto-generated ● Loaded via built-in templatefile function

Slide 56

Slide 56 text

------------------------------------------ application_error_alert_message.pb.md: ------------------------------------------ {{!-- Monitor generated by protoc-gen-go-reason-code. DO NOT EDIT. --}} {{ log.message }} {{#is_match "log.attributes.error.reason_codes" "\"RC00000\"" }} reason_code: `RC00000` A foo error occurred. Please check resources at https://example.com. If it's bar, please proceed with the standard workaround. {{/is_match}} {{#is_match "log.attributes.error.reason_codes" "\"RC00001\"" }} reason_code: `RC00001` 5xx errors returned from bar's system. Their system might be experiencing issues. Check the logs, and if it's not temporary, contact the responsible party via Slack. bar, Inc. 555-1234-5678 {{/is_match}} ... {{#is_alert}} @${notification_alert_channel} {{/is_alert}} ------------------------------------------ monitor.tf: ------------------------------------------ resource "datadog_monitor" "application_error_alert" { ... message = templatefile( “${path.module}/messages/application_error_alert_message.pb.md”, { notification_alert_channel = var.notification_alert_channel } ) } 56 Monitors as Code ● Monitor markdown auto-generated ● is_match branches auto-generated ● Loaded via built-in templatefile function Complex, large-scale branching with zero manual cost, enabled by automation

Slide 57

Slide 57 text

After 57 ● Reduced divergence between Runbooks and implementation code ● One PR updates code, documentation, and monitors simultaneously ● Runbooks provided directly in alert notifications, eliminating manual linking Alert with Runbook

Slide 58

Slide 58 text

After 58 ● Reduced divergence between Runbooks and implementation code ● One PR updates code, documentation, and monitors simultaneously ● Runbooks provided directly in alert notifications, eliminating manual linking Alert with Runbook One proto to rule them all, One proto to fi nd them, One proto to generate them all, and in the code bind them! 💍

Slide 59

Slide 59 text

Two Approaches Covered in This Session Automatic Runbooks Generation 📖 Team-Specific Alert Notification 🚚

Slide 60

Slide 60 text

Team-Specific Alert Notification

Slide 61

Slide 61 text

● To mention a Slack group, you must specify group ID, not group name 61 Slack Notifications

Slide 62

Slide 62 text

● To mention a Slack group, you must specify group ID, not group name 62 Slack Notifications ● Can't switch dynamically from log attributes service name ● Must specify Slack group ID

Slide 63

Slide 63 text

Mapping CSV (Git-managed) — Leveraging Reference Tables

Slide 64

Slide 64 text

About Reference Tables 64 ● Add metadata to information already in Datadog ● Describe metadata in CSV format ● Data sources can be direct CSV upload, S3, GCS, or Azure Storage What are Reference Tables?

Slide 65

Slide 65 text

Linking Slack User Group IDs with Reference Tables 65 ● Define mappings in CSV ● Manage CSVs in GitHub ● Changes are synced to GCS ● Datadog automatically updates from the latest CSV Linking Slack User Group IDs with Reference Tables ------------------------------------------ slack-group-id.csv: ------------------------------------------ service,id,name component.foo,aaaabbbb1234,alert-server-component-foo component.bar,cccdddd4567,alert-server-component-bar component.baz,eeefff8901,alert-server-component-baz ... GitHub Cloud Storage Refernce Tables

Slide 66

Slide 66 text

Linking Slack User Group IDs with Reference Tables 66 ● Specify the group ID from log attributes in the monitor message. Specify in Monitor Message

Slide 67

Slide 67 text

Impact 67 Alert Notification With Group Mention

Slide 68

Slide 68 text

Impact 68 🔭 Easy to spot priority alerts for your group

Slide 69

Slide 69 text

Impact 69 📛 Owners instantly notified via Slack mention badges

Slide 70

Slide 70 text

Impact 70 🚔 Escalation is possible
 if owners don’t respond

Slide 71

Slide 71 text

Wrap Up!

Slide 72

Slide 72 text

Wrap Up! Automatic Runbooks Generation 📖 Runbooks embedded in code & monitors —always up to date, always in sync Team-Specific Alert Notification 🚚 Get the right alert to the right team —instantly, every time

Slide 73

Slide 73 text

Thank You! Shota Iwami Platform Engineer newmo, Inc. X: @B_Sardine ⚠ 💯