Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
GitHub Universe 2015 Talk - Your software is br...
Search
James Smith
October 02, 2015
Technology
1
110
GitHub Universe 2015 Talk - Your software is broken — pay attention: Rethinking production monitoring
My talk from GitHub Universe 2015's "Deploy" track
James Smith
October 02, 2015
Tweet
Share
More Decks by James Smith
See All by James Smith
Why Are Android Apps So Crash-Prone?
loopj
0
180
RailsConf 2016 Talk - Your software is broken — pay attention: Rethinking production monitoring
loopj
1
410
Building A Popular Open-Source Android Library - Best practices and lessons learned
loopj
4
480
Building A Popular Open-Source javascript Library
loopj
0
93
JavaScript Stack Traces: The good, the bad, and the ugly
loopj
1
220
Other Decks in Technology
See All in Technology
こんなところでも(地味に)活躍するImage Modeさんを知ってるかい?- Image Mode for OpenShift -
tsukaman
1
170
usermode linux without MMU - fosdem2026 kernel devroom
thehajime
0
240
Oracle Cloud Observability and Management Platform - OCI 運用監視サービス概要 -
oracle4engineer
PRO
2
14k
生成AIを活用した音声文字起こしシステムの2つの構築パターンについて
miu_crescent
PRO
3
220
AIが実装する時代、人間は仕様と検証を設計する
gotalab555
1
460
10Xにおける品質保証活動の全体像と改善 #no_more_wait_for_test
nihonbuson
PRO
2
340
pool.ntp.orgに ⾃宅サーバーで 参加してみたら...
tanyorg
0
990
ファインディの横断SREがTakumi byGMOと取り組む、セキュリティと開発スピードの両立
rvirus0817
1
1.7k
Embedded SREの終わりを設計する 「なんとなく」から計画的な自立支援へ
sansantech
PRO
3
2.6k
プロダクト成長を支える開発基盤とスケールに伴う課題
yuu26
4
1.4k
ClickHouseはどのように大規模データを活用したAIエージェントを全社展開しているのか
mikimatsumoto
0
270
SREチームをどう作り、どう育てるか ― Findy横断SREのマネジメント
rvirus0817
0
350
Featured
See All Featured
Ruling the World: When Life Gets Gamed
codingconduct
0
150
Building Experiences: Design Systems, User Experience, and Full Site Editing
marktimemedia
0
410
Build your cross-platform service in a week with App Engine
jlugia
234
18k
Designing Dashboards & Data Visualisations in Web Apps
destraynor
231
54k
Accessibility Awareness
sabderemane
0
57
The agentic SEO stack - context over prompts
schlessera
0
650
エンジニアに許された特別な時間の終わり
watany
106
230k
SEO for Brand Visibility & Recognition
aleyda
0
4.2k
The SEO Collaboration Effect
kristinabergwall1
0
350
How People are Using Generative and Agentic AI to Supercharge Their Products, Projects, Services and Value Streams Today
helenjbeal
1
130
The Success of Rails: Ensuring Growth for the Next 100 Years
eileencodes
47
7.9k
The Web Performance Landscape in 2024 [PerfNow 2024]
tammyeverts
12
1k
Transcript
RETHINKING PRODUCTION MONITORING YOUR SOFTWARE IS BROKEN — PAY ATTENTION
JAMES SMITH loopj loopj
None
CODE TEST DEPLOY YOLO ¯\_(ϑ)_/¯
CODE TEST DEPLOY YOLO CODE TEST DEPLOY CONFIDENCE ¯\_(ϑ)_/¯ :)
STABILITY PERFORMANCE AVAILABILITY
DELIVERING AN AWESOME EXPERIENCE TO CUSTOMERS
WHY MONITORING MATTERS
YOUR APP WILL LIVE OR DIE BASED ON ITS QUALITY
— CUSTOMERS HAVE A CHOICE
84% OF USERS ABANDON AFTER TWO CRASHES
49% OF ENGINEERING TIME FINDING & FIXING BUGS
SINS OF PRODUCTION MONITORING WHAT AM I DOING WRONG?
1. PRETENDING NOTHING IS WRONG
“But I’ve written tests!” “The QA Team will check that!”
“Works great for me!”
2. WAITING FOR CUSTOMERS TO COMPLAIN
“Nobody complained so everything must be OK”
3. LACK OF VISIBILITY
“We’ll just check the logs” “Did you remember to add
a log statement?”
4. LACK OF OWNERSHIP
“Not my problem!” “I’ve got a feature to ship” “My
code works fine”
HOW CAN WE DO BETTER?
ACCEPT AUTOMATE AGGREGATE NOTIFY PRIORITIZE DIAGNOSE TEND CORE PRINCIPLES OF
PRODUCTION MONITORING
1. ACCEPT ACCEPT THAT YOUR SOFTWARE WILL BREAK AFTER SHIPPING
2. AUTOMATE ADD HOOKS TO DETECT CRASHES/ERRORS/ISSUES IN PRODUCTION
3. AGGREGATE DON'T JUST HAVE A STREAM OF EVENTS -
GROUP LIKE ISSUES TOGETHER
4. NOTIFY ALERT YOUR DEV TEAM WHERE THEY ALREADY COMMUNICATE
5. PRIORITIZE YOU CAN'T FIX EVERY ERROR - SO FOCUS
ON THE MOST HARMFUL ONES
6. DIAGNOSE KNOWING ABOUT ISSUES ISN'T ENOUGH - THEY MUST
BE ACTIONABLE
7. TEND MAKE AN ORGANIZATIONAL CHANGE - SOMEONE NEEDS TO
CARE ABOUT ERRORS
TAKING ACTION
TOOLS
USES “FAILURE” HOOKS
ASSESS IMPACT
ASSESS SEVERITY
CAPTURES DIAGNOSTIC DATA
WORKFLOW
USE TEAM CHAT
EMBRACE COLLABORATION
TRACK PROGRESS OF FIXES
TEAM STRUCTURES
EMBRACE RAPID ITERATION
CREATE A “BUG TEAM”
OR CREATE A “BUG ROTATION”
OR KNOW “WHO LAST TOUCHED THIS CODE”?
TL;DR
AVOID THE SINS
EMBRACE CORE PRINCIPLES
TAKE ACTION
THANK YOU!
QUESTIONS?
IS HIRING! bugsnag.com/jobs @bugsnag