Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Life with Datadog
Search
Sponsored
·
Your Podcast. Everywhere. Effortlessly.
Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.
→
Takeshi Kondo
January 24, 2021
Technology
3
4.3k
Life with Datadog
July Tech Festa 2021 winter
https://techfesta.connpass.com/event/193966/
Takeshi Kondo
January 24, 2021
Tweet
Share
More Decks by Takeshi Kondo
See All by Takeshi Kondo
SREの知識地図 - 第2章の紹介 - / Knowledge Map of SRE – Introduction to Chapter 2 –
chaspy
0
64
SRE NEXT CfP チームが語る 聞きたくなるプロポーザルとは / Proposals by the SRE NEXT CfP Team that are sure to be accepted
chaspy
2
1.6k
Slack Platform(Deno) での RAG 実装 - LangChain(js) を使ってみた / rag-implementation-on-slack-platform-deno-experimenting-with-langchain-js
chaspy
0
270
SRE の考えをマネジメントに活かす / applying SRE ideas to management
chaspy
7
8.1k
RAGの簡易評価によるフィードバックサイクル実践 / Feedback cycle practice through simplified assessment of RAGs
chaspy
2
5.8k
定量データと定性評価を用いた技術戦略の組織的実践 / Systematic implementation of technology strategies using quantitative data and qualitative evaluation
chaspy
9
2.2k
エンジニアブランディングチームの KPI / KPI's of engineer branding team
chaspy
2
2.4k
「SLO Review」今やるならこうする / If I had to do the "SLO Review" again
chaspy
3
2.2k
開発者とともに作る Site Reliability Engineering / SREing with Developers
chaspy
10
8.8k
Other Decks in Technology
See All in Technology
プロジェクトマネジメントをチームに宿す -ゼロからはじめるチームプロジェクトマネジメントは活動1年未満のチームの教科書です- / 20260304 Shigeki Morizane
shift_evolve
PRO
1
110
Contract One Engineering Unit 紹介資料
sansan33
PRO
0
14k
Oracle Database@Google Cloud:サービス概要のご紹介
oracle4engineer
PRO
5
1.1k
Sansan Engineering Unit 紹介資料
sansan33
PRO
1
4k
メタデータ同期に潜んでいた問題 〜 Cache Stampede 時の Cycle Wait を⾒つけた話
lycorptech_jp
PRO
0
150
Devinを導入したら予想外の人たちに好評だった
tomuro
0
880
AIエンジニア Devin と歩む、自律型運用プロセスの構築
a2ito
0
680
LINEヤフーにおけるAI駆動開発組織のプロデュース施策
lycorptech_jp
PRO
0
400
大規模サービスにおける レガシーコードからReactへの移行
magicpod
1
130
Agentic Codingの実践とチームで導入するための工夫
lycorptech_jp
PRO
0
400
ビズリーチにおける検索・推薦の取り組み / DEIM2026
visional_engineering_and_design
1
100
作るべきものと向き合う - ecspresso 8年間の開発史から学ぶ技術選定 / 技術選定con findy 2026
fujiwara3
7
2.1k
Featured
See All Featured
Redefining SEO in the New Era of Traffic Generation
szymonslowik
1
230
[Rails World 2023 - Day 1 Closing Keynote] - The Magic of Rails
eileencodes
38
2.8k
Utilizing Notion as your number one productivity tool
mfonobong
4
240
Building Flexible Design Systems
yeseniaperezcruz
330
40k
Exploring the Power of Turbo Streams & Action Cable | RailsConf2023
kevinliebholz
37
6.3k
Learning to Love Humans: Emotional Interface Design
aarron
275
41k
Noah Learner - AI + Me: how we built a GSC Bulk Export data pipeline
techseoconnect
PRO
0
130
Unsuck your backbone
ammeep
672
58k
What Being in a Rock Band Can Teach Us About Real World SEO
427marketing
0
190
Put a Button on it: Removing Barriers to Going Fast.
kastner
60
4.2k
Automating Front-end Workflow
addyosmani
1370
200k
The World Runs on Bad Software
bkeepers
PRO
72
12k
Transcript
Life with Datadog Takeshi Kondo / @chaspy 2021/01/24 July Tech
Festa Winter 2021
Who am I chaspy chaspy_ Lead Software Engineer Site Reliability
at Quipper Takeshi Kondo
Japan Datadog User Group Organizer • https://datadog-jp.connpass.com/
Software Design 2݄߸ʹDatadog ϋϯζΦϯهࣄΛدߘ • ʲୈ2ಛूʳγεςϜࢹͷ࢝Ίํɾଓ͚ํ ୈ3ষ Datadog Ͱ࣮ફ͢ΔSaaSࢹ https://twitter.com/chaspy_/status/1350270946767081472?s=20
Datadog ॳ৺ऀͷํͥͻʂ
TγϟπͱۺԼ͍͖ͨͩ·ͨ͠ https://twitter.com/chaspy_/status/1352577570278043649?s=20 ཉ͍͠ͻͱ meetup Ͱͨ͠Γ ϞϒϓϩࢀՃ͢ΔͱΒ͑Δ͔ʂ
ࠓ Datadog ͷѪΛޠΓ·͢ https://www.datadoghq.com/about/resources/ Monitoring SaaS
ࠓͷൃද • ର • γεςϜӡ༻ऀ • Monitoring SaaS ʹڵຯ͕͋Δͻͱ •
Datadog ʹڵຯ͕͋Δͻͱ • ΰʔϧ • Datadog ͷྑ͞ΛΔ • Datadog ͷ༷ʑͳԠ༻ػೳΛΔ͜ͱͰ׆༻ͷώϯτΛಘΔ
Agenda • Datadog ͷ͖ͳͱ͜Ζ • ϓϩμΫτѪᷓΕΔαϙʔτ • ϦονͰΩϡʔτͳ GUI •
SLO Summary & Alert • Datadog ͱͷੜ׆ • Monitor Summary • Weekly Summary • Alfred ͱͷΈ߹Θͤ • Datadog ͱͷੜ׆ Ԡ༻ฤ • Datadog Cost ͱͷ͖߹͍ํ • ৽نαʔϏεͱ Datadog • Custom Metrics ͷ׆༻
Agenda • Datadog ͷ͖ͳͱ͜Ζ • ϓϩμΫτѪᷓΕΔαϙʔτ • ϦονͰΩϡʔτͳ GUI •
SLO Summary & Alert • Datadog ͱͷੜ׆ • Monitor Summary • Weekly Summary • Alfred ͱͷΈ߹Θͤ • Datadog ͱͷੜ׆ Ԡ༻ฤ • Datadog Cost ͱͷ͖߹͍ํ • ৽نαʔϏεͱ Datadog • Custom Metrics ͷ׆༻
Datadog ͷ͖ͳͱ͜Ζɿαϙʔτ͕͍͢͝ • Ϩεϙϯε͕ૣ͍(΄΅࣮֬ʹ24hҎʹؼͬͯ͘Δʣ • ճ͕త֬ • Ͱ͖ͳ͍͜ͱͰϫʔΫΞϥϯυΛҊͯ͘͠ΕΔ • ϓϩμΫτΛྑ͘͠Α͏ͱ͍͏͕࢟ݟ͑Δ
αϙʔτʹ͓ئ͍͢Δํ๏ • جຊνέοτΛىථͯ͠Δʢӳޠʣ • Help -> resource -> Tickets &
Email Support • Ͳ͏ͯ͠ӳޠແཧͳΒຊޠͰͷճΛ͓ئ͍͢Εຊ αϙʔτʹӦۀ࣌ؒͯ͘͠ΕΔΒ͍͠ • Event Stream ʹ @support-datadog Λଧͯ live chat Ͱ ͖ΔΒ͍͠ʢͬͨ͜ͱͳ͍͕ʣ
աڈͷ࣭ྫ • Terraform Ͱ SLO ͷ Apply ͕ Invalid Query
ͰౖΒΕΔ • Monitor Summary Λ metrics ͱͯ͠औΕͳ͍͔૬ஊ • EKS Control Plane ͷ metrics औಘํ๏ • SLO ҧͨ͠ͱ͖ʹΞϥʔτΛඈ͢ํ๏ • API Key ͷΛ૿ͤͳ͍͔૬ஊ
աڈͷ࣭ྫ • Terraform Ͱ SLO ͷ Apply ͕ Invalid Query
ͰౖΒΕΔ • Monitor Summary Λ metrics ͱͯ͠औΕͳ͍͔૬ஊ • EKS Control Plane ͷ metrics औಘํ๏ • SLO ҧͨ͠ͱ͖ʹΞϥʔτΛඈ͢ํ๏ • API Key ͷΛ૿ͤͳ͍͔૬ஊ
Monitor Summary Λ metrics ͱͯ͠औΕͳ͍͔૬ஊ • Monitor Summary ͬͯ·͔͢ʁ •
िʹҰϝʔϧདྷͯΔͭ • https://app.datadoghq.com/report/monitor • ࣮ GUI ͔Βಋઢ͕ͳ͍ʢϝʔϧ͔Β͔͠ඈͳ͍ʣ
Monitor Summary
Monitor Summary ҙͷνϟϯωϧͷΞϥʔτ ͚ͩϑΟϧλ͍ͨ͠
Raw data(CSV) ͕ api ͰऔΕΔΒ͍͠ • CSV ͕μϯϩʔυͰ͖Δ • ͜ͷ
CSV ͔Β custom metrics ΛૹΔίʔυΛॻ͍ͨ https://app.datadoghq.com/report/hourly_data/monitor https://docs.datadoghq.com/monitors/faq/how-can-i-export-alert-history/?tab=us
SLO ҧͨ͠ͱ͖ʹΞϥʔτΛඈ͢ํ๏ • ࣌·ͩ SLO Error Budget Alert ͷػೳͳ͔ͬͨ •
Product Manager ͕ग़ͯ͘Δɻͦͷػೳઈࢍ։ൃதͰɺ ϢʔβΠϯλϏϡʔΛͤͯ͘͞Εͳ͍͔ͱݴͬͯ͘Δ • Support Ticket ্Ͱ͍͔࣭ͭ͘ʹ͑Δ • ͦͷޙແࣄ SLO Alert Beta ϦϦʔε!
Τϯήʔδϝϯτͷߴ·Γ
ϦονͰΩϡʔτͳGUI
ϦονͰΩϡʔτͳGUI • ՄࢹԽͷํ๏͕ଟ͍ • ͙͢ʹө͞ΕΔ • ͕ؔଟ͍ • ίϐϖָ͕
࣮ࡍʹάϥϑΛ͍ͬͯ͡ΈΑ͏
SLO Summary & Alert
SLO Summary & Alert
Datadog ͷ SLO ػೳͳͯ͘ͳΒͳ͍૬ https://quipper.hatenablog.com/entry/2020/01/30/slo-review
Terraform ཧ & generator Ͱ৽نαʔϏε࡞ ࣌ʹ؆୯ʹ࡞Ͱ͖Δ ࣮ࡍʹݟͯΈΑ͏
Agenda • Datadog ͷ͖ͳͱ͜Ζ • ϓϩμΫτѪᷓΕΔαϙʔτ • ϦονͰΩϡʔτͳ GUI •
SLO Summary & Alert • Datadog ͱͷੜ׆ • Monitor Summary • Weekly Summary • Alfred ͱͷΈ߹Θͤ • Datadog ͱͷੜ׆ Ԡ༻ฤ • Datadog Cost ͱͷ͖߹͍ํ • ৽نαʔϏεͱ Datadog • Custom Metrics ͷ׆༻
Monitor Summary Λຖே Daily Standup Ͱݟ͍ͯΔ
Monitor Summary • >muted:false tag:(("severity:alert" OR "severity:emergency") AND "team:sre") •
ͯ͢ͷ Alert ʹ severity tag ͱ team tag ͕͍͍ͭͯΔ • ࣗͨͪͷ୲ͷΞϥʔτͷΈΛ Daily Standup Ͱ Check • NG ͩͬͨΒຖͷ୲ऀ͕ରॲ͢Δ
Weekly Summary • ͜͏͍͏ϝʔϧ͕དྷ͍ͯ·͢ • [Monitor Report] You received xxx
alerts
Weekly Summary • ࣮͜͜ͰݟΕΔ • https://app.datadoghq.com/report/monitor • ΫϦοΫ͢Δͱि࣌ؒଳΞϥʔτνϟϯωϧɺmonitor ͰߜΓࠐΈ͕Ͱ͖Δ •
͜ΕΛݟͯΞϥʔτͷࣗମͷਪҠΛ͍ͬͯΔ
Alfred ͱͷΈ߹Θͤ • Dashboard, monitor, SLO Λαοͱ։͚ΔΑ͏ʹ • Web Search
ొ͍ͯ͠Δ • ddd https://app.datadoghq.com/dashboard/lists?q={query} • ddm https://app.datadoghq.com/monitors/manage?q={query} • ddslo https://app.datadoghq.com/slo?query={query}
Agenda • Datadog ͷ͖ͳͱ͜Ζ • ϓϩμΫτѪᷓΕΔαϙʔτ • ϦονͰΩϡʔτͳ GUI •
SLO Summary & Alert • Datadog ͱͷੜ׆ • Monitor Summary • Weekly Summary • Alfred ͱͷΈ߹Θͤ • Datadog ͱͷੜ׆ Ԡ༻ฤ • Datadog Cost ͱͷ͖߹͍ํ • ৽نαʔϏεͱ Datadog • Custom Metrics ͷ׆༻
Datadog Cost ͷܭࢉࣜ • ݄ؒͷϗετͷ 99 percentile (max Ͱͳ͍) •
ϗετΛݮ͢Δͱ Datadog Cost ݮΔ • ϗετΛݮΒ͔͢͠ͳ͍ʢԦಓͳ͠ʣ
৽نαʔϏεͱ Datadog • ৽نαʔϏεߏங࣌ʹ"Production Readiness Checklist" ͕͋Δ https://quipper.hatenablog.com/entry/2020/01/30/production-readiness-with-all
Production Readiness Checklist - Monitoring / Logging
Template Dashboard
࣮ࡍʹ Template Dashboard ΛݟͯΈΑ͏
Production Readiness Checklist - SLI/SLO • Generator Λͬͯ࡞ͬͯ Dashboard ʹՃ͢Δ͚ͩʂ
• SLO ࣗମΛԿʹ͢Δ͔ Design Doc ϑΣʔζͰࡦఆ͢Δ
Custom metrics ͷ׆༻ྫ • CI Ͱ͔͔ͬͨ࣌ؒΛଌఆ • Docker build ͷ࣌ؒɺ֤ςετͷ࣌ؒͳͲ
• CircleCI ͷίετΛଌఆɺՄࢹԽ • Dependabot ʹΑͬͯ open ʹͳͬͯΔ PR ΛՄࢹԽ • εύΠΫΞΫηεΛ͙ͨΊʹෛՙใΛ custom metrics ͱͯ͠ૹ ΓɺKubernetes HPA ͔Βར༻
Kubernetes meetup tokyo #38 Ͱ͠·ͨ͠ https://twitter.com/chaspy_/status/1352194206962388995?s=20 https://speakerdeck.com/chaspy/v2beta2-and-examples-of-using-hpa-external-metrics-with-datadog https://quipper.hatenablog.com/entry/2020/11/30/scheduled-scaling-with-hpa
Custom metric ͱͯ͠ͷճΓͷใΛૹΔϝϦοτ • ݱঢ়ΛՄࢹԽͰ͖Δ • ҙͷλάͰϑΟϧλ/আ֎Ͱ͖Δ • Metric ͔ΒΞϥʔτΛൃ๒Ͱ͖Δ
Custom metrics ΛૹΔ design pattern • Kubernetes ڥͩͱ Autodiscovery ػೳ͕ศར
https://docs.datadoghq.com/agent/kubernetes/integrations/?tab=kubernetes API ͕͋Δ֎෦γεςϜ Custom metric ੜ͢Δ܅ (Kubernetes Deployment) API Ͱσʔλऔಘ Datadog-agent http Ͱ prometheus ܗࣜͰ metric Λ export
Custom metrics ΛૹΔ design pattern • Kubernetes ڥͩͱ Autodiscovery ػೳ͕ศར
https://docs.datadoghq.com/agent/kubernetes/integrations/?tab=kubernetes API ͕͋Δ֎෦γεςϜ Custom metric ੜ͢Δ܅ (Kubernetes Deployment) API Ͱσʔλऔಘ Datadog-agent 1 annotations: 2 ad.datadoghq.com/timed-exam-schedule-exporter.check_names: | 3 ["prometheus"] 4 ad.datadoghq.com/timed-exam-schedule-exporter.init_configs: | 5 [{}] 6 ad.datadoghq.com/timed-exam-schedule-exporter.instances: | 7 [ 8 { 9 "prometheus_url": "http://%%host%%:8080/metrics", 10 "namespace": "namespace", 11 "metrics": ["*"] 12 } 13 ]
Custom metrics ΛૹΔ design pattern • Kubernetes ڥͩͱ Autodiscovery ػೳ͕ศར
https://docs.datadoghq.com/agent/kubernetes/integrations/?tab=kubernetes API ͕͋Δ֎෦γεςϜ Custom metric ੜ͢Δ܅ (Kubernetes Deployment) API Ͱσʔλऔಘ Datadog-agent • GitHub ͷ PR issue • AWS Security Event(GuardDuty / ECR Image Scan) • OS ͝ͱͷ EC2 Horst • RDS ͷ version ͝ͱͷΫϥελ
·ͱΊ • Datadog Ͱ͔Βͳ͍͜ͱ͕͋ͬͨΒ... • ͙͢ʹ Support ʹฉ͜͏ʂૣͯ͘ஸೡʂ • Japan
Datadog User Group ͷ Slack Ͱฉ͍ͯ okʂ • Monitor Summary ͱ Weekly Report Ͱ Monitor ࣗମݮΒͯ͠ ͍͜͏ • Custom metrics ͦ͜ Datadog ͷޣຯɺͲΜͲΜ׆༻͠Α͏ʂ
More • ࣮ࡍͷάϥϑͱ͔ݟ͍ͨͻͱըͷͳ͍࣌ʹདྷͯʂ • ͜ͷ͋ͱͷ zoom breakout room ͰΑ͠ •
ݸผʹ Twitter Ͱ࿈བྷ͘ΕΕ Datadog ૬ஊΓ·͢
એ
Thank you! chaspy chaspy_ Lead Software Engineer Site Reliability at
Quipper Takeshi Kondo