Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Life with Datadog
Search
Sponsored
·
SiteGround - Reliable hosting with speed, security, and support you can count on.
→
Takeshi Kondo
January 24, 2021
Technology
3
4.3k
Life with Datadog
July Tech Festa 2021 winter
https://techfesta.connpass.com/event/193966/
Takeshi Kondo
January 24, 2021
Tweet
Share
More Decks by Takeshi Kondo
See All by Takeshi Kondo
SREの知識地図 - 第2章の紹介 - / Knowledge Map of SRE – Introduction to Chapter 2 –
chaspy
0
68
SRE NEXT CfP チームが語る 聞きたくなるプロポーザルとは / Proposals by the SRE NEXT CfP Team that are sure to be accepted
chaspy
2
1.6k
Slack Platform(Deno) での RAG 実装 - LangChain(js) を使ってみた / rag-implementation-on-slack-platform-deno-experimenting-with-langchain-js
chaspy
0
270
SRE の考えをマネジメントに活かす / applying SRE ideas to management
chaspy
7
8.1k
RAGの簡易評価によるフィードバックサイクル実践 / Feedback cycle practice through simplified assessment of RAGs
chaspy
2
5.9k
定量データと定性評価を用いた技術戦略の組織的実践 / Systematic implementation of technology strategies using quantitative data and qualitative evaluation
chaspy
9
2.2k
エンジニアブランディングチームの KPI / KPI's of engineer branding team
chaspy
2
2.4k
「SLO Review」今やるならこうする / If I had to do the "SLO Review" again
chaspy
3
2.3k
開発者とともに作る Site Reliability Engineering / SREing with Developers
chaspy
10
8.8k
Other Decks in Technology
See All in Technology
「Blue Team Labs Online」入門 - みんなで挑むログ解析バトル
v_avenger
0
150
僕、S3 シンプルって名前だけど全然シンプルじゃありません よろしくお願いします
yama3133
1
190
Claude Code Skills 勉強会 (DevelersIO向けに調整済み) / claude code skills for devio
masahirokawahara
1
14k
[2026-03-07]あの日諦めたスクラムの答えを僕達はまだ探している。〜守ることと、諦めることと、それでも前に進むチームの話〜
tosite
0
170
8万デプロイ
iwamot
PRO
2
230
EMからVPoEを経てCTOへ:マネジメントキャリアパスにおける葛藤と成長
kakehashi
PRO
9
1.6k
開発組織の課題解決を加速するための権限委譲 -する側、される側としての向き合い方-
daitasu
5
580
聲の形にみるアクセシビリティ
tomokusaba
0
170
OCI技術資料 : コンピュート・サービス 概要
ocise
4
54k
製造業ドメインにおける LLMプロダクト構築: 複雑な文脈へのアプローチ
caddi_eng
1
550
技術的負債の泥沼から組織を救う3つの転換点
nwiizo
8
3.6k
プロジェクトマネジメントをチームに宿す -ゼロからはじめるチームプロジェクトマネジメントは活動1年未満のチームの教科書です- / 20260304 Shigeki Morizane
shift_evolve
PRO
1
250
Featured
See All Featured
Mobile First: as difficult as doing things right
swwweet
225
10k
BBQ
matthewcrist
89
10k
Practical Tips for Bootstrapping Information Extraction Pipelines
honnibal
25
1.8k
Fireside Chat
paigeccino
42
3.8k
Intergalactic Javascript Robots from Outer Space
tanoku
273
27k
Testing 201, or: Great Expectations
jmmastey
46
8.1k
The agentic SEO stack - context over prompts
schlessera
0
690
Design of three-dimensional binary manipulators for pick-and-place task avoiding obstacles (IECON2024)
konakalab
0
380
Rails Girls Zürich Keynote
gr2m
96
14k
The browser strikes back
jonoalderson
0
780
The B2B funnel & how to create a winning content strategy
katarinadahlin
PRO
1
300
DBのスキルで生き残る技術 - AI時代におけるテーブル設計の勘所
soudai
PRO
62
51k
Transcript
Life with Datadog Takeshi Kondo / @chaspy 2021/01/24 July Tech
Festa Winter 2021
Who am I chaspy chaspy_ Lead Software Engineer Site Reliability
at Quipper Takeshi Kondo
Japan Datadog User Group Organizer • https://datadog-jp.connpass.com/
Software Design 2݄߸ʹDatadog ϋϯζΦϯهࣄΛدߘ • ʲୈ2ಛूʳγεςϜࢹͷ࢝Ίํɾଓ͚ํ ୈ3ষ Datadog Ͱ࣮ફ͢ΔSaaSࢹ https://twitter.com/chaspy_/status/1350270946767081472?s=20
Datadog ॳ৺ऀͷํͥͻʂ
TγϟπͱۺԼ͍͖ͨͩ·ͨ͠ https://twitter.com/chaspy_/status/1352577570278043649?s=20 ཉ͍͠ͻͱ meetup Ͱͨ͠Γ ϞϒϓϩࢀՃ͢ΔͱΒ͑Δ͔ʂ
ࠓ Datadog ͷѪΛޠΓ·͢ https://www.datadoghq.com/about/resources/ Monitoring SaaS
ࠓͷൃද • ର • γεςϜӡ༻ऀ • Monitoring SaaS ʹڵຯ͕͋Δͻͱ •
Datadog ʹڵຯ͕͋Δͻͱ • ΰʔϧ • Datadog ͷྑ͞ΛΔ • Datadog ͷ༷ʑͳԠ༻ػೳΛΔ͜ͱͰ׆༻ͷώϯτΛಘΔ
Agenda • Datadog ͷ͖ͳͱ͜Ζ • ϓϩμΫτѪᷓΕΔαϙʔτ • ϦονͰΩϡʔτͳ GUI •
SLO Summary & Alert • Datadog ͱͷੜ׆ • Monitor Summary • Weekly Summary • Alfred ͱͷΈ߹Θͤ • Datadog ͱͷੜ׆ Ԡ༻ฤ • Datadog Cost ͱͷ͖߹͍ํ • ৽نαʔϏεͱ Datadog • Custom Metrics ͷ׆༻
Agenda • Datadog ͷ͖ͳͱ͜Ζ • ϓϩμΫτѪᷓΕΔαϙʔτ • ϦονͰΩϡʔτͳ GUI •
SLO Summary & Alert • Datadog ͱͷੜ׆ • Monitor Summary • Weekly Summary • Alfred ͱͷΈ߹Θͤ • Datadog ͱͷੜ׆ Ԡ༻ฤ • Datadog Cost ͱͷ͖߹͍ํ • ৽نαʔϏεͱ Datadog • Custom Metrics ͷ׆༻
Datadog ͷ͖ͳͱ͜Ζɿαϙʔτ͕͍͢͝ • Ϩεϙϯε͕ૣ͍(΄΅࣮֬ʹ24hҎʹؼͬͯ͘Δʣ • ճ͕త֬ • Ͱ͖ͳ͍͜ͱͰϫʔΫΞϥϯυΛҊͯ͘͠ΕΔ • ϓϩμΫτΛྑ͘͠Α͏ͱ͍͏͕࢟ݟ͑Δ
αϙʔτʹ͓ئ͍͢Δํ๏ • جຊνέοτΛىථͯ͠Δʢӳޠʣ • Help -> resource -> Tickets &
Email Support • Ͳ͏ͯ͠ӳޠແཧͳΒຊޠͰͷճΛ͓ئ͍͢Εຊ αϙʔτʹӦۀ࣌ؒͯ͘͠ΕΔΒ͍͠ • Event Stream ʹ @support-datadog Λଧͯ live chat Ͱ ͖ΔΒ͍͠ʢͬͨ͜ͱͳ͍͕ʣ
աڈͷ࣭ྫ • Terraform Ͱ SLO ͷ Apply ͕ Invalid Query
ͰౖΒΕΔ • Monitor Summary Λ metrics ͱͯ͠औΕͳ͍͔૬ஊ • EKS Control Plane ͷ metrics औಘํ๏ • SLO ҧͨ͠ͱ͖ʹΞϥʔτΛඈ͢ํ๏ • API Key ͷΛ૿ͤͳ͍͔૬ஊ
աڈͷ࣭ྫ • Terraform Ͱ SLO ͷ Apply ͕ Invalid Query
ͰౖΒΕΔ • Monitor Summary Λ metrics ͱͯ͠औΕͳ͍͔૬ஊ • EKS Control Plane ͷ metrics औಘํ๏ • SLO ҧͨ͠ͱ͖ʹΞϥʔτΛඈ͢ํ๏ • API Key ͷΛ૿ͤͳ͍͔૬ஊ
Monitor Summary Λ metrics ͱͯ͠औΕͳ͍͔૬ஊ • Monitor Summary ͬͯ·͔͢ʁ •
िʹҰϝʔϧདྷͯΔͭ • https://app.datadoghq.com/report/monitor • ࣮ GUI ͔Βಋઢ͕ͳ͍ʢϝʔϧ͔Β͔͠ඈͳ͍ʣ
Monitor Summary
Monitor Summary ҙͷνϟϯωϧͷΞϥʔτ ͚ͩϑΟϧλ͍ͨ͠
Raw data(CSV) ͕ api ͰऔΕΔΒ͍͠ • CSV ͕μϯϩʔυͰ͖Δ • ͜ͷ
CSV ͔Β custom metrics ΛૹΔίʔυΛॻ͍ͨ https://app.datadoghq.com/report/hourly_data/monitor https://docs.datadoghq.com/monitors/faq/how-can-i-export-alert-history/?tab=us
SLO ҧͨ͠ͱ͖ʹΞϥʔτΛඈ͢ํ๏ • ࣌·ͩ SLO Error Budget Alert ͷػೳͳ͔ͬͨ •
Product Manager ͕ग़ͯ͘Δɻͦͷػೳઈࢍ։ൃதͰɺ ϢʔβΠϯλϏϡʔΛͤͯ͘͞Εͳ͍͔ͱݴͬͯ͘Δ • Support Ticket ্Ͱ͍͔࣭ͭ͘ʹ͑Δ • ͦͷޙແࣄ SLO Alert Beta ϦϦʔε!
Τϯήʔδϝϯτͷߴ·Γ
ϦονͰΩϡʔτͳGUI
ϦονͰΩϡʔτͳGUI • ՄࢹԽͷํ๏͕ଟ͍ • ͙͢ʹө͞ΕΔ • ͕ؔଟ͍ • ίϐϖָ͕
࣮ࡍʹάϥϑΛ͍ͬͯ͡ΈΑ͏
SLO Summary & Alert
SLO Summary & Alert
Datadog ͷ SLO ػೳͳͯ͘ͳΒͳ͍૬ https://quipper.hatenablog.com/entry/2020/01/30/slo-review
Terraform ཧ & generator Ͱ৽نαʔϏε࡞ ࣌ʹ؆୯ʹ࡞Ͱ͖Δ ࣮ࡍʹݟͯΈΑ͏
Agenda • Datadog ͷ͖ͳͱ͜Ζ • ϓϩμΫτѪᷓΕΔαϙʔτ • ϦονͰΩϡʔτͳ GUI •
SLO Summary & Alert • Datadog ͱͷੜ׆ • Monitor Summary • Weekly Summary • Alfred ͱͷΈ߹Θͤ • Datadog ͱͷੜ׆ Ԡ༻ฤ • Datadog Cost ͱͷ͖߹͍ํ • ৽نαʔϏεͱ Datadog • Custom Metrics ͷ׆༻
Monitor Summary Λຖே Daily Standup Ͱݟ͍ͯΔ
Monitor Summary • >muted:false tag:(("severity:alert" OR "severity:emergency") AND "team:sre") •
ͯ͢ͷ Alert ʹ severity tag ͱ team tag ͕͍͍ͭͯΔ • ࣗͨͪͷ୲ͷΞϥʔτͷΈΛ Daily Standup Ͱ Check • NG ͩͬͨΒຖͷ୲ऀ͕ରॲ͢Δ
Weekly Summary • ͜͏͍͏ϝʔϧ͕དྷ͍ͯ·͢ • [Monitor Report] You received xxx
alerts
Weekly Summary • ࣮͜͜ͰݟΕΔ • https://app.datadoghq.com/report/monitor • ΫϦοΫ͢Δͱि࣌ؒଳΞϥʔτνϟϯωϧɺmonitor ͰߜΓࠐΈ͕Ͱ͖Δ •
͜ΕΛݟͯΞϥʔτͷࣗମͷਪҠΛ͍ͬͯΔ
Alfred ͱͷΈ߹Θͤ • Dashboard, monitor, SLO Λαοͱ։͚ΔΑ͏ʹ • Web Search
ొ͍ͯ͠Δ • ddd https://app.datadoghq.com/dashboard/lists?q={query} • ddm https://app.datadoghq.com/monitors/manage?q={query} • ddslo https://app.datadoghq.com/slo?query={query}
Agenda • Datadog ͷ͖ͳͱ͜Ζ • ϓϩμΫτѪᷓΕΔαϙʔτ • ϦονͰΩϡʔτͳ GUI •
SLO Summary & Alert • Datadog ͱͷੜ׆ • Monitor Summary • Weekly Summary • Alfred ͱͷΈ߹Θͤ • Datadog ͱͷੜ׆ Ԡ༻ฤ • Datadog Cost ͱͷ͖߹͍ํ • ৽نαʔϏεͱ Datadog • Custom Metrics ͷ׆༻
Datadog Cost ͷܭࢉࣜ • ݄ؒͷϗετͷ 99 percentile (max Ͱͳ͍) •
ϗετΛݮ͢Δͱ Datadog Cost ݮΔ • ϗετΛݮΒ͔͢͠ͳ͍ʢԦಓͳ͠ʣ
৽نαʔϏεͱ Datadog • ৽نαʔϏεߏங࣌ʹ"Production Readiness Checklist" ͕͋Δ https://quipper.hatenablog.com/entry/2020/01/30/production-readiness-with-all
Production Readiness Checklist - Monitoring / Logging
Template Dashboard
࣮ࡍʹ Template Dashboard ΛݟͯΈΑ͏
Production Readiness Checklist - SLI/SLO • Generator Λͬͯ࡞ͬͯ Dashboard ʹՃ͢Δ͚ͩʂ
• SLO ࣗମΛԿʹ͢Δ͔ Design Doc ϑΣʔζͰࡦఆ͢Δ
Custom metrics ͷ׆༻ྫ • CI Ͱ͔͔ͬͨ࣌ؒΛଌఆ • Docker build ͷ࣌ؒɺ֤ςετͷ࣌ؒͳͲ
• CircleCI ͷίετΛଌఆɺՄࢹԽ • Dependabot ʹΑͬͯ open ʹͳͬͯΔ PR ΛՄࢹԽ • εύΠΫΞΫηεΛ͙ͨΊʹෛՙใΛ custom metrics ͱͯ͠ૹ ΓɺKubernetes HPA ͔Βར༻
Kubernetes meetup tokyo #38 Ͱ͠·ͨ͠ https://twitter.com/chaspy_/status/1352194206962388995?s=20 https://speakerdeck.com/chaspy/v2beta2-and-examples-of-using-hpa-external-metrics-with-datadog https://quipper.hatenablog.com/entry/2020/11/30/scheduled-scaling-with-hpa
Custom metric ͱͯ͠ͷճΓͷใΛૹΔϝϦοτ • ݱঢ়ΛՄࢹԽͰ͖Δ • ҙͷλάͰϑΟϧλ/আ֎Ͱ͖Δ • Metric ͔ΒΞϥʔτΛൃ๒Ͱ͖Δ
Custom metrics ΛૹΔ design pattern • Kubernetes ڥͩͱ Autodiscovery ػೳ͕ศར
https://docs.datadoghq.com/agent/kubernetes/integrations/?tab=kubernetes API ͕͋Δ֎෦γεςϜ Custom metric ੜ͢Δ܅ (Kubernetes Deployment) API Ͱσʔλऔಘ Datadog-agent http Ͱ prometheus ܗࣜͰ metric Λ export
Custom metrics ΛૹΔ design pattern • Kubernetes ڥͩͱ Autodiscovery ػೳ͕ศར
https://docs.datadoghq.com/agent/kubernetes/integrations/?tab=kubernetes API ͕͋Δ֎෦γεςϜ Custom metric ੜ͢Δ܅ (Kubernetes Deployment) API Ͱσʔλऔಘ Datadog-agent 1 annotations: 2 ad.datadoghq.com/timed-exam-schedule-exporter.check_names: | 3 ["prometheus"] 4 ad.datadoghq.com/timed-exam-schedule-exporter.init_configs: | 5 [{}] 6 ad.datadoghq.com/timed-exam-schedule-exporter.instances: | 7 [ 8 { 9 "prometheus_url": "http://%%host%%:8080/metrics", 10 "namespace": "namespace", 11 "metrics": ["*"] 12 } 13 ]
Custom metrics ΛૹΔ design pattern • Kubernetes ڥͩͱ Autodiscovery ػೳ͕ศར
https://docs.datadoghq.com/agent/kubernetes/integrations/?tab=kubernetes API ͕͋Δ֎෦γεςϜ Custom metric ੜ͢Δ܅ (Kubernetes Deployment) API Ͱσʔλऔಘ Datadog-agent • GitHub ͷ PR issue • AWS Security Event(GuardDuty / ECR Image Scan) • OS ͝ͱͷ EC2 Horst • RDS ͷ version ͝ͱͷΫϥελ
·ͱΊ • Datadog Ͱ͔Βͳ͍͜ͱ͕͋ͬͨΒ... • ͙͢ʹ Support ʹฉ͜͏ʂૣͯ͘ஸೡʂ • Japan
Datadog User Group ͷ Slack Ͱฉ͍ͯ okʂ • Monitor Summary ͱ Weekly Report Ͱ Monitor ࣗମݮΒͯ͠ ͍͜͏ • Custom metrics ͦ͜ Datadog ͷޣຯɺͲΜͲΜ׆༻͠Α͏ʂ
More • ࣮ࡍͷάϥϑͱ͔ݟ͍ͨͻͱըͷͳ͍࣌ʹདྷͯʂ • ͜ͷ͋ͱͷ zoom breakout room ͰΑ͠ •
ݸผʹ Twitter Ͱ࿈བྷ͘ΕΕ Datadog ૬ஊΓ·͢
એ
Thank you! chaspy chaspy_ Lead Software Engineer Site Reliability at
Quipper Takeshi Kondo