Webサービスの品質とは何か?アラート地獄と監視の失敗、サービスレベル目標設計 から学んだ3つの答え
by
takuya542
×
Copy
Open
Share
Embed
Copy iframe code
Copy JS code
Copy link
Start on current slide
Slide 1
Slide 1 text
Copyright © 2009-2018 eureka, inc. All rights reserved. Takuya Onda / eureka, Inc. 2018-09-07 Builderscon Tokyo WebαʔϏεͷ࣭ͱԿ͔ʁ ΞϥʔτࠈͱࢹͷࣦഊɺαʔϏεϨϕϧඪઃܭ ͔ΒֶΜͩ3ͭͷ͑
Slide 2
Slide 2 text
Introduction ■ Takuya Onda – eureka, Inc. – SRE team Head
Slide 3
Slide 3 text
No content
Slide 4
Slide 4 text
About Us - IAC/Match Group
Slide 5
Slide 5 text
Agenda ■ WebΞϓϦέʔγϣϯࢹʹ·ͭΘΔٕज़τϨϯυ ■ ࢹͷతͱɺݱͷ՝ͱ ■ ΤϨΧͰͷࣦഊ / ཱͯ͠ࣄྫͷհ ■ ࢹͷஈ֊త৽ɾ࣮ྫհ
Slide 6
Slide 6 text
8FCΞϓϦέʔγϣϯࢹʹ ·ͭΘΔٕज़τϨϯυ
Slide 7
Slide 7 text
ύϒϦοΫΫϥυͷ಄ ■ ΑΓૣ͘ɺڧྗͳϚγϯϦιʔεͷௐୡ͕༰қʹ ■ αʔόΛ͍ࣺͯΔલఏͷΞʔΩςΫνϟ
Slide 8
Slide 8 text
ࢹπʔϧͷॆ࣮ ■ SaaSܕαʔόʔࢹαʔϏε ■ ΠϯςάϨʔγϣϯͷॆ࣮
Slide 9
Slide 9 text
DevOpsɾSRE ■ ߴ͍։ൃੜ࢈ੑͱ҆ఆӡ༻ͷཁٻ ■ Culture, Automation, Lean, Measurement, Sharing
Slide 10
Slide 10 text
γεςϜͷෳࡶԽ ■ ϚΠΫϩαʔϏεɾSPAɾσόΠεͷଟ༷Խ ■ ඇػೳཁ݅ͷࢹͷχʔζ
Slide 11
Slide 11 text
͜Μͳܦݧ͋Γ·ͤΜ͔ʁ
Slide 12
Slide 12 text
■ ʮ4UBUVT͕૿͑ͯ·͢ʯ – ຖ͜ͷ࣌ؒͰͯΔΑͳɺɺ – ͑Β͍ਓʮ͜ͷΞϥʔτେৎͳͷʁʯ ■ ʮ%#ͷίωΫγϣϯ͕YYYΛ͑·ͨ͠ʯ – ͑ɺͲ͏͢Ε͍͍ͷʁࠓϐʔΫλΠϜͩΑʁ – εύΠΫʁϝϯςೖΕΔʁͱΓ͋͑ͣ؍ɺɺʁ ■ ʮ999ͷΤϥʔ͕ൃੜ͠·ͨ͠ʯ – ੲ͔Β͍Δਓʮ͜Εແࢹͯ͠0,ʯ – ࠷ۙೖͬͨਓʮʜʂʁʂʁʯ Ξϥʔτ͋Δ͋Δ
Slide 13
Slide 13 text
Ξϥʔτ͋Δ͋Δ ■ ຊʹҟৗͳͷ͔Θ͔Βͳ͍ ■ Ξϥʔτ͕ߦಈʹ݁ͼ͔ͭͳ͍ ■ ਖ਼͍͠ํʹ͔͍ͬͯΔ͔Θ͔Βͳ͍ (࣭ͱ?)
Slide 14
Slide 14 text
γεςϜΛࢹ͢Δతͱ ■ Ϣʔβʔͷຬ͢ΔੑೳཁٻʹԠ͍͑ͨ ■ ͦͷͨΊʹɺଈ࣌ʹγεςϜҟৗΛݕ͍ͨ͠ ■ ҟৗʹଈ࠲ʹରԠ͠ɺҟৗൃੜظؒΛ࠷খԽ͍ͨ͠
Slide 15
Slide 15 text
γεςϜΛࢹ͢Δతͱ ■ Ϣʔβʔͷຬ͢ΔੑೳཁٻʹԠ͍͑ͨ ■ ͦͷͨΊʹɺଈ࣌ʹγεςϜҟৗΛݕ͍ͨ͠ ■ ҟৗʹଈ࠲ʹରԠ͠ɺҟৗൃੜظؒΛ࠷খԽ͍ͨ͠ ϢʔβͷٻΊΔ࣭ʹԠ͑Δͷ͕Ձ ࢹతͷͨΊͷखஈ
Slide 16
Slide 16 text
ΞΫγσϯτฏৗͨΕ By SRE Workbook ■ ࣦഊͷίετΛ͑ΔͨΊʹૣΊʹಈ͘ͷ͕େࣄ ■ MTTRʢฏۉ෮چ࣌ؒ)͕͍΄Ͳ։ൃऀෛ୲খ͍͞ ■ ൃݟޙʹͳΔ΄Ͳम෮͕͍͠
Slide 17
Slide 17 text
ΤϨΧͰͷࢹͷࣦഊྫ
Slide 18
Slide 18 text
ΤϨΧͰͷࢹͷࣦഊ ■ Ξϥʔτ͕ߦಈʹ݁ͼ͔ͭͳ͍ ■ ࠓରԠ͖͢ͳͷ͔அͰ͖ͳ͍
Slide 19
Slide 19 text
Ξϥʔτ͕ߦಈʹ݁ͼ͔ͭͳ͍ ■ 1ɿͱΓ͋͑ͣಈ͍ͯΔ ■ 2ɿৗʹҟৗ ■ 3ɿԿͰ͖ͳ͍
Slide 20
Slide 20 text
Ξϥʔτ͕ߦಈʹ݁ͼ͔ͭͳ͍ ■ 1ɿͱΓ͋͑ͣಈ͍ͯΔ ■ 2ɿৗʹҟৗ ■ 3ɿԿͰ͖ͳ͍ • ϐʔΫλΠϜʹϩʔΞϕ͕ۤ͘͠ͳΔDB • DynamoͷΩϟύγςΟ͕ΓΘ͔ͣ • ຖ࣌CPU͕ுΓ͘ϝʔϧ৴αʔό
Slide 21
Slide 21 text
Ξϥʔτ͕ߦಈʹ݁ͼ͔ͭͳ͍ ■ 1ɿͱΓ͋͑ͣಈ͍ͯΔ ■ 2ɿৗʹҟৗ ■ 3ɿԿͰ͖ͳ͍ • ຖ٧·ΔδϣϒΩϡʔ • ৗʹྲྀΕͯΔΞϓϦέʔγϣϯΤϥʔϩά • σϓϩΠͷͨͼൃੜ͢ΔΤϥʔ
Slide 22
Slide 22 text
Ξϥʔτ͕ߦಈʹ݁ͼ͔ͭͳ͍ ■ 1ɿͱΓ͋͑ͣಈ͍ͯΔ ■ 2ɿৗʹҟৗ ■ 3ɿԿͰ͖ͳ͍ • 5xxΤϥʔ૿͑ͯΔ͚ͲԿ͜Εʁ(༷ࢠݟ) • ετϨʔδܥͷανϡϨʔγϣϯ • ߏ࠶ݱੑ͕ͳ͍SPOFͳαʔό
Slide 23
Slide 23 text
ࠓରԠ͖͔͢அͰ͖ͳ͍ ■ ݁ہ͜ΕϢʔβӨڹ͋Μͷʁ ■ ͲΕ͘Β͍Өڹ͋Μͷʁ ■ ͦͦఆྔԽͰ͖Μͷʁ ■ ͜Ε߃ٱରԠ͠ͳ͍ͱϚζΠͷʁ ■ ࣄۀࢪࡦΑΓ༏ઌ͢Μͷʁ
Slide 24
Slide 24 text
ࣾͷงғؾʹӨڹ ΞϥʔτݟΖͬͯݴ͏͚Ͳҙຯͳ͘Ͷʁ ͏ͪͷγεςϜɺຊʹେৎͳͷʁʁ Ͳ͏ͤΈΜͳؾʹͯ͠ͳ͍͍͍͠ ࠶ൃࢭɺ͍ͭԿมΘΒͳ͍͡ΌΜ
Slide 25
Slide 25 text
Ͳ͏ཱ͔͔ͬͯͪͬͨ
Slide 26
Slide 26 text
ํ ■ 1ɿఆྔඪΛઃఆ͢Δ ■ 2ɿΞϥʔτ = ଈ࠲ΞΫγϣϯͱ͢Δ ■ 3ɿશମΛγϯϓϧʹཧ͢Δ ■ 4ɿࢹ͕ϥΫͳΞʔΩςΫνϟʹ͢Δ ■ 5ɿҟৗݕͱύϑΥʔϚϯεඪΛ۠ผ͢Δ
Slide 27
Slide 27 text
1ɿఆྔඪΛઃఆ͢Δ ■ αʔϏεϨϕϧࢦඪ (SLI) Λఆٛ ■ αʔϏεϨϕϧඪ (SLO)Λઃఆ ■ SLOΛΞϥʔτᮢͱλεΫ༏ઌͷج४ʹ
Slide 28
Slide 28 text
1ɿఆྔඪΛઃఆ͢Δ ■ αʔϏεϨϕϧࢦඪ (SLI) Λఆٛ ■ αʔϏεϨϕϧඪ (SLO)Λઃఆ ■ SLOΛΞϥʔτᮢͱλεΫ༏ઌͷج४ʹ • SLI = ޭϦΫΤετ / ૯ϦΫΤετ • SLO = SLI > 99.95 (ظؒɿ1िؒ)
Slide 29
Slide 29 text
2ɿΞϥʔτ = ଈ࠲ΞΫγϣϯͱ͢Δ ■ ຊʹΞΫγϣϯ͕ඞཁͳͷ͚ͩΞϥʔτ໐Β͢ ■ ଈ࠲ʹରԠ͕ՄೳͳΞʔΩςΫνϟʹม͍͑ͯ͘ • SLOະୡͷཁҼͱͳΔͷ • ରԠʹ͔͔࣌ؒΔܥ (ετϨʔδܥͱ͔)
Slide 30
Slide 30 text
2ɿΞϥʔτ = ଈ࠲ʹΞΫγϣϯ͕ඞཁͳࣄͱ͢Δ ■ ຊʹΞΫγϣϯ͕ඞཁͳͷ͚ͩΞϥʔτ໐Β͢ ■ ଈ࠲ʹରԠ͕ՄೳͳΞʔΩςΫνϟʹม͍͑ͯ͘ • ଈ࠲ʹαʔόϦιʔεΛ૿ڧ / ަͰ͖Δ • LB / API / Batch / DB / Cache / etc,,
Slide 31
Slide 31 text
3ɿશମΛγϯϓϧʹཧ͢Δ ■ πʔϧͷ౷Ұ ■ ࣭ͷఆٛᮢΛҰݩԽ • ࢹʹར༻͢ΔπʔϧΛ͠΅Δ • ࢹઃఆΛίʔυԽ͢Δ
Slide 32
Slide 32 text
3ɿશମΛγϯϓϧʹཧ͢Δ ■ πʔϧͷ౷Ұ ■ ࣭ͷఆٛᮢΛҰݩԽ • SLI / SLOΛ౷Ұ • Threshold / Rate / Change / Anomaly
Slide 33
Slide 33 text
4ɿࢹ͕ϥΫͳΞʔΩςΫνϟʹ͢Δ ■ αʔόͷަ͕Χϯλϯ ■ ࣗલओٛΛΊΔ ■ ࣄۀͱӡ༻(ࢹίετ)Λൺྫͤ͞ͳ͍
Slide 34
Slide 34 text
4-1ɿࢹ͕ϥΫ ~ αʔόަ͕Χϯλϯ ■ ҟৗͷ͋Δαʔό͙ࣺͯ͢Δ – ίʔυ͔ΒϏϧυ͞ΕͨΠϝʔδ͕͙͢ೖՄೳͳঢ়ଶ Scheduling Rotate API Worker
Slide 35
Slide 35 text
4-2ɿࢹ͕ϥΫ ~ ࣗલओٛΛΊΔ ■ ڊਓ(AWS)ͷݞʹΔɻࢹରΛݮΒ͢ – εέʔϧΞτ/ Ξοϓ͕ϥΫ & ϑΣΠϧΦʔόཧෆཁ S3 Aurora Dynamo ElastiCache SQS
Slide 36
Slide 36 text
4-3ɿࢹ͕ϥΫ ~ ࣄۀͱࢹίετΛൺྫͤ͞ͳ͍ ■ ಉ͡ϓϩϏδϣχϯάϓϩηεͱٕज़ελοΫ – ٕज़ߏͷΏΒ͗Λ࡞Βͳ͍ Pairs JP Pairs GL Capacity = LL Capacity = M
Slide 37
Slide 37 text
5ɿҟৗݕͱύϑΥʔϚϯεඪΛ۠ผ͢Δ ■ ҟৗݕͱඪୡͷ؍ଌDurationΛ͚Δ ■ αʔϏεΛ͑ͳ͍ͱ͍ͮΒ͍Λ۠ผ͢Δ • ҟৗɿΞϥʔτʹΑΔݕ(within 1min) • ඪɿఆظతͳݕࠪ(within 1week)
Slide 38
Slide 38 text
5ɿҟৗݕͱύϑΥʔϚϯεඪΛ۠ผ͢Δ ■ ҟৗݕͱඪୡͷ؍ଌDurationΛ͚Δ ■ αʔϏεΛ͑ͳ͍ͱ͍ͮΒ͍Λ۠ผ͢Δ • ͑ͳ͍ɿϩάΠϯ/ ݕࡧͰ͖ͳ͍ͳͲ • ͍ͮΒ͍ɿαʔϏεମײ͕͍ / ॏ͍
Slide 39
Slide 39 text
ํ(࠶ܝ) ■ 1ɿఆྔඪΛઃఆ͢Δ ■ 2ɿΞϥʔτ = ଈ࠲ΞΫγϣϯͱ͢Δ ■ 3ɿશମΛγϯϓϧʹཧ͢Δ ■ 4ɿࢹ͕ϥΫͳΞʔΩςΫνϟʹ͢Δ ■ 5ɿҟৗݕͱύϑΥʔϚϯεඪΛ۠ผ͢Δ
Slide 40
Slide 40 text
ࢹͱΞʔΩςΫνϟΛ৽͠·ͨ͠
Slide 41
Slide 41 text
ࢹΛ৽
Slide 42
Slide 42 text
ࢹɾରԠΛஈ֊తʹਐԽ ■ ҟৗΛΕΔ(ݟΕΔ) ■ ҟৗΛݕͰ͖Δ ■ ҟৗʹରԠͰ͖Δ ■ ҟৗ͕ࣗಈͰम෮͢Δ
Slide 43
Slide 43 text
ҟৗΛΕΔ(ݟΕΔ) ■ Datadog ■ StackDriver Loggin
Slide 44
Slide 44 text
Metrics Aggregate AWS Integration DatadogʹΑΔϗετϕʔεࢹ &Integration
Slide 45
Slide 45 text
StackDriver LoggingʹΑΔϩάՄࢹԽͱϨϙʔτੜ Metrics / Log Monitoring Alert Slack call you Log Aggregation Sync to DWH Generate Performance Report Hosting Report Dev & SRE
Slide 46
Slide 46 text
StackDriver LoggingʹΑΔϩάՄࢹԽͱϨϙʔτੜ Metrics / Log Monitoring Alert Slack call you Log Aggregation Sync to DWH Generate Performance Report Hosting Report Dev & SRE • Ξϥʔτ & ଈ࣌ରԠ • Windowɿ1
Slide 47
Slide 47 text
StackDriver LoggingʹΑΔϩάՄࢹԽͱϨϙʔτੜ Metrics / Log Monitoring Alert Slack call you Log Aggregation Sync to DWH Generate Performance Report Hosting Report Dev & SRE • ύϑΥʔϚϯεৼฦ • Windowɿ1िؒ
Slide 48
Slide 48 text
StackDriver LoggingʹΑΔϩάՄࢹԽͱϨϙʔτੜ Metrics / Log Monitoring Alert Slack call you Log Aggregation Sync to DWH Generate Performance Report Hosting Report Dev & SRE • ύϑΥʔϚϯεৼฦ • Windowɿ1िؒ Ϩϙʔτൈਮ(Ұ෦)ɾϨΠςϯγώετάϥϜ
Slide 49
Slide 49 text
StackDriver LoggingʹΑΔϩάՄࢹԽͱϨϙʔτੜ Metrics / Log Monitoring Alert Slack call you Log Aggregation Sync to DWH Generate Performance Report Hosting Report Dev & SRE • ύϑΥʔϚϯεৼฦ • Windowɿ1िؒ Ϩϙʔτൈਮ(Ұ෦)ɾRest EndpointผϦΫΤετϘϦϡʔϜਪҠ
Slide 50
Slide 50 text
ҟৗΛݕͰ͖Δ ■ ֎ܗࢹ (αʔϏεࢹ)ʹΑΔΞϥʔτ ■ ϦιʔεࢹʹΑΔΞϥʔτ ■ ύϑΥʔϚϯεʹΑΔΞϥʔτ ■ ϩάࢹʹΑΔΞϥʔτ
Slide 51
Slide 51 text
ҟৗΛݕͰ͖Δ ■ ֎ܗࢹ (αʔϏεࢹ)ʹΑΔΞϥʔτ ■ ϦιʔεࢹʹΑΔΞϥʔτ ■ ύϑΥʔϚϯεʹΑΔΞϥʔτ ■ ϩάࢹʹΑΔΞϥʔτ • ଈରԠඞཁ • SSLηοτͰ(͔͔࣌ؒΔ)
Slide 52
Slide 52 text
ҟৗΛݕͰ͖Δ ■ ֎ܗࢹ (αʔϏεࢹ)ʹΑΔΞϥʔτ ■ ϦιʔεࢹʹΑΔΞϥʔτ ■ ύϑΥʔϚϯεʹΑΔΞϥʔτ ■ ϩάࢹʹΑΔΞϥʔτ • εςʔτϨεϨΠϠݟͳ͍(ࣺͯΔ) • ετϨʔδܥݟΔ(ରԠʹ͔͔࣌ؒΔ)
Slide 53
Slide 53 text
ҟৗΛݕͰ͖Δ ■ ֎ܗࢹ (αʔϏεࢹ)ʹΑΔΞϥʔτ ■ ϦιʔεࢹʹΑΔΞϥʔτ ■ ύϑΥʔϚϯεʹΑΔΞϥʔτ ■ ϩάࢹʹΑΔΞϥʔτ • Latencyલिൺ or લ࣌ؒൺͰͷมԽ • RequestࣦഊSLO x Status CodeͰݟΔ
Slide 54
Slide 54 text
ҟৗΛݕͰ͖Δ ■ ֎ܗࢹ (αʔϏεࢹ)ʹΑΔΞϥʔτ ■ ϦιʔεࢹʹΑΔΞϥʔτ ■ ύϑΥʔϚϯεʹΑΔΞϥʔτ ■ ϩάࢹʹΑΔΞϥʔτ • લिൺ or લ࣌ؒൺͰͷมԽ • ৗ࣌ྲྀΕΔܥͳΒAnomaly detection͕٢
Slide 55
Slide 55 text
ҟৗʹରԠͰ͖Δ ■ Ϧιʔεͷ૿ڧɺަɺϩʔϧόοΫΛ༰қʹ ■ োൃੜ࣌ͷௐࠪίετΛԼ͛Δ
Slide 56
Slide 56 text
Ϧιʔεͷ૿ڧɺަɺϩʔϧόοΫΛ༰қʹ͢Δ Scale Out / Discard Scale Up Add Shard Scale Out Scale Up Vertical Split
Slide 57
Slide 57 text
Ϧιʔεͷ૿ڧɺަɺϩʔϧόοΫΛ༰қʹ͢Δ Scale Out / Discard Scale Up Add Shard Scale Out Scale Up Vertical Split • ؆୯ & ϦʔυλΠϜແ͘Ϧιʔε૿ͤΔ • ҟৗܥΛ؆୯ʹΓͤΔΑ͏ʹ • ετϨʔδܥ͕ΩϞ.ࣄલʹ༧ߦ࿅शΛ
Slide 58
Slide 58 text
ҟৗ͕ࣗಈͰम෮͢Δ ■ ΦʔτώʔϦϯά ■ ࠞಱ(ΧΦε)ͷೖ
Slide 59
Slide 59 text
ҟৗ͕ࣗಈͰम෮͢Δ ■ ΦʔτώʔϦϯά ■ ࠞಱ(ΧΦε)ͷೖ • ҟৗϗετͷΓ͠ϑΣΠϧΦʔό • ετϨʔδܥ͕ΩϞ • ఆظతͳආ܇࿅
Slide 60
Slide 60 text
ҟৗ͕ࣗಈͰम෮͢Δ ■ ΦʔτώʔϦϯά ■ ࠞಱ(ΧΦε)ͷೖ • ҙਤతͳোͷೖ • ΦʔτώʔϦϯάͷڧ੍ࢼݧ & ৗԽ • ઓ͍͜Ε͔Βͩͥɺɺʂ
Slide 61
Slide 61 text
·ͱΊ
Slide 62
Slide 62 text
ࠓͷͳ͠ ■ WebΞϓϦέʔγϣϯࢹʹ·ͭΘΔٕज़τϨϯυ ■ ࢹͷతͱɺݱͷ՝ͱ ■ ΤϨΧͰͷࣦഊ / ཱͯ͠ࣄྫͷհ ■ ϞχλϦϯάͷஈ֊తਐԽɾ࣮ྫհ
Slide 63
Slide 63 text
·ͱΊ ■ ࢹͷతMTTRΛ࠷খԽ࣭͠ཁٻΛຬͨ͢ࣄ ■ ΞϥʔτରԠͰ͖ͳ͚Εҙຯ͕ͳ͍ ■ αʔϏε࣭ͷඪΛఆΊΔ͠ ■ ࢹϨεͳΞʔΩςΫνϟΛࢦ͢͠ ■ Ξϥʔτ + ఆظతͳ࣭νΣοΫΛDev / OpsͰ
Slide 64
Slide 64 text
·ͱΊ ■ γεςϜʑ & ෳࡶԽ͍ͯ͘͠ ■ Γ͍ͨ͜ͱͨ͘͞Μ͋Δʂ ■ ΤϨΧSREνʔϜͷϝϯόʔΛืूதͰ͢ʂ
Slide 65
Slide 65 text
CONFIDENTIAL Thank you :) Thank you :)
Slide 66
Slide 66 text
Any Questions??