Webサービスの品質とは何か?アラート地獄と監視の失敗、サービスレベル目標設計 から学んだ3つの答え
by
takuya542
Link
Embed
Share
Beginning
This slide
Copy link URL
Copy link URL
Copy iframe embed code
Copy iframe embed code
Copy javascript embed code
Copy javascript embed code
Share
Tweet
Share
Tweet
Slide 1
Slide 1 text
Copyright © 2009-2018 eureka, inc. All rights reserved. Takuya Onda / eureka, Inc. 2018-09-07 Builderscon Tokyo WebαʔϏεͷ࣭ͱԿ͔ʁ ΞϥʔτࠈͱࢹͷࣦഊɺαʔϏεϨϕϧඪઃܭ ͔ΒֶΜͩ3ͭͷ͑
Slide 2
Slide 2 text
Introduction ■ Takuya Onda – eureka, Inc. – SRE team Head
Slide 3
Slide 3 text
No content
Slide 4
Slide 4 text
About Us - IAC/Match Group
Slide 5
Slide 5 text
Agenda ■ WebΞϓϦέʔγϣϯࢹʹ·ͭΘΔٕज़τϨϯυ ■ ࢹͷతͱɺݱͷ՝ͱ ■ ΤϨΧͰͷࣦഊ / ཱͯ͠ࣄྫͷհ ■ ࢹͷஈ֊త৽ɾ࣮ྫհ
Slide 6
Slide 6 text
8FCΞϓϦέʔγϣϯࢹʹ ·ͭΘΔٕज़τϨϯυ
Slide 7
Slide 7 text
ύϒϦοΫΫϥυͷ಄ ■ ΑΓૣ͘ɺڧྗͳϚγϯϦιʔεͷௐୡ͕༰қʹ ■ αʔόΛ͍ࣺͯΔલఏͷΞʔΩςΫνϟ
Slide 8
Slide 8 text
ࢹπʔϧͷॆ࣮ ■ SaaSܕαʔόʔࢹαʔϏε ■ ΠϯςάϨʔγϣϯͷॆ࣮
Slide 9
Slide 9 text
DevOpsɾSRE ■ ߴ͍։ൃੜ࢈ੑͱ҆ఆӡ༻ͷཁٻ ■ Culture, Automation, Lean, Measurement, Sharing
Slide 10
Slide 10 text
γεςϜͷෳࡶԽ ■ ϚΠΫϩαʔϏεɾSPAɾσόΠεͷଟ༷Խ ■ ඇػೳཁ݅ͷࢹͷχʔζ
Slide 11
Slide 11 text
͜Μͳܦݧ͋Γ·ͤΜ͔ʁ
Slide 12
Slide 12 text
■ ʮ4UBUVT͕૿͑ͯ·͢ʯ – ຖ͜ͷ࣌ؒͰͯΔΑͳɺɺ – ͑Β͍ਓʮ͜ͷΞϥʔτେৎͳͷʁʯ ■ ʮ%#ͷίωΫγϣϯ͕YYYΛ͑·ͨ͠ʯ – ͑ɺͲ͏͢Ε͍͍ͷʁࠓϐʔΫλΠϜͩΑʁ – εύΠΫʁϝϯςೖΕΔʁͱΓ͋͑ͣ؍ɺɺʁ ■ ʮ999ͷΤϥʔ͕ൃੜ͠·ͨ͠ʯ – ੲ͔Β͍Δਓʮ͜Εແࢹͯ͠0,ʯ – ࠷ۙೖͬͨਓʮʜʂʁʂʁʯ Ξϥʔτ͋Δ͋Δ
Slide 13
Slide 13 text
Ξϥʔτ͋Δ͋Δ ■ ຊʹҟৗͳͷ͔Θ͔Βͳ͍ ■ Ξϥʔτ͕ߦಈʹ݁ͼ͔ͭͳ͍ ■ ਖ਼͍͠ํʹ͔͍ͬͯΔ͔Θ͔Βͳ͍ (࣭ͱ?)
Slide 14
Slide 14 text
γεςϜΛࢹ͢Δతͱ ■ Ϣʔβʔͷຬ͢ΔੑೳཁٻʹԠ͍͑ͨ ■ ͦͷͨΊʹɺଈ࣌ʹγεςϜҟৗΛݕ͍ͨ͠ ■ ҟৗʹଈ࠲ʹରԠ͠ɺҟৗൃੜظؒΛ࠷খԽ͍ͨ͠
Slide 15
Slide 15 text
γεςϜΛࢹ͢Δతͱ ■ Ϣʔβʔͷຬ͢ΔੑೳཁٻʹԠ͍͑ͨ ■ ͦͷͨΊʹɺଈ࣌ʹγεςϜҟৗΛݕ͍ͨ͠ ■ ҟৗʹଈ࠲ʹରԠ͠ɺҟৗൃੜظؒΛ࠷খԽ͍ͨ͠ ϢʔβͷٻΊΔ࣭ʹԠ͑Δͷ͕Ձ ࢹతͷͨΊͷखஈ
Slide 16
Slide 16 text
ΞΫγσϯτฏৗͨΕ By SRE Workbook ■ ࣦഊͷίετΛ͑ΔͨΊʹૣΊʹಈ͘ͷ͕େࣄ ■ MTTRʢฏۉ෮چ࣌ؒ)͕͍΄Ͳ։ൃऀෛ୲খ͍͞ ■ ൃݟޙʹͳΔ΄Ͳम෮͕͍͠
Slide 17
Slide 17 text
ΤϨΧͰͷࢹͷࣦഊྫ
Slide 18
Slide 18 text
ΤϨΧͰͷࢹͷࣦഊ ■ Ξϥʔτ͕ߦಈʹ݁ͼ͔ͭͳ͍ ■ ࠓରԠ͖͢ͳͷ͔அͰ͖ͳ͍
Slide 19
Slide 19 text
Ξϥʔτ͕ߦಈʹ݁ͼ͔ͭͳ͍ ■ 1ɿͱΓ͋͑ͣಈ͍ͯΔ ■ 2ɿৗʹҟৗ ■ 3ɿԿͰ͖ͳ͍
Slide 20
Slide 20 text
Ξϥʔτ͕ߦಈʹ݁ͼ͔ͭͳ͍ ■ 1ɿͱΓ͋͑ͣಈ͍ͯΔ ■ 2ɿৗʹҟৗ ■ 3ɿԿͰ͖ͳ͍ • ϐʔΫλΠϜʹϩʔΞϕ͕ۤ͘͠ͳΔDB • DynamoͷΩϟύγςΟ͕ΓΘ͔ͣ • ຖ࣌CPU͕ுΓ͘ϝʔϧ৴αʔό
Slide 21
Slide 21 text
Ξϥʔτ͕ߦಈʹ݁ͼ͔ͭͳ͍ ■ 1ɿͱΓ͋͑ͣಈ͍ͯΔ ■ 2ɿৗʹҟৗ ■ 3ɿԿͰ͖ͳ͍ • ຖ٧·ΔδϣϒΩϡʔ • ৗʹྲྀΕͯΔΞϓϦέʔγϣϯΤϥʔϩά • σϓϩΠͷͨͼൃੜ͢ΔΤϥʔ
Slide 22
Slide 22 text
Ξϥʔτ͕ߦಈʹ݁ͼ͔ͭͳ͍ ■ 1ɿͱΓ͋͑ͣಈ͍ͯΔ ■ 2ɿৗʹҟৗ ■ 3ɿԿͰ͖ͳ͍ • 5xxΤϥʔ૿͑ͯΔ͚ͲԿ͜Εʁ(༷ࢠݟ) • ετϨʔδܥͷανϡϨʔγϣϯ • ߏ࠶ݱੑ͕ͳ͍SPOFͳαʔό
Slide 23
Slide 23 text
ࠓରԠ͖͔͢அͰ͖ͳ͍ ■ ݁ہ͜ΕϢʔβӨڹ͋Μͷʁ ■ ͲΕ͘Β͍Өڹ͋Μͷʁ ■ ͦͦఆྔԽͰ͖Μͷʁ ■ ͜Ε߃ٱରԠ͠ͳ͍ͱϚζΠͷʁ ■ ࣄۀࢪࡦΑΓ༏ઌ͢Μͷʁ
Slide 24
Slide 24 text
ࣾͷงғؾʹӨڹ ΞϥʔτݟΖͬͯݴ͏͚Ͳҙຯͳ͘Ͷʁ ͏ͪͷγεςϜɺຊʹେৎͳͷʁʁ Ͳ͏ͤΈΜͳؾʹͯ͠ͳ͍͍͍͠ ࠶ൃࢭɺ͍ͭԿมΘΒͳ͍͡ΌΜ
Slide 25
Slide 25 text
Ͳ͏ཱ͔͔ͬͯͪͬͨ
Slide 26
Slide 26 text
ํ ■ 1ɿఆྔඪΛઃఆ͢Δ ■ 2ɿΞϥʔτ = ଈ࠲ΞΫγϣϯͱ͢Δ ■ 3ɿશମΛγϯϓϧʹཧ͢Δ ■ 4ɿࢹ͕ϥΫͳΞʔΩςΫνϟʹ͢Δ ■ 5ɿҟৗݕͱύϑΥʔϚϯεඪΛ۠ผ͢Δ
Slide 27
Slide 27 text
1ɿఆྔඪΛઃఆ͢Δ ■ αʔϏεϨϕϧࢦඪ (SLI) Λఆٛ ■ αʔϏεϨϕϧඪ (SLO)Λઃఆ ■ SLOΛΞϥʔτᮢͱλεΫ༏ઌͷج४ʹ
Slide 28
Slide 28 text
1ɿఆྔඪΛઃఆ͢Δ ■ αʔϏεϨϕϧࢦඪ (SLI) Λఆٛ ■ αʔϏεϨϕϧඪ (SLO)Λઃఆ ■ SLOΛΞϥʔτᮢͱλεΫ༏ઌͷج४ʹ • SLI = ޭϦΫΤετ / ૯ϦΫΤετ • SLO = SLI > 99.95 (ظؒɿ1िؒ)
Slide 29
Slide 29 text
2ɿΞϥʔτ = ଈ࠲ΞΫγϣϯͱ͢Δ ■ ຊʹΞΫγϣϯ͕ඞཁͳͷ͚ͩΞϥʔτ໐Β͢ ■ ଈ࠲ʹରԠ͕ՄೳͳΞʔΩςΫνϟʹม͍͑ͯ͘ • SLOະୡͷཁҼͱͳΔͷ • ରԠʹ͔͔࣌ؒΔܥ (ετϨʔδܥͱ͔)
Slide 30
Slide 30 text
2ɿΞϥʔτ = ଈ࠲ʹΞΫγϣϯ͕ඞཁͳࣄͱ͢Δ ■ ຊʹΞΫγϣϯ͕ඞཁͳͷ͚ͩΞϥʔτ໐Β͢ ■ ଈ࠲ʹରԠ͕ՄೳͳΞʔΩςΫνϟʹม͍͑ͯ͘ • ଈ࠲ʹαʔόϦιʔεΛ૿ڧ / ަͰ͖Δ • LB / API / Batch / DB / Cache / etc,,
Slide 31
Slide 31 text
3ɿશମΛγϯϓϧʹཧ͢Δ ■ πʔϧͷ౷Ұ ■ ࣭ͷఆٛᮢΛҰݩԽ • ࢹʹར༻͢ΔπʔϧΛ͠΅Δ • ࢹઃఆΛίʔυԽ͢Δ
Slide 32
Slide 32 text
3ɿશମΛγϯϓϧʹཧ͢Δ ■ πʔϧͷ౷Ұ ■ ࣭ͷఆٛᮢΛҰݩԽ • SLI / SLOΛ౷Ұ • Threshold / Rate / Change / Anomaly
Slide 33
Slide 33 text
4ɿࢹ͕ϥΫͳΞʔΩςΫνϟʹ͢Δ ■ αʔόͷަ͕Χϯλϯ ■ ࣗલओٛΛΊΔ ■ ࣄۀͱӡ༻(ࢹίετ)Λൺྫͤ͞ͳ͍
Slide 34
Slide 34 text
4-1ɿࢹ͕ϥΫ ~ αʔόަ͕Χϯλϯ ■ ҟৗͷ͋Δαʔό͙ࣺͯ͢Δ – ίʔυ͔ΒϏϧυ͞ΕͨΠϝʔδ͕͙͢ೖՄೳͳঢ়ଶ Scheduling Rotate API Worker
Slide 35
Slide 35 text
4-2ɿࢹ͕ϥΫ ~ ࣗલओٛΛΊΔ ■ ڊਓ(AWS)ͷݞʹΔɻࢹରΛݮΒ͢ – εέʔϧΞτ/ Ξοϓ͕ϥΫ & ϑΣΠϧΦʔόཧෆཁ S3 Aurora Dynamo ElastiCache SQS
Slide 36
Slide 36 text
4-3ɿࢹ͕ϥΫ ~ ࣄۀͱࢹίετΛൺྫͤ͞ͳ͍ ■ ಉ͡ϓϩϏδϣχϯάϓϩηεͱٕज़ελοΫ – ٕज़ߏͷΏΒ͗Λ࡞Βͳ͍ Pairs JP Pairs GL Capacity = LL Capacity = M
Slide 37
Slide 37 text
5ɿҟৗݕͱύϑΥʔϚϯεඪΛ۠ผ͢Δ ■ ҟৗݕͱඪୡͷ؍ଌDurationΛ͚Δ ■ αʔϏεΛ͑ͳ͍ͱ͍ͮΒ͍Λ۠ผ͢Δ • ҟৗɿΞϥʔτʹΑΔݕ(within 1min) • ඪɿఆظతͳݕࠪ(within 1week)
Slide 38
Slide 38 text
5ɿҟৗݕͱύϑΥʔϚϯεඪΛ۠ผ͢Δ ■ ҟৗݕͱඪୡͷ؍ଌDurationΛ͚Δ ■ αʔϏεΛ͑ͳ͍ͱ͍ͮΒ͍Λ۠ผ͢Δ • ͑ͳ͍ɿϩάΠϯ/ ݕࡧͰ͖ͳ͍ͳͲ • ͍ͮΒ͍ɿαʔϏεମײ͕͍ / ॏ͍
Slide 39
Slide 39 text
ํ(࠶ܝ) ■ 1ɿఆྔඪΛઃఆ͢Δ ■ 2ɿΞϥʔτ = ଈ࠲ΞΫγϣϯͱ͢Δ ■ 3ɿશମΛγϯϓϧʹཧ͢Δ ■ 4ɿࢹ͕ϥΫͳΞʔΩςΫνϟʹ͢Δ ■ 5ɿҟৗݕͱύϑΥʔϚϯεඪΛ۠ผ͢Δ
Slide 40
Slide 40 text
ࢹͱΞʔΩςΫνϟΛ৽͠·ͨ͠
Slide 41
Slide 41 text
ࢹΛ৽
Slide 42
Slide 42 text
ࢹɾରԠΛஈ֊తʹਐԽ ■ ҟৗΛΕΔ(ݟΕΔ) ■ ҟৗΛݕͰ͖Δ ■ ҟৗʹରԠͰ͖Δ ■ ҟৗ͕ࣗಈͰम෮͢Δ
Slide 43
Slide 43 text
ҟৗΛΕΔ(ݟΕΔ) ■ Datadog ■ StackDriver Loggin
Slide 44
Slide 44 text
Metrics Aggregate AWS Integration DatadogʹΑΔϗετϕʔεࢹ &Integration
Slide 45
Slide 45 text
StackDriver LoggingʹΑΔϩάՄࢹԽͱϨϙʔτੜ Metrics / Log Monitoring Alert Slack call you Log Aggregation Sync to DWH Generate Performance Report Hosting Report Dev & SRE
Slide 46
Slide 46 text
StackDriver LoggingʹΑΔϩάՄࢹԽͱϨϙʔτੜ Metrics / Log Monitoring Alert Slack call you Log Aggregation Sync to DWH Generate Performance Report Hosting Report Dev & SRE • Ξϥʔτ & ଈ࣌ରԠ • Windowɿ1
Slide 47
Slide 47 text
StackDriver LoggingʹΑΔϩάՄࢹԽͱϨϙʔτੜ Metrics / Log Monitoring Alert Slack call you Log Aggregation Sync to DWH Generate Performance Report Hosting Report Dev & SRE • ύϑΥʔϚϯεৼฦ • Windowɿ1िؒ
Slide 48
Slide 48 text
StackDriver LoggingʹΑΔϩάՄࢹԽͱϨϙʔτੜ Metrics / Log Monitoring Alert Slack call you Log Aggregation Sync to DWH Generate Performance Report Hosting Report Dev & SRE • ύϑΥʔϚϯεৼฦ • Windowɿ1िؒ Ϩϙʔτൈਮ(Ұ෦)ɾϨΠςϯγώετάϥϜ
Slide 49
Slide 49 text
StackDriver LoggingʹΑΔϩάՄࢹԽͱϨϙʔτੜ Metrics / Log Monitoring Alert Slack call you Log Aggregation Sync to DWH Generate Performance Report Hosting Report Dev & SRE • ύϑΥʔϚϯεৼฦ • Windowɿ1िؒ Ϩϙʔτൈਮ(Ұ෦)ɾRest EndpointผϦΫΤετϘϦϡʔϜਪҠ
Slide 50
Slide 50 text
ҟৗΛݕͰ͖Δ ■ ֎ܗࢹ (αʔϏεࢹ)ʹΑΔΞϥʔτ ■ ϦιʔεࢹʹΑΔΞϥʔτ ■ ύϑΥʔϚϯεʹΑΔΞϥʔτ ■ ϩάࢹʹΑΔΞϥʔτ
Slide 51
Slide 51 text
ҟৗΛݕͰ͖Δ ■ ֎ܗࢹ (αʔϏεࢹ)ʹΑΔΞϥʔτ ■ ϦιʔεࢹʹΑΔΞϥʔτ ■ ύϑΥʔϚϯεʹΑΔΞϥʔτ ■ ϩάࢹʹΑΔΞϥʔτ • ଈରԠඞཁ • SSLηοτͰ(͔͔࣌ؒΔ)
Slide 52
Slide 52 text
ҟৗΛݕͰ͖Δ ■ ֎ܗࢹ (αʔϏεࢹ)ʹΑΔΞϥʔτ ■ ϦιʔεࢹʹΑΔΞϥʔτ ■ ύϑΥʔϚϯεʹΑΔΞϥʔτ ■ ϩάࢹʹΑΔΞϥʔτ • εςʔτϨεϨΠϠݟͳ͍(ࣺͯΔ) • ετϨʔδܥݟΔ(ରԠʹ͔͔࣌ؒΔ)
Slide 53
Slide 53 text
ҟৗΛݕͰ͖Δ ■ ֎ܗࢹ (αʔϏεࢹ)ʹΑΔΞϥʔτ ■ ϦιʔεࢹʹΑΔΞϥʔτ ■ ύϑΥʔϚϯεʹΑΔΞϥʔτ ■ ϩάࢹʹΑΔΞϥʔτ • Latencyલिൺ or લ࣌ؒൺͰͷมԽ • RequestࣦഊSLO x Status CodeͰݟΔ
Slide 54
Slide 54 text
ҟৗΛݕͰ͖Δ ■ ֎ܗࢹ (αʔϏεࢹ)ʹΑΔΞϥʔτ ■ ϦιʔεࢹʹΑΔΞϥʔτ ■ ύϑΥʔϚϯεʹΑΔΞϥʔτ ■ ϩάࢹʹΑΔΞϥʔτ • લिൺ or લ࣌ؒൺͰͷมԽ • ৗ࣌ྲྀΕΔܥͳΒAnomaly detection͕٢
Slide 55
Slide 55 text
ҟৗʹରԠͰ͖Δ ■ Ϧιʔεͷ૿ڧɺަɺϩʔϧόοΫΛ༰қʹ ■ োൃੜ࣌ͷௐࠪίετΛԼ͛Δ
Slide 56
Slide 56 text
Ϧιʔεͷ૿ڧɺަɺϩʔϧόοΫΛ༰қʹ͢Δ Scale Out / Discard Scale Up Add Shard Scale Out Scale Up Vertical Split
Slide 57
Slide 57 text
Ϧιʔεͷ૿ڧɺަɺϩʔϧόοΫΛ༰қʹ͢Δ Scale Out / Discard Scale Up Add Shard Scale Out Scale Up Vertical Split • ؆୯ & ϦʔυλΠϜແ͘Ϧιʔε૿ͤΔ • ҟৗܥΛ؆୯ʹΓͤΔΑ͏ʹ • ετϨʔδܥ͕ΩϞ.ࣄલʹ༧ߦ࿅शΛ
Slide 58
Slide 58 text
ҟৗ͕ࣗಈͰम෮͢Δ ■ ΦʔτώʔϦϯά ■ ࠞಱ(ΧΦε)ͷೖ
Slide 59
Slide 59 text
ҟৗ͕ࣗಈͰम෮͢Δ ■ ΦʔτώʔϦϯά ■ ࠞಱ(ΧΦε)ͷೖ • ҟৗϗετͷΓ͠ϑΣΠϧΦʔό • ετϨʔδܥ͕ΩϞ • ఆظతͳආ܇࿅
Slide 60
Slide 60 text
ҟৗ͕ࣗಈͰम෮͢Δ ■ ΦʔτώʔϦϯά ■ ࠞಱ(ΧΦε)ͷೖ • ҙਤతͳোͷೖ • ΦʔτώʔϦϯάͷڧ੍ࢼݧ & ৗԽ • ઓ͍͜Ε͔Βͩͥɺɺʂ
Slide 61
Slide 61 text
·ͱΊ
Slide 62
Slide 62 text
ࠓͷͳ͠ ■ WebΞϓϦέʔγϣϯࢹʹ·ͭΘΔٕज़τϨϯυ ■ ࢹͷతͱɺݱͷ՝ͱ ■ ΤϨΧͰͷࣦഊ / ཱͯ͠ࣄྫͷհ ■ ϞχλϦϯάͷஈ֊తਐԽɾ࣮ྫհ
Slide 63
Slide 63 text
·ͱΊ ■ ࢹͷతMTTRΛ࠷খԽ࣭͠ཁٻΛຬͨ͢ࣄ ■ ΞϥʔτରԠͰ͖ͳ͚Εҙຯ͕ͳ͍ ■ αʔϏε࣭ͷඪΛఆΊΔ͠ ■ ࢹϨεͳΞʔΩςΫνϟΛࢦ͢͠ ■ Ξϥʔτ + ఆظతͳ࣭νΣοΫΛDev / OpsͰ
Slide 64
Slide 64 text
·ͱΊ ■ γεςϜʑ & ෳࡶԽ͍ͯ͘͠ ■ Γ͍ͨ͜ͱͨ͘͞Μ͋Δʂ ■ ΤϨΧSREνʔϜͷϝϯόʔΛืूதͰ͢ʂ
Slide 65
Slide 65 text
CONFIDENTIAL Thank you :) Thank you :)
Slide 66
Slide 66 text
Any Questions??