Webサービスの品質とは何か?アラート地獄と監視の失敗、サービスレベル目標設計 から学んだ3つの答え
Copyright © 2009-2018 eureka, inc. All rights reserved.Takuya Onda / eureka, Inc.2018-09-07 Builderscon TokyoWebαʔϏεͷ࣭ͱԿ͔ʁΞϥʔτࠈͱࢹͷࣦഊɺαʔϏεϨϕϧඪઃܭ ͔ΒֶΜͩ3ͭͷ͑
View Slide
Introduction■ Takuya Onda– eureka, Inc.– SRE team Head
About Us - IAC/Match Group
Agenda■ WebΞϓϦέʔγϣϯࢹʹ·ͭΘΔٕज़τϨϯυ■ ࢹͷతͱɺݱͷ՝ͱ■ ΤϨΧͰͷࣦഊ / ཱͯ͠ࣄྫͷհ■ ࢹͷஈ֊త৽ɾ࣮ྫհ
8FCΞϓϦέʔγϣϯࢹʹ·ͭΘΔٕज़τϨϯυ
ύϒϦοΫΫϥυͷ಄■ ΑΓૣ͘ɺڧྗͳϚγϯϦιʔεͷௐୡ͕༰қʹ■ αʔόΛ͍ࣺͯΔલఏͷΞʔΩςΫνϟ
ࢹπʔϧͷॆ࣮■ SaaSܕαʔόʔࢹαʔϏε■ ΠϯςάϨʔγϣϯͷॆ࣮
DevOpsɾSRE■ ߴ͍։ൃੜ࢈ੑͱ҆ఆӡ༻ͷཁٻ■ Culture, Automation, Lean, Measurement, Sharing
γεςϜͷෳࡶԽ■ ϚΠΫϩαʔϏεɾSPAɾσόΠεͷଟ༷Խ■ ඇػೳཁ݅ͷࢹͷχʔζ
͜Μͳܦݧ͋Γ·ͤΜ͔ʁ
■ ʮ4UBUVT͕૿͑ͯ·͢ʯ– ຖ͜ͷ࣌ؒͰͯΔΑͳɺɺ– ͑Β͍ਓʮ͜ͷΞϥʔτେৎͳͷʁʯ■ ʮ%#ͷίωΫγϣϯ͕YYYΛ͑·ͨ͠ʯ– ͑ɺͲ͏͢Ε͍͍ͷʁࠓϐʔΫλΠϜͩΑʁ– εύΠΫʁϝϯςೖΕΔʁͱΓ͋͑ͣ؍ɺɺʁ■ ʮ999ͷΤϥʔ͕ൃੜ͠·ͨ͠ʯ– ੲ͔Β͍Δਓʮ͜Εແࢹͯ͠0,ʯ– ࠷ۙೖͬͨਓʮʜʂʁʂʁʯΞϥʔτ͋Δ͋Δ
Ξϥʔτ͋Δ͋Δ■ ຊʹҟৗͳͷ͔Θ͔Βͳ͍■ Ξϥʔτ͕ߦಈʹ݁ͼ͔ͭͳ͍■ ਖ਼͍͠ํʹ͔͍ͬͯΔ͔Θ͔Βͳ͍ (࣭ͱ?)
γεςϜΛࢹ͢Δతͱ■ Ϣʔβʔͷຬ͢ΔੑೳཁٻʹԠ͍͑ͨ■ ͦͷͨΊʹɺଈ࣌ʹγεςϜҟৗΛݕ͍ͨ͠■ ҟৗʹଈ࠲ʹରԠ͠ɺҟৗൃੜظؒΛ࠷খԽ͍ͨ͠
γεςϜΛࢹ͢Δతͱ■ Ϣʔβʔͷຬ͢ΔੑೳཁٻʹԠ͍͑ͨ■ ͦͷͨΊʹɺଈ࣌ʹγεςϜҟৗΛݕ͍ͨ͠■ ҟৗʹଈ࠲ʹରԠ͠ɺҟৗൃੜظؒΛ࠷খԽ͍ͨ͠ϢʔβͷٻΊΔ࣭ʹԠ͑Δͷ͕ՁࢹతͷͨΊͷखஈ
ΞΫγσϯτฏৗͨΕ By SRE Workbook■ ࣦഊͷίετΛ͑ΔͨΊʹૣΊʹಈ͘ͷ͕େࣄ■ MTTRʢฏۉ෮چ࣌ؒ)͕͍΄Ͳ։ൃऀෛ୲খ͍͞■ ൃݟޙʹͳΔ΄Ͳम෮͕͍͠
ΤϨΧͰͷࢹͷࣦഊྫ
ΤϨΧͰͷࢹͷࣦഊ■ Ξϥʔτ͕ߦಈʹ݁ͼ͔ͭͳ͍■ ࠓରԠ͖͢ͳͷ͔அͰ͖ͳ͍
Ξϥʔτ͕ߦಈʹ݁ͼ͔ͭͳ͍■ 1ɿͱΓ͋͑ͣಈ͍ͯΔ■ 2ɿৗʹҟৗ■ 3ɿԿͰ͖ͳ͍
Ξϥʔτ͕ߦಈʹ݁ͼ͔ͭͳ͍■ 1ɿͱΓ͋͑ͣಈ͍ͯΔ■ 2ɿৗʹҟৗ■ 3ɿԿͰ͖ͳ͍• ϐʔΫλΠϜʹϩʔΞϕ͕ۤ͘͠ͳΔDB• DynamoͷΩϟύγςΟ͕ΓΘ͔ͣ• ຖ࣌CPU͕ுΓ͘ϝʔϧ৴αʔό
Ξϥʔτ͕ߦಈʹ݁ͼ͔ͭͳ͍■ 1ɿͱΓ͋͑ͣಈ͍ͯΔ■ 2ɿৗʹҟৗ■ 3ɿԿͰ͖ͳ͍• ຖ٧·ΔδϣϒΩϡʔ• ৗʹྲྀΕͯΔΞϓϦέʔγϣϯΤϥʔϩά• σϓϩΠͷͨͼൃੜ͢ΔΤϥʔ
Ξϥʔτ͕ߦಈʹ݁ͼ͔ͭͳ͍■ 1ɿͱΓ͋͑ͣಈ͍ͯΔ■ 2ɿৗʹҟৗ■ 3ɿԿͰ͖ͳ͍• 5xxΤϥʔ૿͑ͯΔ͚ͲԿ͜Εʁ(༷ࢠݟ)• ετϨʔδܥͷανϡϨʔγϣϯ• ߏ࠶ݱੑ͕ͳ͍SPOFͳαʔό
ࠓରԠ͖͔͢அͰ͖ͳ͍■ ݁ہ͜ΕϢʔβӨڹ͋Μͷʁ■ ͲΕ͘Β͍Өڹ͋Μͷʁ■ ͦͦఆྔԽͰ͖Μͷʁ■ ͜Ε߃ٱରԠ͠ͳ͍ͱϚζΠͷʁ■ ࣄۀࢪࡦΑΓ༏ઌ͢Μͷʁ
ࣾͷงғؾʹӨڹΞϥʔτݟΖͬͯݴ͏͚Ͳҙຯͳ͘Ͷʁ͏ͪͷγεςϜɺຊʹେৎͳͷʁʁͲ͏ͤΈΜͳؾʹͯ͠ͳ͍͍͍͠࠶ൃࢭɺ͍ͭԿมΘΒͳ͍͡ΌΜ
Ͳ͏ཱ͔͔ͬͯͪͬͨ
ํ■ 1ɿఆྔඪΛઃఆ͢Δ■ 2ɿΞϥʔτ = ଈ࠲ΞΫγϣϯͱ͢Δ■ 3ɿશମΛγϯϓϧʹཧ͢Δ■ 4ɿࢹ͕ϥΫͳΞʔΩςΫνϟʹ͢Δ■ 5ɿҟৗݕͱύϑΥʔϚϯεඪΛ۠ผ͢Δ
1ɿఆྔඪΛઃఆ͢Δ■ αʔϏεϨϕϧࢦඪ (SLI) Λఆٛ■ αʔϏεϨϕϧඪ (SLO)Λઃఆ■ SLOΛΞϥʔτᮢͱλεΫ༏ઌͷج४ʹ
1ɿఆྔඪΛઃఆ͢Δ■ αʔϏεϨϕϧࢦඪ (SLI) Λఆٛ■ αʔϏεϨϕϧඪ (SLO)Λઃఆ■ SLOΛΞϥʔτᮢͱλεΫ༏ઌͷج४ʹ• SLI = ޭϦΫΤετ / ૯ϦΫΤετ• SLO = SLI > 99.95 (ظؒɿ1िؒ)
2ɿΞϥʔτ = ଈ࠲ΞΫγϣϯͱ͢Δ■ ຊʹΞΫγϣϯ͕ඞཁͳͷ͚ͩΞϥʔτ໐Β͢■ ଈ࠲ʹରԠ͕ՄೳͳΞʔΩςΫνϟʹม͍͑ͯ͘• SLOະୡͷཁҼͱͳΔͷ• ରԠʹ͔͔࣌ؒΔܥ (ετϨʔδܥͱ͔)
2ɿΞϥʔτ = ଈ࠲ʹΞΫγϣϯ͕ඞཁͳࣄͱ͢Δ■ ຊʹΞΫγϣϯ͕ඞཁͳͷ͚ͩΞϥʔτ໐Β͢■ ଈ࠲ʹରԠ͕ՄೳͳΞʔΩςΫνϟʹม͍͑ͯ͘• ଈ࠲ʹαʔόϦιʔεΛ૿ڧ / ަͰ͖Δ• LB / API / Batch / DB / Cache / etc,,
3ɿશମΛγϯϓϧʹཧ͢Δ■ πʔϧͷ౷Ұ■ ࣭ͷఆٛᮢΛҰݩԽ• ࢹʹར༻͢ΔπʔϧΛ͠΅Δ• ࢹઃఆΛίʔυԽ͢Δ
3ɿશମΛγϯϓϧʹཧ͢Δ■ πʔϧͷ౷Ұ■ ࣭ͷఆٛᮢΛҰݩԽ• SLI / SLOΛ౷Ұ• Threshold / Rate / Change / Anomaly
4ɿࢹ͕ϥΫͳΞʔΩςΫνϟʹ͢Δ■ αʔόͷަ͕Χϯλϯ■ ࣗલओٛΛΊΔ■ ࣄۀͱӡ༻(ࢹίετ)Λൺྫͤ͞ͳ͍
4-1ɿࢹ͕ϥΫ ~ αʔόަ͕Χϯλϯ■ ҟৗͷ͋Δαʔό͙ࣺͯ͢Δ– ίʔυ͔ΒϏϧυ͞ΕͨΠϝʔδ͕͙͢ೖՄೳͳঢ়ଶSchedulingRotateAPIWorker
4-2ɿࢹ͕ϥΫ ~ ࣗલओٛΛΊΔ■ ڊਓ(AWS)ͷݞʹΔɻࢹରΛݮΒ͢– εέʔϧΞτ/ Ξοϓ͕ϥΫ & ϑΣΠϧΦʔόཧෆཁS3 Aurora Dynamo ElastiCacheSQS
4-3ɿࢹ͕ϥΫ ~ ࣄۀͱࢹίετΛൺྫͤ͞ͳ͍■ ಉ͡ϓϩϏδϣχϯάϓϩηεͱٕज़ελοΫ– ٕज़ߏͷΏΒ͗Λ࡞Βͳ͍Pairs JP Pairs GLCapacity = LL Capacity = M
5ɿҟৗݕͱύϑΥʔϚϯεඪΛ۠ผ͢Δ■ ҟৗݕͱඪୡͷ؍ଌDurationΛ͚Δ■ αʔϏεΛ͑ͳ͍ͱ͍ͮΒ͍Λ۠ผ͢Δ• ҟৗɿΞϥʔτʹΑΔݕ(within 1min)• ඪɿఆظతͳݕࠪ(within 1week)
5ɿҟৗݕͱύϑΥʔϚϯεඪΛ۠ผ͢Δ■ ҟৗݕͱඪୡͷ؍ଌDurationΛ͚Δ■ αʔϏεΛ͑ͳ͍ͱ͍ͮΒ͍Λ۠ผ͢Δ• ͑ͳ͍ɿϩάΠϯ/ ݕࡧͰ͖ͳ͍ͳͲ• ͍ͮΒ͍ɿαʔϏεମײ͕͍ / ॏ͍
ํ(࠶ܝ)■ 1ɿఆྔඪΛઃఆ͢Δ■ 2ɿΞϥʔτ = ଈ࠲ΞΫγϣϯͱ͢Δ■ 3ɿશମΛγϯϓϧʹཧ͢Δ■ 4ɿࢹ͕ϥΫͳΞʔΩςΫνϟʹ͢Δ■ 5ɿҟৗݕͱύϑΥʔϚϯεඪΛ۠ผ͢Δ
ࢹͱΞʔΩςΫνϟΛ৽͠·ͨ͠
ࢹΛ৽
ࢹɾରԠΛஈ֊తʹਐԽ■ ҟৗΛΕΔ(ݟΕΔ)■ ҟৗΛݕͰ͖Δ■ ҟৗʹରԠͰ͖Δ■ ҟৗ͕ࣗಈͰम෮͢Δ
ҟৗΛΕΔ(ݟΕΔ)■ Datadog■ StackDriver Loggin
MetricsAggregateAWSIntegrationDatadogʹΑΔϗετϕʔεࢹ &Integration
StackDriver LoggingʹΑΔϩάՄࢹԽͱϨϙʔτੜMetrics / LogMonitoringAlertSlackcall youLogAggregationSync toDWHGeneratePerformanceReportHostingReportDev&SRE
StackDriver LoggingʹΑΔϩάՄࢹԽͱϨϙʔτੜMetrics / LogMonitoringAlertSlackcall youLogAggregationSync toDWHGeneratePerformanceReportHostingReportDev&SRE• Ξϥʔτ & ଈ࣌ରԠ• Windowɿ1
StackDriver LoggingʹΑΔϩάՄࢹԽͱϨϙʔτੜMetrics / LogMonitoringAlertSlackcall youLogAggregationSync toDWHGeneratePerformanceReportHostingReportDev&SRE• ύϑΥʔϚϯεৼฦ• Windowɿ1िؒ
StackDriver LoggingʹΑΔϩάՄࢹԽͱϨϙʔτੜMetrics / LogMonitoringAlertSlackcall youLogAggregationSync toDWHGeneratePerformanceReportHostingReportDev&SRE• ύϑΥʔϚϯεৼฦ• Windowɿ1िؒϨϙʔτൈਮ(Ұ෦)ɾϨΠςϯγώετάϥϜ
StackDriver LoggingʹΑΔϩάՄࢹԽͱϨϙʔτੜMetrics / LogMonitoringAlertSlackcall youLogAggregationSync toDWHGeneratePerformanceReportHostingReportDev&SRE• ύϑΥʔϚϯεৼฦ• Windowɿ1िؒϨϙʔτൈਮ(Ұ෦)ɾRest EndpointผϦΫΤετϘϦϡʔϜਪҠ
ҟৗΛݕͰ͖Δ■ ֎ܗࢹ (αʔϏεࢹ)ʹΑΔΞϥʔτ■ ϦιʔεࢹʹΑΔΞϥʔτ■ ύϑΥʔϚϯεʹΑΔΞϥʔτ■ ϩάࢹʹΑΔΞϥʔτ
ҟৗΛݕͰ͖Δ■ ֎ܗࢹ (αʔϏεࢹ)ʹΑΔΞϥʔτ■ ϦιʔεࢹʹΑΔΞϥʔτ■ ύϑΥʔϚϯεʹΑΔΞϥʔτ■ ϩάࢹʹΑΔΞϥʔτ• ଈରԠඞཁ• SSLηοτͰ(͔͔࣌ؒΔ)
ҟৗΛݕͰ͖Δ■ ֎ܗࢹ (αʔϏεࢹ)ʹΑΔΞϥʔτ■ ϦιʔεࢹʹΑΔΞϥʔτ■ ύϑΥʔϚϯεʹΑΔΞϥʔτ■ ϩάࢹʹΑΔΞϥʔτ• εςʔτϨεϨΠϠݟͳ͍(ࣺͯΔ)• ετϨʔδܥݟΔ(ରԠʹ͔͔࣌ؒΔ)
ҟৗΛݕͰ͖Δ■ ֎ܗࢹ (αʔϏεࢹ)ʹΑΔΞϥʔτ■ ϦιʔεࢹʹΑΔΞϥʔτ■ ύϑΥʔϚϯεʹΑΔΞϥʔτ■ ϩάࢹʹΑΔΞϥʔτ• Latencyલिൺ or લ࣌ؒൺͰͷมԽ• RequestࣦഊSLO x Status CodeͰݟΔ
ҟৗΛݕͰ͖Δ■ ֎ܗࢹ (αʔϏεࢹ)ʹΑΔΞϥʔτ■ ϦιʔεࢹʹΑΔΞϥʔτ■ ύϑΥʔϚϯεʹΑΔΞϥʔτ■ ϩάࢹʹΑΔΞϥʔτ• લिൺ or લ࣌ؒൺͰͷมԽ• ৗ࣌ྲྀΕΔܥͳΒAnomaly detection͕٢
ҟৗʹରԠͰ͖Δ■ Ϧιʔεͷ૿ڧɺަɺϩʔϧόοΫΛ༰қʹ■ োൃੜ࣌ͷௐࠪίετΛԼ͛Δ
Ϧιʔεͷ૿ڧɺަɺϩʔϧόοΫΛ༰қʹ͢ΔScale Out / DiscardScale UpAdd ShardScale Out Scale UpVerticalSplit
Ϧιʔεͷ૿ڧɺަɺϩʔϧόοΫΛ༰қʹ͢ΔScale Out / DiscardScale UpAdd ShardScale Out Scale UpVerticalSplit• ؆୯ & ϦʔυλΠϜແ͘Ϧιʔε૿ͤΔ• ҟৗܥΛ؆୯ʹΓͤΔΑ͏ʹ• ετϨʔδܥ͕ΩϞ.ࣄલʹ༧ߦ࿅शΛ
ҟৗ͕ࣗಈͰम෮͢Δ■ ΦʔτώʔϦϯά■ ࠞಱ(ΧΦε)ͷೖ
ҟৗ͕ࣗಈͰम෮͢Δ■ ΦʔτώʔϦϯά■ ࠞಱ(ΧΦε)ͷೖ• ҟৗϗετͷΓ͠ϑΣΠϧΦʔό• ετϨʔδܥ͕ΩϞ• ఆظతͳආ܇࿅
ҟৗ͕ࣗಈͰम෮͢Δ■ ΦʔτώʔϦϯά■ ࠞಱ(ΧΦε)ͷೖ• ҙਤతͳোͷೖ• ΦʔτώʔϦϯάͷڧ੍ࢼݧ & ৗԽ• ઓ͍͜Ε͔Βͩͥɺɺʂ
·ͱΊ
ࠓͷͳ͠■ WebΞϓϦέʔγϣϯࢹʹ·ͭΘΔٕज़τϨϯυ■ ࢹͷతͱɺݱͷ՝ͱ■ ΤϨΧͰͷࣦഊ / ཱͯ͠ࣄྫͷհ■ ϞχλϦϯάͷஈ֊తਐԽɾ࣮ྫհ
·ͱΊ■ ࢹͷతMTTRΛ࠷খԽ࣭͠ཁٻΛຬͨ͢ࣄ■ ΞϥʔτରԠͰ͖ͳ͚Εҙຯ͕ͳ͍■ αʔϏε࣭ͷඪΛఆΊΔ͠■ ࢹϨεͳΞʔΩςΫνϟΛࢦ͢͠■ Ξϥʔτ + ఆظతͳ࣭νΣοΫΛDev / OpsͰ
·ͱΊ■ γεςϜʑ & ෳࡶԽ͍ͯ͘͠■ Γ͍ͨ͜ͱͨ͘͞Μ͋Δʂ■ ΤϨΧSREνʔϜͷϝϯόʔΛืूதͰ͢ʂ
CONFIDENTIALThankyou :)Thank you :)
Any Questions??