Webサービスの品質とは何か?アラート地獄と監視の失敗、サービスレベル目標設計
から学んだ3つの答え

7890032b748bfc156d75aca46db99562?s=47 takuya542
September 07, 2018

 Webサービスの品質とは何か?アラート地獄と監視の失敗、サービスレベル目標設計
から学んだ3つの答え

Webサービスの品質とは何か?アラート地獄と監視の失敗、サービスレベル目標設計
から学んだ3つの答え

7890032b748bfc156d75aca46db99562?s=128

takuya542

September 07, 2018
Tweet

Transcript

  1. Copyright © 2009-2018 eureka, inc. All rights reserved. Takuya Onda

    / eureka, Inc. 2018-09-07 Builderscon Tokyo WebαʔϏεͷ඼࣭ͱ͸Կ͔ʁ Ξϥʔτ஍ࠈͱ؂ࢹͷࣦഊɺαʔϏεϨϕϧ໨ඪઃܭ
 ͔ΒֶΜͩ3ͭͷ౴͑
  2. Introduction ▪ Takuya Onda – eureka, Inc. – SRE team

    Head
  3. None
  4. About Us - IAC/Match Group

  5. Agenda ▪ WebΞϓϦέʔγϣϯ؂ࢹʹ·ͭΘΔٕज़τϨϯυ ▪ ؂ࢹͷ໨తͱɺݱ৔ͷ՝୊ͱ͸ ▪ Τ΢ϨΧͰͷࣦഊ / ཱͯ௚͠ࣄྫͷ঺հ ▪

    ؂ࢹͷஈ֊త࡮৽ɾ࣮ྫ঺հ
  6. 8FCΞϓϦέʔγϣϯ؂ࢹʹ ·ͭΘΔٕज़τϨϯυ

  7. ύϒϦοΫΫϥ΢υͷ୆಄ ▪ ΑΓૣ͘ɺڧྗͳϚγϯϦιʔεͷௐୡ͕༰қʹ ▪ αʔόΛ࢖͍ࣺͯΔલఏͷΞʔΩςΫνϟ

  8. ؂ࢹπʔϧͷॆ࣮ ▪ SaaSܕαʔόʔ؂ࢹαʔϏε ▪ ΠϯςάϨʔγϣϯͷॆ࣮

  9. DevOpsɾSRE ▪ ߴ͍։ൃੜ࢈ੑͱ҆ఆӡ༻΁ͷཁٻ ▪ Culture, Automation, Lean, Measurement, Sharing

  10. γεςϜͷෳࡶԽ ▪ ϚΠΫϩαʔϏεɾSPAɾσόΠεͷଟ༷Խ ▪ ඇػೳཁ݅ͷ؂ࢹ΁ͷχʔζ

  11. ͜Μͳܦݧ͋Γ·ͤΜ͔ʁ

  12. ▪ ʮ4UBUVT͕૿͑ͯ·͢ʯ – ຖ೔͜ͷ࣌ؒͰͯΔΑͳɺɺ – ͑Β͍ਓʮ͜ͷΞϥʔτେৎ෉ͳͷʁʯ ▪ ʮ%#ͷίωΫγϣϯ਺͕YYYΛ௒͑·ͨ͠ʯ – ͑ɺͲ͏͢Ε͹͍͍ͷʁࠓϐʔΫλΠϜͩΑʁ

    – εύΠΫʁϝϯςೖΕΔʁͱΓ͋͑ͣ؍࡯ɺɺʁ ▪ ʮ999ͷΤϥʔ͕ൃੜ͠·ͨ͠ʯ – ੲ͔Β͍Δਓʮ͜Ε͸ແࢹͯ͠0,ʯ – ࠷ۙೖͬͨਓʮʜʂʁʂʁʯ Ξϥʔτ͋Δ͋Δ
  13. Ξϥʔτ͋Δ͋Δ ▪ ຊ౰ʹҟৗͳͷ͔Θ͔Βͳ͍ ▪ Ξϥʔτ͕ߦಈʹ݁ͼ͔ͭͳ͍ ▪ ਖ਼͍͠ํ޲ʹ޲͔͍ͬͯΔ͔Θ͔Βͳ͍ (඼࣭ͱ͸?)

  14. γεςϜΛ؂ࢹ͢Δ໨తͱ͸ ▪ Ϣʔβʔͷຬ଍͢ΔੑೳཁٻʹԠ͍͑ͨ ▪ ͦͷͨΊʹɺଈ࣌ʹγεςϜҟৗΛݕ஌͍ͨ͠ ▪ ҟৗʹଈ࠲ʹରԠ͠ɺҟৗൃੜظؒΛ࠷খԽ͍ͨ͠

  15. γεςϜΛ؂ࢹ͢Δ໨తͱ͸ ▪ Ϣʔβʔͷຬ଍͢ΔੑೳཁٻʹԠ͍͑ͨ ▪ ͦͷͨΊʹɺଈ࣌ʹγεςϜҟৗΛݕ஌͍ͨ͠ ▪ ҟৗʹଈ࠲ʹରԠ͠ɺҟৗൃੜظؒΛ࠷খԽ͍ͨ͠ ϢʔβͷٻΊΔ඼࣭ʹԠ͑Δͷ͕Ձ஋ ؂ࢹ͸໨తͷͨΊͷखஈ

  16. ΞΫγσϯτ͸ฏৗͨΕ By SRE Workbook ▪ ࣦഊͷίετΛ཈͑ΔͨΊʹૣΊʹಈ͘ͷ͕େࣄ ▪ MTTRʢฏۉ෮چ࣌ؒ)͕୹͍΄Ͳ։ൃऀෛ୲খ͍͞ ▪ ໰୊ൃݟ͸ޙʹͳΔ΄Ͳम෮͕೉͍͠

  17. Τ΢ϨΧͰͷ؂ࢹͷࣦഊྫ

  18. Τ΢ϨΧͰͷ؂ࢹͷࣦഊ ▪ Ξϥʔτ͕ߦಈʹ݁ͼ͔ͭͳ͍ ▪ ࠓରԠ͢΂͖ͳͷ͔൑அͰ͖ͳ͍

  19. Ξϥʔτ͕ߦಈʹ݁ͼ͔ͭͳ͍ ▪ 1ɿͱΓ͋͑ͣಈ͍ͯΔ ▪ 2ɿৗʹҟৗ ▪ 3ɿԿ΋Ͱ͖ͳ͍

  20. Ξϥʔτ͕ߦಈʹ݁ͼ͔ͭͳ͍ ▪ 1ɿͱΓ͋͑ͣಈ͍ͯΔ ▪ 2ɿৗʹҟৗ ▪ 3ɿԿ΋Ͱ͖ͳ͍ • ϐʔΫλΠϜʹϩʔΞϕ͕ۤ͘͠ͳΔDB •

    DynamoͷΩϟύγςΟ͕࢒ΓΘ͔ͣ • ຖ࣌CPU͕ுΓ෇͘ϝʔϧ഑৴αʔό
  21. Ξϥʔτ͕ߦಈʹ݁ͼ͔ͭͳ͍ ▪ 1ɿͱΓ͋͑ͣಈ͍ͯΔ ▪ 2ɿৗʹҟৗ ▪ 3ɿԿ΋Ͱ͖ͳ͍ • ຖ೔٧·ΔδϣϒΩϡʔ •

    ৗʹྲྀΕͯΔΞϓϦέʔγϣϯΤϥʔϩά • σϓϩΠͷͨͼൃੜ͢ΔΤϥʔ
  22. Ξϥʔτ͕ߦಈʹ݁ͼ͔ͭͳ͍ ▪ 1ɿͱΓ͋͑ͣಈ͍ͯΔ ▪ 2ɿৗʹҟৗ ▪ 3ɿԿ΋Ͱ͖ͳ͍ • 5xxΤϥʔ૿͑ͯΔ͚ͲԿ͜Εʁ(༷ࢠݟ) •

    ετϨʔδܥͷανϡϨʔγϣϯ • ߏ੒࠶ݱੑ͕ͳ͍SPOFͳαʔό
  23. ࠓରԠ͢΂͖͔൑அͰ͖ͳ͍ ▪ ݁ہ͜ΕϢʔβӨڹ͋Μͷʁ ▪ ͲΕ͘Β͍Өڹ͋Μͷʁ ▪ ͦ΋ͦ΋ఆྔԽͰ͖Μͷʁ ▪ ͜Ε߃ٱରԠ͠ͳ͍ͱϚζΠͷʁ ▪

    ࣄۀࢪࡦΑΓ༏ઌ͢Μͷʁ
  24. ࣾ಺ͷงғؾʹ΋Өڹ ΞϥʔτݟΖͬͯݴ͏͚Ͳҙຯͳ͘Ͷʁ ͏ͪͷγεςϜɺຊ౰ʹେৎ෉ͳͷʁʁ Ͳ͏ͤΈΜͳؾʹͯ͠ͳ͍͍͍͠΍ ࠶ൃ๷ࢭɺ͍ͭ΋Կ΋มΘΒͳ͍͡ΌΜ

  25. Ͳ͏΍ཱͬͯͪ޲͔͔ͬͨ

  26. ํ਑ ▪ 1ɿఆྔ໨ඪΛઃఆ͢Δ ▪ 2ɿΞϥʔτ = ଈ࠲ΞΫγϣϯͱ͢Δ ▪ 3ɿશମΛγϯϓϧʹ؅ཧ͢Δ ▪

    4ɿ؂ࢹ͕ϥΫͳΞʔΩςΫνϟʹ͢Δ ▪ 5ɿҟৗݕ஌ͱύϑΥʔϚϯε໨ඪΛ۠ผ͢Δ
  27. 1ɿఆྔ໨ඪΛઃఆ͢Δ ▪ αʔϏεϨϕϧࢦඪ (SLI) Λఆٛ ▪ αʔϏεϨϕϧ໨ඪ (SLO)Λઃఆ ▪ SLOΛΞϥʔτᮢ஋ͱλεΫ༏ઌ౓ͷج४ʹ

  28. 1ɿఆྔ໨ඪΛઃఆ͢Δ ▪ αʔϏεϨϕϧࢦඪ (SLI) Λఆٛ ▪ αʔϏεϨϕϧ໨ඪ (SLO)Λઃఆ ▪ SLOΛΞϥʔτᮢ஋ͱλεΫ༏ઌ౓ͷج४ʹ

    • SLI = ੒ޭϦΫΤετ / ૯ϦΫΤετ • SLO = SLI > 99.95 (ظؒɿ1िؒ)
  29. 2ɿΞϥʔτ = ଈ࠲ΞΫγϣϯͱ͢Δ ▪ ຊ౰ʹΞΫγϣϯ͕ඞཁͳ΋ͷ͚ͩΞϥʔτ໐Β͢ ▪ ଈ࠲ʹରԠ͕ՄೳͳΞʔΩςΫνϟʹม͍͑ͯ͘ • SLOະୡͷཁҼͱͳΔ΋ͷ •

    ରԠʹ͔͔࣌ؒΔܥ (ετϨʔδܥͱ͔)
  30. 2ɿΞϥʔτ = ଈ࠲ʹΞΫγϣϯ͕ඞཁͳࣄ৅ͱ͢Δ ▪ ຊ౰ʹΞΫγϣϯ͕ඞཁͳ΋ͷ͚ͩΞϥʔτ໐Β͢ ▪ ଈ࠲ʹରԠ͕ՄೳͳΞʔΩςΫνϟʹม͍͑ͯ͘ • ଈ࠲ʹαʔόϦιʔεΛ૿ڧ /

    ަ׵Ͱ͖Δ • LB / API / Batch / DB / Cache / etc,,
  31. 3ɿશମΛγϯϓϧʹ؅ཧ͢Δ ▪ πʔϧͷ౷Ұ ▪ ඼࣭ͷఆٛ΍ᮢ஋ΛҰݩԽ • ؂ࢹʹར༻͢ΔπʔϧΛ͠΅Δ • ؂ࢹઃఆΛίʔυԽ͢Δ

  32. 3ɿશମΛγϯϓϧʹ؅ཧ͢Δ ▪ πʔϧͷ౷Ұ ▪ ඼࣭ͷఆٛ΍ᮢ஋ΛҰݩԽ • SLI / SLOΛ౷Ұ •

    Threshold / Rate / Change / Anomaly
  33. 4ɿ؂ࢹ͕ϥΫͳΞʔΩςΫνϟʹ͢Δ ▪ αʔόͷަ׵͕Χϯλϯ ▪ ࣗલओٛΛ΍ΊΔ ▪ ࣄۀͱӡ༻޻਺(؂ࢹίετ)Λൺྫͤ͞ͳ͍

  34. 4-1ɿ؂ࢹ͕ϥΫ ~ αʔόަ׵͕Χϯλϯ ▪ ҟৗͷ͋Δαʔό͸͙ࣺͯ͢Δ – ίʔυ͔ΒϏϧυ͞ΕͨΠϝʔδ͕͙͢౤ೖՄೳͳঢ়ଶ Scheduling Rotate API

    Worker
  35. 4-2ɿ؂ࢹ͕ϥΫ ~ ࣗલओٛΛ΍ΊΔ ▪ ڊਓ(AWS)ͷݞʹ৐Δɻ؂ࢹର৅ΛݮΒ͢ – εέʔϧΞ΢τ/ Ξοϓ͕ϥΫ & ϑΣΠϧΦʔό؅ཧෆཁ

    S3 Aurora Dynamo ElastiCache SQS
  36. 4-3ɿ؂ࢹ͕ϥΫ ~ ࣄۀͱ؂ࢹίετΛൺྫͤ͞ͳ͍ ▪ ಉ͡ϓϩϏδϣχϯάϓϩηεͱٕज़ελοΫ – ٕज़ߏ੒ͷΏΒ͗Λ࡞Βͳ͍ Pairs JP Pairs

    GL Capacity = LL Capacity = M
  37. 5ɿҟৗݕ஌ͱύϑΥʔϚϯε໨ඪΛ۠ผ͢Δ ▪ ҟৗݕ஌ͱ໨ඪୡ੒ͷ؍ଌ͸DurationΛ෼͚Δ ▪ αʔϏεΛ࢖͑ͳ͍ͱ࢖͍ͮΒ͍Λ۠ผ͢Δ • ҟৗɿΞϥʔτʹΑΔݕ஌(within 1min) • ໨ඪɿఆظతͳݕࠪ(within

    1week)
  38. 5ɿҟৗݕ஌ͱύϑΥʔϚϯε໨ඪΛ۠ผ͢Δ ▪ ҟৗݕ஌ͱ໨ඪୡ੒ͷ؍ଌ͸DurationΛ෼͚Δ ▪ αʔϏεΛ࢖͑ͳ͍ͱ࢖͍ͮΒ͍Λ۠ผ͢Δ • ࢖͑ͳ͍ɿϩάΠϯ/ ݕࡧͰ͖ͳ͍ͳͲ • ࢖͍ͮΒ͍ɿαʔϏεମײ͕஗͍

    / ॏ͍
  39. ํ਑(࠶ܝ) ▪ 1ɿఆྔ໨ඪΛઃఆ͢Δ ▪ 2ɿΞϥʔτ = ଈ࠲ΞΫγϣϯͱ͢Δ ▪ 3ɿશମΛγϯϓϧʹ؅ཧ͢Δ ▪

    4ɿ؂ࢹ͕ϥΫͳΞʔΩςΫνϟʹ͢Δ ▪ 5ɿҟৗݕ஌ͱύϑΥʔϚϯε໨ඪΛ۠ผ͢Δ
  40. ؂ࢹͱΞʔΩςΫνϟΛ࡮৽͠·ͨ͠

  41. ؂ࢹΛ࡮৽

  42. ؂ࢹɾରԠΛஈ֊తʹਐԽ ▪ ҟৗΛ஌ΕΔ(ݟΕΔ) ▪ ҟৗΛݕ஌Ͱ͖Δ ▪ ҟৗʹରԠͰ͖Δ ▪ ҟৗ͕ࣗಈͰम෮͢Δ

  43. ҟৗΛ஌ΕΔ(ݟΕΔ) ▪ Datadog ▪ StackDriver Loggin

  44. Metrics Aggregate AWS Integration DatadogʹΑΔϗετϕʔε؂ࢹ &Integration

  45. StackDriver LoggingʹΑΔϩάՄࢹԽͱϨϙʔτੜ੒ Metrics / Log Monitoring Alert Slack call you

    Log Aggregation Sync to DWH Generate Performance Report Hosting Report Dev & SRE
  46. StackDriver LoggingʹΑΔϩάՄࢹԽͱϨϙʔτੜ੒ Metrics / Log Monitoring Alert Slack call you

    Log Aggregation Sync to DWH Generate Performance Report Hosting Report Dev & SRE • Ξϥʔτ & ଈ࣌ରԠ • Windowɿ1෼
  47. StackDriver LoggingʹΑΔϩάՄࢹԽͱϨϙʔτੜ੒ Metrics / Log Monitoring Alert Slack call you

    Log Aggregation Sync to DWH Generate Performance Report Hosting Report Dev & SRE • ύϑΥʔϚϯεৼฦ • Windowɿ1िؒ
  48. StackDriver LoggingʹΑΔϩάՄࢹԽͱϨϙʔτੜ੒ Metrics / Log Monitoring Alert Slack call you

    Log Aggregation Sync to DWH Generate Performance Report Hosting Report Dev & SRE • ύϑΥʔϚϯεৼฦ • Windowɿ1िؒ Ϩϙʔτൈਮ(Ұ෦)ɾϨΠςϯγώετάϥϜ
  49. StackDriver LoggingʹΑΔϩάՄࢹԽͱϨϙʔτੜ੒ Metrics / Log Monitoring Alert Slack call you

    Log Aggregation Sync to DWH Generate Performance Report Hosting Report Dev & SRE • ύϑΥʔϚϯεৼฦ • Windowɿ1िؒ Ϩϙʔτൈਮ(Ұ෦)ɾRest EndpointผϦΫΤετϘϦϡʔϜਪҠ
  50. ҟৗΛݕ஌Ͱ͖Δ ▪ ֎ܗ؂ࢹ (αʔϏε؂ࢹ)ʹΑΔΞϥʔτ ▪ Ϧιʔε؂ࢹʹΑΔΞϥʔτ ▪ ύϑΥʔϚϯεʹΑΔΞϥʔτ ▪ ϩά؂ࢹʹΑΔΞϥʔτ

  51. ҟৗΛݕ஌Ͱ͖Δ ▪ ֎ܗ؂ࢹ (αʔϏε؂ࢹ)ʹΑΔΞϥʔτ ▪ Ϧιʔε؂ࢹʹΑΔΞϥʔτ ▪ ύϑΥʔϚϯεʹΑΔΞϥʔτ ▪ ϩά؂ࢹʹΑΔΞϥʔτ

    • ଈରԠඞཁ • SSL΋ηοτͰ(͔͔࣌ؒΔ)
  52. ҟৗΛݕ஌Ͱ͖Δ ▪ ֎ܗ؂ࢹ (αʔϏε؂ࢹ)ʹΑΔΞϥʔτ ▪ Ϧιʔε؂ࢹʹΑΔΞϥʔτ ▪ ύϑΥʔϚϯεʹΑΔΞϥʔτ ▪ ϩά؂ࢹʹΑΔΞϥʔτ

    • εςʔτϨεϨΠϠ͸ݟͳ͍(ࣺͯΔ) • ετϨʔδܥ͸ݟΔ(ରԠʹ͔͔࣌ؒΔ)
  53. ҟৗΛݕ஌Ͱ͖Δ ▪ ֎ܗ؂ࢹ (αʔϏε؂ࢹ)ʹΑΔΞϥʔτ ▪ Ϧιʔε؂ࢹʹΑΔΞϥʔτ ▪ ύϑΥʔϚϯεʹΑΔΞϥʔτ ▪ ϩά؂ࢹʹΑΔΞϥʔτ

    • Latency͸લिൺ or લ࣌ؒൺͰͷมԽ཰ • Requestࣦഊ͸SLO x Status CodeͰݟΔ
  54. ҟৗΛݕ஌Ͱ͖Δ ▪ ֎ܗ؂ࢹ (αʔϏε؂ࢹ)ʹΑΔΞϥʔτ ▪ Ϧιʔε؂ࢹʹΑΔΞϥʔτ ▪ ύϑΥʔϚϯεʹΑΔΞϥʔτ ▪ ϩά؂ࢹʹΑΔΞϥʔτ

    • લिൺ or લ࣌ؒൺͰͷมԽ཰ • ৗ࣌ྲྀΕΔܥͳΒAnomaly detection͕٢
  55. ҟৗʹରԠͰ͖Δ ▪ Ϧιʔεͷ૿ڧɺަ׵ɺϩʔϧόοΫΛ༰қʹ ▪ ো֐ൃੜ࣌ͷௐࠪίετΛԼ͛Δ

  56. Ϧιʔεͷ૿ڧɺަ׵ɺϩʔϧόοΫΛ༰қʹ͢Δ Scale Out / Discard Scale Up Add Shard Scale

    Out Scale Up Vertical Split
  57. Ϧιʔεͷ૿ڧɺަ׵ɺϩʔϧόοΫΛ༰қʹ͢Δ Scale Out / Discard Scale Up Add Shard Scale

    Out Scale Up Vertical Split • ؆୯ & ϦʔυλΠϜແ͘Ϧιʔε૿΍ͤΔ • ҟৗܥΛ؆୯ʹ੾Γ཭ͤΔΑ͏ʹ • ετϨʔδܥ͕ΩϞ.ࣄલʹ༧ߦ࿅शΛ
  58. ҟৗ͕ࣗಈͰम෮͢Δ ▪ ΦʔτώʔϦϯά ▪ ࠞಱ(ΧΦε)ͷ஫ೖ

  59. ҟৗ͕ࣗಈͰम෮͢Δ ▪ ΦʔτώʔϦϯά ▪ ࠞಱ(ΧΦε)ͷ஫ೖ • ҟৗϗετͷ੾Γ཭͠΍ϑΣΠϧΦʔό • ετϨʔδܥ͕ΩϞ •

    ఆظతͳආ೉܇࿅
  60. ҟৗ͕ࣗಈͰम෮͢Δ ▪ ΦʔτώʔϦϯά ▪ ࠞಱ(ΧΦε)ͷ஫ೖ • ҙਤతͳো֐ͷ஫ೖ • ΦʔτώʔϦϯάͷڧ੍ࢼݧ &

    ೔ৗԽ • ઓ͍͸͜Ε͔Βͩͥɺɺʂ
  61. ·ͱΊ

  62. ࠓ೔ͷ͸ͳ͠ ▪ WebΞϓϦέʔγϣϯ؂ࢹʹ·ͭΘΔٕज़τϨϯυ ▪ ؂ࢹͷ໨తͱɺݱ৔ͷ՝୊ͱ͸ ▪ Τ΢ϨΧͰͷࣦഊ / ཱͯ௚͠ࣄྫͷ঺հ ▪

    ϞχλϦϯάͷஈ֊తਐԽɾ࣮ྫ঺հ
  63. ·ͱΊ ▪ ؂ࢹͷ໨త͸MTTRΛ࠷খԽ͠඼࣭ཁٻΛຬͨ͢ࣄ ▪ Ξϥʔτ͸ରԠͰ͖ͳ͚Ε͹ҙຯ͕ͳ͍ ▪ αʔϏε඼࣭ͷ໨ඪ஋ΛఆΊΔ΂͠ ▪ ؂ࢹϨεͳΞʔΩςΫνϟΛ໨ࢦ͢΂͠ ▪

    Ξϥʔτ + ఆظతͳ඼࣭νΣοΫΛDev / OpsͰ
  64. ·ͱΊ ▪ γεςϜ͸೥ʑ੒௕ & ෳࡶԽ͍ͯ͘͠ ▪ ΍Γ͍ͨ͜ͱ͸ͨ͘͞Μ͋Δʂ ▪ Τ΢ϨΧ͸SREνʔϜͷϝϯόʔΛืूதͰ͢ʂ

  65. CONFIDENTIAL Thank you :) Thank you :)

  66. Any Questions??