Save 37% off PRO during our Black Friday Sale! »

SRE入門 & チームで取り組んでいるSRE #srefukuoka / introduce-to-sre

SRE入門 & チームで取り組んでいるSRE #srefukuoka / introduce-to-sre

「SRE meetup at Fukuoka #1」での発表資料です
https://sre-fukuoka.connpass.com/event/119041/

0a98ad166f9cdf8d27d92c37438c6e9d?s=128

Manabu Matsuzaki

March 13, 2019
Tweet

Transcript

  1. SREೖ໳ & νʔϜͰऔΓ૊ΜͰ͍ΔSRE SRE meetup at Fukuoka #1 2019/03/13 @matsumana

  2. About me • Nameɿ Manabu Matsuzaki • Work atɿ LINE

    Fukuoka Corporation • Roleɿ SRE • Twitterɿ @matsumana
  3. Agenda • SRE(Site Reliability Engineering)ͱ͸ʁ • SLOͱError budget • νʔϜͰऔΓ૊ΜͰ͍ΔSRE

  4. SRE(Site Reliability Engineering) ͱ͸ʁ

  5. GoogleͰ͸ • https://landing.google.com/sre/ SRE is what you get when you

    treat operations as if it’s a software problem. Our mission is to protect, provide for, and progress the software and systems behind all of Google’s public services with an ever-watchful eye on their availability, latency, performance, and capacity. SRE͸ɺӡ༻্ͷ໰୊Λιϑτ΢ΣΞతʹղܾ͢ΔͨΊͷΤϯδχΞϦϯάͰ͢ɻ ࢲͨͪͷ࢖໋͸ɺGoogleͷαʔϏεͷՄ༻ੑɺϨΠςϯγɺύϑΥʔϚϯεɺ ΩϟύγςΟΛ ৗʹ؂ࢹ͠ͳ͕ΒकΓɺਐาͤ͞Δ͜ͱͰ͢ɻ
  6. Ұൠతʹ͸ • https://landing.google.com/sre/sre-book/chapters/introduction/ In general, an SRE team is responsible

    for the availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning of their service(s). ҰൠతʹɺSREνʔϜ͸ɺαʔϏεͷՄ༻ੑɺϨΠςϯγɺ
 ύϑΥʔϚϯεɺޮ཰ੑɺมߋ؅ཧɺϞχλϦϯάɺۓٸରԠɺ
 ΩϟύγςΟϓϥϯχϯάʹ੹೚Λ࣋ͪ·͢ɻ
  7. SRE vs. DevOps: competing standards or close friends? • https://cloud.google.com/blog/products/gcp/sre-vs-devops-

    competing-standards-or-close-friends ͜ͷϒϩάͰ͸ʮclass SRE implements DevOpsʯͱදݱ͞Ε͍ͯ·͢ɻ SRE͸DevOpsͷࢥ૝Λ۩ݱԽͨ͠΋ͷͰ͋ΓɺDevOps΋SREͷཁૉͷҰͭͩͱ ݴ͑Δͱࢥ͍·͢ɻ
  8. ͦͷଞͷαΠτ΋ݟͯΈΔͱ • Google͕ఏএͨ͠ʮSite Reliability EngineeringʢSREʣʯͱ͸
 https://furien.jp/columns/327/ SREʹٻΊΒΕΔਓ෺͸ɺεΩϧ͕ߴ͍ਓͰ͋Δ͜ͱ͕ ଟ͍Ͱ͢ɻୈҰʹΠϯϑϥͷٕज़ɻͦͯ͠ΞϓϦέʔγϣ ϯͷٕज़΋ඞཁʹͳΔ͔ΒͰ͢ɻ •

    APIαʔόͷՄ༻ੑͷҡ࣋ͱ޲্ • APIαʔόͷύϑΥʔϚϯε޲্ • ϛυϧ΢ΣΞͷՄ༻ੑͷҡ࣋ͱ޲্ • ϛυϧ΢ΣΞͷύϑΥʔϚϯε޲্ • ϩάͷऩू • ϩά෼ੳͷج൫ߏஙͱӡ༻ • αʔόɺσϓϩΠ؀ڥͷ੔උ • ։ൃ؀ڥͳͲͷ੔උ • ηΩϡϦςΟͷڧԽ
  9. SREͷۀ຿ൣғ (͜͜·Ͱͷ·ͱΊ) • DevOps • ࣗಈԽɺInfrastructure as CodeɺCI/CDɺetc • αʔϏεΛ҆ఆՔಇ(Մ༻ੑɺύϑΥʔϚϯε)ͤ͞ɺਐาͤ͞Δ

    • Πϯϑϥɺϛυϧ΢ΣΞɺΞϓϦέʔγϣϯ࣮૷ʹؔ͢ΔεΩϧ͕ٻΊΒΕΔ • ϞχλϦϯά • ӡ༻ • ΩϟύγςΟ ϓϥϯχϯά
  10. SREͰ࠷΋ॏཁͳ΋ͷ

  11. SLO ͱ Error budget

  12. • Site Reliability Engineering
 Chapter 4 - Service Level Objectives


    https://landing.google.com/sre/sre-book/toc/ • The Site Reliability Workbook
 Chapter 2 - Implementing SLOs, “this is the most important chapter in this book”
 https://landing.google.com/sre/workbook/toc/ • SRE νʔϜͷධՁʹ໾ཱͭϨϕϧผνΣοΫ Ϧετ
 SREͷجຊͱͯ͠ɺ࠷ॳͷ߲໨ͱͯ͠঺հ͞Ε͍ͯΔ
 https://cloudplatform-jp.googleblog.com/2019/02/how-to-start-and-assess-your-sre-journey.html • Google͕ղઆ - ଞࣾͷSRE࣮ફ͸ͳͥޡΓͳͷ͔
 https://www.infoq.com/jp/news/2018/08/google-explains-sre ඇৗʹॏཁͳࣄͱͯ͠ड़΂ΒΕ͍ͯΔ
  13. SLO ͱ Error budget ͱ͸ʁ

  14. SLI ͱ SLO • SLI • Service Level Indicator •

    SLO • Service Level Objective • SLIΛϕʔεʹͨ͠αʔϏε৴པੑͷ໨ඪ
  15. WebαʔϏεͰͷSLIͷྫ • AvailabilityʢՄ༻ੑʣ • HTTPϦΫΤετ੒ޭ཰ (successful requests / total requests)

    • ϨΠςϯγ • ͖͍͠஋Λຬͨͨ͠ϦΫΤετͷׂ߹ • Quality • ϑΥʔϧόοΫ͞ΕͨϨεϙϯε • ྫʣ Τϥʔ࣌ʹαʔϏεͷσϑΥϧτը૾΍ݹ͍σʔλΛදࣔ͢ΔɻͳͲ
  16. SLO • ΋͠SLO͕͖͍͠஋ΛԼճͬͨ৔߹ɺϢʔβ͸αʔϏεʹର͢ΔෆຬΛ࣋ͪ ࢝ΊͨΓɺ࢖͏ͷΛ΍ΊΔ͔΋஌Εͳ͍ • SRE workbookͰ͸ɺݱঢ়ͷύϑΥʔϚϯεʹج͍ͮͯઃఆ͢ΔࣄΛ͓͢͢ Ί͍ͯ͠Δ • SLO

    != SLA (service level agreement) • SLA͸ϢʔβͱαʔϏεఏڙऀͷؒͷܖ໿ • ྫ͑͹ɺAWS EC2Ͱ͸݄ؒͷuptime͕SLAΛԼճΔͱͦͷ෼͕ฦۚ͞ΕΔ
  17. SLOΛ100%ʹ͢Δ΂͖͔ʁ • SLO 100%͸ؒҧͬͨ໨ඪ • ྫ͑͹ɺҎԼͷΑ͏ͳࣄ͕ߦ͑ͳ͘ͳΔ • ৽ػೳ௥Ճ • طଘػೳͷվળ

    • ϋʔυ΢ΣΞ΍ϛυϧ΢ΣΞͷϝϯςφϯε • ηΩϡϦςΟύονͷద༻ • SLOΛԼ͛ͯͰ΋ɺ৽ػೳͷϦϦʔεΛ༏ઌ͍ͨ͠৔໘΋͋Δ • ͦ΋ͦ΋ɺ࢖͍ͬͯΔϓϥοτϑΥʔϜͷSLA͕100%Ͱ͸ͳ͍
  18. Error budget • 100% - SLO = Error budget •

    SLO͔Βܭࢉ͞Εͨɺڐ༰Ͱ͖ΔΤϥʔͷׂ߹ • ྫ͑͹ɺ”APIϦΫΤετͷAvailability 99.9%"ΛSLOͱͨ͠৔߹ • Error budget ͸ 0.1% • ݄ؒ300ສϦΫΤετͷαʔϏεͷ৔߹ɺError budget͸3,000
  19. SLOͱError budgetΛҙࢥܾఆʹ࢖͏ • ྫ͑͹ɺError budgetΛ࢖͍Ռͨͦ͠͏ʹͳ͖ͬͯͨΒɺ
 ৽ػೳ։ൃΑΓ΋ɺαʔϏεͷ৴པੑΛߴΊΔࣄʹ஫ྗ͢Δɻ
 ͳͲ

  20. SLO͕༗༻ͰޮՌతͰ͋ΔͨΊʹ͸ʁ • ͢΂ͯͷεςʔΫϗϧμʔʹಉҙͯ͠΋Β͏ඞཁ͕͋Δ • ܧଓతʹݟ௚ͯ͠վળ͠ଓ͚Δ

  21. SREͷۀ຿ൣғ (·ͱΊ) • DevOps • ࣗಈԽɺInfrastructure as CodeɺCI/CDɺetc • αʔϏεΛ҆ఆՔಇ(Մ༻ੑɺύϑΥʔϚϯε)ͤ͞ɺਐาͤ͞Δ

    • Πϯϑϥɺϛυϧ΢ΣΞɺΞϓϦέʔγϣϯ࣮૷ʹؔ͢ΔεΩϧ͕ٻΊΒΕΔ • ϞχλϦϯά • ӡ༻ • ΩϟύγςΟ ϓϥϯχϯά • SLO ͱ Error budget
  22. νʔϜͰऔΓ૊ΜͰ͍ΔSRE

  23. ܞΘ͍ͬͯΔαʔϏε • LINEͷίϯςϯπൢചϓϥοτϑΥʔϜ • LINE DEVELOPER DAY 2018Ͱͷϙελʔηογϣϯࢿྉ
 https://twitter.com/LINE_DEV/status/1073068507707789313 •

    νʔϜߏ੒ • ౦ژɿ ։ൃΤϯδχΞ 10਺ਓ • ෱Ԭɿ ։ൃΤϯδχΞ 10਺ਓ + SRE 1ਓʢSRE΋αʔϏε։ൃνʔϜͷ1ਓʣ • ݩʑαʔϏε։ൃ୲౰ΤϯδχΞ͕SRE΋΍͍͕ͬͯͨɺ
 αʔϏε͕͞Βʹେن໛ɾෳࡶʹͳ͍ͬͯ͘ͳ͔Ͱ2018/07ʹνʔϜ಺ʹSRE role͕৽ઃɻ
 ͦͷλΠϛϯάͰҟಈ
  24. PracticalͳTopic • SLO ͱ Error budget • On-call • νʔϜશһ(։ൃΤϯδχΞ

    + SRE)Ͱ1िؒ͝ͱͷ࣋ͪճΓɻ1st:1ਓ, 2nd:1ਓ • ϙετϞʔςϜ • ϨϙʔτΛॻ͍ͯؔ܎ऀʹڞ༗͠ɺϛʔςΟϯάΛ։࠵
 ʢνʔϜ֎͔Β΋ؔ܎ऀ͕ࢀՃʣ • DevOps
  25. ΑΓTechnicalͳTopic • Monitoring • ΞϓϦέʔγϣϯͷӡ༻ • ϛυϧ΢ΣΞͷӡ༻ • NginxɺElasticsearchͳͲ •

    ΩϟύγςΟ ϓϥϯχϯά • εέʔϧΞοϓɺεέʔϧΞ΢τɺεέʔϧΠϯ • Load test
  26. νʔϜʹೖͬͯSREͱͯ͠΍ͬͨࣄΛ͍͔ͭ͘͝঺հ • SLO ͱ Error budget • Elasticsearchͷslow logϞχλϦϯά

  27. SLOΛμογϡϘʔυͰՄࢹԽ • ֤ϚΠΫϩαʔϏεͷݱঢ়ʹج͍ͮͯSLOΛઃఆ • SLIͱͯ͠ɺAPIͷAvailabilityͱLatencyͷ2ͭΛ࢖༻ • μογϡϘʔυΛ࡞ͬͯɺνʔϜͷڞ༗Ϟχλʹදࣔ • ࠓޙͷ՝୊ •

    ݄ؒ΍࢛൒ظͳͲͷظؒͰूܭ͍ͨ͠ • ϝτϦΫεͷྔ͕ଟ͗ͯ͢୯७ʹܭࢉͯ͠͠·͏ͱΫΤϦ͕ฦͬͯ͜ͳ͍
 ʢPrometheusͷRecording rulesΛ࢖͑͹ղܾͰ͖ͦ͏ͳؾ͕͍ͯ͠Δʣ • εςʔΫϗϧμʔͱSLOΛڞ༗͠ɺܧଓతʹվળ͍ͯ͘͠
  28. Motivation • ීஈ࢖͍ͬͯΔPrometheusͰϞχλϦϯά͍ͨ͠ • ΫϥελશମͰͷslow logൃੜճ਺ • ϊʔυ୯ҐͰͷslow logൃੜճ਺ •

    ϩά͸KibanaͰݟ͍ͨ Elasticsearchͷslow logϞχλϦϯά
  29. Elasticsearchͷslow logϞχλϦϯά

  30. Thank you :)