Slide 1

Slide 1 text

ϚΠΫϩαʔϏεͱSRECon ~ SRECon'16 Wrap Up ~ Takumi Sakamoto @takus

Slide 2

Slide 2 text

ࣗݾ঺հ • ࡔຊ୎າ (@takus) • SRE @ εϚʔτχϡʔεגࣜձࣾ • ࠷ۙͷڵຯ : OLAP data store (ಛʹ druid.io) • ࠷ۙͷझຯɿଉࢠ (5 ݄ʹര஀) ͱ༡ͿɺҭࣇຊΛಡΉ

Slide 3

Slide 3 text

Slack ຊΛॻ͖·ͨ͠ Slackೖ໳ [ChatOpsʹΑΔνʔϜ։ൃͷޮ཰Խ]͸Slackʹ͸͡Ίͯ࢖͍͸͡ΊΔਓʹ΋ಡΜͰ΋Β͍͍ͨ ՄѪ͍ݟͨ໨ͷSlackೖ໳ [ChatOpsʹΑΔνʔϜ։ൃͷޮ཰Խ]த਎΋ՄѪౕ͍ͩͥ ෼͔Γ΍͍͢ʂSlackͷॳ৺ऀ͔Βߋʹ׆༻͍ͨ͠தڃऀ·Ͱɺղઆ͕ඇৗʹॆ࣮ͨ͠Φεεϝͷຊ -Slackೖ໳ ʲॻධʳʮSlackೖ໳ʙ ChatOpsʹΑΔνʔϜ։ൃͷޮ཰Խʙʯ Slack͸؀ڥͰ͋Δ ʙॻධ : ʮSlackೖ໳ ChatOpsʹΑΔνʔϜ։ൃͷޮ཰Խʯʙ

Slide 4

Slide 4 text

AWS ͷϒϩάʹدߘ͠·ͨ͠ How SmartNews Built a Lambda Architecture on AWS to Analyze Customer Behavior and Recommend Content

Slide 5

Slide 5 text

SmartNews • News discovery app for mobile • Algorithm-driven article selection • 18M+ downloads in world wide https://www.smartnews.com/

Slide 6

Slide 6 text

εϚχϡʔͱϚΠΫϩαʔϏε • ͍ΘΏΔϚΠΫϩαʔϏεΞʔΩςΫνϟͰ͸ͳ͍(ͱࢥ͏) • χϡʔεϓϩμΫτͱΞυϓϩμΫτ • ͦΕͧΕͷதʹ APIɺσʔλղੳج൫ɺetc... • ϝΠϯͷ API ͸ׂͱେ͖͍ • ৄࡉ͸࣍εϥΠυͷࢀߟࢿྉΛ͝ཡ͍ͩ͘͞

Slide 7

Slide 7 text

ࢀߟࢿྉ1 SmartNewsͷχϡʔε഑৴Λࢧ͑Δαʔόٕज़

Slide 8

Slide 8 text

ࢀߟࢿྉ2 SmartNews TechNight vol5 SmartNews Adsେਤղ

Slide 9

Slide 9 text

εϚχϡʔࣗओ౉ߤ঑ྭ੍౓ • ൒ظ͝ͱʹ 1 ճɺSan Francisco ·ͨ͸ New York ΦϑΟε ͷ๚໰ɺ·ͨ͸ΧϯϑΝϨϯεࢀՃͷͨΊͷւ֎౉ߤʹ͔͔ Δߤۭ݊අɺަ௨අɺ॓ധඅɺ௨৴අɺւ֎౉ߤอݥɺΧϯ ϑΝϨϯεɾֶձ౳ࢀՃඅΛෛ୲ͯ͘͠ΕΔࣾ಺੍౓ • ΪϣʔϜʹ௚઀ؔ܎ͳ͍ΧϯϑΝϨϯεͰ΋ OK

Slide 10

Slide 10 text

• Conference for Site Reliability Engineers (SRE) • April 7-8, 2016 in Santa Clara, CA. • 600+ attendees https://www.usenix.org/conference/srecon16

Slide 11

Slide 11 text

SRECon ͱ͸ʁ • Site Reliability Engineer (SRE) ͷͨΊͷΧϯϑΝϨϯε • ࠓ೥ͷ 4/7 - 8 ʹΧϦϑΥϧχΞभαϯλΫϥϥͰ։࠵ • ΞϝϦΧࠃ಺Λத৺ʹ 600 ໊ఔ౓ͷࢀՃऀ

Slide 12

Slide 12 text

ௌߨͨ͠ηογϣϯͷҰ෦ • Netflix: 190 Countries and 5 CORE SREs • Panel: Who/What Is SRE? • Shaping Reality to Shape Outcomes: Making SRE Work with Uber Growth • nrrd 911 ic me: The Incident Commander Role • A Young Lady's Illustrated Primer to Technical Decision-Making • Continuous Deployment to Millions of Users 40 Times a Day • Finding the Order in Chaos • Performance Checklists for SREs • Doorman: Global Distributed Client Side Rate Limiting • Running Consul at Scale—Journey from RFC to Production • Panel: SRE Managers

Slide 13

Slide 13 text

"Microservices" ͷݕࡧ݁Ռ https://www.usenix.org/conference/srecon16/program

Slide 14

Slide 14 text

Ͱ΋ɺϚΠΫϩαʔϏε͸ΞλϦϚΤ ⬇ൃදऀͷձࣾͷϒϩά΍εϥΠυ⬇ • Netflix • MicroServices at Netflix - challenges of scale • Uber • Service-Oriented Architecture: Scaling Our Codebase As We Grow • Fastly • Microservices war stories

Slide 15

Slide 15 text

ϚΠΫϩαʔϏεʹର͢Δ SRE ͷؔΘΓํ (@kenjiszk ͞ΜͷൃදͱඃͬͨΒΰϝϯͳ͍͞)

Slide 16

Slide 16 text

Netflix: 190 Countries and 5 CORE SREs / USENIX SRECon'16

Slide 17

Slide 17 text

Freedom & Responsibility

Slide 18

Slide 18 text

Ͳ͜·Ͱࣗ༝ͳͷ͔ฉ͍ͯΈͨ Q. Freedom ͬͯݴͬͯΔ͚ͲɺͲ͜·Ͱࣗ༝ͳͷʁ • جຊతʹ֤αʔϏεͷ͜ͱ͸શͯ։ൃνʔϜʹ೚ͤΔ • Ͳͷ։ൃऀ΋ Netflix ͷγεςϜΛյͤΔ΄ͲͷΞΫηεݖ ݶ͕༩͑ΒΕ͍ͯΔ Q. ҙਤ͠ͳ͍ૢ࡞ͰαʔϏεΛഁյ͞ΕͨΓ͠ͳ͍ͷʁ • ͦ͜Ͱ SRE ͕࡞͍ͬͯΔπʔϧ͕ॏཁʹͳΔ • πʔϧ͕ศར͗͢ΔͷͰ։ൃऀ͸࢖͍ʹ͘͘ࣄނ΋ى͖΍ ͍͢ଞͷπʔϧΛબΜͩΓ͠ͳ͍

Slide 19

Slide 19 text

Developers can run Ops(*) *If provided the tools and support

Slide 20

Slide 20 text

ྫ: Spinnaker http://techblog.netflix.com/2015/11/global-continuous-delivery-with.html

Slide 21

Slide 21 text

ࣾ಺޲͚πʔϧͷ։ൃ΋ ϓϩμΫτ։ൃͷͭ΋ΓͰ Overall, these SRE-developed tools are full-fledged software engineering projects, distinct from one-off solutions and quick hacks, and the SREs who develop them have adopted a product-based mindset that takes both internal customers and a roadmap for future plans into account. Chapter 18. Software Engineering in SRE - Site Reliability Engineering

Slide 22

Slide 22 text

ྫ: εϚχϡʔࣾ಺ PaaS • ࣾ಺ͷ՝୊ΛΈ͚ͭΔ (՝୊͸݈ࡏԽͯ͠Δ͜ͱ΋ଟ͍) • ϓϩτλΠϓΛ࡞Δ (࡞ΓࠐΈա͗ͳ͍ɺMVP Λҙࣝ) • ࠷ॳͷސ٬ (։ൃऀ) ΛΈ͚ͭΔ • ϩʔυϚοϓΛ੔උɺ༏ઌ౓Λ͚ͭͯ։ൃ͍ͯ͘͠ • υοάϑʔσΟϯά͢Δ • ސ٬ͷ੠ʹࣖΛ܏͚Δɺސ٬ͷߦಈཤྺΛ௥͍͔͚Δ

Slide 23

Slide 23 text

͓٬༷͕෍ڭͯ͘͠ΕΔ͜ͱ΋

Slide 24

Slide 24 text

ࣗ໰ࣗ౴ͷ೔ʑ (·ͩ·ͩෆे෼...orz) ࣗ෼͕ಋೖͨ͠ xxx ຊ౰ʹ࢖͍΍͍ͩ͢Ζ͏͔ʁ xxx = OSS πʔϧ / σϓϩΠγεςϜ / SaaS

Slide 25

Slide 25 text

Shaping Reality to Shape Outcomes: Making SRE Work with Uber Growth / USENIX SRECon'16

Slide 26

Slide 26 text

the school of hard knocks i have 99 problems, and reliability is 1

Slide 27

Slide 27 text

ྫ: ϑΣΠϧΦʔόʔ ϓϩμΫτνʔϜ͸ ໓ଟʹى͖ͳ͍͜ͱʹ ࣌ؒΛׂ͖ʹ͍͘

Slide 28

Slide 28 text

ఆظతʹىͯ͜͠ݱ࣮ײΛ

Slide 29

Slide 29 text

ྑ͍श׳Λࣗવʹ࡞Δ࢓૊Έ ͋ΔछͷήʔϛϑΟέʔγϣϯʁ • Chaos Engineering / ϑΣΠϧΦʔόʔͷςετ • ఆظతʹյ͢͜ͱͰো֐Λҙࣝͯ͠΋Β͏ • յΕͳ͍Α͏ʹ޻෉͢Δ • σϓϩΠ࣌ʹ໰୊Λൃݟ͢ΔͱࣗಈϩʔϧόοΫ • ϑΣΠϧͨ͠෦෼ʹ͍ͭͯߟ͑ͯཧղͯ͠΋Β͏ • σϓϩΠͷͨΊʹΫϦΞ͠Α͏ͱ޻෉͢Δ

Slide 30

Slide 30 text

nrrd 911 ic me: The Incident Commander Role / USENIX SRECon'16

Slide 31

Slide 31 text

Incident Command System • ถࠃͰ։ൃ͞Εͨࡂ֐ݱ৔ɾࣄ݅ݱ৔ͳͲʹ͓͚Δඪ४Խ͞ ΕͨϚωδϝϯτɾγεςϜͷ͜ͱɻ໋ྩܥ౷΍؅ཧख๏͕ ඪ४Խ͞Ε͍ͯΔ఺͕ಛ௃ɻ1970೥୅ʹফ๷ʹΑΓ։ൃ͞ Εɺঃʑʹଞͷߦ੓ػؔͳͲͰͷར༻͕֦େ͠ɺσϑΝΫτ ελϯμʔυʹͳͬͨɻ ΠϯγσϯτɾίϚϯυɾγεςϜ / Wikipedia

Slide 32

Slide 32 text

ICS ͷྫ http://www.wikiwand.com/en/Incident_Command_System

Slide 33

Slide 33 text

ϓϩηεʹ߹Θͤͯ୯७Խ nrrd 911 ic me: The Incident Commander Role / USENIX SRECon 16

Slide 34

Slide 34 text

nrrd 911 ic me: The Incident Commander Role / USENIX SRECon 16

Slide 35

Slide 35 text

nrrd 911 ic me: The Incident Commander Role / USENIX SRECon 16

Slide 36

Slide 36 text

nrrd 911 ic me: The Incident Commander Role / USENIX SRECon 16

Slide 37

Slide 37 text

Կ͕Α͍ͷ͔ʁ • ݸʑͷ໾ׂ͕໌֬ʹఆٛ͞Ε͍ͯΔ • ৘ใϑϩʔͷ੍ޚ • શମ΁ͷӨڹΛߟ্ྀͨ͠Ͱͷ൑அ

Slide 38

Slide 38 text

Ͳ͏΍ͬͯීٴ͍͔ͯ͘͠ʁ ChatOps

Slide 39

Slide 39 text

Πϯγσϯτൃੜ

Slide 40

Slide 40 text

εςʔλεߋ৽

Slide 41

Slide 41 text

ސ٬ͱͷίϛϡχέʔγϣϯ

Slide 42

Slide 42 text

ରԠ׬ྃ

Slide 43

Slide 43 text

ৼΓฦΓͷ४උ

Slide 44

Slide 44 text

·ͱΊ

Slide 45

Slide 45 text

SRE should be a Enabler • ։ൃऀ͕ࣗ෼ͨͪͰӡ༻͍͚ͯ͠ΔΑ͏ʹಓΛ࡞Δ • ։ൃऀʹ৴པੑʹ͍ͭͯҙࣝͯ͠΋Β͏Α͏ʹ͢Δ • ྑ͍ϓϥΫςΟε͕ࣗવʹ࣮ફ͞ΕΔ૊৫ʹ͢Δ

Slide 46

Slide 46 text

SRE should not be a Servant • Google SRE's 50% ϧʔϧ • ӡ༻ΛؚΊͨٿर͍తͳ࢓ࣄΛ 50% Ҏ্΍Βͳ͍ • ٿर͍ͯ͠ʮಇ͍ͨؾʯʹͳ͍ͬͯ·ͤΜ͔ʁ • Կ͔ΛࠜຊతʹΑ͘͢Δ࣌ؒΛࣦ͍ͬͯΔ͔΋ • λεΫͷ༏ઌॱҐΛͪΌΜͱҙࣝ͢Δ • ࣌ʹ͸ No Λݴ͏͜ͱ΋େ੾ • ΋ͪΖΜো֐ରԠͷΞγετͳͲ͸༏ઌ

Slide 47

Slide 47 text

Be a Enabler!!! https://www.wantedly.com/projects/48033