Upgrade to Pro — share decks privately, control downloads, hide ads and more …

エンジニアのためのSRE論文への招待 / Introduction to SRE Papers for Engineers

Yuuki Tsubouchi (yuuk1)
September 30, 2023
5.7k

エンジニアのためのSRE論文への招待 / Introduction to SRE Papers for Engineers

SRE NEXT 2023 IN TOKYOでの20分の講演スライド。https://sre-next.dev/2023/schedule/#jp029

当日の講演で使用したスライドでは、時間の関係上スキップしたスライドをすべて含めて公開しています。

Yuuki Tsubouchi (yuuk1)

September 30, 2023
Tweet

More Decks by Yuuki Tsubouchi (yuuk1)

Transcript

  1. 2 ϓϩϑΟʔϧ Yuuki TSUBOUCHI (yuuk1) ͘͞ΒΠϯλʔωοτݚڀॴɹ্ڃݚڀһ TopotalɹςΫϊϩδΞυόΠβʔ ژ౎େֶେֶӃ ৘ใֶݚڀՊ ത࢜ޙظ՝ఔ

    ೝఆୀֶ https://yuuk.io/ SRE NEXTొஃྺ @yuuk1t 2020 2022 SREͷ૯࿦ 
 औΓ૊ΜͰ͍ΔSREݚڀͷ࿩ جௐߨԋ ެืηογϣϯ AIOpsݚڀͷ࿩ ݪҼ਍அͷࣗಈԽ
  2. 6 ̕೥લͷϒϩάهࣄʮΠϯϑϥΤ ϯδχΞ޲͚γεςϜܥ࿦จʯ ɾOS,DB,NWܥ࿦จͷ঺հ ΤϯδχΞͱ࿦จ ※1 y_uuki, ΠϯϑϥΤϯδχΞ޲͚γεςϜܥ࿦จ, 2014 https://blog.yuuk.io/entry/system-papers.

    ※2 engineering-reading-papers.md, https://gist.github.com/yuuki/20f22bdd85a00630006b8dab6386881e ※1 ɾSREͷීٴҎޙɺ࿦จʹSREͷݴٴ΍SREbookͷҾ༻͕ΈΒΕΔ ɾ࿦จΛಡΉ͜ͱʹڵຯΛ࣋ͭɾಡΜͰ͍ΔΤϯδχΞ͸গͳ͘ͳ͍ ɾಛʹٕज़࿦จ͸ٕज़ऀ͕ಡΉ͜ͱΛ૝ఆ͍ͯ͠Δ͸ͣ ※2
  3. ΤϯδχΞ͸࢖͍ͬͯΔΫϥ΢υαʔϏε΍OSSͷཧղͷͨΊʹಡ ΜͰ͍Δ 7 ɾࣾձ࣮૷ޙʹ࿦จൃද͞Εٕͨज़ͷఏҊ࿦จ ɾ࿦จൃදޙʹࣾձ࣮૷͞Εɺීٴٕͨ͠ज़ͷఏҊ࿦จ ٕज़࿦จͷେ෼ྨ طීٴٕज़࿦จ ະීٴٕज़࿦จ ※1 rrreeeyyy,

    ”Web αʔϏεͷ৴པੑͱӡ༻ͷࣗಈԽʹ͍ͭͯ”, ৘ใॲཧֶձୈ40ճΠϯλʔωοτͱӡ༻ٕज़ݚڀձ ɹɹ ɹ ট଴ߨԋ, 2018೥ https://speakerdeck.com/rrreeeyyy/iot40-rrreeeyyy. ※1 ɾ·ͩීٴ͍ͯ͠ͳ͍ٕज़ΛఏҊ͢Δ࿦จ ※ ൃදऀʹΑΔಠࣗͷ෼ྨͱ༻ޠ ྫʣ Aurora, DynamoDB, Spanner, TiDB, CockroachDB, Firecracker, gVisor, Dapper, Gorillaͷࠩ෼ූ߸Խ, Monarch…
  4. 10 ਓྨʹ஝ੵ͞Εͨʮ஌ʯͷྖҬΛԡ͠޿͛ͨূͱͯ͠ͷจॻ ֶज़࿦จͱ͸ ※ ൃදऀʹΑΔಠࣗͷఆٛ খɾதֶߍ ߴߍ େֶ म࢜՝ఔ ത࢜՝ఔ

    The illustrated guide to Ph.DΑΓൈਮɾҰ෦վม طଘͷ஌ʹରͯ͠৽ͨʹ ੵΈ্͛ͨ஌Λ࿦ূ ࢀߟɿখ໺ా ३ਓ, “ത࢜՝ఔͷޡղͱਅ࣮ ʔਐֶʹ޲͚ͯɺ྆਌Λઆಘͨ͠ࢿྉΛ΋ͱʹʔ“, 2018೥. https://www.slideshare.net/atsutoonoda/ss-124873093. ਓྨͷط஌ྖҬ ਓྨͷະ஌ྖҬ
  5. 11 ֶज़࿦จͷओͳ۠෼ ֶҐ࿦จ ֶҐΛಘΔͨΊʹେֶػؔͳͲʹఏग़͞ΕΔ࿦จ ܝࡌ࿦จ ࢀߟ ࿦จͷछྨͷҧ͍, 2008೥ https://next49.hatenadiary.jp/entry/20080612/p2. ɹɹ

    ࿦จͷछྨͱҐஔ͚ͮ, https://wrc.sfc.keio.ac.jp/?p=129 ഔମʹܝࡌ͞ΕΔ࿦จɻֶձʢACMɺIEEEͳͲʣ͕ӡӦ ࿦จࢽ࿦จ ձٞ࿥࿦จ ࡶࢽʹܝࡌ͞ΕΔɻδϟʔφϧ࿦จͱ΋ݺ͹ΕΔɻ ΧϯϑΝϨϯεͰޱ಄ൃද͞ΕΔɻ ɾ৘ใՊֶܥͰ͸ɺࠃࡍձ͕ٞଞ෼໺ΑΓॏࢹ͞ΕΔ ɾݚڀͷ࠷ऴ׬੒൛Λެ։͢Δ৔ͱͯ͠ͷҐஔ͚ͮ ɾஶऀͷݚڀ׆ಈΛ૯ׅ͢Δɻ ɾෳ਺ͷܝࡌ࿦จͷ಺༰Λܨ͗߹ΘͤΔ͜ͱ΋͋Δɻ ※ओʹ৘ใՊֶܥͷ۠෼ ࣮ࡍʹಡΉຊ਺ ͕ଟ͍
  6. 12 ֶज़࿦จͷ಺༰ʹΑΔ۠෼ʢ৘ใՊֶܥʣ ݚڀ࿦จʢresearchʣ ૯આɾௐࠪ࿦จʢreview, surveyʣ ɾطଘͷ࿦จ΍੡඼Λௐ΂্͛ɺ෼ྨͨ͠ΓൺֱධՁ͢Δ࿦จ ɾA Survey of/on …

    Ͱ࢝·Δද୊ͷ࿦จ͕ଟ͍ ɾ৽نੑͷ͋ΔΞϧΰϦζϜ΍γεςϜΛఏҊ͢Δ࿦จ ɾ௕͞ʹΑͬͯɺFull/Short/Poster/Position paperͳͲʹ۠෼͞ΕΔ ࢈ۀ࿦จʢindustrialʣ ɾ࣮ੈքͷγεςϜͷ࣮ࡍతͳ໰୊ɺ؍࡯ɺଌఆʹॏ఺Λஔ͘ ɾ࿦จ਺͸࢒೦ͳ͕Βଟ͘ͳ͍ ※1 MIDDLEWARE 2020 CALL FOR INDUSTRY PAPERS, 2022 https://middleware-conf.github.io/2022/call-for-industry-papers/. ※2 SIGMOD 2023 Call for Papers - Industrial Track https://2023.sigmod.org/calls_industrial_track_papers.shtml. ྫ ※1,2 ࣮ࡍʹಡΉຊ਺͕ଟ͍
  7. σʔλϕʔε ετϨʔδ 13 SRE࿦จͱ͸ ※ൃදऀಠࣗͷ༻ޠ ຊߨԋͰͷSRE࿦จͷൣғ ιϑτ΢ΣΞ޻ֶ γεςϜ ιϑτ΢ΣΞ OSɺ෼ࢄγεςϜɺ

    ݴޠॲཧܥ ίϯϐϡʔλ ωοτϫʔΫ ιϑτ΢ΣΞͷ඼࣭ɾ։ൃ ଎౓ɾอकੑ ৴པੑ޻ֶ ػց΍ݐங෺ͳͲͷނ োʹର͢Δ৴པੑΛ෼ ੳ͢Δ Ϋϥ΢υ ίϯϐϡʔςΟϯά Site Reliability Engineering SLI/SLOɺObservabilityɺ Πϯγσϯτ؅ཧɺ… ֶࡍྖҬͰ͋ΔͨΊ ໌֬ͳઢҾ͖͸ࠔ೉
  8. 17 ɾΠϯσΩγϯά͕ૣ͘ಈ࡞΋ߴ଎ ɾޙड़͢Δ௨஌ػೳ͕ศར ɾ೉఺ɿSREconͷαΠτ͕ώοτ ࿦จݕࡧΤϯδϯ Google Scholar Connected Papers ɾ࿦จؒͷҾ༻ؔ܎Λ௥੻͠΍͍͢

    ɾແྉϓϥϯͰ͸ػೳ੍͕ݶ͞ΕΔ ݕࡧΩʔϫʔυ ɾ“Observability”ͳͲ͸ଞ෼໺ͷ࿦จ͕ώοτ͕ͪ͠ ɾଞ෼໺Ͱ͸࢖Θͳ͍ϚδοΫϫʔυʢ“Microservices”ͳͲʣΛؚΊΔ ɾૈ͍ϑΟϧλʔͱͯ͠ACM, IEEE, USENIXͷ࿦จΛબͿ ※1 Connected Papers, https://www.connectedpapers.com/. ※1
  9. 18 SRE࿦จΛ୳͢ํ๏ SREbookͷҾ༻࿦จ ࠃࡍձٞͷ८ճ SRE΁ͷؔ࿈ॏࢹ ෼໺ɾ৽઱͞ɾ࣭ॏࢹ ઌͷϦετͷFieldʹ - Software Engineering

    - Reliability - Cloud Computing ͷ͍ͣΕ͔ΛؚΉձٞͷϓϩάϥϜΛ ΈΔ SREͷؔ࿈ ੑΑΓڧΊ ̍ձٞ։࠵͋ͨΓ1,2ຊ ൃݟͰ͖Ε͹े෼
  10. 19 ϒοΫϚʔΫʹ͸Paperpile͕͓͢͢Ί ɾ࿦จϑΝΠϧΛGoogle DriveʹϑΝΠϧ໊Λਖ਼نԽͯ͠อଘՄೳ ɾϒϥ΢β֦ுͰGoogle Scholarͱ࿈ܞՄೳ ࿦จͷϒοΫϚʔΫͱ௨஌ Google Scholar Alert

    ɾҭ͍ͯͯ͘ͱडಈతͳ୳ࡧ͕Ͱ͖ΔΑ͏ʹͳΔ ɾϝʔϧ௨஌Մೳ ɾϑΥϩʔதͷ࿦จ͕ଞ࿦จʹҾ༻͞Εͨͱ͖ ɾϑΥϩʔதͷஶऀ͕৽ن࿦จΛެ։ͨ͠ͱ͖ ※1 Paperpile, https://paperpile.com/ ※1
  11. 20 SRE࿦จͷྫ Hauer, et al., “Meaningful Availability”, NSDI 2020. [Hauer+,NSDI2020]ͷදࢴͷసࡌ

    ɾGoogleͷG SuiteͰ༻͍ΒΕ͍ͯΔՄ༻ੑࢦඪ ɾαʔϏεԽ΍OSSԽ͸͞Ε͍ͯͳ͍ະීٴٕज़ ɾSREcon21ͰPinterestͰͷద༻ࣄྫ͋Γ ※1 Anika Mukherji, User Uptime in Practice, SREcon, 2021. ※1
  12. 21 ͦͷଞͷSRE࿦จͷྫʢ̍ʣ eBPF༝དྷͷϝτϦΫεΛίϯςφͷ഑ஔઓུ΍ੑೳղੳʹ࢖༻ Neves, et al., Black-box Inter-application Traf fi

    c Monitoring for Adaptive Container Placement, SAC, 2020. Amaral, et al., MicroLens: A Performance Analysis Framework for Microservices Using Hidden Metrics With BPF, CLOUD, 2022. ෼ࢄτϨʔεͷαϯϓϦϯά໰୊Λղܾ͢ΔMLϞσϧ ϓϩμΫγϣϯͷΠϯγσϯτͷ෼ੳ Wu, et al., An Empirical Study on Change-induced Incidents of Online Service Systems, ICSE 2023. Ghoso, et al., How to Fight Production Incidents? An Empirical Study on a Large-scale Cloud Service, SoCC 2022. Huang, et al., Sieve: Attention-based Sampling of End-to-End Trace Data in Distributed Microservice Systems, ICWS, 2021. Las-Casas, et al., Sifter: Scalable Sampling for Distributed Traces, without Feature Engineering, SoCC, 2019.
  13. 22 ͦͷଞͷSRE࿦จͷྫʢ̎ʣ LLMΛ༻͍ͨো֐ͷݪҼ਍அ΍ϩά෼ੳ Ahmed, et al., Recommending Root-Cause and Mitigation

    Steps for Cloud Incidents using Large Language Models, ICSE 2023. Gupta, et al., Learning Representations on Logs for AIOps, CLOUD 2023. ϝτϦΫε͔Β߹੒͞ΕͨSLOΛ༻͍ͨಈతεέʔϦϯάϑϨʔϜϫʔΫ Nastic, et al., SLOC: Service Level Objectives for Next Generation Cloud Computing, IEEE Internet Computing 24(3). Pusztai, et al., SLO Script: A Novel Language for Implementing Complex Cloud-Native Elasticity- Driven SLOs, ICWS, 2021. Pusztai, et al., A Novel Middleware for Ef fi ciently Implementing Complex Cloud-Native SLOs, CLOUD, 2021. Nastic, et al., Polaris Scheduler: Edge Sensitive and SLO Aware Workload Scheduling in Cloud- Edge-IoT Clusters, CLOUD, 2021. OSS: https://github.com/polaris-slo-cloud/polaris-slo-framework.
  14. 24 ࿦จΛ୳͢ͱ͖͸଎ಡ ࣮૷ɾద༻Λݕ౼͢Δͱ͖͸ਫ਼ಡ ಡΈํͷجຊํ਑ ※ ࿦จʹ׳Ε͍ͯͳ͍͏ͪ͸ Կຊ͔ਫ਼ಡΛͨ͠΄͏͕Α͍ ʮSRE࿦จͷྫʯͷ࿦จΛϐο ΫΞοϓͯ͠ಡΉͳͲ λΠτϧɾཁ໿ɺਤද

    ͚ͩΛಡΉ ϊʔτΛͱΓͳ͕ΒಡΉ ʢޙଓϖʔδࢀরʣ ಡΈ͍ͨ࿦จ͔Ͳ͏͔ Λૣ͘δϟοδ ࢀߟɿ඼઒ ੓ଠ࿠, ”࿦จͷಡΈํɾॻ͖ํɾݚڀࣨͷա͝͠ํ - NAIST”, 2020 http://bit.ly/naist-how-to-research.
  15. 25 Introduction ৘ใܥ࿦จͷయܕߏ੒ͱಡΉॱ൪ Related Work Method Experiment Conclusion Abstract ᶃ

    ᶅ ᶄ ᶆ ᶇ ͜ͷ࿦จͰ͸ԿΛ͔ͨ͠ʁ Ͳ͜·ͰͰ͖ͨͷ͔ʁ ͳͥ͜ͷ࿦จ͕ॏཁͳͷ͔ʁ ᶈ ໰୊͕ͪΌΜͱղ͚͍ͯΔ͔ʁ ଞͱԿ͕ҧ͏͔ʁ ͦͷҧ͍͔Β͘Δ໰୊͸ͳʹ͔ʁ ࿦จͷҐஔ͚ͮ ࿦จͷཁ໿ ؔ࿈ݚڀͱ໰୊ઃఆ ఏҊͷৄࡉ ࣮ݧɾධՁɾߟ࡯ ݁࿦ ໰୊ΛͲ͏΍ͬͯղ͍͔ͨʁ ࢀߟɿམ߹ཅҰ,ઌ୺ٕज़ͱϝσΟΞදݱ#1 #FTMA15, 2015 https://www.slideshare.net/Ochyai/1-ftma15.
  16. 26 IntroductionͷಡΈํ ࣾձͷഎܠ ໰୊ҙࣝ Ұൠతͳ هड़ ᶃ ࿦จݻ༗ͷ ࿩୊ ᶄ

    ࿦จͷఏҊʹ ࠷΋͍ۙഎܠ طଘख๏ ͱͦͷ՝୊ ᶅ ᶆ ఏҊͷ֓ཁ ධՁͷ֓ཁ Introduction͸࿦จͷશମ૾͕ॻ͍ͯ͋ΔͷͰ௒ॏཁ ޿͘஌ΒΕͨ໰୊ ᶄʹΞϓϩʔν ͖ͯͨ͠ઌਓୡ ᶅͷ՝୊Λղܾ ͢Δղܾࡦ
  17. 27 IntroductionͷಡΈํʢྫʣ എܠ/໰୊ҙࣝ Hauer, et al., “Meaningful Availability”, NSDI 2020.

    ΑΓసࡌ ᶃ ᶄ ᶅ ᶆ ৴པੑΛఆྔԽ windowed user-uptime طଘख๏ͷ՝୊ ੒ޭ཰͸ΞΫςΟ ϒϢʔβʔʹภΔ ͳͲ ఏҊख๏ ධՁํ๏ ੒ޭ཰ͱൺֱ G Suiteͷσʔλ Λ࢖༻ SLI should be - meaningful - proportional - actionable ͜ΕΒͷཁ݅Λ ຬͨ͢ࢦඪ͸ͳ ͍
  18. 35 1. ඼઒ ੓ଠ࿠, ”࿦จͷಡΈํɾॻ͖ํɾݚڀࣨͷա͝͠ํ - NAIST”, 2020 http://bit.ly/naist-how-to- research.

    2. མ߹ཅҰ, ”ઌ୺ٕज़ͱϝσΟΞදݱ#1 #FTMA15”, 2015 https://www.slideshare.net/Ochyai/1-ftma15. 3. ຊଟ ྙ෉, ”γεςϜܥ࿦จͷಡΈํͱ୳͠ํ”, https://micchie.net/ fi les/RG-HowToPaper.pdf. 4. joisino, ”࿦จಡΈͷ೔՝ʹ͍ͭͯ”, 2023 https://joisino.hatenablog.com/entry/2023/04/10/170519. 5. S. Keshav, “How to Read a Paper”, ACM SIGCOMM Computer Communication Review, 2007. ࢀߟจݙ