Slide 1

Slide 1 text

坪内 佑樹 (@yuuk1t) クラウドのテレメトリーシステム 研究動向2025年 2025/03/13 さくらインターネット研究所 テックトーク2025春 

Slide 2

Slide 2 text

2 ௶಺ ༎थ / yuuk1 さくらインターネット研究所   上級研究員 京都 大 学博 士 (情報学) to appear https://yuuk.io/ 主な研究分野 AIOps eBPFܭ૷ ࣌ܥྻDB SREͷݚڀऀ 学位論 文 審査や 手 続き は完了し最終確定待ち 京都市在住

Slide 3

Slide 3 text

ΞδΣϯμ 1. ത࢜࿦จͷςʔϚͷಋೖ 2. ത࢜࿦จͷ̏ͭͷߩݙ 3. ത࢜࿦จͷςʔϚʹԊ͏࠷৽ݚڀಈ޲ 4. ·ͱΊ ത࢜࿦จ࠷৽ݚڀಈ޲ lςϨϝτϦʔϫʔΫϩʔυεέʔϦϯάz ໰୊ͷಋೖ -JOVYΧʔωϧͷಈతܭ૷ɺ࣌ܥྻ%# ؔ࿈ݚڀ΋ؚΊͨςϨϝτϦʔϫʔΫϩʔ υεέʔϦϯάͷੈքͷ࿩ ෼ ෼ ෼ ෼ 3 ʢ࣌ؒͷؔ܎্̎ͭͷͷΈ঺հʣ

Slide 4

Slide 4 text

1. 博 士 論 文 のテーマの導 入

Slide 5

Slide 5 text

ΞϓϦέʔγϣϯ γεςϜ ΤϯδχΞ ར༻ऀ Πϯλʔωοτ ςϨϝτϦʔγεςϜ σʔλ఻ૹ ςϨϝτϦʔʹΑΔ؂ࢹͱ෼ੳ ؂ࢹͱ෼ੳͷͨΊʹɺγεςϜɺΞϓϦέʔγϣϯɺαʔϏε͔Βԕִ஍΁ɺ ੑೳ΍ར༻ʹؔ͢ΔσʔλΛࣗಈͰऩू͠ɺૹ৴͢Δɻ ܭثͷಡΈऔΓ஋Λه࿥͠ɺૹ৴͢Δϓϩηεɻ ࣙॻͷఆٛ ຊݚڀʹ͓͚Δఆٛ ԕִ஍ ܭث ૹ৴ ෼ੳ ςϨϝτϦʔ [63] <>'SBOL$BSEFO 3VTTFMM1+FEMJDLB BOE3PCFSU)FOSZ5FMFNFUSZ4ZTUFNT&OHJOFFSJOH 5

Slide 6

Slide 6 text

OpenTelemetry, Prometheus exporter, Fluentd, eBPF, … ΞϓϦέʔγϣϯ γεςϜ ར༻ऀ Πϯλʔωοτ ςϨϝτϦʔγεςϜ σʔλ఻ૹ ςϨϝτϦʔγεςϜͷయܕߏ੒ ࡾ૚ߏ଄ʹ෼ׂ ܭଌ อଘ ෼ੳ Prometheus, InfluxDB, VictoriaMetrics,VictoriaLogs, Loki/Tempo/Mimer, Clickhouse,Apache Iceberg, … Grafana, Perces, AIOps 6 ΤϯδχΞ

Slide 7

Slide 7 text

ΞϓϦέʔγϣϯ γεςϜ ར༻ऀ Πϯλʔωοτ ςϨϝτϦʔγεςϜ σʔλ఻ૹ ςϨϝτϦʔϫʔΫϩʔυͷ૿େ ܭࢉࢿݯ ফඅ૿େ σʔλॲཧྔ૿େ •ΞϓϦͷCPU/ϝϞ Ϧফඅྔ૿ •ΞϓϦͷॲཧ஗Ԇ૿ •DBͷऔΓࠐΈॲཧෛՙ૿ •DBͷอଘσʔλྔ૿ • ػցֶशͷॲཧ஗Ԇ૿Ճɾ ਫ਼౓௿Լ ܭଌ อଘ ෼ੳ 7 ΤϯδχΞ

Slide 8

Slide 8 text

ΞϓϦέʔγϣϯ γεςϜ ར༻ऀ Πϯλʔωοτ ςϨϝτϦʔγεςϜ ໨తɿςϨϝτϦʔϫʔΫϩʔυεέʔϦϯά ܭଌ อଘ ෼ੳ ϫʔΫϩʔυ Φ ʛ ό ʛ ϔ ο υ ςϨϝτϦʔϫʔΫϩʔυͷ૿େʹରͯ͠ɺ ޮ཰తʹεέʔϦϯάͤ͞Δ 8 ΤϯδχΞ

Slide 9

Slide 9 text

ܭଌ อଘ ෼ੳ ΞϓϦέʔγϣϯ γεςϜ ར༻ऀ Πϯλʔωοτ ςϨϝτϦʔγεςϜ σʔλ఻ૹ ߩݙɿ֤૚͝ͱͷ࢒ཹ՝୊Λٕज़తʹղܾ ܭࢉࢿݯ ফඅ૿େ σʔλॲཧ ྔ૿େ ߩݙᶃ ߩݙᶅ ߩݙᶄ 9 ωοτϫʔΫ τϨʔε ϝτϦΫε ΤϯδχΞ

Slide 10

Slide 10 text

2. 博 士 論 文 の3つの貢献 ʢ࣌ؒͷؔ܎্࠷ॳͷ̎ͭΛ঺հ͠·͢ɻʣ

Slide 11

Slide 11 text

ܭଌ อଘ ෼ੳ ΞϓϦέʔγϣϯ γεςϜ ٕज़ऀ ར༻ऀ Πϯλʔωοτ ςϨϝτϦʔγεςϜ ߩݙᶃɿܭଌ ܭࢉࢿݯ ফඅ૿େ ߩݙᶃ ߩݙᶅ ߩݙᶄ Y. Tsubouchi, M. Furukawa, R. Matsumoto, Low Overhead TCP/UDP Socket- based Tracing for Discovering Network Services Dependencies, JIP, 2022. ωοτϫʔΫίʔ ϧάϥϑΛಘΔͨ ΊʹɺeBPFʹΑΔ ௿Φʔόʔϔου ͳܭ૷๏ʹண໨ɻ 11

Slide 12

Slide 12 text

ωοτϫʔΫίʔϧάϥϑ Ͳ͏΍ͬͯܭଌ͢Δʁ Cloud Load Balancers Database Clusters Web app servers Message queues ֤ίϯϙʔωϯτͷݺͼग़ؔ͠܎ Λ஌Γ͍ͨɻ - มߋͷӨڹൣғΛ஌Γ͍ͨɻ - ϦϯΫ୯ҐͷϝτϦΫεΛ஌Γ͍ͨɻ ߩݙᶃ 12

Slide 13

Slide 13 text

ܭ૷Ξϓϩʔν Cloud Load Balancers Database Clusters Web app servers Message queues Kernel User Proxy Network Stack App NIC Switch ωοτϫʔΫ௨৴ܦ࿏্ͷ͍ͣΕ ͔ʹܭଌ఺Λઃஔ͢Δɻ ߩݙᶃ Χʔωϧͷ্Ґ૚ʢιέοτʣͰͷܭ૷ʹண໨ɻ ରApp: ΞϓϦͷमਖ਼͕ඞཁͳ͍ɻ ରProxy: தܧΦʔόʔϔου͕ͳ͍ɻ ରSwitch: ܭଌෛՙΛΤϯυϗετʹ෼ࢄͰ͖Δɻ 13

Slide 14

Slide 14 text

ιέοτ૚ʹ͓͚Δܭ૷ख๏ Kernel User Service Agent ετϦʔϛϯά๏ ϑϩʔू໿๏ ϑϩʔूଋ๏ʢఏҊʣ ✗ ϝοηʔδ਺૿ՃʹԠ͡ ͯɺϢʔβۭؒ΁ͷܭଌ஋ͷ సૹ਺͕૿Ճɻ ✗ ୹໋ͳϑϩʔ͕૿Ճ͢Δͱɺసૹ σʔλ਺΋૿Ճɻ Ѽઌ͕ಉҰͷϑϩʔΛଋ ͶΔɻ ※ ϑϩʔ = ྆୺ͷΞυϨεͱϙʔτͷ૊͕ಉҰͷ௨৴୯Ґ Queue ܭଌ఺ Kernel User Service Agent ܭଌ఺ ※ ໼ҹ͸σʔλͷྲྀΕΛද͢ ✔ ϑϩʔ͝ͱʹू໿͞Εͨܭଌ஋ ͷΈอଘɻసૹσʔλ਺Λ௿ݮɻ Flow1 Flow2 Flow3 Flow4 Kernel User Service Agent ܭଌ఺ ✔ ୹໋ͳϑϩʔ਺͕ଟ͘ͱ ΋సૹσʔλ਺Λ௿ݮ Bundle 1 Bundle 2 ✔ ܭଌΦʔόʔϔου͕ খ͍͞ ([96,97]) ([27,98]) ߩݙᶃ 14 <>+JO+JO-JO FUBM l.JDSPTDPQF1JOQPJOU1FSGPSNBODF*TTVFTXJUI$BVTBM(SBQITJO.JDSP4FSWJDF&OWJSPONFOUTz*$40$ <>8FBWF4DPQFIUUQTHJUIVCDPNXFBWFXPSLTTDPQF

Slide 15

Slide 15 text

ιέοτ૚ʹ͓͚Δܭ૷ख๏ Kernel User Service Agent ετϦʔϛϯά๏ ϑϩʔू໿๏ ϑϩʔूଋ๏ʢఏҊʣ ✗ ϝοηʔδ਺૿ՃʹԠ͡ ͯɺϢʔβۭؒ΁ͷܭଌ஋ͷ సૹ਺͕૿Ճɻ ✗ ୹໋ͳϑϩʔ͕૿Ճ͢Δͱɺసૹ σʔλ਺΋૿Ճɻ Ѽઌ͕ಉҰͷϑϩʔΛଋ ͶΔɻ ※ ϑϩʔ = ྆୺ͷΞυϨεͱϙʔτͷ૊͕ಉҰͷ௨৴୯Ґ Queue ܭଌ఺ Kernel User Service Agent ܭଌ఺ ※ ໼ҹ͸σʔλͷྲྀΕΛද͢ ✔ ϑϩʔ͝ͱʹू໿͞Εͨܭଌ஋ ͷΈอଘɻసૹσʔλ਺Λ௿ݮɻ Flow1 Flow2 Flow3 Flow4 Kernel User Service Agent ܭଌ఺ ✔ ୹໋ͳϑϩʔ਺͕ଟ͘ͱ ΋సૹσʔλ਺Λ௿ݮ Bundle 1 Bundle 2 ✔ ܭଌΦʔόʔϔου͕ খ͍͞ ([96,97]) ([27,98]) ߩݙᶃ 15 <>'SBODJTDP/FWFT FUBM l#MBDLCPY*OUFSBQQMJDBUJPO5SB ff i D.POJUPSJOHGPS"EBQUJWF$POUBJOFS1MBDFNFOUz4"$ <>%BUBEPH/FUXPSL1FSGPSNBODF.POJUPSJOHIUUQTEPDTEBUBEPHIRDPNOFUXPSL@NPOJUPSJOHQFSGPSNBODF

Slide 16

Slide 16 text

ιέοτ૚ʹ͓͚Δܭ૷ख๏ Kernel User Service Agent ετϦʔϛϯά๏ ϑϩʔू໿๏ ϑϩʔूଋ๏ʢఏҊʣ ✗ ϝοηʔδ਺૿ՃʹԠ͡ ͯɺϢʔβۭؒ΁ͷܭଌ஋ͷ సૹ਺͕૿Ճɻ ✗ ୹໋ͳϑϩʔ͕૿Ճ͢Δͱɺసૹ σʔλ਺΋૿Ճɻ Ѽઌ͕ಉҰͷϑϩʔΛଋ ͶΔɻ ※ ϑϩʔ = ྆୺ͷΞυϨεͱϙʔτͷ૊͕ಉҰͷ௨৴୯Ґ Queue ܭଌ఺ Kernel User Service Agent ܭଌ఺ ※ ໼ҹ͸σʔλͷྲྀΕΛද͢ ✔ ϑϩʔ͝ͱʹू໿͞Εͨܭଌ஋ ͷΈอଘɻసૹσʔλ਺Λ௿ݮɻ Flow1 Flow2 Flow3 Flow4 Kernel User Service Agent ܭଌ఺ ✔ ୹໋ͳϑϩʔ਺͕ଟ͘ͱ ΋సૹσʔλ਺Λ௿ݮ Bundle 1 Bundle 2 ✔ ܭଌΦʔόʔϔου͕ খ͍͞ ([96,97]) ([27,98]) ߩݙᶃ 16

Slide 17

Slide 17 text

࣮ݧɿ୹໋ͳTCPϑϩʔ਺ͷ૿େʹର͢ΔCPUෛՙͷൺֱ ఏҊख๏ ɾ2.2%ҎԼͷCPUར༻཰Λҡ࣋ɻ ετϦʔϛϯά๏ ࠷େ21.3%·ͰCPUར༻཰͕૿Ճɻ Χʔωϧ಺ू໿๏ ࠷େ11.5%·ͰCPUར༻཰͕૿Ճɻ ߩݙᶃ 17

Slide 18

Slide 18 text

ܭଌ อଘ ෼ੳ ΞϓϦέʔγϣϯ γεςϜ ٕज़ऀ ར༻ऀ Πϯλʔωοτ ςϨϝτϦʔγεςϜ ߩݙᶄɿอଘ ܭࢉࢿݯ ফඅ૿େ ߩݙᶃ ߩݙᶅ ߩݙᶄ ࣌ܥྻσʔλʢϝ τϦΫεʣͷ औΓࠐΈॲཧޮ཰ ޲্ͱ௕ظσʔλ อଘίετͷ௿ݮ Λཱ྆͢Δɻ ௶಺༎थ, ࿬ࡔேਓ, ᖛా݈, দ໦խ޾, খྛོߒ, Ѩ෦ത, দຊ྄հ, HeteroTSDB: ҟछ෼ࢄ KVSؒͷࣗಈ֊૚ԽʹΑΔߴੑೳͳ࣌ܥྻσʔλϕʔε, ৘ใॲཧֶձ࿦จࢽ, 2021೥. 18

Slide 19

Slide 19 text

ϝτϦΫεͷऔΓࠐΈϫʔΫϩʔυྔ͸ɺ̎ͭͷ࣍ݩʹൺྫ͢Δ ϝτϦΫεετϨʔδͷϫʔΫϩʔυ ࣌ؒ cpu_seconds{instance=host1,…} memory_total_bytes{instance=host1,…} http_requests_count{instance=host1,…} http_requests_count{instance=host99,…} औΓࠐΈ ᶄ ϝ τ Ϧ Ϋ ε ͷ ݸ ਺ ᶃ ղ૾౓ (Ұൠʹ1 ~ 60ඵͷൣғ) 19 ߩݙᶄ

Slide 20

Slide 20 text

ϝτϦΫεͷऔΓࠐΈϫʔΫϩʔυྔ͸ɺ̎ͭͷ࣍ݩʹൺྫ͢Δ ϝτϦΫεετϨʔδͷϫʔΫϩʔυ ࣌ؒ cpu_seconds{instance=host1,…} memory_total_bytes{instance=host1,…} http_requests_count{instance=host1,…} http_requests_count{instance=host99,…} ߩݙᶄ ᶄ ϝ τ Ϧ Ϋ ε ͷ ݸ ਺ ᶃ ղ૾౓ (Ұൠʹ1 ~ 60ඵͷൣғ) cpu_seconds{instance=host1,…} cpu_seconds{instance=host1,mode=user,core_no=1,…} cpu_seconds{instance=host1,mode=system,core_no=1,…} cpu_seconds{instance=host1,mode=user,core_no=2,…} ଟ࣍ݩԽʹΑΔݸ਺૿Ճ ෼ղ 20

Slide 21

Slide 21 text

21 ϝτϦΫεετϨʔδͷεέʔϥϏϦςΟཁٻ औΓࠐΈॲཧεϧʔϓοτ σʔλอଘྔ ɾਫฏ෼ׂ͞Εͨෳ਺ϊʔυͰͷऔΓࠐΈ ɾϝϞϦ্ͷσʔλߏ଄΁ͷޮ཰తॻ͖ࠐΈ Ұൠతͳղܾ๏ Slack 12M datapoints / sec Meta 700M datapoints / min LYCorp 12.5M datapoints / min [19] [32] [112] Slack 12 TB / day ByteDance 10 TB/ day LYCorp 2.7 TB / day Mackerel 460 days Ұൠతͳղܾ๏ σʔλѹॖٕज़΍ ίʔϧυετϨʔδ্Ͱͷ௕ظอଘ ߩݙᶄ [19] [35] [69] [112] <>4VNBO,BSVNVSJ FUBM l5PXBSET0CTFSWBCJMJUZ%BUB.BOBHFNFOUBU4DBMFz 4*(.0%3FDPSE <>5VPNBT1FMLPOFO FUBMl(PSJMMB"'BTU 4DBMBCMF *O.FNPSZ5JNF4FSJFT%BUBCBTFz 7-%# <>9VBOIVB4IJ FUBMl#ZUFTFSJFT"O*O.FNPSZ5JNF4FSJFT%BUBCBTFGPS-BSHF4DBMF.POJUPSJOH4ZTUFNTz 4P$$ <>)JSPLJ4BLBNPUP4DBMJOH5JNF4FSJFT%BUBUP*O fi OJUZ",VCFSOFUFT1PXFSFE4PMVUJPOXJUI&OWPZIUUQTTQFBLFSEFDLDPNMZDPSQUFDI@KQTDBMJOH@UTEC@JOJ fi OJUFMZ@XJUI@PTT <>.BDLFSFMIUUQTNBDLFSFMJP

Slide 22

Slide 22 text

22 KVSͷऔΓࠐΈޮ཰ ϝϞϦϕʔεKVS ϝϞϦ͸ϥϯμϜΞΫ ηεޮ཰ʹ༏ΕΔͨ ΊɺϋογϡදΛ࠾༻ σΟεΫϕʔεKVS ϝτϦΫε਺͕૿େ͢Δ = KVSͷΩʔ਺͕૿େ͢Δ Memory Disk ฏߧ໦ɾεΩο ϓϦετͳͲͷ ιʔτࡁΈߏ଄ ιʔτࡁΈͷͨ ΊσΟεΫΞΫ ηεޮ཰͕ߴ͍ O(logn) ॻ͖ࠐΈ Flush ॻ͖ࠐΈ Memory O(k) σΟεΫ্ʹσʔλΛอ࣋͠ͳ͍ɻ ʢίϛοτϩά΍εφοϓγϣοτΛআ͘ʣ Disk File HBase, Cassandra, … Redis, Valkey, Dragonfly, … ߩݙᶄ ↳ ಺෦ΦϒδΣΫτͷ؅ཧίετ૿େɻྫʣσʔλ௥Ճ࣌ͷΠϯσοΫεࢀরޮ཰

Slide 23

Slide 23 text

23 KVSͷऔΓࠐΈޮ཰ ϝϞϦϕʔεKVS ϝϞϦ͸ϥϯμϜΞΫ ηεޮ཰ʹ༏ΕΔͨ ΊɺϋογϡදΛ࠾༻ σΟεΫϕʔεKVS ϝτϦΫε਺͕૿େ͢Δ = KVSͷΩʔ਺͕૿େ͢Δ Memory Disk ฏߧ໦ɾεΩο ϓϦετͳͲͷ ιʔτࡁΈߏ଄ ιʔτࡁΈͷͨ ΊσΟεΫΞΫ ηεޮ཰͕ߴ͍ O(logn) ॻ͖ࠐΈ Flush ॻ͖ࠐΈ Memory O(k) σΟεΫ্ʹσʔλΛอ࣋͠ͳ͍ɻ ʢίϛοτϩά΍εφοϓγϣοτΛআ͘ʣ Disk File HBase, Cassandra, … Redis, Valkey, Dragonfly, … ߩݙᶄ ✘ ϝϞϦ͸هԱྔ͋ͨΓͷඅ༻͕େ ͖͍ͨΊɺ௕ظอ࣋ʹ͸ෆ޲͖ɻ ✘ Ωʔ਺͕େ͖͍࣌ʹɺσʔλͷॻ͖ ࠐΈޮ཰͕௿Լ͢Δɻ ↳ ಺෦ΦϒδΣΫτͷ؅ཧίετ૿େɻྫʣσʔλ௥Ճ࣌ͷΠϯσοΫεࢀরޮ཰

Slide 24

Slide 24 text

24 ఏҊख๏ HeteroTSDB Client ϝϞϦϕʔεKVS σΟεΫϕʔεKVS App Flusher ௚ۙͷλΠϜελϯϓΛ΋ͭσʔ λ͕֨ೲ͞ΕΔϝϞϦόοϑΝ ϋογϡදʹجͮ͘ߴ଎औΓࠐΈ ݹ͍λΠϜελϯϓΛ΋ͭσʔλ͕ ֨ೲ͞ΕΔσΟεΫετϨʔδ SSD/HDDʹอଘ͢Δ͜ͱʹΑΔ ௕ظอ࣋ίετͷ௿Լ σʔλͷϚΠά Ϩʔγϣϯ ཱ྆ ߩݙᶄ (Redis) (Cassandra)

Slide 25

Slide 25 text

25 ࣮ݧɿऔΓࠐΈॲཧޮ཰ͷൺֱ ϗετ਺ʢ1~8ʣ औ Γ ࠐ Έ ε ϧ ʛ ϓ ο τ ఏҊख๏ʢHeteroTSDBʣ͕ ϕʔεϥΠϯͷ3.98ഒɻ 420k datapoints/s ੨ɿKairosDB ᒵɿఏҊख๏ Slackࣾͷ12 M/s ͷϫʔΫϩʔυʹஔ ͖׵͑Δͱ - ఏҊख๏͸229ݸ - KairosDB͸915ݸ ͷϗετ਺Λඞཁͱ͢ΔܭࢉʹͳΔɻ ϝτϦΫε਺Λ1Mʹݻఆ ߩݙᶄ

Slide 26

Slide 26 text

ܭଌ อଘ ෼ੳ ΞϓϦέʔγϣϯ γεςϜ ΤϯδχΞ ར༻ऀ Πϯλʔωοτ ςϨϝτϦʔγεςϜ ߩݙᶅɿ෼ੳ ܭࢉࢿݯ ෛՙ૿େ ߩݙᶃ ߩݙᶅ ߩݙᶄ ো֐ʹؔ࿈͠ͳ͍ϝ τϦΫεΛڭࢣͳ͠ ػցֶशͰࣗಈͰ࡟ ݮ͢Δલॲཧ๏Λఏ Ҋ͢Δɻ 26 Y. Tsubouchi and H. Tsuruta, MetricSifter: Feature Reduction of Multivariate Time Series Data for Efficient Fault Localization in Cloud Applications, IEEE Access, 2024. εΩοϓ

Slide 27

Slide 27 text

27 ΞϓϦέʔγϣϯ ܭଌ جຊݪଇɿαϯϓϦϯάɾू໿ɾಛ௃࡟ݮͳͲͷσʔλ࡟ݮ͸ɺίϯςΩετ ͕๛෋ͳՕॴʢܭ૷ɾϚΠχϯάʣͰద༻͢Δ͜ͱɻ ૯ׅɿςϨϝτϦʔγεςϜઃܭࢦ਑ ςϨϝτϦʔγεςϜ ΦϖϨʔλʔ อଘ ෼ੳ ϓϩηεɺιέοτɺτϥϯβΫ γϣϯͳͲɻ ߩݙᶃͰ͸ɺιέοτΛجʹू໿ɻ ΞϓϦέʔγϣϯ ίϯςΩετ ো֐΍ΞϥʔτͳͲɻ ӡ༻ίϯςΩετ σʔλ࡟ݮΛͤͣɺܭࢉ ࢿݯͷར༻ޮ཰޲্Λ ໨ࢦ͢ɻ ߩݙᶅͰ͸ɺো֐ൃੜΛ جʹಛ௃࡟ݮɻ

Slide 28

Slide 28 text

28 IUUQTTQFBLFSEFDLDPNZVVLJUQIEEFGFODF ത࢜࿦จͷৄࡉ͸ɺԼهεϥΠυ·ͨ͸ۙ೔ެ։͞ΕΔຊจ1%'Λ͝ཡ͍ͩ͘͞ɻ

Slide 29

Slide 29 text

3. 博 士 論 文 のテーマに沿う最新 研究動向 ιϑτ΢ΣΞ޻ֶɺγεςϜιϑτ΢ΣΞܥͷτοϓձ͔ٞΒओʹηϨΫτ

Slide 30

Slide 30 text

ςϨϝτϦʔϫʔΫϩʔυεέʔϦϯάͷੈք 30 ܭଌ ςϨϝτϦʔγεςϜ อଘ ෼ੳ ϝτϦΫεɾϩάɾτϨʔεɾϓϩϑΝΠϧɾμϯϓɾηογϣϯͳͲͷ ༷ʑͳςϨϝτϦσʔλͷछผ͝ͱʹ਺ଟ͘ͷݚڀ࿦จ͕ൃද͞Ε͍ͯΔɻ

Slide 31

Slide 31 text

ςϨϝτϦʔϫʔΫϩʔυεέʔϦϯάͷੈք 31 ܭଌ ςϨϝτϦʔγεςϜ อଘ ෼ੳ •࿦จͷ਺͕࠷ଟɻ •τϨʔγϯάͷݡ͍αϯϓϦ ϯά๏ͷఏҊ͕ಛʹଟ͍ɻ •੡඼͸ଟ͍͕࿦จ͸গͳ͍ •ϝλσʔλʢଟ࣍ݩϥϕϧʣ ͷѹॖ΍ΫΤϦͷϓογϡμ ΢ϯػߏͳͲ͕ఏҊɻ •AIOpsͷ࿦จ͸େྔʹൃද ͞Ε͍ͯΔ͕ɺεέʔϦϯ άʹؔ͢Δ࿦จ͸গͳ͍ɻ •ϊΠζআڈͳͲͷϑΟϧλ Ϧϯά

Slide 32

Slide 32 text

ςϨϝτϦʔϫʔΫϩʔυεέʔϦϯάͷੈք 32 ܭଌ ςϨϝτϦʔγεςϜ อଘ ෼ੳ •੡඼͸ଟ͍͕࿦จ͸গͳ͍ •ϝλσʔλʢଟ࣍ݩϥϕϧʣ ͷѹॖ΍ΫΤϦͷϓογϡμ ΢ϯػߏͳͲ͕ఏҊɻ •AIOpsͷ࿦จ͸େྔʹൃද ͞Ε͍ͯΔ͕ɺεέʔϦϯ άʹؔ͢Δ࿦จ͸গͳ͍ɻ •ϊΠζআڈͳͲͷϑΟϧλ Ϧϯά ࠓ೔͸ܭଌ૚ʹண໨ •࿦จͷ਺͕࠷ଟɻ •τϨʔγϯάͷݡ͍αϯϓϦ ϯά๏ͷఏҊ͕ಛʹଟ͍ɻ

Slide 33

Slide 33 text

ܭଌ૚Ͱͷσʔλ࡟ݮͷಈػ 33 ܭଌ ෼ੳ ܭଌ૚ͰσʔλྔΛ࡟ݮͤ͞Ε͹ɺ ޙଓ૚ͷෛՙ΋࡟ݮՄೳɻ [186] Paige Cruz, “99.99% of Your Traces Are (Probably) Trash", SREcon24. [71] Guangba Yu, et al. “LogReducer: Identify and Reduce Log Hotspots in Kernel on the Fly”. ICSE. 2023. ”τϨʔεͷ99.99%͸ΰϛͰ͋Δ” [186] WeChatͰ͸ɺ୯ҰͷϩάςϯϓϨʔτ ͕શετϨʔδͷ95.7%Λ઎Ί͍ͯͨɻ [71] •࿦จͷ਺͕࠷ଟɻ •τϨʔγϯάͷݡ͍αϯϓϦ ϯά๏ͷఏҊ͕ಛʹଟ͍ɻ

Slide 34

Slide 34 text

34 ܭଌ૚ͷݚڀಈ޲ɹ֓ཁ ܭଌ อଘ ෼ੳ ϝτϦΫε τϨʔε ػցֶशϞσϧʹΑΔςΠϧ αϯϓϦϯά গ਺ͷՁ஋ͷ͋ΔτϨʔε ͷΈΛબ୒͢Δɻ ॏཁͳϝτϦΫεΛಈతʹܾఆ͠ɺ ܭଌස౓Λ্͛Δɻ ௚લͷ஋ͱͷဃ཭͕େ͖͍৔߹ͷΈ ૹ৴͢Δɻ ϨτϩΞΫςΟϒαϯϓϦϯά τϨʔεͷߏ଄ɾ࣮ߦ࣌ؒɾ ଟ༷ੑɾ࣮ߦ࣌ঢ়ଶʢγες ϜϝτϦΫεʣΛߟྀ͢Δɻ ѹॖʢશτϨʔεͷۙࣅ৘ใอ࣋ʣ ނোݕ஌ޙʹ࣌ؒΛḪͬͯશ τϨʔεΛܭଌɾऩू ࠓ೔͸εΩοϓ ϩά ϗοτεϙοτΛࣗಈൃݟ ετϨʔδফඅྔΛ઎ΊΔϩάςϯϓ ϨʔτΛࣗಈൃݟ ো֐ൃੜ࣌ʹϦΞΫςΟϒʹऩू ނোͷരൃ൒ܘΛܭࢉ͠ɺͦͷൣғ಺ͷ ϊʔυͷΈ͔Βσʔλऩू PMF (CLOUD,2024) τϨʔεΛڞ௨෦෼ͱՄม෦෼ʹ ෼ղ͠ɺॏෳഉআɻ PMF: Chakraborty, Aishwariya, et al. "Enabling Programmable Metric Flows." CLOUD, 2024.

Slide 35

Slide 35 text

35 ܭଌ૚ͷݚڀಈ޲ɹτϨʔε อଘ ෼ੳ τϨʔε গ਺ͷՁ஋ͷ͋ΔτϨʔε ͷΈΛબ୒͢Δɻ ނোݕ஌ޙʹ࣌ؒΛḪͬͯશ τϨʔεΛܭଌɾऩू ϨτϩΞΫςΟϒαϯϓϦϯά τϨʔεͷߏ଄ɾ࣮ߦ࣌ؒɾ ଟ༷ੑɾ࣮ߦ࣌ঢ়ଶʢγες ϜϝτϦΫεʣΛߟྀ͢Δɻ ѹॖʢશτϨʔεͷۙࣅ৘ใอ࣋ʣ Sifter (SoCC,2019) Sieve (IWCS, 2021) STEAM (FSE, 2023) TraStrainer (FSE, 2024) τϨʔεσʔλ͔Β ਖ਼ৗϞσϧΛߏஙɻ ҟৗ΍֎Ε஋ͱͳΔ τϨʔεΛ༏ઌɻ ҟৗ͚ͩͰͳ͘ߏ଄ తʹ΋࣌ؒతʹ΋௝ ͍͠τϨʔεΛ༏ ઌɻ APIɾߏ଄ɾ஗Ԇɾε ςʔλείʔυͳͲͷ ଐੑ͝ͱʹଟ༷ੑΛҡ ࣋͢Δɻ γεςϜͷঢ়ଶมԽʢϝ τϦΫεͷมԽʣʹؔ ࿈͢Δ౓߹͍͕ߴ͍τ ϨʔεΛ༏ઌ͢Δɻ (Microsoft) ػցֶशϞσϧʹΑΔςΠϧ αϯϓϦϯά τϨʔεΛڞ௨෦෼ͱՄม෦෼ʹ ෼ղ͠ɺॏෳഉআɻ Sifter: Las-Casas, Pedro, et al. "Sifter: Scalable sampling for distributed traces, without feature engineering." SoCC. 2019. Sieve: Huang, Zicheng, et al. "Sieve: Attention-based sampling of end-to-end trace data in distributed microservice systems.” ICWS, 2021. STEAM: He, Shilin, et al. "STEAM: Observability-preserving trace sampling.” ESEC/FSE, 2023. TraStrainer: Huang, Haiyu, et al. "Trastrainer: Adaptive sampling for distributed traces with system runtime state.” ESEC/FSE, 2024.

Slide 36

Slide 36 text

ܭଌ૚ͷݚڀಈ޲ɹτϨʔε อଘ ෼ੳ ػցֶशϞσϧʹΑΔςΠϧ αϯϓϦϯά গ਺ͷՁ஋ͷ͋ΔτϨʔε ͷΈΛબ୒͢Δɻ ϨτϩΞΫςΟϒαϯϓϦϯά τϨʔεͷߏ଄ɾ࣮ߦ࣌ؒɾ ଟ༷ੑɾ࣮ߦ࣌ঢ়ଶʢγες ϜϝτϦΫεʣΛߟྀ͢Δɻ ѹॖʢશτϨʔεͷۙࣅ৘ใอ࣋ʣ τϨʔεΛڞ௨෦෼ͱՄม෦෼ʹ ෼ղ͠ɺॏෳഉআɻ ނোݕ஌ޙʹ࣌ؒΛḪͬͯશ τϨʔεΛܭଌɾऩू Mint (ASPLOS,2025) (Alibaba) 36 શτϨʔε (100%) ڞ௨ੑ ʢύλʔϯʣ Մมੑ ʢύϥϝʔλʣ ͢΂ͯอଘ ॏཁͳ΋ͷ ͷΈอଘ • ετϨʔδ࢖༻ྔɿݩͷ2.7%ʹ࡟ݮ • NW࢖༻ྔɿݩͷ4.2%ʹ࡟ݮ “1 or 0”ํࣜͷݶք 5%ͷτϨʔεΛอ࣋͠ ͍͕ͯͨɺ෼ੳΫΤϦͷ ϛεϨʔτ͕27.17% Huang, Haiyu, et al. "Mint: Cost-Ef fi cient Tracing with All Requests Collection via Commonality and Variability Analysis." arXiv preprint arXiv:2411.04605 (2024).

Slide 37

Slide 37 text

ػցֶशϞσϧʹΑΔςΠϧ αϯϓϦϯά গ਺ͷՁ஋ͷ͋ΔτϨʔε ͷΈΛબ୒͢Δɻ 37 ܭଌ૚ͷݚڀಈ޲ɹτϨʔε ϨτϩΞΫςΟϒαϯϓϦϯά ѹॖʢશτϨʔεͷۙࣅ৘ใอ࣋ʣ ނোݕ஌ޙʹͷΈ࣌ؒΛḪͬ ͯτϨʔεΛऔಘ IUUQTHJUMBCNQJTXTPSHDMEUSBDJOHIJOETJHIU ΞϓϦέʔγϣϯ τϨʔεੜ੒ ϩʔΧϧϝϞϦʹอଘ Hindsight Agent ϝλσʔλͷ؅ཧ τϨʔεϝτϦΫεͷҡ࣋ Hindsight (NSDI,2022) ίʔσΟωʔλʔ ύϯͣ͘Λ௥੻ ܦ࿏Λޙ͔Β࠶ߏங͢Δ ͨΊͷϙΠϯλ อ ଘ ૚ τϦΨʔ ϚγϯؒͰτϨʔεͷ Ұ؏ੑΛௐ੔ τϨʔεΛڞ௨෦෼ͱՄม෦෼ʹ ෼ղ͠ɺॏෳഉআɻ Zhang, Lei, et al. "The bene fi t of hindsight: Tracing Edge-Cases in distributed systems." NSDI, 2023.

Slide 38

Slide 38 text

38 ܭଌ૚ͷݚڀಈ޲ɹϩά ϩά ϗοτεϙοτΛࣗಈൃݟ ετϨʔδফඅྔΛ઎ΊΔϩάςϯϓ ϨʔτΛࣗಈൃݟ SALO (CLOUD,2024) ো֐ൃੜ࣌ʹϦΞΫςΟϒʹऩू ނোͷരൃ൒ܘΛܭࢉ͠ɺͦͷൣғ಺ͷ ϊʔυͷΈ͔Βσʔλऩू (IBM Research) طଘΞϓϩʔν SALO •࠷େ95ˋͷϩάྔ࡟ݮ •ԼྲྀͷAIOpsλεΫͷ࠷େ20ˋੑೳ޲্ ݁Ռ େྔͷϩά ࡟ݮ͞Εͨϩά രൃ൒ܘ ނো Pathak, Divya, et al. "Self Adjusting Log Observability for Cloud Native Applications." CLOUD, 2024.

Slide 39

Slide 39 text

39 ܭଌ૚ͷݚڀಈ޲ɹϩά LogReducer (ICSE,2023) (Tencent) ୯ҰͷϩάςϯϓϨʔτ͕શετϨʔδͷ 95.7%Λ઎Ί͍ͯͨɻ ໰୊ʢWeChat: 1೔ʹ19.7PBɺ100ஹߦʣ γεςϜίʔϧ write() ʹeBPFͰϑοΫͯ͠ ϗοτεϙοτ൑ఆ͞ΕͨΒϩάग़ྗΛdrop ࡟ݮɿ19.7PB → 12.0PB/೔ (໿39%ݮগʣ मਖ਼࣌ؒ୹ॖɿ9೔ → 10෼ ݁Ռ ϩά ϗοτεϙοτΛࣗಈൃݟ ετϨʔδফඅྔΛ઎ΊΔϩάςϯϓ ϨʔτΛࣗಈൃݟ ো֐ൃੜ࣌ʹϦΞΫςΟϒʹऩू ނোͷരൃ൒ܘΛܭࢉ͠ɺͦͷൣғ಺ͷ ϊʔυͷΈ͔Βσʔλऩू Yu, Guangba, et al. "Logreducer: Identify and reduce log hotspots in kernel on the fl y." ICSE, 2023.

Slide 40

Slide 40 text

4. まとめ

Slide 41

Slide 41 text

41 ɾത࢜࿦จͰߏஙͨ͠ςϨϝτϦʔϫʔΫϩʔυεέʔϦϯάͱݺͿแׅత ͳ֓೦Λಋೖͨ͠ɻ ɾ֓೦ʹԊͬͨ̎ͭͷߩݙʢωοτϫʔΫτϨʔεɺ࣌ܥྻDBʣΛࣔͨ͠ɻ ɾ࠷৽ͷݚڀ࿦จΛຊ֓೦ʹ౰ͯ͸Ίͯ੔ཧ͢Δ͜ͱʹΑΓɺຊ֓೦ͷੈք Λ঺հͨ͠ɻ ·ͱΊ ܭଌ࣌఺ͰͷɺΞϓϦίϯςΩετͱӡ༻ίϯςΩετΛ౿·͑ͨɺ ࣮ߦ࣌ঢ়ଶʹԠͯ͡ಈతͳσʔλ࡟ݮ๏͕ଟ਺։ൃ͞Ε͍ͯΔɻ