Slide 1

Slide 1 text

ཁૉٕज़ͷstate of the art͔Βߟ͑Δ ۙ౻Ӊஐ࿕ / GMO Pepabo, Inc. 2021.02.21 ʮ৽͍͠ηΩϡϦςΟϏδωεΩϟϦΞʯγϯϙδ΢Ϝ Linuxίϯςφͷະདྷ

Slide 2

Slide 2 text

GMOϖύϘגࣜձࣾ γχΞϓϦϯγύϧ ٕज़෦ٕज़ج൫νʔϜॴଐ ۙ౻Ӊஐ࿕ (@udzura)

Slide 3

Slide 3 text

ۙ౻Ӊஐ࿕ ུྺ • ࡾՏᅳͷਓɻچ٢ాൡߍ࣌शؗߴߍΛଔۀɺ౦ژେֶจֶ෦೔ຊޠ೔ຊจֶઐम՝ఔ ͷֶ࢜ଔʢ2007ʣɻ • Ϛείϛͷࣾ಺SEɺECαΠτ։ൃɺΦϯϥΠϯήʔϜ ։ൃͳͲΛܦͯ2013೥ΑΓݱ৬ɺಉ೥ʹ෱ԬҠॅɻ • RubyɺίϯςφɺΫϥ΢υωΠςΟϒٕज़ͳͲͷίϛϡχςΟͰ ׆ಈɻஶॻʹʮWebͰ࢖͑ΔmrubyγεςϜϓϩάϥϛϯάೖ໳ʯʢC&Rݚڀॴʣ • ޷͖ͳγεςϜίʔϧ͸ʢ࠷ۙ͸ʣ socketpair(2) ɻ

Slide 4

Slide 4 text

ࠓ೔ͷ͓࿩ •ίϯςφͷཁૉٕज़ʹ͍ͭͯɺࢲͷߨٛͰʢͬ͘͟Γʣཧղ͞Εͨํ ޲͚ͷ಺༰Ͱ͢ɻཁૉٕज़ͷղઆ͸ࢀߟࢿྉΛͲ͏ͧɻ •ࢀߟ1: https://container-security.dev/ •ࢀߟ2: ʰίϯςφܕԾ૝Խ֓࿦ʱʢ઒ޱ, ΧοτγεςϜʣ •௚ۙͷΧʔωϧʹؚ·ΕΔ৽ٕज़ͷ͏ͪɺίϯςφʹؔ܎͢Δ΋ͷΛ ঺հ͠·͢ɻ •͕࣌ؒ͋Ε͹ɺͦͷ্ͰίϯςφͷະདྷΛߟ͑·͢ɻ

Slide 5

Slide 5 text

cgroup v2

Slide 6

Slide 6 text

cgroup ͷ͓͞Β͍ •Linux Kernelͷجຊٕज़ͷҰͭɻ •ϓϩηεΛάϧʔϐϯά͠ɺͦͷάϧʔϓ୯ҐͰϦιʔεར༻ͷ੍ݶ Λ͔͚Δٕज़ɻCPUɺϝϞϦɺIOɺϓϩηε਺... IUUQTHJIZPKQBENJOTFSJBMMJOVY@DPOUBJOFST

Slide 7

Slide 7 text

cgroup v1ͷྫ •cgroupfs ͱ͍͏ϑΝΠϧγεςϜʹmkdir(2) read(2) write(2)ͳͲΛ࣮ ߦ͠ɺૢ࡞Λߦ͏

Slide 8

Slide 8 text

cgroup v2 •v1 ͷ͍͔ͭ͘ͷܽ఺ - ओʹ੍ޚର৅ʢίϯτϩʔϥʣ͝ͱʹσΟϨΫ τϦΛ෼͚ͳ͚Ε͹͍͚ͳ͍࢓༷ - Λࠀ෰͢΂͘։ൃ͞Εͨ •େ͖ͳҧ͍ͱͯ͠ɺ v1 Ͱ͸ίϯτϩʔϥผʹσΟϨΫτϦ ΛϚ΢ϯ τɺݸผʹάϧʔϓʹॴଐͰ͖ͨͷʹର͠ɺv2Ͱ͸શίϯτϩʔϥΛ ·ͱΊͨҰͭͷσΟϨΫτϦͷΈΛϚ΢ϯτ͠ɺ·ͱΊͯάϧʔϓΛ ࡞੒͢ΔڍಈʹͳΔɻ •ίϯςφͱͯ͠͸ͪ͜Βͷํ͕౎߹͕͍͍ɻ

Slide 9

Slide 9 text

Unified hierarchy /sys/fs/cgroup /sys/fs/cgroup /group-a /group-b /cpu.* /memory.* /io.* ... /cpu.* /memory.* /io.* ... /cpu /memory /blkio /group-a /group-b /group-a /group-c

Slide 10

Slide 10 text

ίϯςφϥϯλΠϜͰͷcgroupͷར༻ •ʢOCIܥͷʣϥϯλΠϜͰ͸ҎԼͷ2ͭͷઃఆ߲໨͕͋Δ •Cgroup Driver: ίϯςφʹׂΓ౰ͯΔcgroupΛͲ͏ίϯτϩʔϧ͢Δ͔ •cgroupfs: cgroupfs΁ͷ௚઀ͷϑΝΠϧૢ࡞ •systemd: systemdʹΑΔ؅ཧ •Cgroup Version: Ϧιʔε੍ݶʹ v1/v2 ͲͪΒΛར༻͢Δ͔ •/sys/fs/cgroup ʹͲͷϑΝΠϧγεςϜ͕Ϛ΢ϯτ͞ΕͯΔ͔Ͱ൑ఆ

Slide 11

Slide 11 text

v2 ͷ৽ػೳ •Unified Hierarchy •PSI(Pressure Stall Information) •eBPFͰcgoup IDͷऔಘ͕Մೳʹ •nsdelegate (ඇಛݖίϯςφʹ͸ॏཁ) •clone3(2) Ͱಛఆͷcgroup಺෦ʹ௚઀ϓϩηε࡞੒͕Մೳʹ •ͳͲͳͲ...

Slide 12

Slide 12 text

e.g. PSI(Pressure Stall Information) •γεςϜશମɺ·ͨ͸cgroup୯ҐͰར༻Ͱ͖Δෛՙͷࢦඪ •CPU, ϝϞϦ, IO Ͱ stall ͨ͠୯Ґ࣌ؒͰͷׂ߹ ΛܭଌͰ͖Δ •e.g. 1෼ؒͰ45ඵؒɺάϧʔϓͷ ͋Δϓϩηε͕CPUىҼͰ ஗Ԇͨ͠৔߹ɺcpu some: 75.00

Slide 13

Slide 13 text

e.g. eBPFͰͷτϥοΩϯά৘ใ •bpf_get_current_cgroup_id(void) ϔϧύʔ •eBPFͷΠϕϯτ͕ى͖ͨλεΫ͕Ͳͷcgroup(v2)ʹॴଐ͍ͯ͠Δ͔ɺ ͦͷIDΛฦ͢ɻ

Slide 14

Slide 14 text

How cgroup-v2 and PSI Impacts Cloud Native? Uchio Kondo / GMO Pepabo, Inc. 2019.07.23 CloudNative Days Tokyo 2019 Image from pixabay: https://pixabay.com/images/id-3193865/

Slide 15

Slide 15 text

eBPF per containers

Slide 16

Slide 16 text

eBPFٕज़ͱ͸ •ϢʔβۭؒͰ࡞ͬͨϓϩάϥϜΛΧʔωϧͰಈ͔ٕ͢ज़ͷͻͱͭ •2012೥ʹseccomp΁ͷಋೖɺ2013೥ʹLinuxͷSDNͰͷԠ༻͕࣮૷͞ ΕɺͦΕҎ߱੒ख़͢Δ •ϑΟϧλϦϯά͕ಘҙʢtcpdump, seccomp, bpftraceʣ •Χʔωϧͷ৘ใʹΞΫηεͰ͖Δ͕ɺةݥͳίʔυ͸ಈ͔ͳ͍ͳͲ ҆શੑ͕͋Δఔ౓୲อ͞Ε͍ͯΔ

Slide 17

Slide 17 text

eBPFͷԠ༻ྫ

Slide 18

Slide 18 text

ίϯςφͷeBPFτϨʔεઓུ •ઓུ͕͍͔ͭ͋͘Δ •Linux Namespace·ͨ͸cgroup (v2)ͷ৘ใ͕ར༻Ͱ͖Δ

Slide 19

Slide 19 text

ྫ1: task_struct ͷ৘ใΛḷΔ •task_struct→nsproxy ͔Β namespaceͷ৘ใΛ औಘͯ͠ϑΟϧλ͢Δ ʢcxrayʣ IUUQTHJUIVCDPNNSUDDYSBZCMPCNBTUFSQLHUSBDFSPQFOPQFOHP--

Slide 20

Slide 20 text

ྫ2: NS಺/ϗετͰͷPIDΛൺֱ •BPFϓϩάϥϜͰऔಘͰ͖ͨ tidͱɺϗετͰͷtidΛ ൺֱ͠ɺҰக͠ͳ͚Ε͹ ίϯςφͱ൑ఆ͢Δ ʢTraceeʣ • task_structґଘ IUUQTHJUIVCDPNBRVBTFDVSJUZUSBDFFCMPCNBJOUSBDFFUSBDFFCQGD-ɹ

Slide 21

Slide 21 text

ྫ3: cgroup helperΛར༻ IUUQTHJUIVCDPNVE[VSBDPQFODMPTFCMPCNBTUFSTSDCQGDPQFODMPTFCQGD

Slide 22

Slide 22 text

࣮૷ྫ •copenclose(8) •ۙ౻ͷPoC (BPF+Rust) •ϑϥάͰtask_struct/ cgroup v2 ID Λ੾Γସ͑

Slide 23

Slide 23 text

bpf_get_current_cgroup_id(void) を添えて Uchio Kondo / Container Runtime Meetup #3 ランタイムとcgroupの
 xxxな関係 * Photo by Fukuoka City

Slide 24

Slide 24 text

seccomp

Slide 25

Slide 25 text

seccompͷ͓͞Β͍ •ϓϩάϥϜʹ͓͚ΔγεςϜίʔϧݺͼग़͠ΛϑΟϧλϦϯά͢Δ •γεςϜίʔϧͷҾ਺ͷ৚݅ʹΑͬͯࢦఆΛม͑ΒΕΔ •blacklist(denylist), whitelist(allowlist) ͳͲΛ࣮૷Ͱ͖Δ •ϑϥά͕ࡉ͔͘ଘࡏ͠ɺྫ͑͹γεςϜίʔϧͷauditϩάͷΈɺ೚ҙ ͷerrnoΛฦͤ͞ΔɺͳͲͷࢦఆ͕Ͱ͖Δ

Slide 26

Slide 26 text

seccompͷར༻(mruby)

Slide 27

Slide 27 text

User space notification •seccompʹΑΓγεςϜίʔϧݺͼग़͠Λݕ஌͠ɺͦͷڍಈΛϢʔβ ϥϯυͷϓϩάϥϜʹҕͶΔ͜ͱ͕Ͱ͖Δٕज़ •Linux 5.0 (2019/3) ͔Βͷಋೖ •൑அ͢Δ·ͰɺͦͷγεςϜίʔϧ͸ϒϩοΫ͢Δ •e.g. LXCͰͷσόΠεΞΫηεͷ੍ޚ IUUQTHJIZPKQBENJOTFSJBMMJOVY@DPOUBJOFSTɹ

Slide 28

Slide 28 text

User space notification IUUQTHJIZPKQBENJOTFSJBMMJOVY@DPOUBJOFSTɹ • LXCͰֶͿίϯςφೖ໳ ୈ47ճɹඇಛݖίϯςφͷՄೳੑΛ޿͛Δseccomp notifyػೳ ΑΓ

Slide 29

Slide 29 text

࣮૷ྫʢmrubyར༻ʣ •ҎԼͷΑ͏ͳ acceptor.rbΛ ༻ҙ͢Δ

Slide 30

Slide 30 text

࣮૷ྫʢmrubyར༻ʣ •ҎԼͷinvokerΛܦ༝ͯ͠ϓϩάϥϜΛ ىಈɺ listen(3) ΛݺͿ

Slide 31

Slide 31 text

listen(2) ͷىಈݕ஌ •acceptor.rb ଆͷίϯιʔϧͰڐՄ/ېࢭΛ੍ޚՄೳɻ •ېࢭͨ͠Βͦͷ··ىಈࣦഊͯ͠invokerϓϩηε͕མͪΔ •ڐՄͨ͠ΒԿ΋ͳ͔͔ͬͨͷΑ͏ʹɺىಈΛܧଓͯ͠Ϧοεϯɻ

Slide 32

Slide 32 text

listen(2) ͷىಈݕ஌ ېࢭ࣌ͷग़ྗ ڐՄ࣌ͷग़ྗ

Slide 33

Slide 33 text

Ԡ༻ʁ •ʮ೚ҙͷϥΠϒϥϦؔ਺ݺͼग़͠ʯͰϓϩηεΛఀࢭɺCRIU(*)ʹΑΓ ϓϩηεμϯϓΛ࡞੒͢Δ࣮ݧΛߦͬͨɻ •LD_PRELOAD + ϥούؔ਺ + ʮԿ΋͠ͳ͍ʯsyscall + seccomp IUUQTVE[VSBIBUFOBCMPHKQFOUSZ $IFDLQPJOUBOE3FTUPSF*O6TFSTQBDF ϓϩηεͷঢ়ଶΛอଘɺ͔ͦ͜Β࠶ੜ͢Δٕज़ IUUQTDSJVPSH.BJO@1BHF

Slide 34

Slide 34 text

ߟ࡯

Slide 35

Slide 35 text

৽ٕज़ʹΑΓͰ͖Δ͜ͱ͸૿͑Δ͕... •৽ٕज़ͷʮग़ݱʯͱʮීٴʯͷλΠϛϯά͸ζϨΔ •ͨͱ͑͹ cgroup v2ͷॳग़͸2013೥ɻ •2019 ~ 2020 ೥ʹϥϯλΠϜͰͷରԠ͕ਐΜͩଆ໘ •पลͷπʔϧ͕ग़ݱ͢Δͷ͸΋ͬͱઌͰ͋Ζ͏ •ग़ݱظʹ୯ମͰٕज़Λݕূ͠ɺʢηΩϡϦςΟؚΊʣͲ͏͍͏໰୊͕ ͋Δ͔ɺͲ͏͍͏Մೳੑ͕͋Δ͔ݕূ͢Δҙٛ͸େ͖͍

Slide 36

Slide 36 text

eBPF ͸Linuxͷجຊٕज़ʹͳΓͭͭ͋Δ •ద༻ൣғ͕ͲΜͲΜ޿·͍ͬͯΔ •τϨʔγϯάɺଳҬ੍ޚ΍ωοτϫʔΩϯάͷ΄͔ɺcgroup(v2) deviceͷ൑ఆɺLSM BPF programͳͲͳͲ... •ηΩϡϦςΟͷจ຺Ͱ͸τϨʔεɺ؂ࠪɺҟৗݕ஌ͱ͔ܽͤͳ͍ٕज़ ʹͳΔ͜ͱ͕૝૾͞ΕΔ •Ұͭͷprog typeʹ৮͓͚ͬͯͩ͘Ͱ΋ײ֮͸෼͔Γͦ͏