Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Noah A Robust and Flexible Operating System Com...

Noah A Robust and Flexible Operating System Compatibility Architecture

Presentation at Container Runtime Meetup #2 at Tokyo (https://runtime.connpass.com/event/180172/)

Explains about Noah, a flexible operating system compatibility architecture, which was presented at VEE'20. In addition, the presentation showed a toy OCI runtime implementation based on Noah.

Takaya Saeki

August 22, 2020
Tweet

More Decks by Takaya Saeki

Other Decks in Technology

Transcript

  1. Noah A Robust and Flexible Operating System Compatibility Architecture Takahiro

    Shinagawa Shinichi Honiden Yuichi Nishiwaki Takaya Saeki (@nullpo_head)
  2. Who I am? • Takaya Saeki (@nullpo_head) • Software Engineer

    • Likes web and system layer • Projects that might sound interesting • Noah • Ported XV6 OS to MIPS, and to a home-built FPGA CPU • Sudo by Windows Hello in WSL 2
  3. Noah: User-space Linux*compatibility layer powered by virtualization 1. As an

    implementation Noah runs Linux apps in macOS, like WSL 1 in Windows 2. As an architecture Noah is a kind of user-space kernel for OS compatibility, powered by virtualization. No Linux emulation kernel extension (unlike WSL 1) • Loads a guest binary to an empty VM without any kernel • Traps System calls, and emulates them in the user space with • Accomplishes memory management such as CoW by virtualization 4 * The architecture is not limited to Linux, but can be applied to other operating systems => Technially fun! => Academic novelty (APSys ’17, VEE ‘20)
  4. Linux • One of the most important operating systems •

    Today’s de facto standard ecosystem • Kubernetes / Docker
  5. OS Compatibility Layers • Windows and FreeBSD have Linux compatibility

    layer to utilize Linux ecosystem natively • So, why not let macOS have one? • Then, Linux ABI would be lingua franca! • What is more, creating yet another Linux layer is fun! • Started in 2016 as a MITOH project by me and Yuichi Nishiwaki.
  6. Implementing OS compatibility layer: Kernel-space vs User-space • Kernel space

    👍 Flexibility to achieve binary compatibility • System calls and memory management can be easily handled 👎 Vulnerability against bugs in OS compatibility layers • A bug could lead to system crashes • User space 👍 Robustness against bugs • Bugs do not affect the OS stability 👎 Challenges to achieve full compatibility • E.g., copy-on-write not implemented in Cygwin A Robust and Flexible Operating System Compatibility Architecture (VEE 2020, March 17, 2020) 11 Host OS Kernel Guest Application Binary OS Compatibility Layer Host OS Kernel Guest Application Binary OS Compat. Layer
  7. Noah’s OS Compatibility Architecture • Running each guest process in

    a VM (without its OS kernel) 👍 Robustness • Most part of OS compatibility layers can be implemented in user space • Bugs do not cause kernel crashes 👍 Flexibility • Hardware virtualization technology provides low-layer event handling functionalities • E.g., trapping system calls and page faults, manipulating page tables, … 12 Host OS Kernel OS Compatibility Layer VM Host Process Standardized Virtualization Interface Guest Application Process CPU Hardware Virtualization Function ⇒ Published as papers for its academic novelty [T.Saeki, Y.Nishiwaki, T.Shinagawa, S.Honiden] • A robust and flexible operating system compatibility architecture, in VEE 2020 • Bash on Ubuntu on macOS, in APSys 2017
  8. Overall Design • Three main components 1. Guest VM 2.

    VMM module 3. Monitor process 13 monitor process guest process Guest VMs kernel emulate system calls User Space Kernel space trap system calls & exceptions no kernel upcall monitor VMM module manage VMs Host OS
  9. The host OS uest s monitor process user space no

    ernel guest process ernel space ernel module Our approach: Utilize Virtualization Technology 14
  10. Our approach: Utilize Virtualization Technology 15 1. Monitor process launches

    a new VM and loads ELF inside it without kernel The host OS uest s monitor process user space no ernel guest process ernel space ernel module
  11. Our approach: Utilize Virtualization Technology 16 2. The ELF application

    calls Linux system calls when running in the VM. Then, they are trapped by the VMM. The host OS uest s monitor process user space no ernel guest process ernel space ernel module
  12. The host OS uest s monitor process user space no

    ernel guest process ernel space ernel module Our approach: Utilize Virtualization Technology 17 3. The VMM passes the trapped system call to the monitor process
  13. The host OS uest s monitor process user space no

    ernel guest process ernel space ernel module Our approach: Utilize Virtualization Technology 18 4. The Monitor process emulates the behavior of the Linux system call with host OS’s system calls
  14. macOS Monitor Process Monitor Process Bash Bash fork fork() fork()

    Clone the VM state 21 $ noah bash $ cat file | grep 42 Example: Inter-process communication
  15. macOS Monitor Process Monitor Process Bash Bash exec to “cat”

    execve(…) cat Replace VM contents 22 $ noah bash $ cat file | grep 42 Example: Inter-process communication
  16. macOS Monitor Process Monitor Process Monitor Process Bash cat write

    read grep 23 $ noah bash $ cat file | grep 42 Example: Inter-process communication
  17. macOS Monitor Process Monitor Process Regular Native Process Bash cat

    Can do IPC with native apps smoothly 24 $ noah bash $ cat file | grep 42 Example: Inter-process communication
  18. Advantages of User-space compatibility layer with virtualization 1. Robust •

    Do not cause OS crash, because it’s just a user-space app except VMM 2. Flexible • Thanks to VMM, achieve binary compatibility, CoW by user-space kernel 3. Portable and has lower development cost, compared to kernel-space • Rich host OS functionalities: system calls, libraries, high-level languages… • Actually, NoahW is implemented by C++ with Boost 4. Seamlessness • Single ernel: share resources such as FS, memory, process scheduling, IPC… 25
  19. Implementation • Target Linux 4.6 of x86-64 (Intel VT-x) •

    Noah: Linux compatibility layer for macOS • Use Apple Hypervisor.framework as the VMM module • NoahW: Linux compatibility layer for Windows (preliminary) • Use Intel Hardware Accelerated Execution Manager as the VMM module 27
  20. Memory Management • Two page tables (a) Guest page table

    in the VM (b) Nested page table (EPT) in the VMM • Fix (a) and modify (b) • (a) is fixed to the straight mapping • Virtual address = Physical address • (b) can be manipulated with the API • Provided by the VMM module Limitation: GVA is up to 512 GiB • 39-bit physical address in Intel CPU • 48-bit virtual address • Stack is moved to the lower address • No kernel area A Robust and Flexible Operating System Compatibility Architecture (VEE 2020, March 17, 2020) 28 511 GiB 0 GVA GPA HPA GVA: Guest Virtual Address GPA: Guest Physical Address HPA: Host Physical Address 512 GiB Guest page table (fixed) Nested page table (modified) 1-GiB guest system data area page tables, segment descriptors, …
  21. Process Management (fork) • Noah (on macOS) • Implement a

    subset of clone() • Apple Hypervisor.framework does not support fork() with a VM • Save and destroy the VM before fork() • Restore the VM after fork() • NoahW (on Windows) • Implement fork() with copy-on-write using shared memory and virtualization • Create a memory region shared among monitor processes • Save, restore, and modify the VM states on fork() • Trap page faults in the VMs to implement copy-on-write A Robust and Flexible Operating System Compatibility Architecture (VEE 2020, March 17, 2020) 29
  22. File System • Implemented VFS layer • To run Linux

    apps, the default FS is mapped as follows 30 / /usr /etc /Users /dev /tmp /Users /dev ~/.noah/tree/usr ~/.noah/tree/etc /tmp
  23. Other Systemcalls • Futex • emulate with conditional value •

    Signal • Implement delivery system inside Noah • Socket • Integrate with Noah’s FS • IO such readv64 / writev64 • Simulate incompatible small IO system calls. 31
  24. Macro Benchmark (Phoronix Test Suite + α A Robust and

    Flexible Operating System Compatibility Architecture (VEE 2020, March 17, 2020) 37 16% -23% -4% 50% -58% 9% -200% -100% 0% 100% 200% Linux kernel build unpack-linux postmark sqlite openssl compress-7zip
  25. Primitive Benchmark: dup() system call A Robust and Flexible Operating

    System Compatibility Architecture (VEE 2020, March 17, 2020) 38 270 320 2520 11091 1330 7044 2770 2118 588 5504 2809 11091 251 297 0 5,000 10,000 15,000 20,000 25,000 30,000 35,000 40,000 macOS Windows Cycle Number VM enter downcall post-process host syscall pre-process upcall VM exit
  26. Micro Benchmark: lmbench (processor, process) A Robust and Flexible Operating

    System Compatibility Architecture (VEE 2020, March 17, 2020) 39 410% 310% 175% 172% 13% 329% 239% 256% -24% 46% -100% 0% 100% 200% 300% 400% null call null I/O stat open clos slct TCP sig inst sig hndl fork proc exec proc sh proc
  27. Micro Benchmark: lmbench (File & VM latency) A Robust and

    Flexible Operating System Compatibility Architecture (VEE 2020, March 17, 2020) 40 42% 5% 17% 4% 28% -45% -92% 8% -100% -50% 0% 50% 0K Create 0K Delete 10K Create 10K Delete Mmap Latency Prot Fault Page Fault 100fd selct
  28. Comparison of OS Compatibility Layers Benchmark NoahW Cygwin WSL1 dup2()

    [call per second] 36,723 556,453 693,309 write() [call per second] 0.30 0.56 0.57 fork() (0 MiB array) [ms] 106.4 219.4 2.06 fork() (512 MiB array) [ms] 338.9 789.9 32.51 fork() (1 GiB array) [ms] 458.4 1531.8 62.66 A Robust and Flexible Operating System Compatibility Architecture (VEE 2020, March 17, 2020) 41
  29. Summary • Noah has a novel OS compatibility architecture •

    Exploited the OS-standard virtualization technology support • Achieved both robustness and flexibility • The architecture consists of three components • VMs to run guest processes • The VMM module to provide API for hardware virtualization technology • Monitor processes to implement OS compatibility functions • Run Linux binaries on macOS, and Windows (preliminary) • Noah implemented 172 out of 329 Linux system calls • The overhead of Linux kernel build time on Noah was 16% A Robust and Flexible Operating System Compatibility Architecture (VEE 2020, March 17, 2020) 42
  30. Wait, so you don’t mention to containers at all…???? 🙄

    In “Container Runtime eetup”..???🙄 43
  31. Noah as an OCI Runtime • OCI Runtime • The

    spec of the layer of runc • Runs container images • E.g.) runc, visor’s runsc • Why not add Noah to them? • Run Linux image (near) natively on macOS • I can finally talk about containers in this Container Runtime Meetup #2 😂 44
  32. Thanks, Hajime-san… • Containerd and Dockerd buildable on macOS •

    https://github.com/ukontainer/containerd • https://github.com/ukontainer/dockerd-darwin 45
  33. 46 This joke was made since late at night yesterday,

    so enjoy the simple demo as much as possible! 🤗