Upgrade to Pro — share decks privately, control downloads, hide ads and more …

AsiaBSDCon 2026 Keynote1

Avatar for yasukata yasukata
March 21, 2026
37

AsiaBSDCon 2026 Keynote1

Avatar for yasukata

yasukata

March 21, 2026
Tweet

Transcript

  1. Quiz Q. What is the key design principle of distributed

    systems for achieving optimal performance? (while there would be many ...) 2
  2. Quiz Q. What is the key design principle of distributed

    systems for achieving optimal performance? (while there would be many ...) • Ideally, the throughput should increase in proportion to the number of workers 3 0 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 Normalized Throughput Number of Workers Optimal
  3. Quiz Q. What is the key design principle of distributed

    systems for achieving optimal performance? (while there would be many ...) • Ideally, the throughput should increase in proportion to the number of workers • In practice, distributed systems often fail to scale their performance 4 0 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 Normalized Throughput Number of Workers Optimal Suboptimal
  4. Quiz Q. What is the key design principle of distributed

    systems for achieving optimal performance? (while there would be many ...) • Ideally, the throughput should increase in proportion to the number of workers • In practice, distributed systems often fail to scale their performance A. Shared nothing • to minimize communications and synchronizations between workers 5 0 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 Normalized Throughput Number of Workers Optimal Suboptimal
  5. Quiz • In practice 11 Worker 1 Time Worker 2

    Worker 3 Shared Item lock acquired
  6. Quiz • In practice 13 Worker 1 Time Worker 2

    Worker 3 Shared Item wait wait
  7. Quiz • In practice 14 Worker 1 Time Worker 2

    Worker 3 Shared Item wait wait lock released
  8. Quiz • In practice 15 Worker 1 Time Worker 2

    Worker 3 Shared Item wait wait lock acquired
  9. Quiz • In practice 16 Worker 1 Time Worker 2

    Worker 3 Shared Item wait wait
  10. Quiz • In practice 17 Worker 1 Time Worker 2

    Worker 3 Shared Item wait lock released wait
  11. Quiz • In practice 18 Worker 1 Time Worker 2

    Worker 3 Shared Item lock acquired wait
  12. Quiz • In practice 19 Worker 1 Time Worker 2

    Worker 3 Shared Item wait wait
  13. Quiz • In practice 20 Worker 1 Time Worker 2

    Worker 3 Shared Item wait wait lock released
  14. Quiz • In practice 21 Worker 1 Time Worker 2

    Worker 3 Shared Item lock acquired wait
  15. Quiz • In practice 23 Worker 1 Time Worker 2

    Worker 3 Shared Item wait These are the time slots that the workers failed to spend on meaningful tasks
  16. Quiz • Shared nothing 24 Worker 1 Time Worker 2

    Worker 3 Item for Worker 1 Item for Worker 2 Item for Worker 3
  17. Quiz • Shared nothing 25 Worker 1 Time Worker 2

    Worker 3 Item for Worker 1 Item for Worker 2 Item for Worker 3 lock acquired
  18. Quiz • Shared nothing 26 Worker 1 Time Worker 2

    Worker 3 Item for Worker 1 Item for Worker 2 Item for Worker 3 lock acquired
  19. Quiz • Shared nothing 27 Worker 1 Time Worker 2

    Worker 3 Item for Worker 1 Item for Worker 2 Item for Worker 3 lock acquired
  20. Quiz • Shared nothing 28 Worker 1 Time Worker 2

    Worker 3 Item for Worker 1 Item for Worker 2 Item for Worker 3 lock released
  21. Quiz • Shared nothing 29 Worker 1 Time Worker 2

    Worker 3 Item for Worker 1 Item for Worker 2 Item for Worker 3 lock released
  22. Quiz • Shared nothing 30 Worker 1 Time Worker 2

    Worker 3 Item for Worker 1 Item for Worker 2 Item for Worker 3 lock acquired
  23. Quiz • Shared nothing 31 Worker 1 Time Worker 2

    Worker 3 Item for Worker 1 Item for Worker 2 Item for Worker 3 lock released
  24. Quiz • Shared nothing 32 Worker 1 Time Worker 2

    Worker 3 Item for Worker 1 Item for Worker 2 Item for Worker 3 The workers can spend all their time on meaningful tasks because they do not need to wait for each other to access a shared item
  25. Quiz • Shared nothing 33 Worker 1 Time Worker 2

    Worker 3 Item for Worker 1 Item for Worker 2 Item for Worker 3 The workers can spend all their time on meaningful tasks because they do not need to wait for each other to access a shared item In general, the designers of distributed systems try to reduce shared objects to achieve high throughput
  26. Shared Items in Software Development? 35 Developer 1 Time Developer

    2 Developer 3 Shared Codebase Developers share the codebase, and their communications are often way more complicated than that of computer systems
  27. The Main Discussion Point of This Talk 36 Developer 1

    Time Developer 2 Developer 3 Codebase of Developer 1 Codebase of Developer 2 Codebase of Developer 3 Can we apply the shared nothing model to software development?
  28. The Main Discussion Point of This Talk 37 Developer 1

    Time Developer 2 Developer 3 Codebase of OS functionality A Codebase of OS functionality B Codebase of OS functionality C It would be great if • developers do not need to communicate with each other • and, the OS functionalities are easily incorporated
  29. The Main Discussion Point of This Talk 38 Developer 1

    Time Developer 2 Developer 3 Codebase of OS functionality A,B,C Current OS development often relies on intensive communications of developers It would be great if • developers do not need to communicate with each other • and, the OS functionalities are easily incorporated
  30. The Main Discussion Point of This Talk 39 Developer 1

    Time Developer 2 Developer 3 Codebase of OS functionality A,B,C Current OS development often relies on intensive communications of developers Discussion Point How about exploring the possibility of decentralized collaboration? It would be great if • developers do not need to communicate with each other • and, the OS functionalities are easily incorporated
  31. Outline of This Talk • Why I would like to

    bring up this discussion • Prior exploration • Modest feature requests to the BSD community 40
  32. Outline of This Talk • Why I would like to

    bring up this discussion • Prior exploration • Modest feature requests to the BSD community 41
  33. In Early 2010s, ... • The throughput of commodity NICs

    has reached 10 Gbps • It got challenging for software to utilize the potential of NICs NIC 42
  34. In Early 2010s, ... • The common steps for a

    packet transmission NIC User-space program Kernel 43
  35. In Early 2010s, ... • The common steps for a

    packet transmission NIC User-space program Kernel Payload Buffer (DRAM) 1. prepares a payload 44
  36. In Early 2010s, ... • The common steps for a

    packet transmission NIC User-space program Kernel Payload Buffer (DRAM) 1. prepares a payload 2. triggers a system call like write() 45
  37. In Early 2010s, ... • The common steps for a

    packet transmission NIC User-space program Kernel Payload Buffer (DRAM) 1. prepares a payload 2. triggers a system call like write() 46
  38. In Early 2010s, ... • The common steps for a

    packet transmission NIC User-space program Kernel Payload Buffer (DRAM) 1. prepares a payload 2. triggers a system call like write() 3. allocates a buffer Buffer (DRAM) 47
  39. In Early 2010s, ... • The common steps for a

    packet transmission NIC User-space program Kernel Payload Buffer (DRAM) 1. prepares a payload 2. triggers a system call like write() 3. allocates a buffer Buffer (DRAM) 4. copies in the payload Payload 48
  40. In Early 2010s, ... • The common steps for a

    packet transmission NIC User-space program Kernel Payload Buffer (DRAM) 1. prepares a payload 2. triggers a system call like write() 3. allocates a buffer Buffer (DRAM) 4. copies in the payload Payload 5. associates the buffer with the NIC 49
  41. In Early 2010s, ... • The common steps for a

    packet transmission NIC User-space program Kernel Payload Buffer (DRAM) 1. prepares a payload 2. triggers a system call like write() 3. allocates a buffer Buffer (DRAM) 4. copies in the payload Payload 5. associates the buffer with the NIC 6. kicks the NIC to start the transmission 50
  42. In Early 2010s, ... • The common steps for a

    packet transmission NIC User-space program Kernel Payload Buffer (DRAM) 1. prepares a payload 2. triggers a system call like write() 3. allocates a buffer Buffer (DRAM) 4. copies in the payload Payload 5. associates the buffer with the NIC 6. kicks the NIC to start the transmission 51 People found that these steps are costly and make it hard to achieve high packet I/O performance
  43. In Early 2010s, ... • netmap, a packet I/O framework,

    merged in FreeBSD NIC User-space program Kernel 52
  44. In Early 2010s, ... • netmap, a packet I/O framework,

    merged in FreeBSD NIC User-space program Kernel Buffer (DRAM) 0. allocates buffers and associate them with the NIC 53
  45. In Early 2010s, ... • netmap, a packet I/O framework,

    merged in FreeBSD NIC User-space program Kernel 0. exposes buffers associated with the NIC through mmap() during the setup phase Buffer (DRAM) 0. allocates buffers and associate them with the NIC mmap 54
  46. In Early 2010s, ... • netmap, a packet I/O framework,

    merged in FreeBSD NIC User-space program Kernel 0. exposes buffers associated with the NIC through mmap() during the setup phase Buffer (DRAM) 0. allocates buffers and associate them with the NIC mmap Payload 1. prepares a payload Payload 55
  47. In Early 2010s, ... • netmap, a packet I/O framework,

    merged in FreeBSD NIC User-space program Kernel 0. exposes buffers associated with the NIC through mmap() during the setup phase Buffer (DRAM) 0. allocates buffers and associate them with the NIC mmap Payload 1. prepares a payload Payload 56 These are physically the same thus, no copy is necessary
  48. In Early 2010s, ... • netmap, a packet I/O framework,

    merged in FreeBSD NIC User-space program Kernel 0. exposes buffers associated with the NIC through mmap() during the setup phase Buffer (DRAM) 0. allocates buffers and associate them with the NIC mmap Payload 1. prepares a payload 2. triggers a system call like poll() and ioctl() Payload 57
  49. In Early 2010s, ... • netmap, a packet I/O framework,

    merged in FreeBSD NIC User-space program Kernel 0. exposes buffers associated with the NIC through mmap() during the setup phase Buffer (DRAM) 0. allocates buffers and associate them with the NIC mmap Payload 1. prepares a payload 2. triggers a system call like poll() and ioctl() Payload 58
  50. In Early 2010s, ... • netmap, a packet I/O framework,

    merged in FreeBSD NIC User-space program Kernel 0. exposes buffers associated with the NIC through mmap() during the setup phase Buffer (DRAM) 0. allocates buffers and associate them with the NIC mmap Payload 1. prepares a payload 2. triggers a system call like poll() and ioctl() 3. kicks the NIC to start the transmission Payload 59
  51. In Early 2010s, ... • netmap, a packet I/O framework,

    merged in FreeBSD NIC User-space program Kernel 0. exposes buffers associated with the NIC through mmap() during the setup phase Buffer (DRAM) 0. allocates buffers and associate them with the NIC mmap Payload 1. prepares a payload 2. triggers a system call like poll() and ioctl() 3. kicks the NIC to start the transmission Payload 60 netmap offers the fast data path
  52. In Early 2010s, ... • netmap, a packet I/O framework,

    merged in FreeBSD application costs almost negligible: a packet generator which streams pre-generated packets, and a packet re- ceiver which just counts incoming packets. 5.2 Test equipment We have run most of our experiments on systems equipped with an i7-870 4-core CPU at 2.93 GHz (3.2 GHz with turbo-boost), memory running at 1.33 GHz, and a dual port 10 Gbit/s card based on the Intel 82599 NIC. The numbers reported in this paper refer to the netmap version in FreeBSD HEAD/amd64 as of April 2012. Experiments have been run using di- rectly connected cards on two similar systems. Results are highly repeatable (within 2% or less) so we do not report confidence intervals in the tables and graphs. netmap is extremely efficient so it saturates a 10 Gbit/s interface even at the maximum packet rate, and we need to run the system at reduced clock speeds to determine 0 2 4 6 8 10 12 14 16 0 0.5 1 1.5 2 2.5 3 Tx Rate (Mpps) Clock speed (GHz) netmap on 4 cores netmap on 2 cores netmap on 1 core Linux/pktgen FreeBSD/netsend Figure 5: Netmap transmit performance with 64-byte packets, variable clock rates and number of cores, com- pared to pktgen (a specialised, in-kernel generator avail- able on linux, peaking at about 4 Mpps) and a netsend (FreeBSD userspace, peaking at 1.05 Mpps). Luigi Rizzo, “netmap: a novel framework for fast packet I/O”, USENIX ATC 2012 https://www.usenix.org/conference/atc12/technical-sessions/presentation/rizzo 61
  53. In Early 2010s, ... • netmap, a packet I/O framework,

    merged in FreeBSD • The speed of packet transmission over a raw socket is 1.05 Mpps • Mpps: million packets per second application costs almost negligible: a packet generator which streams pre-generated packets, and a packet re- ceiver which just counts incoming packets. 5.2 Test equipment We have run most of our experiments on systems equipped with an i7-870 4-core CPU at 2.93 GHz (3.2 GHz with turbo-boost), memory running at 1.33 GHz, and a dual port 10 Gbit/s card based on the Intel 82599 NIC. The numbers reported in this paper refer to the netmap version in FreeBSD HEAD/amd64 as of April 2012. Experiments have been run using di- rectly connected cards on two similar systems. Results are highly repeatable (within 2% or less) so we do not report confidence intervals in the tables and graphs. netmap is extremely efficient so it saturates a 10 Gbit/s interface even at the maximum packet rate, and we need to run the system at reduced clock speeds to determine 0 2 4 6 8 10 12 14 16 0 0.5 1 1.5 2 2.5 3 Tx Rate (Mpps) Clock speed (GHz) netmap on 4 cores netmap on 2 cores netmap on 1 core Linux/pktgen FreeBSD/netsend Figure 5: Netmap transmit performance with 64-byte packets, variable clock rates and number of cores, com- pared to pktgen (a specialised, in-kernel generator avail- able on linux, peaking at about 4 Mpps) and a netsend (FreeBSD userspace, peaking at 1.05 Mpps). 62 Luigi Rizzo, “netmap: a novel framework for fast packet I/O”, USENIX ATC 2012 https://www.usenix.org/conference/atc12/technical-sessions/presentation/rizzo
  54. In Early 2010s, ... • netmap, a packet I/O framework,

    merged in FreeBSD • The speed of packet transmission over a raw socket is 1.05 Mpps • Mpps: million packets per second • netmap achieves the line rate (14.88 Mpps) with one CPU core application costs almost negligible: a packet generator which streams pre-generated packets, and a packet re- ceiver which just counts incoming packets. 5.2 Test equipment We have run most of our experiments on systems equipped with an i7-870 4-core CPU at 2.93 GHz (3.2 GHz with turbo-boost), memory running at 1.33 GHz, and a dual port 10 Gbit/s card based on the Intel 82599 NIC. The numbers reported in this paper refer to the netmap version in FreeBSD HEAD/amd64 as of April 2012. Experiments have been run using di- rectly connected cards on two similar systems. Results are highly repeatable (within 2% or less) so we do not report confidence intervals in the tables and graphs. netmap is extremely efficient so it saturates a 10 Gbit/s interface even at the maximum packet rate, and we need to run the system at reduced clock speeds to determine 0 2 4 6 8 10 12 14 16 0 0.5 1 1.5 2 2.5 3 Tx Rate (Mpps) Clock speed (GHz) netmap on 4 cores netmap on 2 cores netmap on 1 core Linux/pktgen FreeBSD/netsend Figure 5: Netmap transmit performance with 64-byte packets, variable clock rates and number of cores, com- pared to pktgen (a specialised, in-kernel generator avail- able on linux, peaking at about 4 Mpps) and a netsend (FreeBSD userspace, peaking at 1.05 Mpps). 63 Luigi Rizzo, “netmap: a novel framework for fast packet I/O”, USENIX ATC 2012 https://www.usenix.org/conference/atc12/technical-sessions/presentation/rizzo
  55. In Early 2010s, ... • netmap, a packet I/O framework,

    merged in FreeBSD NIC User-space program Kernel 64 Payload
  56. In Early 2010s, ... • netmap, a packet I/O framework,

    merged in FreeBSD NIC User-space program Kernel 65 Payload TCP/IP Stack When netmap is activated, payloads bypass the kernel-space TCP/IP stack
  57. In Early 2010s, ... • netmap, a packet I/O framework,

    merged in FreeBSD NIC User-space program Kernel 66 Payload TCP/IP Stack When netmap is activated, payloads bypass the kernel-space TCP/IP stack The user-space program bypassing a TCP/IP stack cannot communicate over TCP/IP networks
  58. In Early 2010s, ... • netmap, a packet I/O framework,

    merged in FreeBSD NIC User-space program Kernel 67 Payload TCP/IP Stack When netmap is activated, payloads bypass the kernel-space TCP/IP stack TCP/IP Stack User-space TCP/IP stacks enable user-space programs to communicate over TCP/IP networks while using netmap
  59. In Early 2010s, ... • netmap + a user-space TCP/IP

    stack NIC User-space program Kernel 0. exposes buffers associated with the NIC through mmap() during the setup phase Buffer (DRAM) 0. allocates buffers and associate them with the NIC mmap 68
  60. In Early 2010s, ... • netmap + a user-space TCP/IP

    stack NIC User-space program Kernel 0. exposes buffers associated with the NIC through mmap() during the setup phase Buffer (DRAM) 0. allocates buffers and associate them with the NIC mmap Payload 1. prepares a payload Payload 69
  61. In Early 2010s, ... • netmap + a user-space TCP/IP

    stack NIC User-space program Kernel 0. exposes buffers associated with the NIC through mmap() during the setup phase Buffer (DRAM) 0. allocates buffers and associate them with the NIC mmap Payload 1. prepares a payload Payload 2. prepares the header using the user-space TCP/IP stack 70
  62. In Early 2010s, ... • netmap + a user-space TCP/IP

    stack NIC User-space program Kernel 0. exposes buffers associated with the NIC through mmap() during the setup phase Buffer (DRAM) 0. allocates buffers and associate them with the NIC mmap Payload 1. prepares a payload 2. prepares the header using the user-space TCP/IP stack Payload 3. triggers a system call like poll() and ioctl() 71
  63. In Early 2010s, ... • netmap + a user-space TCP/IP

    stack NIC User-space program Kernel 0. exposes buffers associated with the NIC through mmap() during the setup phase Buffer (DRAM) 0. allocates buffers and associate them with the NIC mmap Payload 1. prepares a payload 2. prepares the header using the user-space TCP/IP stack Payload 3. triggers a system call like poll() and ioctl() 72
  64. In Early 2010s, ... • netmap + a user-space TCP/IP

    stack NIC User-space program Kernel 0. exposes buffers associated with the NIC through mmap() during the setup phase Buffer (DRAM) 0. allocates buffers and associate them with the NIC mmap Payload 1. prepares a payload 2. prepares the header using the user-space TCP/IP stack Payload 3. triggers a system call like poll() and ioctl() 3. kicks the NIC to start the transmission 73
  65. In Early 2010s, ... • Sandstorm (SIGCOMMʼ14) is a web

    server based on a user- space TCP/IP stack that leverages netmap for its packet I/O 32 64 128 256 512 756 1024 File size (KB) Sandstorm nginx + FreeBSD nginx + Linux roughput, 4 NICs 4 8 16 24 32 64 128 256 512 756 1024 0 20 40 60 File size (KB) Throughput (Gbps) Sandstorm nginx + FreeBSD nginx + Linux (c) Network throughput, 6 NICs 100 Figure 4 74 Ilias Marinos, Robert N.M. Watson, and Mark Handley, "Network Stack Specialization for Performance", SIGCOMM 2014 https://doi.org/10.1145/2740070.2626311
  66. In Early 2010s, ... • Many performance-optimized TCP/IP stacks have

    been introduced • StackMap (USENIX ATCʼ16) • F-Stack (Tencent) • Shenango (NSDIʼ19) • TAS (EuroSysʼ19) • Demikernel (SOSPʼ21) • Luna (USENIX ATCʼ23) • … • mTCP (NSDIʼ14) • Sandstorm (SIGCOMMʼ14) • UTCP (SIGCOMM CCR in 2014) • Arrakis (OSDIʼ14) • IX (OSDIʼ14) • Seastar (Cloudius Systems) 75
  67. Problem Statement • As of today, performance-optimized TCP/IP stacks have

    not been common for most users and not widely used • even though the techniques to achieve high performance have been discovered more than 10 years ago 76
  68. Problem Statement • As of today, performance-optimized TCP/IP stacks have

    not been common for most users and not widely used • even though the techniques to achieve high performance have been discovered more than 10 years ago 32 64 128 256 512 756 1024 le size (KB) Sandstorm nginx + FreeBSD nginx + Linux roughput, 1 NIC 4 8 16 24 32 64 128 256 512 756 1024 0 20 40 60 File size (KB) Throughput (Gbps) Sandstorm nginx + FreeBSD nginx + Linux (b) Network throughput, 4 NICs 4 8 16 24 32 64 128 256 512 756 1024 0 20 40 60 File size (KB) Throughput (Gbps) Sandstorm nginx + FreeBSD nginx + Linux (c) Network throughput, 6 NICs Sandstorm 100 100 Figure 4 Ilias Marinos, Robert N.M. Watson, and Mark Handley, "Network Stack Specialization for Performance", SIGCOMM 2014 https://doi.org/10.1145/2740070.2626311 77
  69. Problem Statement • As of today, performance-optimized TCP/IP stacks have

    not been common for most users and not widely used • even though the techniques to achieve high performance have been discovered more than 10 years ago 32 64 128 256 512 756 1024 le size (KB) Sandstorm nginx + FreeBSD nginx + Linux roughput, 1 NIC 4 8 16 24 32 64 128 256 512 756 1024 0 20 40 60 File size (KB) Throughput (Gbps) Sandstorm nginx + FreeBSD nginx + Linux (b) Network throughput, 4 NICs 4 8 16 24 32 64 128 256 512 756 1024 0 20 40 60 File size (KB) Throughput (Gbps) Sandstorm nginx + FreeBSD nginx + Linux (c) Network throughput, 6 NICs Sandstorm 100 100 Figure 4 Ilias Marinos, Robert N.M. Watson, and Mark Handley, "Network Stack Specialization for Performance", SIGCOMM 2014 https://doi.org/10.1145/2740070.2626311 This gap persists for most users until an adequate solution is identified 78
  70. Problem Statement • As of today, performance-optimized TCP/IP stacks have

    not been common for most users and not widely used • even though the techniques to achieve high performance have been discovered more than 10 years ago 32 64 128 256 512 756 1024 le size (KB) Sandstorm nginx + FreeBSD nginx + Linux roughput, 1 NIC 4 8 16 24 32 64 128 256 512 756 1024 0 20 40 60 File size (KB) Throughput (Gbps) Sandstorm nginx + FreeBSD nginx + Linux (b) Network throughput, 4 NICs 4 8 16 24 32 64 128 256 512 756 1024 0 20 40 60 File size (KB) Throughput (Gbps) Sandstorm nginx + FreeBSD nginx + Linux (c) Network throughput, 6 NICs Sandstorm 100 100 Figure 4 Ilias Marinos, Robert N.M. Watson, and Mark Handley, "Network Stack Specialization for Performance", SIGCOMM 2014 https://doi.org/10.1145/2740070.2626311 This gap persists for most users until an adequate solution is identified I am exploring this 79
  71. Problem Analysis • Why are the existing performance-optimized TCP/IP stacks

    not commonly used despite their performance advantages? 80
  72. Problem Analysis • Why are the existing performance-optimized TCP/IP stacks

    not commonly used despite their performance advantages? • There is a negative cycle 81
  73. Problem Analysis • Why are the existing performance-optimized TCP/IP stacks

    not commonly used despite their performance advantages? • There is a negative cycle • Many of the performance-optimized TCP/IP stacks are research prototypes, and their implementations are immature 82
  74. Problem Analysis • Why are the existing performance-optimized TCP/IP stacks

    not commonly used despite their performance advantages? • There is a negative cycle • Many of the performance-optimized TCP/IP stacks are research prototypes, and their implementations are immature • On one hand, users do not like to use immature TCP/IP stacks 83
  75. Problem Analysis • Why are the existing performance-optimized TCP/IP stacks

    not commonly used despite their performance advantages? • There is a negative cycle • Many of the performance-optimized TCP/IP stacks are research prototypes, and their implementations are immature • On one hand, users do not like to use immature TCP/IP stacks • On the other hand, usersʼ feedback and bug reports are essential for the maturity of the implementations 84
  76. Problem Analysis • Why are the existing performance-optimized TCP/IP stacks

    not commonly used despite their performance advantages? • There is a negative cycle • Many of the performance-optimized TCP/IP stacks are research prototypes, and their implementations are immature • On one hand, users do not like to use immature TCP/IP stacks • On the other hand, usersʼ feedback and bug reports are essential for the maturity of the implementations • So, immature TCP/IP stack implementations can never get mature 85
  77. Problem Analysis • Why are the existing performance-optimized TCP/IP stacks

    not commonly used despite their performance advantages? • There is a negative cycle • Many of the performance-optimized TCP/IP stacks are research prototypes, and their implementations are immature • On one hand, users do not like to use immature TCP/IP stacks • On the other hand, usersʼ feedback and bug reports are essential for the maturity of the implementations • So, immature TCP/IP stack implementations can never get mature 86
  78. Problem Analysis • Why are the existing performance-optimized TCP/IP stacks

    not commonly used despite their performance advantages? • There is a negative cycle • Many of the performance-optimized TCP/IP stacks are research prototypes, and their implementations are immature • On one hand, users do not like to use immature TCP/IP stacks • On the other hand, usersʼ feedback and bug reports are essential for the maturity of the implementations • So, immature TCP/IP stack implementations can never get mature 87 How can we break this negative cycle?
  79. Problem Analysis • Why are the existing performance-optimized TCP/IP stacks

    not commonly used despite their performance advantages? • There is a negative cycle • Many of the performance-optimized TCP/IP stacks are research prototypes, and their implementations are immature • On one hand, users do not like to use immature TCP/IP stacks • On the other hand, usersʼ feedback and bug reports are essential for the maturity of the implementations • So, immature TCP/IP stack implementations can never get mature 88 How can we break this negative cycle? My Approach How about lowering the bar for users to try out performance-optimize TCP/IP stacks to make it easier for them to gather user feedback?
  80. The Issues of Development Complexities 89 Developer 1 Developer 2

    Developer 3 Codebase of Developer 1 Codebase of Developer 2 Codebase of Developer 3 It would be great if developers could work independently
  81. The Issues of Development Complexities 90 Developer 1 Developer 2

    Developer 3 The codebase of an OS It would be great if developers could work independently Many Developers … But, developers often work on the same OS codebases while there are boundaries of functionalities
  82. The Issues of Development Complexities 91 Developer 1 Developer 2

    Developer 3 The codebase of an OS It would be great if developers could work independently Many Developers … But, developers often work on the same OS codebases while there are boundaries of functionalities The codebase of a performance-optimized TCP/IP stack The issue is that currently, it is not easy to incorporate performance-optimized TCP/IP stacks and OSes
  83. The Issues of Development Complexities 92 Developer 1 Developer 2

    Developer 3 The codebase of an OS It would be great if developers could work independently Many Developers … But, developers often work on the same OS codebases while there are boundaries of functionalities The codebase of a performance-optimized TCP/IP stack The issue is that currently, it is not easy to incorporate performance-optimized TCP/IP stacks and OSes As a result, it is hard for many users to try out performance-optimized TCP/IP stacks and for their developers to have feedback from users
  84. The Issues of Development Complexities 93 Developer 1 Developer 2

    Developer 3 The codebase of an OS It would be great if developers could work independently Many Developers … But, developers often work on the same OS codebases while there are boundaries of functionalities The codebase of a performance-optimized TCP/IP stack The issue is that currently, it is not easy to incorporate performance-optimized TCP/IP stacks and OSes If the developers wish to properly incorporate their TCP/IP stacks and OSes, they need to interact with the OS developers while imposing communication overheads As a result, it is hard for many users to try out performance-optimized TCP/IP stacks and for their developers to have feedback from users
  85. Why I would like to bring up this discussion •

    I believe OS development based on the decentralized collaboration model is beneficial for developers and users • It reduces communication overheads between developers • It makes it easier for users to try out emerging projects and for those projects to gather user feedback è Eventually, users will have many mature advanced products • The challenging part is the methods to incorporate OS subsystems independently developed by different parties • I would like to discuss solutions to the challenges with the BSD community, a key driver of OS development 94
  86. Outline of This Talk • Why I would like to

    bring up this discussion • Prior exploration • Modest feature requests to the BSD community 95
  87. Outline of This Talk • Why I would like to

    bring up this discussion • Prior exploration • Modest feature requests to the BSD community 96
  88. Focuses of My Prior Exploration • Portability: an OS subsystem,

    like a TCP/IP stack, should be aware of portability for the compatibility with different OSes • Transparent integration: an OS subsystem should allow for transparent integration into an existing OS 97 The codebase of OS A The codebase of a TCP/IP stack The codebase of OS B The codebase of OS C Portability + Transparent integration I think these two are important for decentralized collaboration
  89. Focuses of My Prior Exploration • Portability: an OS subsystem,

    like a TCP/IP stack, should be aware of portability for the compatibility with different OSes • Transparent integration: an OS subsystem should allow for transparent integration into an existing OS 98 The codebase of OS A The codebase of a TCP/IP stack The codebase of OS B The codebase of OS C Portability + Transparent integration Can we achieve high performance and portability simultaneously? I think these two are important for decentralized collaboration
  90. Exploration for Portability • iip • A TCP/IP stack aware

    of both performance and portability • Paper at SIGCOMM CCR • https://doi.org/10.1145/3687230.3687233 • nominated to be presented at the Best of CCR session in SIGCOMM 2024 • Source code • https://github.com/yasukata/iip 99 https://yasukata.github.io/presentation/2024/08/sigcomm2024/sigcomm2024ccr_slides_yasukata.pdf
  91. Rough Numbers of Other Implementations • TCP ping-pong workload 100

    100 Gbps iip pinger app 32 CPU cores DPDK ponger app 1 CPU core The pinger and ponger exchange 1-byte TCP payloads A ponger thread handles 32 concurrent TCP connections Run the benchmark while changing the TCP/IP stack implementation TCP/IP stack
  92. Rough Numbers of Other Implementations 101 NOTE 1: this comparison

    is not fair because the TCP/IP stacks have different features and implementations NOTE 2: it is often the case that a system outperforms others in its developersʼ environment TAS uses three CPU cores and Caladan uses two CPU cores because their minimal setups require the additional CPU cores 0 5 10 15 20 25 30 35 Linux lw IP Seastar F-Stack TAS Caladan iip 99th %ile Latency [us] 0 0.5 1 1.5 2 2.5 3 Linux lw IP Seastar F-Stack TAS Caladan iip Throughput [million requests/sec] Higher is better Lower is better 160.3 us The benchmark setup is found at https://github.com/yasukata/bench-iip#performance-numbers-of-other-tcpip-stacks
  93. Example Application • mimicached: a memcached-compatible server using iip •

    https://github.com/yasukata/mimicached 102 The benchmark setup is found at https://github.com/yasukata/mimicached#rough-performance-numbers • set 10% get 90% • zipfian distribution • 1 million key-value items • key size is 8 bytes • value size is 8 bytes • text protocol Workload
  94. Focuses of My Prior Exploration • Portability: an OS subsystem,

    like a TCP/IP stack, should be aware of portability for the compatibility with different OSes • Transparent integration: an OS subsystem should allow for transparent integration into an existing OS 103 The codebase of OS A The codebase of a TCP/IP stack The codebase of OS B The codebase of OS C Portability + Transparent integration Can we achieve high performance and portability simultaneously? I think these two are important for decentralized collaboration
  95. Focuses of My Prior Exploration • Portability: an OS subsystem,

    like a TCP/IP stack, should be aware of portability for the compatibility with different OSes • Transparent integration: an OS subsystem should allow for transparent integration into an existing OS 104 The codebase of OS A The codebase of a TCP/IP stack The codebase of OS B The codebase of OS C Portability + Transparent integration Portable TCP/IP stacks can be compiled and run on a wide range of OSes but, we need a mechanism to transparently integrate them into OSes I think these two are important for decentralized collaboration
  96. Ideally, it should be like a FUSE • Filesystem in

    Userspace (FUSE) Application Networking Virtual File System (VFS) File System 105 User Space Kernel
  97. Ideally, it should be like a FUSE • Filesystem in

    Userspace (FUSE) Application Networking Virtual File System (VFS) File System 106 User Space Kernel The VFS layer steers requests from an application to kernel-space services
  98. Ideally, it should be like a FUSE • Filesystem in

    Userspace (FUSE) Application Networking Virtual File System (VFS) File System 107 User Space Kernel The VFS layer steers requests from an application to kernel-space services FUSE Driver
  99. Ideally, it should be like a FUSE • Filesystem in

    Userspace (FUSE) Application Networking Virtual File System (VFS) File System 108 User Space Kernel The VFS layer steers requests from an application to kernel-space services FUSE Daemon FUSE Driver
  100. Ideally, it should be like a FUSE • Filesystem in

    Userspace (FUSE) Application Networking Virtual File System (VFS) File System 109 User Space Kernel The VFS layer steers requests from an application to kernel-space services FUSE Daemon FUSE Driver File operation requests from an application are steered to a user-space file system implementation through the kernel-space FUSE subsystem
  101. Ideally, it should be like a FUSE • Filesystem in

    Userspace (FUSE) Application Networking Virtual File System (VFS) File System 110 User Space Kernel The VFS layer steers requests from an application to kernel-space services FUSE Daemon FUSE Driver File operation requests from an application are steered to a user-space file system implementation through the kernel-space FUSE subsystem Applications can be agnostic to FUSE and kernel-space file systems because their APIs are the same system calls
  102. The Main Discussion Point of This Talk 111 Developer 1

    Time Developer 2 Developer 3 Codebase of OS functionality A Codebase of OS functionality B Codebase of OS functionality C Coming back to ... It would be great if • developers do not need to communicate with each other • and, the OS functionalities are easily incorporated
  103. Ideally, it should be like a FUSE 112 Developer 1

    Time Developer 2 Developer 3 FUSE FS 1 FUSE FS 2 An OS
  104. Ideally, it should be like a FUSE 113 Developer 1

    Time Developer 2 Developer 3 FUSE FS 1 FUSE FS 2 An OS
  105. Ideally, it should be like a FUSE 114 Developer 1

    Time Developer 2 Developer 3 FUSE FS 1 FUSE FS 2 An OS
  106. Ideally, it should be like a FUSE 115 Developer 1

    Time Developer 2 Developer 3 An OS FUSE FS 1 FUSE FS 2 The developers of the OS, FUSE FS 1, FUSE FS 2 can work independently while their implementations can be properly incorporated
  107. Ideally, it should be like a FUSE 116 Developer 1

    Time Developer 2 Developer 3 An OS FUSE FS 1 FUSE FS 2 The developers of the OS, FUSE FS 1, FUSE FS 2 can work independently while their implementations can be properly incorporated My thought: this is great, so can we do the same for TCP/IP stacks?
  108. Ideally, it should be like a FUSE • What I

    wish to do Application Networking Virtual File System (VFS) File System 117 User Space Kernel FUSE Daemon FUSE Driver
  109. Ideally, it should be like a FUSE • What I

    wish to do Application Networking Virtual File System (VFS) File System 118 User Space Kernel FUSE Daemon FUSE Driver VFS-like Layer Adding a VFS-like layer in user space
  110. Ideally, it should be like a FUSE • What I

    wish to do Application Networking Virtual File System (VFS) File System 119 User Space Kernel FUSE Daemon FUSE Driver VFS-like Layer TCP/IP Stack Adding a VFS-like layer in user space Redirect application requests to a TCP/IP stack in user space
  111. Ideally, it should be like a FUSE • What I

    wish to do Application Networking Virtual File System (VFS) File System 120 User Space Kernel FUSE Daemon FUSE Driver VFS-like Layer TCP/IP Stack How can we add this?
  112. Ideally, it should be like a FUSE • What I

    wish to do Application Networking Virtual File System (VFS) File System 121 User Space Kernel FUSE Daemon FUSE Driver VFS-like Layer TCP/IP Stack How can we add this? Hooking system calls seem to be a good option
  113. Ideally, it should be like a FUSE • What I

    wish to do Application Networking Virtual File System (VFS) File System 122 User Space Kernel FUSE Daemon FUSE Driver VFS-like Layer TCP/IP Stack How can we add this? Hooking system calls seem to be a good option But, I found existing hook mechanisms have drawbacks
  114. Exploration for Transparent Integration • zpoline • A system call

    hook mechanism for x86-64 based on binary rewriting • Paper at USENIX ATC 2023 • https://www.usenix.org/conference/atc23/presentation/yasukata • received a Best Paper Award • Source code • https://github.com/yasukata/zpoline 123
  115. Exploration for Transparent Integration • svc-hook • A system call

    hook mechanism for ARM64 based on binary rewriting • Paper at ACM/IFIP Middleware 2025 • https://dl.acm.org/doi/10.1145/3721462.3770771 • Source code • https://github.com/retrage/svc-hook 124 https://speakerdeck.com/retrage/svc-hook-hooking-system-calls-on-arm64-by-binary-rewriting
  116. Ideally, it should be like a FUSE • For the

    next step ... Application Networking Virtual File System (VFS) File System 125 User Space Kernel FUSE Daemon FUSE Driver VFS-like Layer TCP/IP Stack I am exploring an appropriate design of this layer that can offer good compatibility with various systems without largely diminishing application performance
  117. Outline of This Talk • Why I would like to

    bring up this discussion • Prior exploration • Modest feature requests to the BSD community 126
  118. Outline of This Talk • Why I would like to

    bring up this discussion • Prior exploration • Modest feature requests to the BSD community 127
  119. Modest Feature Requests to BSD OSes • I would like

    to humbly put forward modest feature requests Application Networking Virtual File System (VFS) File System 128 User Space Kernel FUSE Daemon FUSE Driver VFS-like Layer TCP/IP Stack
  120. Modest Feature Requests to BSD OSes • I would like

    to humbly put forward modest feature requests Application Networking Virtual File System (VFS) File System 129 User Space Kernel FUSE Daemon FUSE Driver VFS-like Layer TCP/IP Stack I find doing everything properly is hard without the official support of OS kernels (FUSE is an example of such support)
  121. Modest Feature Requests to BSD OSes • I would like

    to humbly put forward modest feature requests Application Networking Virtual File System (VFS) File System 130 User Space Kernel FUSE Daemon FUSE Driver VFS-like Layer TCP/IP Stack I find doing everything properly is hard without the official support of OS kernels (FUSE is an example of such support) Feature Request The official kernel support for efficient system call hooking
  122. Modest Feature Requests to BSD OSes • I would like

    to humbly put forward modest feature requests Application Networking Virtual File System (VFS) File System 131 User Space Kernel FUSE Daemon FUSE Driver VFS-like Layer TCP/IP Stack I find doing everything properly is hard without the official support of OS kernels (FUSE is an example of such support) Feature Request The official kernel support for efficient system call hooking because binary rewriting approaches are not always applicable
  123. Modest Feature Requests to BSD OSes • I would like

    to humbly put forward modest feature requests Application Networking Virtual File System (VFS) File System 132 User Space Kernel FUSE Daemon FUSE Driver VFS-like Layer TCP/IP Stack I find doing everything properly is hard without the official support of OS kernels (FUSE is an example of such support) Feature Request The official kernel support for efficient system call hooking because binary rewriting approaches are not always applicable I think that there are many people who wish to have this feature; please see the references in our system call hooking papers
  124. Modest Feature Requests to BSD OSes • Rough idea of

    the kernel support for system call hooking 133 Application User Space Kernel syscall Immediately getting back to user space while jumping to a memory address preliminarily configured through a designated control interface
  125. Modest Feature Requests to BSD OSes • Rough idea of

    the kernel support for system call hooking 134 Application User Space Kernel syscall Immediately getting back to user space while jumping to a memory address preliminarily configured through a designated control interface It would be nice if users could use a BPF-like mechanism to choose whether getting back to user space or executing the kernel-space system call, at run time BPF program
  126. Modest Feature Requests to BSD OSes • Rough idea of

    the kernel support for system call hooking 135 Application User Space Kernel syscall function pointer function Module It would be enough to add a function pointer that can be set by a loadable kernel module call
  127. Modest Feature Requests to BSD OSes • Rough idea of

    the kernel support for system call hooking Discussion point • What security implications should we consider? è Can we apply the same assumptions as ptrace? 136
  128. Modest Feature Requests to BSD OSes Another request • I

    would appreciate it if you could consider adding any support for decentralized collaboration in future updates 137
  129. Summary • It would be great if OS functionalities could

    be developed in a decentralized manner while still being easily incorporated • The OS designs have a significant impact on how OSes are developed, and there would be well-suited designs • If certain features were officially supported by major OS kernels, it could drive substantial progress in this area 138