AsiaBSDCon 2026 Keynote1

Rethinking the forms of OS functionality development Kenichi Yasukata AsiaBSDCon
2026 ‒ 21st March ‒ Keynote #1 1

Quiz Q. What is the key design principle of distributed
systems for achieving optimal performance? (while there would be many ...) 2

systems for achieving optimal performance? (while there would be many ...) • Ideally, the throughput should increase in proportion to the number of workers 3 0 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 Normalized Throughput Number of Workers Optimal

systems for achieving optimal performance? (while there would be many ...) • Ideally, the throughput should increase in proportion to the number of workers • In practice, distributed systems often fail to scale their performance 4 0 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 Normalized Throughput Number of Workers Optimal Suboptimal

systems for achieving optimal performance? (while there would be many ...) • Ideally, the throughput should increase in proportion to the number of workers • In practice, distributed systems often fail to scale their performance A. Shared nothing • to minimize communications and synchronizations between workers 5 0 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 Normalized Throughput Number of Workers Optimal Suboptimal

Quiz • Ideally 6 Worker 1 Time

Quiz • Ideally 7 Worker 1 Time

Quiz • Ideally 8 Worker 1 Time Worker 2

Quiz • Ideally 9 Worker 1 Time Worker 2 Worker
3

Quiz • In practice 10 Worker 1 Time Worker 2
Worker 3 Shared Item

Worker 3 Shared Item lock acquired

Worker 3 Shared Item wait

Worker 3 Shared Item wait wait

Worker 3 Shared Item wait wait lock released

Worker 3 Shared Item wait wait lock acquired

Worker 3 Shared Item wait lock released wait

Worker 3 Shared Item lock acquired wait

Worker 3 Shared Item wait wait lock released

Worker 3 Shared Item lock acquired wait

Worker 3 Shared Item wait

Worker 3 Shared Item wait These are the time slots that the workers failed to spend on meaningful tasks

Quiz • Shared nothing 24 Worker 1 Time Worker 2
Worker 3 Item for Worker 1 Item for Worker 2 Item for Worker 3

Worker 3 Item for Worker 1 Item for Worker 2 Item for Worker 3 lock acquired

Worker 3 Item for Worker 1 Item for Worker 2 Item for Worker 3 lock released

Worker 3 Item for Worker 1 Item for Worker 2 Item for Worker 3 The workers can spend all their time on meaningful tasks because they do not need to wait for each other to access a shared item

Worker 3 Item for Worker 1 Item for Worker 2 Item for Worker 3 The workers can spend all their time on meaningful tasks because they do not need to wait for each other to access a shared item In general, the designers of distributed systems try to reduce shared objects to achieve high throughput

Shared Items in Software Development? 34 Worker 1 Time Worker
2 Worker 3 Shared Item

Shared Items in Software Development? 35 Developer 1 Time Developer
2 Developer 3 Shared Codebase Developers share the codebase, and their communications are often way more complicated than that of computer systems

The Main Discussion Point of This Talk 36 Developer 1
Time Developer 2 Developer 3 Codebase of Developer 1 Codebase of Developer 2 Codebase of Developer 3 Can we apply the shared nothing model to software development?

Time Developer 2 Developer 3 Codebase of OS functionality A Codebase of OS functionality B Codebase of OS functionality C It would be great if • developers do not need to communicate with each other • and, the OS functionalities are easily incorporated

Time Developer 2 Developer 3 Codebase of OS functionality A,B,C Current OS development often relies on intensive communications of developers It would be great if • developers do not need to communicate with each other • and, the OS functionalities are easily incorporated

Time Developer 2 Developer 3 Codebase of OS functionality A,B,C Current OS development often relies on intensive communications of developers Discussion Point How about exploring the possibility of decentralized collaboration? It would be great if • developers do not need to communicate with each other • and, the OS functionalities are easily incorporated

Outline of This Talk • Why I would like to
bring up this discussion • Prior exploration • Modest feature requests to the BSD community 40

In Early 2010s, ... • The throughput of commodity NICs
has reached 10 Gbps • It got challenging for software to utilize the potential of NICs NIC 42

In Early 2010s, ... • The common steps for a
packet transmission NIC User-space program Kernel 43

packet transmission NIC User-space program Kernel Payload Buffer (DRAM) 1. prepares a payload 44

packet transmission NIC User-space program Kernel Payload Buffer (DRAM) 1. prepares a payload 2. triggers a system call like write() 45

packet transmission NIC User-space program Kernel Payload Buffer (DRAM) 1. prepares a payload 2. triggers a system call like write() 46

packet transmission NIC User-space program Kernel Payload Buffer (DRAM) 1. prepares a payload 2. triggers a system call like write() 3. allocates a buffer Buffer (DRAM) 47

packet transmission NIC User-space program Kernel Payload Buffer (DRAM) 1. prepares a payload 2. triggers a system call like write() 3. allocates a buffer Buffer (DRAM) 4. copies in the payload Payload 48

packet transmission NIC User-space program Kernel Payload Buffer (DRAM) 1. prepares a payload 2. triggers a system call like write() 3. allocates a buffer Buffer (DRAM) 4. copies in the payload Payload 5. associates the buffer with the NIC 49

packet transmission NIC User-space program Kernel Payload Buffer (DRAM) 1. prepares a payload 2. triggers a system call like write() 3. allocates a buffer Buffer (DRAM) 4. copies in the payload Payload 5. associates the buffer with the NIC 6. kicks the NIC to start the transmission 50

packet transmission NIC User-space program Kernel Payload Buffer (DRAM) 1. prepares a payload 2. triggers a system call like write() 3. allocates a buffer Buffer (DRAM) 4. copies in the payload Payload 5. associates the buffer with the NIC 6. kicks the NIC to start the transmission 51 People found that these steps are costly and make it hard to achieve high packet I/O performance

In Early 2010s, ... • netmap, a packet I/O framework,
merged in FreeBSD NIC User-space program Kernel 52

merged in FreeBSD NIC User-space program Kernel Buffer (DRAM) 0. allocates buffers and associate them with the NIC 53

merged in FreeBSD NIC User-space program Kernel 0. exposes buffers associated with the NIC through mmap() during the setup phase Buffer (DRAM) 0. allocates buffers and associate them with the NIC mmap 54

merged in FreeBSD NIC User-space program Kernel 0. exposes buffers associated with the NIC through mmap() during the setup phase Buffer (DRAM) 0. allocates buffers and associate them with the NIC mmap Payload 1. prepares a payload Payload 55

merged in FreeBSD NIC User-space program Kernel 0. exposes buffers associated with the NIC through mmap() during the setup phase Buffer (DRAM) 0. allocates buffers and associate them with the NIC mmap Payload 1. prepares a payload Payload 56 These are physically the same thus, no copy is necessary

merged in FreeBSD NIC User-space program Kernel 0. exposes buffers associated with the NIC through mmap() during the setup phase Buffer (DRAM) 0. allocates buffers and associate them with the NIC mmap Payload 1. prepares a payload 2. triggers a system call like poll() and ioctl() Payload 57

merged in FreeBSD NIC User-space program Kernel 0. exposes buffers associated with the NIC through mmap() during the setup phase Buffer (DRAM) 0. allocates buffers and associate them with the NIC mmap Payload 1. prepares a payload 2. triggers a system call like poll() and ioctl() Payload 58

merged in FreeBSD NIC User-space program Kernel 0. exposes buffers associated with the NIC through mmap() during the setup phase Buffer (DRAM) 0. allocates buffers and associate them with the NIC mmap Payload 1. prepares a payload 2. triggers a system call like poll() and ioctl() 3. kicks the NIC to start the transmission Payload 59

merged in FreeBSD NIC User-space program Kernel 0. exposes buffers associated with the NIC through mmap() during the setup phase Buffer (DRAM) 0. allocates buffers and associate them with the NIC mmap Payload 1. prepares a payload 2. triggers a system call like poll() and ioctl() 3. kicks the NIC to start the transmission Payload 60 netmap offers the fast data path

merged in FreeBSD application costs almost negligible: a packet generator which streams pre-generated packets, and a packet re- ceiver which just counts incoming packets. 5.2 Test equipment We have run most of our experiments on systems equipped with an i7-870 4-core CPU at 2.93 GHz (3.2 GHz with turbo-boost), memory running at 1.33 GHz, and a dual port 10 Gbit/s card based on the Intel 82599 NIC. The numbers reported in this paper refer to the netmap version in FreeBSD HEAD/amd64 as of April 2012. Experiments have been run using di- rectly connected cards on two similar systems. Results are highly repeatable (within 2% or less) so we do not report conﬁdence intervals in the tables and graphs. netmap is extremely efﬁcient so it saturates a 10 Gbit/s interface even at the maximum packet rate, and we need to run the system at reduced clock speeds to determine 0 2 4 6 8 10 12 14 16 0 0.5 1 1.5 2 2.5 3 Tx Rate (Mpps) Clock speed (GHz) netmap on 4 cores netmap on 2 cores netmap on 1 core Linux/pktgen FreeBSD/netsend Figure 5: Netmap transmit performance with 64-byte packets, variable clock rates and number of cores, com- pared to pktgen (a specialised, in-kernel generator avail- able on linux, peaking at about 4 Mpps) and a netsend (FreeBSD userspace, peaking at 1.05 Mpps). Luigi Rizzo, “netmap: a novel framework for fast packet I/O”, USENIX ATC 2012 https://www.usenix.org/conference/atc12/technical-sessions/presentation/rizzo 61

merged in FreeBSD • The speed of packet transmission over a raw socket is 1.05 Mpps • Mpps: million packets per second application costs almost negligible: a packet generator which streams pre-generated packets, and a packet re- ceiver which just counts incoming packets. 5.2 Test equipment We have run most of our experiments on systems equipped with an i7-870 4-core CPU at 2.93 GHz (3.2 GHz with turbo-boost), memory running at 1.33 GHz, and a dual port 10 Gbit/s card based on the Intel 82599 NIC. The numbers reported in this paper refer to the netmap version in FreeBSD HEAD/amd64 as of April 2012. Experiments have been run using di- rectly connected cards on two similar systems. Results are highly repeatable (within 2% or less) so we do not report conﬁdence intervals in the tables and graphs. netmap is extremely efﬁcient so it saturates a 10 Gbit/s interface even at the maximum packet rate, and we need to run the system at reduced clock speeds to determine 0 2 4 6 8 10 12 14 16 0 0.5 1 1.5 2 2.5 3 Tx Rate (Mpps) Clock speed (GHz) netmap on 4 cores netmap on 2 cores netmap on 1 core Linux/pktgen FreeBSD/netsend Figure 5: Netmap transmit performance with 64-byte packets, variable clock rates and number of cores, com- pared to pktgen (a specialised, in-kernel generator avail- able on linux, peaking at about 4 Mpps) and a netsend (FreeBSD userspace, peaking at 1.05 Mpps). 62 Luigi Rizzo, “netmap: a novel framework for fast packet I/O”, USENIX ATC 2012 https://www.usenix.org/conference/atc12/technical-sessions/presentation/rizzo

merged in FreeBSD • The speed of packet transmission over a raw socket is 1.05 Mpps • Mpps: million packets per second • netmap achieves the line rate (14.88 Mpps) with one CPU core application costs almost negligible: a packet generator which streams pre-generated packets, and a packet re- ceiver which just counts incoming packets. 5.2 Test equipment We have run most of our experiments on systems equipped with an i7-870 4-core CPU at 2.93 GHz (3.2 GHz with turbo-boost), memory running at 1.33 GHz, and a dual port 10 Gbit/s card based on the Intel 82599 NIC. The numbers reported in this paper refer to the netmap version in FreeBSD HEAD/amd64 as of April 2012. Experiments have been run using di- rectly connected cards on two similar systems. Results are highly repeatable (within 2% or less) so we do not report conﬁdence intervals in the tables and graphs. netmap is extremely efﬁcient so it saturates a 10 Gbit/s interface even at the maximum packet rate, and we need to run the system at reduced clock speeds to determine 0 2 4 6 8 10 12 14 16 0 0.5 1 1.5 2 2.5 3 Tx Rate (Mpps) Clock speed (GHz) netmap on 4 cores netmap on 2 cores netmap on 1 core Linux/pktgen FreeBSD/netsend Figure 5: Netmap transmit performance with 64-byte packets, variable clock rates and number of cores, com- pared to pktgen (a specialised, in-kernel generator avail- able on linux, peaking at about 4 Mpps) and a netsend (FreeBSD userspace, peaking at 1.05 Mpps). 63 Luigi Rizzo, “netmap: a novel framework for fast packet I/O”, USENIX ATC 2012 https://www.usenix.org/conference/atc12/technical-sessions/presentation/rizzo

merged in FreeBSD NIC User-space program Kernel 64 Payload

merged in FreeBSD NIC User-space program Kernel 65 Payload TCP/IP Stack When netmap is activated, payloads bypass the kernel-space TCP/IP stack

merged in FreeBSD NIC User-space program Kernel 66 Payload TCP/IP Stack When netmap is activated, payloads bypass the kernel-space TCP/IP stack The user-space program bypassing a TCP/IP stack cannot communicate over TCP/IP networks

merged in FreeBSD NIC User-space program Kernel 67 Payload TCP/IP Stack When netmap is activated, payloads bypass the kernel-space TCP/IP stack TCP/IP Stack User-space TCP/IP stacks enable user-space programs to communicate over TCP/IP networks while using netmap

In Early 2010s, ... • netmap + a user-space TCP/IP
stack NIC User-space program Kernel 0. exposes buffers associated with the NIC through mmap() during the setup phase Buffer (DRAM) 0. allocates buffers and associate them with the NIC mmap 68

stack NIC User-space program Kernel 0. exposes buffers associated with the NIC through mmap() during the setup phase Buffer (DRAM) 0. allocates buffers and associate them with the NIC mmap Payload 1. prepares a payload Payload 69

stack NIC User-space program Kernel 0. exposes buffers associated with the NIC through mmap() during the setup phase Buffer (DRAM) 0. allocates buffers and associate them with the NIC mmap Payload 1. prepares a payload Payload 2. prepares the header using the user-space TCP/IP stack 70

stack NIC User-space program Kernel 0. exposes buffers associated with the NIC through mmap() during the setup phase Buffer (DRAM) 0. allocates buffers and associate them with the NIC mmap Payload 1. prepares a payload 2. prepares the header using the user-space TCP/IP stack Payload 3. triggers a system call like poll() and ioctl() 71

stack NIC User-space program Kernel 0. exposes buffers associated with the NIC through mmap() during the setup phase Buffer (DRAM) 0. allocates buffers and associate them with the NIC mmap Payload 1. prepares a payload 2. prepares the header using the user-space TCP/IP stack Payload 3. triggers a system call like poll() and ioctl() 72

stack NIC User-space program Kernel 0. exposes buffers associated with the NIC through mmap() during the setup phase Buffer (DRAM) 0. allocates buffers and associate them with the NIC mmap Payload 1. prepares a payload 2. prepares the header using the user-space TCP/IP stack Payload 3. triggers a system call like poll() and ioctl() 3. kicks the NIC to start the transmission 73

In Early 2010s, ... • Sandstorm (SIGCOMMʼ14) is a web
server based on a userspace TCP/IP stack that leverages netmap for its packet I/O 32 64 128 256 512 756 1024 File size (KB) Sandstorm nginx + FreeBSD nginx + Linux roughput, 4 NICs 4 8 16 24 32 64 128 256 512 756 1024 0 20 40 60 File size (KB) Throughput (Gbps) Sandstorm nginx + FreeBSD nginx + Linux (c) Network throughput, 6 NICs 100 Figure 4 74 Ilias Marinos, Robert N.M. Watson, and Mark Handley, "Network Stack Specialization for Performance", SIGCOMM 2014 https://doi.org/10.1145/2740070.2626311

In Early 2010s, ... • Many performance-optimized TCP/IP stacks have
been introduced • StackMap (USENIX ATCʼ16) • F-Stack (Tencent) • Shenango (NSDIʼ19) • TAS (EuroSysʼ19) • Demikernel (SOSPʼ21) • Luna (USENIX ATCʼ23) • … • mTCP (NSDIʼ14) • Sandstorm (SIGCOMMʼ14) • UTCP (SIGCOMM CCR in 2014) • Arrakis (OSDIʼ14) • IX (OSDIʼ14) • Seastar (Cloudius Systems) 75

Problem Statement • As of today, performance-optimized TCP/IP stacks have
not been common for most users and not widely used • even though the techniques to achieve high performance have been discovered more than 10 years ago 76

not been common for most users and not widely used • even though the techniques to achieve high performance have been discovered more than 10 years ago 32 64 128 256 512 756 1024 le size (KB) Sandstorm nginx + FreeBSD nginx + Linux roughput, 1 NIC 4 8 16 24 32 64 128 256 512 756 1024 0 20 40 60 File size (KB) Throughput (Gbps) Sandstorm nginx + FreeBSD nginx + Linux (b) Network throughput, 4 NICs 4 8 16 24 32 64 128 256 512 756 1024 0 20 40 60 File size (KB) Throughput (Gbps) Sandstorm nginx + FreeBSD nginx + Linux (c) Network throughput, 6 NICs Sandstorm 100 100 Figure 4 Ilias Marinos, Robert N.M. Watson, and Mark Handley, "Network Stack Specialization for Performance", SIGCOMM 2014 https://doi.org/10.1145/2740070.2626311 77

not been common for most users and not widely used • even though the techniques to achieve high performance have been discovered more than 10 years ago 32 64 128 256 512 756 1024 le size (KB) Sandstorm nginx + FreeBSD nginx + Linux roughput, 1 NIC 4 8 16 24 32 64 128 256 512 756 1024 0 20 40 60 File size (KB) Throughput (Gbps) Sandstorm nginx + FreeBSD nginx + Linux (b) Network throughput, 4 NICs 4 8 16 24 32 64 128 256 512 756 1024 0 20 40 60 File size (KB) Throughput (Gbps) Sandstorm nginx + FreeBSD nginx + Linux (c) Network throughput, 6 NICs Sandstorm 100 100 Figure 4 Ilias Marinos, Robert N.M. Watson, and Mark Handley, "Network Stack Specialization for Performance", SIGCOMM 2014 https://doi.org/10.1145/2740070.2626311 This gap persists for most users until an adequate solution is identified 78

not been common for most users and not widely used • even though the techniques to achieve high performance have been discovered more than 10 years ago 32 64 128 256 512 756 1024 le size (KB) Sandstorm nginx + FreeBSD nginx + Linux roughput, 1 NIC 4 8 16 24 32 64 128 256 512 756 1024 0 20 40 60 File size (KB) Throughput (Gbps) Sandstorm nginx + FreeBSD nginx + Linux (b) Network throughput, 4 NICs 4 8 16 24 32 64 128 256 512 756 1024 0 20 40 60 File size (KB) Throughput (Gbps) Sandstorm nginx + FreeBSD nginx + Linux (c) Network throughput, 6 NICs Sandstorm 100 100 Figure 4 Ilias Marinos, Robert N.M. Watson, and Mark Handley, "Network Stack Specialization for Performance", SIGCOMM 2014 https://doi.org/10.1145/2740070.2626311 This gap persists for most users until an adequate solution is identified I am exploring this 79

Problem Analysis • Why are the existing performance-optimized TCP/IP stacks
not commonly used despite their performance advantages? 80

not commonly used despite their performance advantages? • There is a negative cycle 81

not commonly used despite their performance advantages? • There is a negative cycle • Many of the performance-optimized TCP/IP stacks are research prototypes, and their implementations are immature 82

not commonly used despite their performance advantages? • There is a negative cycle • Many of the performance-optimized TCP/IP stacks are research prototypes, and their implementations are immature • On one hand, users do not like to use immature TCP/IP stacks 83

not commonly used despite their performance advantages? • There is a negative cycle • Many of the performance-optimized TCP/IP stacks are research prototypes, and their implementations are immature • On one hand, users do not like to use immature TCP/IP stacks • On the other hand, usersʼ feedback and bug reports are essential for the maturity of the implementations 84

not commonly used despite their performance advantages? • There is a negative cycle • Many of the performance-optimized TCP/IP stacks are research prototypes, and their implementations are immature • On one hand, users do not like to use immature TCP/IP stacks • On the other hand, usersʼ feedback and bug reports are essential for the maturity of the implementations • So, immature TCP/IP stack implementations can never get mature 85

not commonly used despite their performance advantages? • There is a negative cycle • Many of the performance-optimized TCP/IP stacks are research prototypes, and their implementations are immature • On one hand, users do not like to use immature TCP/IP stacks • On the other hand, usersʼ feedback and bug reports are essential for the maturity of the implementations • So, immature TCP/IP stack implementations can never get mature 86

not commonly used despite their performance advantages? • There is a negative cycle • Many of the performance-optimized TCP/IP stacks are research prototypes, and their implementations are immature • On one hand, users do not like to use immature TCP/IP stacks • On the other hand, usersʼ feedback and bug reports are essential for the maturity of the implementations • So, immature TCP/IP stack implementations can never get mature 87 How can we break this negative cycle?

not commonly used despite their performance advantages? • There is a negative cycle • Many of the performance-optimized TCP/IP stacks are research prototypes, and their implementations are immature • On one hand, users do not like to use immature TCP/IP stacks • On the other hand, usersʼ feedback and bug reports are essential for the maturity of the implementations • So, immature TCP/IP stack implementations can never get mature 88 How can we break this negative cycle? My Approach How about lowering the bar for users to try out performance-optimize TCP/IP stacks to make it easier for them to gather user feedback?

The Issues of Development Complexities 89 Developer 1 Developer 2
Developer 3 Codebase of Developer 1 Codebase of Developer 2 Codebase of Developer 3 It would be great if developers could work independently

Developer 3 The codebase of an OS It would be great if developers could work independently Many Developers … But, developers often work on the same OS codebases while there are boundaries of functionalities

Developer 3 The codebase of an OS It would be great if developers could work independently Many Developers … But, developers often work on the same OS codebases while there are boundaries of functionalities The codebase of a performance-optimized TCP/IP stack The issue is that currently, it is not easy to incorporate performance-optimized TCP/IP stacks and OSes

Developer 3 The codebase of an OS It would be great if developers could work independently Many Developers … But, developers often work on the same OS codebases while there are boundaries of functionalities The codebase of a performance-optimized TCP/IP stack The issue is that currently, it is not easy to incorporate performance-optimized TCP/IP stacks and OSes As a result, it is hard for many users to try out performance-optimized TCP/IP stacks and for their developers to have feedback from users

Developer 3 The codebase of an OS It would be great if developers could work independently Many Developers … But, developers often work on the same OS codebases while there are boundaries of functionalities The codebase of a performance-optimized TCP/IP stack The issue is that currently, it is not easy to incorporate performance-optimized TCP/IP stacks and OSes If the developers wish to properly incorporate their TCP/IP stacks and OSes, they need to interact with the OS developers while imposing communication overheads As a result, it is hard for many users to try out performance-optimized TCP/IP stacks and for their developers to have feedback from users

Why I would like to bring up this discussion •
I believe OS development based on the decentralized collaboration model is beneficial for developers and users • It reduces communication overheads between developers • It makes it easier for users to try out emerging projects and for those projects to gather user feedback è Eventually, users will have many mature advanced products • The challenging part is the methods to incorporate OS subsystems independently developed by different parties • I would like to discuss solutions to the challenges with the BSD community, a key driver of OS development 94

Focuses of My Prior Exploration • Portability: an OS subsystem,
like a TCP/IP stack, should be aware of portability for the compatibility with different OSes • Transparent integration: an OS subsystem should allow for transparent integration into an existing OS 97 The codebase of OS A The codebase of a TCP/IP stack The codebase of OS B The codebase of OS C Portability + Transparent integration I think these two are important for decentralized collaboration

like a TCP/IP stack, should be aware of portability for the compatibility with different OSes • Transparent integration: an OS subsystem should allow for transparent integration into an existing OS 98 The codebase of OS A The codebase of a TCP/IP stack The codebase of OS B The codebase of OS C Portability + Transparent integration Can we achieve high performance and portability simultaneously? I think these two are important for decentralized collaboration

Exploration for Portability • iip • A TCP/IP stack aware
of both performance and portability • Paper at SIGCOMM CCR • https://doi.org/10.1145/3687230.3687233 • nominated to be presented at the Best of CCR session in SIGCOMM 2024 • Source code • https://github.com/yasukata/iip 99 https://yasukata.github.io/presentation/2024/08/sigcomm2024/sigcomm2024ccr_slides_yasukata.pdf

Rough Numbers of Other Implementations • TCP ping-pong workload 100
100 Gbps iip pinger app 32 CPU cores DPDK ponger app 1 CPU core The pinger and ponger exchange 1-byte TCP payloads A ponger thread handles 32 concurrent TCP connections Run the benchmark while changing the TCP/IP stack implementation TCP/IP stack

Rough Numbers of Other Implementations 101 NOTE 1: this comparison
is not fair because the TCP/IP stacks have different features and implementations NOTE 2: it is often the case that a system outperforms others in its developersʼ environment TAS uses three CPU cores and Caladan uses two CPU cores because their minimal setups require the additional CPU cores 0 5 10 15 20 25 30 35 Linux lw IP Seastar F-Stack TAS Caladan iip 99th %ile Latency [us] 0 0.5 1 1.5 2 2.5 3 Linux lw IP Seastar F-Stack TAS Caladan iip Throughput [million requests/sec] Higher is better Lower is better 160.3 us The benchmark setup is found at https://github.com/yasukata/bench-iip#performance-numbers-of-other-tcpip-stacks

Example Application • mimicached: a memcached-compatible server using iip •
https://github.com/yasukata/mimicached 102 The benchmark setup is found at https://github.com/yasukata/mimicached#rough-performance-numbers • set 10% get 90% • zipfian distribution • 1 million key-value items • key size is 8 bytes • value size is 8 bytes • text protocol Workload

like a TCP/IP stack, should be aware of portability for the compatibility with different OSes • Transparent integration: an OS subsystem should allow for transparent integration into an existing OS 103 The codebase of OS A The codebase of a TCP/IP stack The codebase of OS B The codebase of OS C Portability + Transparent integration Can we achieve high performance and portability simultaneously? I think these two are important for decentralized collaboration

like a TCP/IP stack, should be aware of portability for the compatibility with different OSes • Transparent integration: an OS subsystem should allow for transparent integration into an existing OS 104 The codebase of OS A The codebase of a TCP/IP stack The codebase of OS B The codebase of OS C Portability + Transparent integration Portable TCP/IP stacks can be compiled and run on a wide range of OSes but, we need a mechanism to transparently integrate them into OSes I think these two are important for decentralized collaboration

Ideally, it should be like a FUSE • Filesystem in
Userspace (FUSE) Application Networking Virtual File System (VFS) File System 105 User Space Kernel

Userspace (FUSE) Application Networking Virtual File System (VFS) File System 106 User Space Kernel The VFS layer steers requests from an application to kernel-space services

Userspace (FUSE) Application Networking Virtual File System (VFS) File System 107 User Space Kernel The VFS layer steers requests from an application to kernel-space services FUSE Driver

Userspace (FUSE) Application Networking Virtual File System (VFS) File System 108 User Space Kernel The VFS layer steers requests from an application to kernel-space services FUSE Daemon FUSE Driver

Userspace (FUSE) Application Networking Virtual File System (VFS) File System 109 User Space Kernel The VFS layer steers requests from an application to kernel-space services FUSE Daemon FUSE Driver File operation requests from an application are steered to a user-space file system implementation through the kernel-space FUSE subsystem

Userspace (FUSE) Application Networking Virtual File System (VFS) File System 110 User Space Kernel The VFS layer steers requests from an application to kernel-space services FUSE Daemon FUSE Driver File operation requests from an application are steered to a user-space file system implementation through the kernel-space FUSE subsystem Applications can be agnostic to FUSE and kernel-space file systems because their APIs are the same system calls

Time Developer 2 Developer 3 Codebase of OS functionality A Codebase of OS functionality B Codebase of OS functionality C Coming back to ... It would be great if • developers do not need to communicate with each other • and, the OS functionalities are easily incorporated

Ideally, it should be like a FUSE 112 Developer 1
Time Developer 2 Developer 3 FUSE FS 1 FUSE FS 2 An OS

Time Developer 2 Developer 3 An OS FUSE FS 1 FUSE FS 2 The developers of the OS, FUSE FS 1, FUSE FS 2 can work independently while their implementations can be properly incorporated

Time Developer 2 Developer 3 An OS FUSE FS 1 FUSE FS 2 The developers of the OS, FUSE FS 1, FUSE FS 2 can work independently while their implementations can be properly incorporated My thought: this is great, so can we do the same for TCP/IP stacks?

Ideally, it should be like a FUSE • What I
wish to do Application Networking Virtual File System (VFS) File System 117 User Space Kernel FUSE Daemon FUSE Driver

wish to do Application Networking Virtual File System (VFS) File System 118 User Space Kernel FUSE Daemon FUSE Driver VFS-like Layer Adding a VFS-like layer in user space

wish to do Application Networking Virtual File System (VFS) File System 119 User Space Kernel FUSE Daemon FUSE Driver VFS-like Layer TCP/IP Stack Adding a VFS-like layer in user space Redirect application requests to a TCP/IP stack in user space

wish to do Application Networking Virtual File System (VFS) File System 120 User Space Kernel FUSE Daemon FUSE Driver VFS-like Layer TCP/IP Stack How can we add this?

wish to do Application Networking Virtual File System (VFS) File System 121 User Space Kernel FUSE Daemon FUSE Driver VFS-like Layer TCP/IP Stack How can we add this? Hooking system calls seem to be a good option

wish to do Application Networking Virtual File System (VFS) File System 122 User Space Kernel FUSE Daemon FUSE Driver VFS-like Layer TCP/IP Stack How can we add this? Hooking system calls seem to be a good option But, I found existing hook mechanisms have drawbacks

Exploration for Transparent Integration • zpoline • A system call
hook mechanism for x86-64 based on binary rewriting • Paper at USENIX ATC 2023 • https://www.usenix.org/conference/atc23/presentation/yasukata • received a Best Paper Award • Source code • https://github.com/yasukata/zpoline 123

Exploration for Transparent Integration • svc-hook • A system call
hook mechanism for ARM64 based on binary rewriting • Paper at ACM/IFIP Middleware 2025 • https://dl.acm.org/doi/10.1145/3721462.3770771 • Source code • https://github.com/retrage/svc-hook 124 https://speakerdeck.com/retrage/svc-hook-hooking-system-calls-on-arm64-by-binary-rewriting

Ideally, it should be like a FUSE • For the
next step ... Application Networking Virtual File System (VFS) File System 125 User Space Kernel FUSE Daemon FUSE Driver VFS-like Layer TCP/IP Stack I am exploring an appropriate design of this layer that can offer good compatibility with various systems without largely diminishing application performance

Modest Feature Requests to BSD OSes • I would like
to humbly put forward modest feature requests Application Networking Virtual File System (VFS) File System 128 User Space Kernel FUSE Daemon FUSE Driver VFS-like Layer TCP/IP Stack

to humbly put forward modest feature requests Application Networking Virtual File System (VFS) File System 129 User Space Kernel FUSE Daemon FUSE Driver VFS-like Layer TCP/IP Stack I find doing everything properly is hard without the official support of OS kernels (FUSE is an example of such support)

to humbly put forward modest feature requests Application Networking Virtual File System (VFS) File System 130 User Space Kernel FUSE Daemon FUSE Driver VFS-like Layer TCP/IP Stack I find doing everything properly is hard without the official support of OS kernels (FUSE is an example of such support) Feature Request The official kernel support for efficient system call hooking

to humbly put forward modest feature requests Application Networking Virtual File System (VFS) File System 131 User Space Kernel FUSE Daemon FUSE Driver VFS-like Layer TCP/IP Stack I find doing everything properly is hard without the official support of OS kernels (FUSE is an example of such support) Feature Request The official kernel support for efficient system call hooking because binary rewriting approaches are not always applicable

to humbly put forward modest feature requests Application Networking Virtual File System (VFS) File System 132 User Space Kernel FUSE Daemon FUSE Driver VFS-like Layer TCP/IP Stack I find doing everything properly is hard without the official support of OS kernels (FUSE is an example of such support) Feature Request The official kernel support for efficient system call hooking because binary rewriting approaches are not always applicable I think that there are many people who wish to have this feature; please see the references in our system call hooking papers

Modest Feature Requests to BSD OSes • Rough idea of
the kernel support for system call hooking 133 Application User Space Kernel syscall Immediately getting back to user space while jumping to a memory address preliminarily configured through a designated control interface

the kernel support for system call hooking 134 Application User Space Kernel syscall Immediately getting back to user space while jumping to a memory address preliminarily configured through a designated control interface It would be nice if users could use a BPF-like mechanism to choose whether getting back to user space or executing the kernel-space system call, at run time BPF program

the kernel support for system call hooking 135 Application User Space Kernel syscall function pointer function Module It would be enough to add a function pointer that can be set by a loadable kernel module call

the kernel support for system call hooking Discussion point • What security implications should we consider? è Can we apply the same assumptions as ptrace? 136

Modest Feature Requests to BSD OSes Another request • I
would appreciate it if you could consider adding any support for decentralized collaboration in future updates 137

Summary • It would be great if OS functionalities could
be developed in a decentralized manner while still being easily incorporated • The OS designs have a significant impact on how OSes are developed, and there would be well-suited designs • If certain features were officially supported by major OS kernels, it could drive substantial progress in this area 138

AsiaBSDCon 2026 Keynote1

AsiaBSDCon 2026 Keynote1

More Decks by yasukata

Featured

Transcript