Slide 1

Slide 1 text

Making the Linux Kernel

Slide 2

Slide 2 text

Making the Linux Kernel SUCK LESS

Slide 3

Slide 3 text

September 19, 2024

Slide 4

Slide 4 text

Making the Linux Kernel SUCK LESS

Slide 5

Slide 5 text

Making the Linux Kernel SUCK LESS PREEMPT_RT

Slide 6

Slide 6 text

In memory of Daniel Bristot de Oliviera

Slide 7

Slide 7 text

Real-Time Going Upstream 6.12!!! ● In honor of RT going upstream, need to list what it did for the kernel ○ mutex_lock ○ NO HZ ○ High Resolution Timers ○ lockdep ○ Priority Inheritance (futex) ○ Threaded interrupts ○ ftrace ○ printk

Slide 8

Slide 8 text

mutex_lock/unlock ● All sleeping locks use to be just a semaphore

Slide 9

Slide 9 text

mutex_lock/unlock ● All sleeping locks use to be just a semaphore ○ semaphores allow a set of tasks to enter a critical section

Slide 10

Slide 10 text

mutex_lock/unlock ● All sleeping locks use to be just a semaphore ○ semaphores allow a set of tasks to enter a critical section ○ Most semaphores only allow just one task (mutex!)

Slide 11

Slide 11 text

mutex_lock/unlock ● All sleeping locks use to be just a semaphore ○ semaphores allow a set of tasks to enter a critical section ○ Most semaphores only allow just one task (mutex!) ○ Semaphores allowed passing of ownership

Slide 12

Slide 12 text

mutex_lock/unlock ● All sleeping locks use to be just a semaphore ○ semaphores allow a set of tasks to enter a critical section ○ Most semaphores only allow just one task (mutex!) ○ Semaphores allowed passing of ownership ● RT requires an “owner” for all critical sections ○ This is a requirement for priority inheritance

Slide 13

Slide 13 text

mutex_lock/unlock ● All sleeping locks use to be just a semaphore ○ semaphores allow a set of tasks to enter a critical section ○ Most semaphores only allow just one task (mutex!) ○ Semaphores allowed passing of ownership ● RT requires an “owner” for all critical sections ○ This is a requirement for priority inheritance ● The RT folks added the mutex_lock/unlock interface

Slide 14

Slide 14 text

NO_HZ and High Resolution Timers ● The tick use to always run ○ Even when the system was idle

Slide 15

Slide 15 text

NO_HZ and High Resolution Timers ● The tick use to always run ○ Even when the system was idle ○ Prevents the CPU from entering a deep sleep ■ Expensive for large data centers

Slide 16

Slide 16 text

NO_HZ and High Resolution Timers ● The tick use to always run ○ Even when the system was idle ○ Prevents the CPU from entering a deep sleep ■ Expensive for large data centers ● High Resolution Timers was being blocked by the timer maintainer

Slide 17

Slide 17 text

NO_HZ and High Resolution Timers ● The tick use to always run ○ Even when the system was idle ○ Prevents the CPU from entering a deep sleep ■ Expensive for large data centers ● High Resolution Timers was being blocked by the timer maintainer ○ NO_HZ was made dependent on HR Timers ○ Couldn’t have NO_HZ without HR Timers

Slide 18

Slide 18 text

NO_HZ and High Resolution Timers ● The tick use to always run ○ Even when the system was idle ○ Prevents the CPU from entering a deep sleep ■ Expensive for large data centers ● High Resolution Timers was being blocked by the timer maintainer ○ NO_HZ was made dependent on HR Timers ○ Couldn’t have NO_HZ without HR Timers ○ Linus pulled in both

Slide 19

Slide 19 text

NO_HZ and High Resolution Timers ● The tick use to always run ○ Even when the system was idle ○ Prevents the CPU from entering a deep sleep ■ Expensive for large data centers ● High Resolution Timers was being blocked by the timer maintainer ○ NO_HZ was made dependent on HR Timers ○ Couldn’t have NO_HZ without HR Timers ○ Linus pulled in both ○ Thomas Gleixner became the new timer maintainer

Slide 20

Slide 20 text

Lockdep Lock A Lock B TASK 1 Lock B Lock A TASK 2

Slide 21

Slide 21 text

Lockdep Lock A Lock B TASK 1 Lock B Lock C TASK 2 Lock C TASK 3 Lock A

Slide 22

Slide 22 text

Lockdep Lock A

Slide 23

Slide 23 text

Lockdep Lock A Interrupt

Slide 24

Slide 24 text

Lockdep Lock A Interrupt Lock A

Slide 25

Slide 25 text

Lockdep Lock A Interrupt Lock A DEADLOCK

Slide 26

Slide 26 text

Lockdep Lock A

Slide 27

Slide 27 text

Lockdep Lock A Interrupt

Slide 28

Slide 28 text

Lockdep Lock A Lock B TASK 1

Slide 29

Slide 29 text

Lockdep Lock A Lock B TASK 1 Lock B TASK 2

Slide 30

Slide 30 text

Lockdep Lock A Lock B TASK 1 Lock B Lock C TASK 2 Lock C TASK 3

Slide 31

Slide 31 text

Lockdep Lock A Interrupt Lock B TASK 1 Lock B Lock C TASK 2 Lock C TASK 3

Slide 32

Slide 32 text

Lockdep Lock A Interrupt Lock B TASK 1 Lock B Lock C TASK 2 Lock C TASK 3 Lock A

Slide 33

Slide 33 text

Lockdep Lock A Interrupt Lock B TASK 1 Lock B Lock C TASK 2 Lock C TASK 3 Lock A DEADLOCK

Slide 34

Slide 34 text

Lockdep Lock A Interrupt Lock B TASK 1 Lock B Lock C TASK 2 Lock C TASK 3

Slide 35

Slide 35 text

Lockdep Lock A kmalloc(size, GFP_KERNEL)

Slide 36

Slide 36 text

Lockdep Lock A kmalloc(size, GFP_KERNEL) MEMORY RECLAIM!

Slide 37

Slide 37 text

Lockdep Lock A kmalloc(size, GFP_KERNEL) Lock A MEMORY RECLAIM!

Slide 38

Slide 38 text

Lockdep Lock A kmalloc(size, GFP_KERNEL) Lock A MEMORY RECLAIM! DEADLOCK

Slide 39

Slide 39 text

Lockdep Lock A MEMORY RECLAIM Lock B TASK 1 Lock B Lock C TASK 2 Lock C TASK 3 Lock A

Slide 40

Slide 40 text

Lockdep Lock A MEMORY RECLAIM Lock B TASK 1 Lock B Lock C TASK 2 Lock C TASK 3 Lock A DEADLOCK

Slide 41

Slide 41 text

Lockdep down_read A TASK 1 down_read B down_read B TASK 2 down_read A

Slide 42

Slide 42 text

Lockdep TASK 1 TASK 2 DEADLOCK?? down_read A down_read B down_read B down_read A

Slide 43

Slide 43 text

Lockdep TASK 1 TASK 2 DEADLOCK?? YES down_read A down_read B down_read B down_read A

Slide 44

Slide 44 text

Lockdep TASK 1 TASK 2 down_write C TASK 3 down_read A down_read B

Slide 45

Slide 45 text

Lockdep TASK 1 TASK 2 TASK 3 down_read A down_read B down_read B down_read A down_write C

Slide 46

Slide 46 text

Lockdep TASK 1 TASK 2 TASK 3 DEADLOCK down_read A down_read B down_read B down_read A down_write C

Slide 47

Slide 47 text

User Space Priority Inheritance

Slide 48

Slide 48 text

User Space Priority Inheritance pthread_mutexattr_setprotocol(&attr, PTHREAD_PRIO_INHERIT)

Slide 49

Slide 49 text

Interrupts

Slide 50

Slide 50 text

Interrupt threads task: irq-123/harddrive

Slide 51

Slide 51 text

Tracing / ftrace ● Three month project ○ Started in January 2008 ○ Was to port the tracing infrastructure from PREEMPT_RT

Slide 52

Slide 52 text

Tracing / ftrace ● Three month project ○ Started in January 2008 ○ Was to port the tracing infrastructure from PREEMPT_RT ● Still haven’t finished it

Slide 53

Slide 53 text

printk # cyclictest -q -D 90s -i 500 -d 500 T: 0 ( 3634) P: 0 I:500 C: 180000 Min: 10 Act: 54 Avg: 54 Max: 68

Slide 54

Slide 54 text

printk # cyclictest -q -D 90s -i 500 -d 500 T: 0 ( 3634) P: 0 I:500 C: 180000 Min: 10 Act: 54 Avg: 54 Max: 68 # cyclictest -q -D 90s -i 500 -d 500

Slide 55

Slide 55 text

printk # cyclictest -q -D 90s -i 500 -d 500 T: 0 ( 3634) P: 0 I:500 C: 180000 Min: 10 Act: 54 Avg: 54 Max: 68 # cyclictest -q -D 90s -i 500 -d 500 # echo ? > /proc/sysrq-trigger

Slide 56

Slide 56 text

printk # cyclictest -q -D 90s -i 500 -d 500 T: 0 ( 3634) P: 0 I:500 C: 180000 Min: 10 Act: 54 Avg: 54 Max: 68 # cyclictest -q -D 90s -i 500 -d 500 T: 0 ( 3653) P: 0 I:500 C: 179931 Min: 6 Act: 55 Avg: 55 Max: 34658 # echo ? > /proc/sysrq-trigger !!!!!!!!!!!!!!!!!

Slide 57

Slide 57 text

printk ● The last blocker of PREEMPT_RT!

Slide 58

Slide 58 text

printk ● The last blocker of PREEMPT_RT! ● Old printk serializes the output ○ Does each console one by one!

Slide 59

Slide 59 text

printk ● The last blocker of PREEMPT_RT! ● Old printk serializes the output ○ Does each console one by one! ● Threaded printk allows all consoles to be printed at once!

Slide 60

Slide 60 text

printk ● The last blocker of PREEMPT_RT! ● Old printk serializes the output ○ Does each console one by one! ● Threaded printk allows all consoles to be printed at once! ● Can now be called in any context! ○ NMI ○ Scheduler

Slide 61

Slide 61 text

Build speed ● In 2008, ftrace was added

Slide 62

Slide 62 text

Build speed ● In 2008, ftrace was added ● It originally required a daemon to find the mcount locations ○ This is needed to convert them to nops

Slide 63

Slide 63 text

Build speed ● In 2008, ftrace was added ● It originally required a daemon to find the mcount locations ○ This is needed to convert them to nops ● To get rid of the daemon, Perl scripts were added to the build

Slide 64

Slide 64 text

Build speed ● In 2008, ftrace was added ● It originally required a daemon to find the mcount locations ○ This is needed to convert them to nops ● To get rid of the daemon, Perl scripts were added to the build ○ It did objdump to find the locations of mcount calls

Slide 65

Slide 65 text

Build speed ● In 2008, ftrace was added ● It originally required a daemon to find the mcount locations ○ This is needed to convert them to nops ● To get rid of the daemon, Perl scripts were added to the build ○ It did objdump to find the locations of mcount calls ○ It then created an assembly file holding these locations

Slide 66

Slide 66 text

Build speed ● In 2008, ftrace was added ● It originally required a daemon to find the mcount locations ○ This is needed to convert them to nops ● To get rid of the daemon, Perl scripts were added to the build ○ It did objdump to find the locations of mcount calls ○ It then created an assembly file holding these locations ○ It then compiled and relinked the file into the original object file

Slide 67

Slide 67 text

Build speed ● In 2008, ftrace was added ● It originally required a daemon to find the mcount locations ○ This is needed to convert them to nops ● To get rid of the daemon, Perl scripts were added to the build ○ It did objdump to find the locations of mcount calls ○ It then created an assembly file holding these locations ○ It then compiled and relinked the file into the original object file ● Needless to say, this slowed down the build

Slide 68

Slide 68 text

Build speed ● In 2008, ftrace was added ● It originally required a daemon to find the mcount locations ○ This is needed to convert them to nops ● To get rid of the daemon, Perl scripts were added to the build ○ It did objdump to find the locations of mcount calls ○ It then created an assembly file holding these locations ○ It then compiled and relinked the file into the original object file ● Needless to say, this slowed down the build ● In 2010, recordmcount.c was written to do this in C ○ Written by John Reiser ○ sped up the build again!

Slide 69

Slide 69 text

Build speed ● In 2008, Linus complained about build speed

Slide 70

Slide 70 text

Build speed ● In 2008, Linus complained about build speed ○ We ask users to bisect the bugs they report

Slide 71

Slide 71 text

Build speed ● In 2008, Linus complained about build speed ○ We ask users to bisect the bugs they report ○ They only have the distro configs

Slide 72

Slide 72 text

Build speed ● In 2008, Linus complained about build speed ○ We ask users to bisect the bugs they report ○ They only have the distro configs ○ It can take them 13 hours to build on their machines

Slide 73

Slide 73 text

Build speed ● In 2008, Linus complained about build speed ○ We ask users to bisect the bugs they report ○ They only have the distro configs ○ It can take them 13 hours to build on their machines ● Linus told Kernel Summit attendees to fix it

Slide 74

Slide 74 text

Build speed ● In 2008, Linus complained about build speed ○ We ask users to bisect the bugs they report ○ They only have the distro configs ○ It can take them 13 hours to build on their machines ● Linus told Kernel Summit attendees to fix it (I wasn’t invited)

Slide 75

Slide 75 text

Build speed ● In 2008, Linus complained about build speed ○ We ask users to bisect the bugs they report ○ They only have the distro configs ○ It can take them 13 hours to build on their machines ● Linus told Kernel Summit attendees to fix it ○ Thomas Gleixner said we already have a script

Slide 76

Slide 76 text

Build speed ● In 2008, Linus complained about build speed ○ We ask users to bisect the bugs they report ○ They only have the distro configs ○ It can take them 13 hours to build on their machines ● Linus told Kernel Summit attendees to fix it ○ Thomas Gleixner said we already have a script ○ “Steven Rostedt has this streamline-config.pl”

Slide 77

Slide 77 text

Build speed ● In 2008, Linus complained about build speed ○ We ask users to bisect the bugs they report ○ They only have the distro configs ○ It can take them 13 hours to build on their machines ● Linus told Kernel Summit attendees to fix it ○ Thomas Gleixner said we already have a script ○ “Steven Rostedt has this streamline-config.pl” ○ Linus asked him why it’s not already in the kernel?

Slide 78

Slide 78 text

Build speed ● In 2008, Linus complained about build speed ○ We ask users to bisect the bugs they report ○ They only have the distro configs ○ It can take them 13 hours to build on their machines ● Linus told Kernel Summit attendees to fix it ○ Thomas Gleixner said we already have a script ○ “Steven Rostedt has this streamline-config.pl” ○ Linus asked him why it’s not already in the kernel? ● Later Steven was asked to add it

Slide 79

Slide 79 text

Build speed ● In 2008, Linus complained about build speed ○ We ask users to bisect the bugs they report ○ They only have the distro configs ○ It can take them 13 hours to build on their machines ● Linus told Kernel Summit attendees to fix it ○ Thomas Gleixner said we already have a script ○ “Steven Rostedt has this streamline-config.pl” ○ Linus asked him why it’s not already in the kernel? ● Later Steven was asked to add it ○ This became make localmodconfig ○ It sped up the build tremendously

Slide 80

Slide 80 text

Build speed ● In 2008, Linus complained about build speed ○ We ask users to bisect the bugs they report ○ They only have the distro configs ○ It can take them 13 hours to build on their machines ● Linus told Kernel Summit attendees to fix it ○ Thomas Gleixner said we already have a script ○ “Steven Rostedt has this streamline-config.pl” ○ Linus asked him why it’s not already in the kernel? ● Later Steven was asked to add it ○ This became make localmodconfig ○ It sped up the build tremendously ○ It also hid the slow down caused by ftrace perl scripts!

Slide 81

Slide 81 text

text_poke() Ftrace was the first to add runtime modification of code : 0f 1f 44 00 00 nop 53 push %rbx 65 48 8b 1c 25 00 61 mov %gs:0x16100,%rbx 01 00 ffffffff81a1491b: R_X86_64_32S current_task 48 8b 43 10 mov 0x10(%rbx),%rax 48 85 c0 test %rax,%rax 74 10 je ffffffff81a14938 f6 43 24 20 testb $0x20,0x24(%rbx) 75 49 jne ffffffff81a14977 48 83 bb 20 0c 00 00 cmpq $0x0,0xc20(%rbx) 00 74 1f je ffffffff81a14957 31 ff xor %edi,%edi e8 a1 f8 ff ff callq ffffffff81a141e0 <__schedule>

Slide 82

Slide 82 text

text_poke() Ftrace was the first to add runtime modification of code : 1f 44 00 00 nop 53 push %rbx 65 48 8b 1c 25 00 61 mov %gs:0x16100,%rbx 01 00 ffffffff81a1491b: R_X86_64_32S current_task 48 8b 43 10 mov 0x10(%rbx),%rax 48 85 c0 test %rax,%rax 74 10 je ffffffff81a14938 f6 43 24 20 testb $0x20,0x24(%rbx) 75 49 jne ffffffff81a14977 48 83 bb 20 0c 00 00 cmpq $0x0,0xc20(%rbx) 00 74 1f je ffffffff81a14957 31 ff xor %edi,%edi e8 a1 f8 ff ff callq ffffffff81a141e0 <__schedule>

Slide 83

Slide 83 text

text_poke() Ftrace was the first to add runtime modification of code : nop push %rbx mov %gs:0x16100,%rbx mov 0x10(%rbx),%rax test %rax,%rax

Slide 84

Slide 84 text

text_poke() Ftrace was the first to add runtime modification of code : nop push %rbx mov %gs:0x16100,%rbx mov 0x10(%rbx),%rax test %rax,%rax do_int3(struct pt_regs *regs) { regs->ip += 5; return }

Slide 85

Slide 85 text

text_poke() Ftrace was the first to add runtime modification of code : nop push %rbx mov %gs:0x16100,%rbx mov 0x10(%rbx),%rax test %rax,%rax do_int3(struct pt_regs *regs) { regs->ip += 5; return }

Slide 86

Slide 86 text

text_poke() Ftrace was the first to add runtime modification of code : nop push %rbx mov %gs:0x16100,%rbx mov 0x10(%rbx),%rax test %rax,%rax do_int3(struct pt_regs *regs) { regs->ip += 5; return }

Slide 87

Slide 87 text

text_poke() Ftrace was the first to add runtime modification of code : nop push %rbx mov %gs:0x16100,%rbx mov 0x10(%rbx),%rax test %rax,%rax do_int3(struct pt_regs *regs) { regs->ip += 5; return }

Slide 88

Slide 88 text

text_poke() Ftrace was the first to add runtime modification of code : 1f 44 00 00 nop 53 push %rbx 65 48 8b 1c 25 00 61 mov %gs:0x16100,%rbx 01 00 ffffffff81a1491b: R_X86_64_32S current_task 48 8b 43 10 mov 0x10(%rbx),%rax 48 85 c0 test %rax,%rax 74 10 je ffffffff81a14938 f6 43 24 20 testb $0x20,0x24(%rbx) 75 49 jne ffffffff81a14977 48 83 bb 20 0c 00 00 cmpq $0x0,0xc20(%rbx) 00 74 1f je ffffffff81a14957 31 ff xor %edi,%edi e8 a1 f8 ff ff callq ffffffff81a141e0 <__schedule>

Slide 89

Slide 89 text

text_poke() Ftrace was the first to add runtime modification of code : 1b d0 1e 00 callq ffffffff81c01930 <__fentry__> 53 push %rbx 65 48 8b 1c 25 00 61 mov %gs:0x16100,%rbx 01 00 ffffffff81a1491b: R_X86_64_32S current_task 48 8b 43 10 mov 0x10(%rbx),%rax 48 85 c0 test %rax,%rax 74 10 je ffffffff81a14938 f6 43 24 20 testb $0x20,0x24(%rbx) 75 49 jne ffffffff81a14977 48 83 bb 20 0c 00 00 cmpq $0x0,0xc20(%rbx) 00 74 1f je ffffffff81a14957 31 ff xor %edi,%edi e8 a1 f8 ff ff callq ffffffff81a141e0 <__schedule>

Slide 90

Slide 90 text

text_poke() Ftrace was the first to add runtime modification of code : e8 1b d0 1e 00 callq ffffffff81c01930 <__fentry__> 53 push %rbx 65 48 8b 1c 25 00 61 mov %gs:0x16100,%rbx 01 00 ffffffff81a1491b: R_X86_64_32S current_task 48 8b 43 10 mov 0x10(%rbx),%rax 48 85 c0 test %rax,%rax 74 10 je ffffffff81a14938 f6 43 24 20 testb $0x20,0x24(%rbx) 75 49 jne ffffffff81a14977 48 83 bb 20 0c 00 00 cmpq $0x0,0xc20(%rbx) 00 74 1f je ffffffff81a14957 31 ff xor %edi,%edi e8 a1 f8 ff ff callq ffffffff81a141e0 <__schedule>

Slide 91

Slide 91 text

text_poke() Static branches can use the same method! if (static_key_false(&somekey)) { /* Do something unlikely */ }

Slide 92

Slide 92 text

text_poke() Static branches can use the same method! static int static_key_false(struct static_key *key) { asm goto("1:" ".byte 0xe9 \n\t .long 0\n\t /* nop */" ".pushsection __jump_table, \"a\" \n\t" _ASM_PTR "1b, %l[" #label "], %c0 \n\t" ".popsection \n\t" : : "i" (key) : : l_yes); return false; l_yes: return true; }

Slide 93

Slide 93 text

text_poke() Static branches can use the same method! static int static_key_false(struct static_key *key) { asm goto("1:" ".byte 0xe9 \n\t .long 0\n\t /* nop */" ".pushsection __jump_table, \"a\" \n\t" _ASM_PTR "1b, %l[" #label "], %c0 \n\t" ".popsection \n\t" : : "i" (key) : : l_yes); return false; l_yes: return true; } NOP

Slide 94

Slide 94 text

text_poke() Static branches can use the same method! static int static_key_false(struct static_key *key) { asm goto("1:" "jmp l_yes\n\t" ".pushsection __jump_table, \"a\" \n\t" _ASM_PTR "1b, %l[" #label "], %c0 \n\t" ".popsection \n\t" : : "i" (key) : : l_yes); return false; l_yes: return true; }

Slide 95

Slide 95 text

text_poke() Static branches can use the same method! static int static_key_false(struct static_key *key) { asm goto("1:" "jmp l_yes\n\t" ".pushsection __jump_table, \"a\" \n\t" _ASM_PTR "1b, %l[" #label "], %c0 \n\t" ".popsection \n\t" : : "i" (key) : : l_yes); return false; l_yes: return true; }

Slide 96

Slide 96 text

text_poke() Static branches can use the same method! if (static_key_false(&somekey)) { /* Do something unlikely */ }

Slide 97

Slide 97 text

text_poke() Static branches can use the same method! if (0) { /* Do something unlikely */ }

Slide 98

Slide 98 text

text_poke() Static branches can use the same method! if (1) { /* Do something unlikely */ }

Slide 99

Slide 99 text

text_poke() ● But you must enable CONFIG_JUMP_LABEL ○ General architecture-dependent options → Optimize very unlikely/likely branches

Slide 100

Slide 100 text

text_poke() ● But you must enable CONFIG_JUMP_LABEL ○ General architecture-dependent options → Optimize very unlikely/likely branches ● Without this, it’s just a conditional branch if (somekey->enabled) { /* Do something unlikely */ }

Slide 101

Slide 101 text

cond_resched() ● In CONFIG_PREEMPT_NONE the kernel does not preempt!

Slide 102

Slide 102 text

cond_resched() ● In CONFIG_PREEMPT_NONE the kernel does not preempt! ● If there is a long loop, the watchdog timer will trigger ○ Need to notify areas that can schedule

Slide 103

Slide 103 text

cond_resched() ● In CONFIG_PREEMPT_NONE the kernel does not preempt! ● If there is a long loop, the watchdog timer will trigger ○ Need to notify areas that can schedule ● Approximately 1455 cond_resch() in v6.11

Slide 104

Slide 104 text

cond_resched() kernel/bpf/verifier.c: for (i = 0; i < env->subprog_cnt; i++) { old_bpf_func = func[i]->bpf_func; tmp = bpf_int_jit_compile(func[i]); if (tmp != func[i] || func[i]->bpf_func != old_bpf_func) { verbose(env, "JIT doesn't support bpf-to-bpf calls\n"); err = -ENOTSUPP; goto out_free; } }

Slide 105

Slide 105 text

cond_resched() kernel/bpf/verifier.c: for (i = 0; i < env->subprog_cnt; i++) { old_bpf_func = func[i]->bpf_func; tmp = bpf_int_jit_compile(func[i]); if (tmp != func[i] || func[i]->bpf_func != old_bpf_func) { verbose(env, "JIT doesn't support bpf-to-bpf calls\n"); err = -ENOTSUPP; goto out_free; } + cond_resched(); }

Slide 106

Slide 106 text

What’s wrong with CONFIG_PREEMPT? ● It can preempt when locks are held ● Some workloads can take a big hit from it

Slide 107

Slide 107 text

CONFIG_PREEMPT_RT

Slide 108

Slide 108 text

CONFIG_PREEMPT_RT

Slide 109

Slide 109 text

CONFIG_PREEMPT_RT

Slide 110

Slide 110 text

NEED_RESCHED CONFIG_PREEMPT_RT

Slide 111

Slide 111 text

NEED_RESCHED CONFIG_PREEMPT_RT

Slide 112

Slide 112 text

NEED_RESCHED CONFIG_PREEMPT_RT

Slide 113

Slide 113 text

NEED_RESCHED CONFIG_PREEMPT_RT

Slide 114

Slide 114 text

NEED_RESCHED CONFIG_PREEMPT_RT

Slide 115

Slide 115 text

NEED_RESCHED CONFIG_PREEMPT_RT

Slide 116

Slide 116 text

NEED_RESCHED CONFIG_PREEMPT_RT

Slide 117

Slide 117 text

NEED_RESCHED CONFIG_PREEMPT_RT

Slide 118

Slide 118 text

LAZY_NEED_RESCHED

Slide 119

Slide 119 text

LAZY_NEED_RESCHED

Slide 120

Slide 120 text

LAZY_NEED_RESCHED

Slide 121

Slide 121 text

LAZY_NEED_RESCHED

Slide 122

Slide 122 text

LAZY_NEED_RESCHED

Slide 123

Slide 123 text

LAZY_NEED_RESCHED

Slide 124

Slide 124 text

LAZY_NEED_RESCHED

Slide 125

Slide 125 text

LAZY_NEED_RESCHED NEED_RESCHED

Slide 126

Slide 126 text

CONFIG_PREEMPT_AUTO ● Takes the lessons learned from LAZY_NEED_RESCHED of RT Preempt Type Schedule Tick might_sleep() PREEMPT_NONE x x PREEMPT_VOLUNTARY x ✅ PREEMPT ✅ ✅

Slide 127

Slide 127 text

CONFIG_PREEMPT_AUTO LAZY_NEED_RESCHED

Slide 128

Slide 128 text

LAZY_NEED_RESCHED Go to user space CONFIG_PREEMPT_AUTO

Slide 129

Slide 129 text

LAZY_NEED_RESCHED Go to user space schedule CONFIG_PREEMPT_AUTO

Slide 130

Slide 130 text

LAZY_NEED_RESCHED CONFIG_PREEMPT_AUTO

Slide 131

Slide 131 text

LAZY_NEED_RESCHED CONFIG_PREEMPT_AUTO

Slide 132

Slide 132 text

LAZY_NEED_RESCHED NEED_RESCHED CONFIG_PREEMPT_AUTO

Slide 133

Slide 133 text

LAZY_NEED_RESCHED NEED_RESCHED schedule CONFIG_PREEMPT_AUTO

Slide 134

Slide 134 text

Questions?

Slide 135

Slide 135 text

Questions?