Slide 1

Slide 1 text

Improving Linux for (KVM) Real Time Workloads Leonardo Brás Soares Passos Kernel Recipes 2025

Slide 2

Slide 2 text

whoami ● Leonardo Brás Soares Passos ● Work @ Arm UK (started this month) ● Kernel team ● arm64 ● KVM ● Previously: Virt-RT team @ Red Hat ● Find me: leobras @ {IRC, GitLab, GitHub, Mastodon}

Slide 3

Slide 3 text

Disclaimer This presentation is based in the experience acquired during my employment at Red Hat, as a member of the Virt-RT team. This work was not done in any capacity of my Arm employment.

Slide 4

Slide 4 text

Part 1: The basic stuff (for Real-Time)

Slide 5

Slide 5 text

What is Real Time ?

Slide 6

Slide 6 text

What is Real Time ? I need you to deal with my request in T time, or else...

Slide 7

Slide 7 text

What is Real Time ? I need you to deal with my request in T time, or else...

Slide 8

Slide 8 text

What is Real Time ? ● Real-time (computing) is the computer science term for hardware and software systems subject to a "real-time constraint", for example from event to system response.[1] ● Real-time programs must guarantee response within specified time constraints, often referred to as "deadlines". [2]

Slide 9

Slide 9 text

What is a deadline? Real Time Workload Request / Event Result / Response Response Time Deadline = Maximum “Response time” acceptable by given workload

Slide 10

Slide 10 text

Why does Real Time Matter? ● Missing a deadline can have bad consequences [3]: ● Hard – missing a deadline is a total system failure. ● Firm – infrequent deadline misses are tolerable, but may degrade the system's quality of service. The usefulness of a result is zero after its deadline. ● Soft – the usefulness of a result degrades after its deadline, thereby degrading the system's quality of service.

Slide 11

Slide 11 text

Real Time workload examples ● Hard Real Time: ● Medical Systems (pacemakers, robot surgery) ● Industrial Process Controllers (assembly line, often Firm RT) ● Soft Real Time: ● Live audio / video systems (such as online meeting apps) ● Video games (missing deadline degrades experience)

Slide 12

Slide 12 text

Real Time Operating Systems ● Historically Real Time applications ran on bare metal ● No Operating Systems bellow ● To avoid the latency overhead of an operative system ● Programming an application on top of an Operating System is much more comfortable & fast. ● Abstractions, drivers, libraries ● What about an OS focused on Real Time workloads?

Slide 13

Slide 13 text

Real Time Operating Systems ● In order to have a RTOS, a different strategy is needed: ● Zephyr: Simpler kernel, with RT-oriented mechanisms. Many Schedulers available for different RT requirements . [4] ● FreeRTOS: Whole OS is deterministic, has many memory allocation options (for different RT requirements), don’t share stack between processes [5] ● RTLinux: Linux is ran for non-RT tasks, relying on a virtualized interrupt control. RT tasks run on a POSIX thread on RTLinux infrastructure (not using Linux infrastructure). ● RT threads have higher priority compared to Linux threads [6]

Slide 14

Slide 14 text

RTLinux RT Workload RTLinux Linux Task A (not-RT) Task C (not-RT) Task B (not-RT) Virtualized Interrupt Control Interruption Handlers Linux Interruption Linux Interruption Handlers RT Interruption

Slide 15

Slide 15 text

What about Linux ? ● Linux is the major free OS in the market ● But was not planned as a Real Time operating system. ● Uses performance-oriented schedulers ● Focus on throughput instead of latency ● Makes sense on server workloads ● No deterministic behavior ● Workloads can be interrupted by scheduler or kernel ● possibly causing missed deadlines. ● What if we turn Linux on a Real Time OS?

Slide 16

Slide 16 text

Turn Linux on a Real Time OS? ● As one of the major OS ● Many programming languages, libraries & tools are available ● An extensive hardware support library (drivers) ● More options during hardware design ● Source available ● Code can be modified to add new feature or device support

Slide 17

Slide 17 text

Part 2: Improving Real-Time behavior on Linux Kernel

Slide 18

Slide 18 text

Real-Time scheduling policies

Slide 19

Slide 19 text

Real-Time scheduling policies Run my task first!

Slide 20

Slide 20 text

Real-Time scheduling policies Run my task first!

Slide 21

Slide 21 text

Real-Time scheduling policies ● The scheduler is the kernel component that decides which runnable thread will be executed by the CPU. ● Some of those are normal (performance) scheduling policies (SCHED_OTHER, SCHED_BATCH, SCHED_IDLE) ● Other are known as Real-Time scheduling policies ● They make sure highest priority workloads run first

Slide 22

Slide 22 text

Real-Time scheduling policies ● SCHED_FIFO ● Highest priority task will always run first, until finished ● Scheduler picks task which was scheduled first for running next. ● Given same priority ● SCHED_RR (Round-Robin) ● Scheduler run a highest priority task for a time period, and switches to the other same-priority task after that. ● ● SCHED_DEADLINE ● Runs the task with the earliest deadline first ● Need to specify Runtime, Deadline & Period

Slide 23

Slide 23 text

Real-Time scheduling policies ... Priority 1 List Priority 99 List Task A Time Task A Task A Task A Task B Task C Task A Task A Task D Task E Task F SCHED_FIFO Scheduler Run Time ... Priority 1 List Priority 99 List Task A Time Task A Task A Task A Task B Task C Task A Task A Task D Task E Task F D E F Run Time SCHED_RR Scheduler A B C A B C A B C D E F D E F A B C D E F Priority 99 List Task A Time Task A Task A Task A Task B Task C Run Time SCHED_DEADLINE Scheduler Deadline T+0 Deadline T+1 Deadline T+3 A B C

Slide 24

Slide 24 text

PREEMPT_RT

Slide 25

Slide 25 text

PREEMPT_RT Stop the kernel, run my task!

Slide 26

Slide 26 text

PREEMPT_RT Stop the kernel, run my task!

Slide 27

Slide 27 text

PREEMPT_RT What is it? ● A kernel option that enables the preemption model: ● "Fully Preemptible Kernel (Real-Time)" ● “This option turns the kernel into a real-time kernel” ● What is Preemption? ● Interrupting a task in order to execute a higher priority task ● tl;dr: Allow interruption of Kernel code by userspace code

Slide 28

Slide 28 text

PREEMPT_RT What does it change? ● Replace locking primitives (spinlocks, rwlocks, etc.) ● Locks turned on preemptable, priority-inheritance aware variants ● Serve as mechanisms to break long non-preemptable sections ● Allow the kernel to be (mostly) preemptable ● Some code paths such as ‘entry code’, scheduler, and low level interrupt handling are still not-preemptable.

Slide 29

Slide 29 text

PREEMPT_RT How does it improve RT behavior? ● Preemptable kernel kernel can be “interrupted” → ● RT workload gets higher priority Gets CPU time earlier → ● RT workload has reduced latency ● Also has reduced chance of missing a deadline

Slide 30

Slide 30 text

PREEMPT_RT User Code Scheduler Kernel Code Kernel Code Processing request IRQ Scheduler User Code Scheduler Real-Time Request Without PREEMPT_RT User Code Scheduler Kernel Code Scheduler User Code Scheduler Real-Time Request With PREEMPT_RT Processing request IRQ Scheduler Kernel Code User Code

Slide 31

Slide 31 text

CPU Isolation

Slide 32

Slide 32 text

CPU Isolation Don’t run stuff at these CPUs!

Slide 33

Slide 33 text

CPU Isolation Don’t run stuff at these CPUs! Unless I tell you to

Slide 34

Slide 34 text

CPU Isolation What is it? ● Linux boot parameter ● Remove a CPU list from the SMP load balancer and scheduler. ● Tell other Linux code to avoid scheduling work on it.

Slide 35

Slide 35 text

CPU Isolation What does it change? ● Scheduler don’t schedule work on Isolated CPUs by default ● User need to manually assign (pin) a process to given CPU ● User has full control of what runs there. ● Also avoid some ‘housekeeping’ kernel work there ● Less workload interruption

Slide 36

Slide 36 text

CPU Isolation How does it improve RT behavior? ● RT workloads can be ‘pinned’ to Isolated CPUs ● Single workload per CPU Not competing for CPU time → ● Multiple CPUs can handle multiple different RT-workloads ● Non-RT stuff can run on other CPUs

Slide 37

Slide 37 text

CPU Isolation Scheduler CPUs to schedule on: 0,1,2,3 CPU 0 CPUMASK 0xF CPU 1 CPU 2 CPU 3 Task A Task E Task H Task B Task F Task C Task D Task G Task J Pinned Task No isolcpus parameter Scheduler CPUs to schedule on: 0,2 CPU 0 CPUMASK 0x5 CPU 1 CPU 2 CPU 3 Task A Task E Task H Task B Task F Task C Task D Task G Task J Pinned Task isolcpus=1,3

Slide 38

Slide 38 text

NOHZ_FULL

Slide 39

Slide 39 text

NOHZ_FULL Is my task alone? Don’t run the scheduler!

Slide 40

Slide 40 text

NOHZ_FULL Is my task alone? Don’t run the scheduler!

Slide 41

Slide 41 text

NOHZ_FULL What is it? ● Linux boot parameter ● Takes a cpu-list ● Turns-off the tick for a CPU (when there is a single task). ● Also offloads RCU callbacks ● to CPUs that are not in the NOHZ_FULL list ● Requires CONFIG_NO_HZ_FULL=y

Slide 42

Slide 42 text

NOHZ_FULL What does it change? ● Less interruption ● No scheduler interrupting CPU to check for other tasks ● RCU callbacks will not be ran in that CPU ● (Offloaded to other CPUs) ● How does it improve RT behavior? ● Less interruption RT workload gets more CPU time →

Slide 43

Slide 43 text

NOHZ_FULL User Code Scheduler Real-Time Request Without NOHZ_FULL With NOHZ_FULL User Code Scheduler User Code Processing request Total Time User Code Real-Time Request Processing request Total Time

Slide 44

Slide 44 text

Part 3: Numbers

Slide 45

Slide 45 text

Tests ● Cyclictest ● “Sleeps” for given time, checks maximum time spent after return. ● Good for checking predictability ● Oslat ● Reads time in a loop, checks largest value between ● Measures the maximum time userspace get interrupted ● Stress-ng ● Creates cpu/mem workload to try and mess latency

Slide 46

Slide 46 text

Numbers ● Cyclictest: 12h, 20CPUs ● No other workload ● Average: 4.57us, Higher: 8us ● Workload in other CPU ● Average: 4.7us, Higher: 8us ● Workload in same CPU ● Average: 31us, Higher: 34us

Slide 47

Slide 47 text

Numbers ● Oslat: 12h, 20CPUs ● No other workload ● Average: 2us, Higher: 2us ● Workload in other CPU ● Average: 2.1us, Higher: 3us ● Workload in same CPU ● N/A ● Oslat does not share cpus well

Slide 48

Slide 48 text

Part 4: RT in Virtual Machines

Slide 49

Slide 49 text

Real Time Virtual Machines Physical Machine Isolated CPUs, NOHZ_Full KVM (Threads in SCHED_FIFO) CPU1 CPU2 CPU3 CPUn CPU4 ... PREEMPT_RT Virtual Machine Isolated CPUs, NOHZ_Full RT Workload (Threads in SCHED_FIFO) vCPU1 vCPU2 vCPU3 vCPUk vCPU4 ... PREEMPT_RT Virtual Machine Isolated CPUs, NOHZ_Full RT Workload (Threads in SCHED_FIFO) vCPU1 vCPU2 vCPU3 vCPUm vCPU4 ... PREEMPT_RT

Slide 50

Slide 50 text

Motivation Same as regular VM motivations. ● Can use fewer, bigger & more efficient machines ● While meeting the latency needs ● Centralizing workloads ● Easier to move workload when needed (Live Migration) ● More modular approach ● Cloud providers could rent Real-time Virtual machines ● Easier to implement RT workloads ● No need to worry on hardware specs

Slide 51

Slide 51 text

Drawbacks ● Virtualization adds overhead, and thus adds latency ● guest_exit and guest_entry take time ● Preemption may need to happen both on guest and host ● Housekeeping tasks both in guest and host ● Deadline need to cover the network latency ● In case the processing is far from the user

Slide 52

Slide 52 text

Numbers ● Oslat: 12h, 16CPUs ● Stress on non-isolated cores at host & guest ● Average: 14us, Higher: 23us ● Cyclictest: 24h, 8CPUs ● Stress on non-isolated cores at host & guest ● Average: 15.6us, Higher: 18us

Slide 53

Slide 53 text

Numbers ● Oslat: 12h, 16CPUs ● Stress on non-isolated cores at host & guest ● Average: 14us, Higher: 23us ● Cyclictest: 24h, 8CPUs ● Stress on non-isolated cores at host & guest ● Average: 15.6us, Higher: 18us Virt-RT team works on further reducing those numbers

Slide 54

Slide 54 text

Some of my work on RT ● Add tracepoints on remotelly scheduled functions ● Avoid isolated cpus for queue_delayed_work() ● Note RCU quiescent state in guest_exit ● Avoid rcu_core() running short after guest_exit ● Convinced paulmck to add new “patience” rcu option ● Change local_lock strategy in RT to avoid IPIs ● Ongoing, should remove a lot of IPIs for Isolated CPUs

Slide 55

Slide 55 text

Thanks! Questions?

Slide 56

Slide 56 text

References: [1] "FreeRTOS - Open Source RTOS Kernel for small embedded systems - What is FreeRTOS FAQ?". FreeRTOS. Retrieved 2021-03-08. [2] Ben-Ari, Mordechai; "Principles of Concurrent and Distributed Programming", ch. 16, Prentice Hall, 1990, ISBN 0- 13-711821-X, page 164 [3] Kopetz, Hermann; Real-Time Systems: Design Principles for Distributed Embedded Applications, Kluwer Academic Publishers, 1997 [4] “Zephyr Documentation : Schedulers”. Retrieved 2023-08-23 https://docs.zephyrproject.org/latest/kernel/service s/scheduling/index.html [5] “FreeRTOS Documentation”. Retrieved 2023-08-23. https://www.freertos.org/features.html [6] “RTLinux page”. Retrieved 2023-08-23. https://en.wikipedia.org/wiki/RTLinux [7] “Linux Kernel Documentation”. Retrieved 2023-08-24. https://www.kernel.org/doc/html/latest