Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Faster & Fewer Page Faults

Faster & Fewer Page Faults

We have improved the Linux page fault mechanism to reduce the number of faults and handle them more quickly when they do happen. By managing memory in large folios, we reduce the number of page faults. The 4KiB page used on many architectures is simply too small for the amount of memory we need to manage today. When you take a page fault, the kernel can allocate multiple pages and map them all at the same time. By managing VMAs in a Maple Tree, we handle page faults more quickly. The Maple Tree is shallower than the red-black tree and uses the CPU cache more effectively. When you take a page fault, the kernel can find the information it needs to handle the page fault more quickly. These two projects together result in a significant reduction of time spent handling page faults and allow your computer to spend more of its time running user code. No cars were crashed in the execution of this project.

Matthew Wilcox

Kernel Recipes

September 30, 2023
Tweet

More Decks by Kernel Recipes

Other Decks in Programming

Transcript

  1. The following is intended to outline our general product direction.

    It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, timing, and pricing of any features or functionality described for Oracle’s products may change and remains at the sole discretion of Oracle Corporation. Safe harbor statement Copyright © 2023, Oracle and/or its affiliates 2
  2. Copyright © 2023, Oracle and/or its affiliates 3 • Maple

    Tree • Per-VMA Locking • Large Folios • New PTE manipulation interfaces https://www.cs.virginia.edu/~robins/YouAndYourResearch.html Four projects
  3. Copyright © 2023, Oracle and/or its affiliates 4 • Your

    CPU is attempting to extract parallelism from your sequential code • My 2.8GHz laptop CPU is able to issue 6 insn/clock 30 insn/5-cycle L1 cache hit 70 insn/14-cycle L2 cache hit 200 insn/40-cycle (14ns) L3 cache hit 1680 insn/100ns L3 cache miss • Linked lists bottleneck on fetching the next entry in the list • Arrays can be prefetched • Walking a million-entry array is 12x faster than a million-entry list on my laptop Linked Lists are Immoral
  4. Copyright © 2023, Oracle and/or its affiliates 5 • Look

    up VMA (Virtual Memory Area) for virtual address • Walk down the page tables • If VMA is anonymous, allocate a page • Otherwise, call VMA fault handler - Fault handler may return a page or populate page table directly • If page provided, insert entry into page table Anatomy of a page fault
  5. Copyright © 2023, Oracle and/or its affiliates 6 • Maple

    Tree • Per-VMA Locking • Large Folios • New PTE manipulation interfaces Four projects
  6. Copyright © 2023, Oracle and/or its affiliates 7 • VMAs

    were originally stored on a singly-linked list in 0.98 (1992) • An AVL tree was added in 1.1.83 (1995) • A Red-Black tree replaced the AVL tree in 2.4.9.11 (2001) • A Maple Tree replaced the linked list & Red-Black tree in 6.1 (2022) Looking up a VMA
  7. Copyright © 2023, Oracle and/or its affiliates 8 • In-memory,

    RCU-safe B-tree for non-overlapping ranges • Average branching factor of eight creates shallower trees (faster lookups) • Modifications allocate memory (slower modifications) • Applications typically have between 20 VMAs (cat) and 1000 (Mozilla) - Can be millions in pathological cases (ElectricFence) • RCU safety guarantees that a VMA which was present before the RCU lock was taken, and is still present after the RCU lock is released will be found. Maple Tree
  8. Copyright © 2023, Oracle and/or its affiliates 9 • Maple

    Tree • Per-VMA Locking • Large Folios • New PTE manipulation interfaces Four projects
  9. Copyright © 2023, Oracle and/or its affiliates 10 • Protected

    by a semaphore from 2.0.19 (1996) • Changed to a read-write semaphore from 2.4.2.5 (2001) • Added per-VMA read-write semaphores in 6.4 (2023) VMA tree locking
  10. Copyright © 2023, Oracle and/or its affiliates 11 • Take

    RCU read lock to prevent Maple tree nodes and VMAs from being freed • Load VMA from Maple tree • Read-trylock the per-VMA lock - If write-locked, a writer is modifying this VMA. • If MM seqcount is equal to VMA seqcount, VMA is locked - This allows a writer to unlock all locked VMAs just by updating mm seqcount • Drop RCU read lock; we will not look at the Maple Tree, and the VMA cannot be freed Per-VMA locking lookup
  11. Copyright © 2023, Oracle and/or its affiliates 12 • Anonymous

    VMAs handled from 6.4 on arm64, powerpc, s390, x86; 6.5 on riscv • Swap and Userfaultfd support in 6.6 • In-core page cache VMAs support in 6.6 • DAX support in 6.6 • Page cache faults that need reads in 6.7? • COW faults of page cache VMAs in 6.7? • More support is possible, both architectures and types of memory - Device drivers may rely on mmap_sem synchronisation - HugeTLB faults have not yet been converted Support for per-VMA locking
  12. Copyright © 2023, Oracle and/or its affiliates 13 • Maple

    Tree • Per-VMA Locking • Large Folios • New PTE manipulation interfaces Four projects
  13. Copyright © 2023, Oracle and/or its affiliates 14 • XFS

    files can be buffered in larger chunks than PAGE_SIZE since 5.17 (2022) - AFS since 6.0, EROFS since 6.2 • Large folios can be created on write() since 6.6 • Support for other filesystems & anonymous memory is in progress Large Folios
  14. Copyright © 2023, Oracle and/or its affiliates 15 • Maple

    Tree • Per-VMA Locking • Large Folios • New PTE manipulation interfaces Four projects
  15. Copyright © 2023, Oracle and/or its affiliates 16 • set_pte_at()

    could only insert a single Page Table Entry • set_ptes() can insert n consecutive Page Table Entries pointing to contiguous pages • flush_dcache_folio() flushes the entire folio from the data cache • flush_icache_pages() flushes n consecutive pages from the instruction cache • update_mmu_cache_range() acts on n consecutive pages - Also tells the architecture which page was actually requested New PTE manipulation interfaces
  16. Copyright © 2023, Oracle and/or its affiliates 17 • Large

    Anonymous Folios • Removing writepage() → • Removing launder_folio() → • Shrinking struct page • Batched folio freeing • bdev_getblk() • ext2 directory handling • folio_end_read() • mrlock removal • Converting buffer_heads to use folios • Lockless page faults • Removing GFP_NOFS • struct ptdesc Projects I Don’t Have Time To Talk About • A better approach to the LRU list • Block size > PAGE_SIZE • Removing arch_make_page_accessible() • Why kernel-doc is not my favourite • Rewriting the swap subsystem • Removing __GFP_COMP • What does folio mapcount mean anyway? • Replacing the XArray radix tree with the maple tree • Converting HugeTLBfs to folios • Making HugeTLBfs less special • mshare • Improving readahead for modern storage • Support folios larger than PMD size
  17. Copyright © 2023, Oracle and/or its affiliates 18 • Andrew

    Morton • Darrick Wong • Dave Chinner • David Howells • David Hildenbrand • David Rientjes • Davidlohr Bueso • Greg Marsden • Jan Kara • Johannes Weiner • Jon Corbet • Kiryl Shutsemau Thanks • Laurent Dufour • Liam Howlett • Michal Hocko • Michel Lespinasse • Mike Kravetz • Mike Rapoport • Paul McKenney • Ryan Roberts • Song Liu • Suren Baghdasaryan • Vlastimil Babka • Yin Fengwei