Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Where does my memory come from?

Where does my memory come from?

This talk provides a deep-ish dive into the APIs the kernel provides to userland to map memory, how they work internally, how the kernel tracks everything, what might happen to that memory (e.g. reclaim, migration) and how we handle that, the dread terror of forking, VMA merging and splitting and all those intricate cogs that must keep turning to do what seems the most fundamental thing any program might do – allocating memory.

Lorenzo STOAKES

Avatar for Kernel Recipes

Kernel Recipes PRO

September 25, 2025
Tweet

More Decks by Kernel Recipes

Other Decks in Technology

Transcript

  1. Where Does My Memory Come From? Kernel Recipes 2025 Lorenzo

    Stoakes Consulting Member of the Technical Staff Oracle Cloud Infrastructure September, 2025
  2. Where Does My Memory Come From? Take 1 2 Copyright

    © 2025, Oracle and/or its affiliates app glibc malloc() Magical internal implementation details brk() mmap() Kernel Userspace
  3. Well no, not really 3 Copyright © 2025, Oracle and/or

    its affiliates Not actually allocating anything mm_struct Process .mm_mt Maple tree VMA VMA VMA VMA VMA 0 0x7fffffffffff
  4. OK, So How Does Userland Ask For a VMA? 4

    Copyright © 2025, Oracle and/or its affiliates void *ptr = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_ANON | MAP_PRIVATE, -1, 0); Figure out address for me Read/write VMA Kernel installs a VMA at ptr The modern way is mmap(): Anonymous memory (not mapping a file)
  5. Well OK, Not Quite 5 Copyright © 2025, Oracle and/or

    its affiliates VMA VMA VMA VMA VMA 0 0x7fffffffffff If same attributes MERGE VMA VMA VMA 0 VMA Return address within valid VMA 0x7fffffffffff
  6. Also, brk() 6 Copyright © 2025, Oracle and/or its affiliates

    Can grow 'program break' instead if (!brk(new_brk_addr)) ... allocation succeeded ... VMA If system call succeeds, returns END of valid range, i.e. the 'break'. Otherwise returns old break. libc function just returns 0 on success, -1 on failure. Can use sbrk() function to get current brk.
  7. Where Does My Memory Come From? Take 2 7 Copyright

    © 2025, Oracle and/or its affiliates app glibc malloc() Magical internal implementation details brk() mmap() Kernel Userspace VMA
  8. Well... Not Quite 8 Copyright © 2025, Oracle and/or its

    affiliates In modern, sane hardware, memory is virtual and uses page tables to 'map' memory. Process mm_struct PGD PUD PMD PTE .pgd Physical Address Virtual Address P4D (Newer x86-64 can have 5 levels, mapping 64 PiB of user memory) Allows mapping 128 TiB of user memory.
  9. Page Faults 11 Copyright © 2025, Oracle and/or its affiliates

    User accesses unmapped memory MMU hardware traps fault, kernel handles Kernel checks whether valid VMA spans range Allocates & sets up page tables Allocates & maps physical memory Memory transparently available for use SEGFAULT No Yes When userland tries to access memory that doesn't have page tables set up: 1. The CPU detects this and triggers a page fault. 2. The kernel handles the page fault. 3. If address is valid, the kernel transparently allocates memory and sets up the mapping.
  10. Physical Allocation 12 Copyright © 2025, Oracle and/or its affiliates

    Check freelists based on 2n page size, where n is 'order'. If required size not available, keep dividing next available block size by two until we have what we need. The 'other half' of the divided blocks is each block's 'buddy', e.g. page_pfn ^ (1 << order). When freeing, 'coalesce' blocks by combining freed block and buddy, if that buddy is already free. Invented in 1963 by Harry Markowitz, known to be highly efficient (& helpfully - simple!) Described very well in The Art of Computer Programming by Donald Knuth.
  11. Folios PGD PUD PMD PTE Physical Address Virtual Address 13

    Copyright © 2025, Oracle and/or its affiliates Page tables map virtual addresses to physical ones for userland memory: Metadata: VMA Metadata: Folio We store metadata describing virtual mappings using VMAs. We store metadata describing virtual mappings using Folios. Folios describe physical memory that is: Physically contiguous. Virtually contiguous. Mappable into userland. The physical memory is aligned but can be mapped arbitrarily in virtual memory. Folios are, of course, always page-aligned. Folios can be of any (valid) order and use pages allocated by the physical allocator.
  12. Anonymous vs. File-backed 14 Copyright © 2025, Oracle and/or its

    affiliates The kernel makes an important distinction between 'anonymous' memory and 'file-backed' mappings. File-backed mappings map a file, anonymous mappings do not. Simple From the kernel perspective, VMAs are anonymous if they do not map a file, but folios are anonymous if they do not live in the page cache. More useful point of view - swap-backed vs. file-backed*. (* edge cases: MADV_FREE sets anon folios non-swap-backed, shmem) Folios belonging to file-backed mappings live in the page cache. This is, as it sounds, a cache of... err... folios for file-backed memory. Page cache folio ranges are tied (ultimately) to inodes. Can look them up easily, so when mapping a file we can very quickly do so. Anonymous memory simply consists of arbitrarily allocated memory. Cannot look them up easily. Importantly: we reclaim memory based on whether the folio is anonymous or not. Generally: when referring to anonymous memory we mean the folios. When referring to anonymous mappings we mean the VMAs. Of course this is mm so we make life difficult for ourselves: • shmem (tmpfs et al.) mappings have file-backed VMAs and folios, are in the page cache but are swap-backed + reclaimed as anon. • A user can map file-backed memory MAP_PRIVATE, which means it maps file-backed folios up until the mapping is written to, at which point we copy them into anonymous folios.
  13. Reverse Mapping 15 Copyright © 2025, Oracle and/or its affiliates

    We deal with physical memory at folio level, but need to look up VMAs too (e.g. for unmapping reclaimed memory or truncated files). We use the reverse mapping for this. PGD PUD PMD PTE Folio VMA anon_vma VMA ... PGD PUD PMD PTE ... ... Folio PGD PUD PMD PTE Folio VMA address_space VMA ... PGD PUD PMD PTE ... ... Folio Anonymous folios map to an anon_vma, and file-backed folios to an address_space. Folios track their offset in the mapping. The reverse mapping is established between VMAs and a mapping object. Each 'related' VMA is tracked from mapping via an interval tree. A typical reverse mapping operation moves from the folio via the mapping to the VMAs and from there ascertains virtual address range which are used to walk page tables. Anonymous File-backed inode
  14. Where Does My Memory Come From? Take 3 16 Copyright

    © 2025, Oracle and/or its affiliates app glibc malloc() Magical internal implementation details brk() mmap() Kernel Userspace VMA Later, page fault... Allocate page tables as necessary Allocate physical memory Map
  15. Closer... But the Memory Might Not Stick Around 17 Copyright

    © 2025, Oracle and/or its affiliates Under memory pressure, or madvise() we reclaim the least most recently used memory. When we can't allocate higher order pages we need to perform compaction. Files mapped into memory might be deleted, which truncates mappings. When enough PTEs are faulted in, we can replace it with a huge page via khugepaged. The user might force a zap of memory via madvise(MADV_ DONTNEED). A user can fault in a range and force conversion to huge page via MADV_ COLLAPSE. Of course, a user might simply munmap() memory or free via brk(). DAMON may trigger migration of memory via DAMOS_ MIGRATE_ HOT/COLD. User invokes NUMA mbind(), migrate_pages(), move_pages() system calls, causing migrate. NUMA balancing clears memory mappings which might migrate. User uses the cgroup cpuset mechanism to constrain tasks to node, causing migrate. Long-term GUP pin migrates now-unmovable pages away from movable ones. mremap() moves memory, unmapping existing mapping. userfaultfd memory move using UFFDIO_MOVE, unmapping original memory. Copy-on-Write anon/ MAP_PRIVATE file-backed write-fault. Driver might migrate user mapped pages to device memory (e.g. GPU to VRAM). madvise(MADV_ DONTFORK) mappings don't exist in child post-fork. And more...
  16. Reclaim 1 18 Copyright © 2025, Oracle and/or its affiliates

    Reclaim is the kernel's means of freeing up memory under memory pressure. Approximates a Least Recently Used (LRU) algorithm. ... 64 bits PTE entry Unusable All zero Hardware ignores these for purposes of target address, so instead we can use for flags. Accessed flag is set high every time the page table is accessed. The kernel can clear it so it knows: "This page was accessed since I lasted cleared this flag." Memory pressure simply means that free memory is getting low. Use watermarks to figure out what to do. How to figure out if recently used? What is memory pressure?
  17. Reclaim 2 19 Copyright © 2025, Oracle and/or its affiliates

    Folio Folio Folio Folio Folio Folio Folio Folio ... ... Inactive LRU Twice Reclaim 1. If file-backed + clean, clear the page table entry at PTE level. If dirty, write to disk, then reclaim. 2. If anonymous (not file-backed), put in swap cache and add a swap entry at PTE level, start write to disk. 3. Either way, mapping now faults. Folio First faulted in Not accessed Accessed Once Active LRU TWO pairs of LRUs for file-backed and swap-backed memory. Set folio referenced flag to mark seen once. Clear if not accessed since.
  18. Swap 20 Copyright © 2025, Oracle and/or its affiliates |

    Confidential: Internal/Restricted/Highly Restricted 20 Copyright © 2025, Oracle and/or its affiliates PGD PUD PMD PTE Virtual Address Folio (Swap Entry) Swap Cache 1. Put folio in swap cache, replace PTE entry with swap entry for every mapping. Mapping will now fault. 2. Start async writeback of folio to swap. 3. When writeback done, free folio from swap cache. Swap OUT PGD PUD PMD PTE Virtual Address Folio (Swap Entry) Swap Cache 1. Check if already in swap cache, if not, start async read of folio to swap cache. 2. When done, replace swap entry with folio. No longer faults. 3. If no swap entries left, remove from swap cache. Swap IN Disk Disk
  19. Forking & Copy-on-Write (CoW) 21 Copyright © 2025, Oracle and/or

    its affiliates Anonymous memory seems to be simple, but forking makes it not so simple. For efficiency, when we fork, we copy page tables, marking the 'leaf' entries (PTE, or if huge higher) entries read-only. If somebody writes, then we fault - if they are the exclusive mapper it is just marked read/write otherwise we copy. Read-only flag set PGD PUD PMD PTE Physical Address Virtual Address PGD PUD PMD PTE Folio VMA mapping VMA ... PGD PUD PMD PTE ... Things get complicated fast, as one process can fork another which can fork another, etc. etc.
  20. Compaction/Migration 1 22 Copyright © 2025, Oracle and/or its affiliates

    When we physically allocate memory, we allocate sizes based on physically contiguous pages. 20 pages - order 0 21 pages - order 1 22 pages - order 2 ... The physical allocator will coalesce smaller pages into larger ones when freed. But sometimes they get fragmented. A B A B B C C C A A D needs order 2 (4 pages): ... To reduce fragmentation between userland/kernel we place userland (movable) memory in pageblocks (typically order-9). Isolate and migrate folios* to other pageblocks D D D D *folios comprise of the order-N physical page provided by the physical allocator.
  21. Compaction/Migration 2 23 Copyright © 2025, Oracle and/or its affiliates

    PGD PUD PMD PTE Folio VMA mapping VMA ... PGD PUD PMD PTE ... 1. Isolate folios by removing them from the LRU, then use the reverse mapping to find all mapped PTEs + set migration entries. 2. Find a suitable destination for each previously mapped folio & isolate those free pages from being allocated. 3. Copy each source folio to the destination, then replace the migration entry with a PTE entry pointing at the now-moved folio, freeing the original. (Migration Entry) PGD PUD PMD PTE Folio VMA mapping VMA ... PGD PUD PMD PTE ... Folio (Migration Entry) Folio If somebody page faults while this is here, they wait for migration to be done or fail and put back folio. (There's other ways for migration to happen too, with a similar mechanism.)
  22. Where Does My Memory Come From? Take 4 24 Copyright

    © 2025, Oracle and/or its affiliates app glibc malloc() Magical internal implementation details brk() mmap() Kernel Userspace VMA Later, page fault... Allocate page tables as necessary Allocate physical memory Map Maybe page fault again Kernel fixes up A lot of complexity in the details here :)
  23. Conclusions 25 Copyright © 2025, Oracle and/or its affiliates •

    We make allocating memory easy from userland, • But there's a lot of complexity underneath. • Managing resources & abstracting complexity is the kernel's job :) • Memory is a lot more fluid than you might think... ◦ It moves around, ◦ And can flicker in and out of existence, ◦ But when you need it, it's there. • Virtual memory is very powerful.