Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Little Memory Block That Could

The Little Memory Block That Could

Alongside the CPU, memory is a core resource in every modern computing device. It is accessed by peripherals through DMA and by the main OS through the CPU. With the advent of virtualisation, concurrent OS and CPU technologies interact to give a seamless experience.
This talk will walk through the evolution of memory and CPUs leading to the current x86 and ARM landscape; be warned though, there be dragons...

GeekMasher

March 02, 2016
Tweet

Other Decks in Technology

Transcript

  1. Who are we?? Pablo Crossa - 4th year EH student

    - Programmer - Low-level Junkie - Ranter of all-the-things Mathew Payne (@GeekMasher) - 4th year EH student - Programmer - I can Powerpoint - Listens to this Ranter 2
  2. Agenda - History - MMU - CPU Cache - VT-x

    / AMD-V - VT-d / AMD-Vi / IOMMU - NX / XN / XD / EVP - ASLR Where the rant will take us 3
  3. Why was it introduced? 6 - CPUs needed to store

    temporary data - Volatile memory was born - Directly accessible by the CPU - Faster than secondary storage - Lost on shutdown https://en.wikipedia.org/wiki/Computer_memory
  4. Memory Technologies - Unreliable types - “store bits of information

    in the form of sound waves propagating through mercury” - Magnetic-core memory - Semiconductor memory - Current memory type 7 https://en.wikipedia.org/wiki/Computer_memory
  5. Semiconductor memory types DRAM - Each bit is stored in

    a capacitor - These slowly discharge - All values need to be read and re-written periodically (called refreshing) - Slow but cheap - Denser due to component count (one capacitor and one transistor per bit) SRAM - Each bit is stored in a flip-flop circuit - No refresh needed - Fast but expensive 8 https://en.wikipedia.org/wiki/Computer_memory
  6. Introduction - An OS needed a way to: - Protect

    from buggy or malicious software - Isolate applications - Isolate the kernel - Reduce memory fragmentation - Memory Management Unit - MMU was introduced, which: - Included memory protection - Supported per-process address mapping - Paging 10 https://en.wikipedia.org/wiki/Memory_management_unit
  7. Overview - The virtual address space is divided into “pages”

    - Usually a few kilobytes in size - Can be much larger (1GB on x64) - These pages can have different access permissions - Reading, writing and/or executing, or none - Several pages can map to the same physical address - A page can also map to several physical addresses 11 https://en.wikipedia.org/wiki/Memory_management_unit
  8. Paging - When low on memory, OS can mark physical

    pages as unassigned - When accessing this page, another active page is swapped into secondary storage and allocated to this section - When accessing the ‘swapped out’ page, another is stored in the ‘paging file’ (‘PAGEFILE.SYS’ on Windows, ‘swap’ on Unix) and the old one loaded in its place 12 http://wiki.osdev.org/Paging
  9. Page sizes - x86 - x86: - 4KiB - 4MiB

    - Since Pentium - 2MiB - PAE enabled - 36bit physical address - All can coexist within the same process 14 https://en.wikipedia.org/wiki/Memory_management_unit#IA-32_.2F_x86
  10. Page sizes - x64 - x64: - 4KiB - 2MiB

    - 1GiB - Huge tables - All can coexist within the same process 15 https://en.wikipedia.org/wiki/Memory_management_unit#x86-64
  11. Page sizes - ARM - ARM - 4KiB - 64KiB

    - 1MiB - 16MiB - (Legacy: 1KiB Tiny pages) - All can coexist within the same process 16 https://en.wikipedia.org/wiki/Memory_management_unit#ARM
  12. Original x86 MMU design - 4KiB pages only - 2

    table types - 2 stage translation - Page Directory - Page Table - Page directory contains 1024 x 32bit entries, each pointing to a page table (bits 31-12 address) - Page table contains 1024 x 32bit entries, each pointing to a 4KiB page in memory (bits 31-12 address) - 1024 * 1024 * 4KiB = 4GiB address space 18 http://wiki.osdev.org/Paging
  13. Like all x86, it got extended... - 4MiB pages -

    Extend page directory - 1 stage translation - Setting S to 1 will mark the entry as 4MiB - bits 21 through 12 Reserved (bits 31-22 address) 19 http://wiki.osdev.org/Paging
  14. And extended... - PAE - 4KiB and 2MiB pages -

    New table with 4 x 64bit entries, each pointing to a page directory - one more translation stage - Page directory with 512 x 64bit entries, each pointing to a 2MiB page or a page table - Page table with 512 x 64bit entries, each pointing to a 4KiB page 20 http://wiki.osdev.org/Paging
  15. And extended some more... - Long mode (x64) - 4KiB

    pages - 4 stage translation - 2MiB pages - 3 stage translation - 1GiB pages - 2 stage translation 22 http://wiki.osdev.org/Page_Tables#Long_mode_.2864-bit.29_page_map
  16. MMU lookup 27 - An x64 4KiB page read should

    need AT LEAST 5 memory accesses for 1 read (4 to translate, one to read) - But does it? http://wiki.osdev.org/Page_Tables#Long_mode_.2864-bit.29_page_map
  17. CPUs got too good - Memory was lagging behind CPUs

    - Latency too high, speed too low - Cache was widespread since the 1980s - Now cache is hierarchical 29 http://www.extremetech.com/extreme/188776-how-l1-and-l2-cpu-caches-work-and-why-theyre-an-essential-part-of-modern-chips
  18. CPU Caches - Present in 3 or 4 levels (L1

    to L4) - If there was a cache miss (not a cache hit) proceed to the next level - Different types and specialisation at each level 30 http://www.extremetech.com/extreme/188776-how-l1-and-l2-cpu-caches-work-and-why-theyre-an-essential-part-of-modern-chips
  19. Cache types - Direct mapped - Any address can only

    exist on one cache location - Fully associative - Any address can exist on any cache location - Every cache location must be checked - Many way set associative - In between - In a 2 way set associative cache, any address can exist on 2 cache locations - Duplicated circuitry (i.e. 8 way set associative has 8 parallel readers for each possible location) 31 https://en.wikipedia.org/wiki/CPU_cache#Associativity
  20. Cache index fetch - All cache will have some form

    of modulo operation - A directly mapped 256 byte cache will do: index = [ address % 256 ] - Lowest hit count (addresses 0 and 256, and many others, can’t coexist) - More complex schemes will perform more calculations - A 2 way set associative 256 byte cache will do: index[0] = [ address % 128 ] * 2 index[1] = index[0] + 1 - From 4 way a very high hit count 32 https://en.wikipedia.org/wiki/CPU_cache#Associativity
  21. Cache operation - Convert an address to a cache index

    - Relevant entries have a tag that verifies the address corresponds to the cache entry - If there is a miss, traverse to the next level 33 https://en.wikipedia.org/wiki/CPU_cache#Address_translation
  22. Cache index and tags - Physically indexed, Physically tagged -

    Needs to go through the MMU before the lookup can even begin - Virtually indexed, Virtually tagged - Very fast but flushed on every context switch - If two virtual addresses point to the same physical address (or vice- versa) it’s very difficult to ensure coherency - Virtually indexed, Physically tagged - Index computation and fetch can begin alongside MMU translation - No coherency issues - Physically indexed, Virtually tagged - ... 34 https://en.wikipedia.org/wiki/CPU_cache#Address_translation
  23. Example Entry 36 - index: 0xF0 - Tag = 0x456789F0

    - Data = 0xCAFEBABE 64bit for 32bit data http://www.7-cpu.com/cpu/Skylake.html
  24. Example VIPT lookup - 256 byte direct mapped cache, 32bit

    data read - Virtual address: 0x801234F0 - MMU lookup starts & Cache lookup starts - index = [ 0x801234F0 % 256 ] = 0xF0 - tag = 0x456789F0 - MMU returns address 0x456789F0 - Cache entry valid, data is read - 0xCAFEBABE 37 https://en.wikipedia.org/wiki/Translation_lookaside_buffer#Overview
  25. L1 - Priority is speed - Very small - Separate

    data and instruction - i7-6700: 64KiB 8-way set associative - VIPT - 32KiB instruction: 4 cycle latency - Split into 2 x 16KiB caches if HT - 32KiB data: 5 cycle latency (for complex addresses) - Memory can be ~120 cycle latency 39 http://www.7-cpu.com/cpu/Skylake.html
  26. L2 - Larger than L1 - Unified data and instruction

    - i7-6700: 256KiB 4-way set associative, 12 cycle latency - PIPT 40 http://www.7-cpu.com/cpu/Skylake.html
  27. L3 - Larger than L2 - Unified data and instruction

    - Shared on all cores - i7-6700: 8MiB 16-way set associative, 42 cycle latency - PIPT - 2MiB per core, 1MiB per thread (HT enabled) 41 http://www.7-cpu.com/cpu/Skylake.html
  28. L4 - Larger than L3 - Unified data and instruction

    - Off silicon, but not off chip - Victim cache for L3 (eviction) - Haswell with Iris Pro GT3e: 128MiB - PIPT - IGP dedicated memory - ~4x speed of memory - 61 cycle latency 42 http://www.extremetech.com/extreme/188776-how-l1-and-l2-cpu-caches-work-and-why-theyre-an-essential-part-of-modern-chips
  29. MMU caches - TLBs - Intel Nehalem virtual translation TLBs:

    - L1 DTLB - L1 ITLB - Split into 2 x ITLB if HT - L2 TLB - Unified - First type of CPU cache was an MMU TLB 43 https://en.wikipedia.org/wiki/Translation_lookaside_buffer#Multiple_TLBs
  30. Before the Introduction of VT-x - First implementation of virtualisation

    was done in software - Very slow - Software emulated instruction decoder and execution - Software emulated MMU - Limited access to hardware - Similar to running an application on the host OS - Runs in Ring 3 - Apart from VirtualBox 46 https://en.wikipedia.org/wiki/X86_virtualization#Software-based_virtualization
  31. Introduction of VT-x - Sped up virtualisation - By running

    the virtualised instructions through the native decoder and running it natively - Requires BIOS and CPU support - Still had MMU limitations - MMU still was emulated in software - Until… 47 https://en.wikipedia.org/wiki/X86_virtualization#Intel_virtualization_.28VT-x.29
  32. SLAT (RVI / EPT) Support … Another MMU extension -

    Second Level Address Translation - Rapid Virtualization Indexing / Extended Page Tables - Allows for page-table virtualisation - Avoids the overhead associated with software-managed shadow page tables - Uses the native MMU, not a virtualised one in software - An x64 4KiB page read with SLAT can need AT LEAST 6 (!) memory accesses... 48 https://en.wikipedia.org/wiki/Second_Level_Address_Translation
  33. ARM Virtualisation Extensions - The first draft of the ARM

    virtualisation extensions include all the above technologies - Virtualised execution - SLAT - The Raspberry Pi 2 CPU supports virtualisation - play around with KVM 49 http://genode.org/documentation/articles/arm_virtualization
  34. What is Directed I/O ?? - Used for I/O devices

    VM passthrough - PCI & PCI Express devices - Network, GPU, sound, etc. cards - Secures memory isolation - Devices cannot access memory outside of the range of memory allocated to it on a hardware level - Supports DMA remapping - Address translations for device DMA data transfers - Interrupt remapping - Provides VM routing and isolation of device interrupts 51 http://pix.toile-libre. org/upload/original/1353272461.png https://software.intel.com/en-us/blogs/2009/06/25/understanding-vt-d-intel-virtualization-technology-for-directed-io https://www-ssl.intel.com/content/dam/www/public/us/en/documents/product-specifications/vt-directed-io-spec.pdf
  35. Support & Implementations - VT-d requires chipset, CPU and BIOS

    support - All of these factors must be met - Not like VT-x which does not require chipset support - Many devices claim to support this functionality - Devices typically ship with broken implementations with from the beginning or updates - Typically BIOS updates 52 http://wiki.xen.org/wiki/VTd_HowTo
  36. ARM Virtualisation Extensions - The first draft of the ARM

    virtualisation extensions include the above technology - IOMMU - Very powerful technologies introduced simultaneously 53 http://genode.org/documentation/articles/arm_virtualization
  37. The non-executable bit - Sets memory page to non-executable -

    if set to 1: - code cannot be executed from the page and assumed to be data - else: - executable - This bit is set in the page table - NX bit is found on the 63 bit of the page entry - Stops code from being executed in data storage areas - For example: on the stack where variables are stored 59 https://en.wikipedia.org/wiki/NX_bit
  38. DEP - Data Execution Prevention - Used on Windows (with

    OS X and iOS implementations) - DEP is an abstraction of the all the different non- executable bits available - If the x86 processor supports NX, the system's BIOS supports it and it has been enabled - Then the DEP features are turned on in the OS - Also uses SafeSEH which does not need the NX bit 61 https://en.wikipedia.org/wiki/Data_Execution_Prevention
  39. Software Implementations - For history purposes - This is a

    lot slower - Projects: - Exec Shield and W^X - Emulates an NX bit on x86 CPUs that lack a native NX bit - It’s all done in hardware now - Only implemented on unsupported NX systems 62 https://en.wikipedia.org/wiki/W%5EX http://web.archive.org/web/20050512030425/http://www.redhat.com/f/pdf/rhel/WHP0006US_Execshield.pdf
  40. ARM XN - Introduced on ARMv6, supported on all major

    platforms 63 http://infocenter.arm.com/help/topic/com.arm.doc.ddi0360f/CACHFICI.html
  41. Before Non-Executable Added - Buffer overflow exploits - One of

    the most basic attacks - Where an attacker’s code can set and then executed on the stack Once implemented, the MMU handles page entries and stopped memory blocks from being executed with the NX bit set to true. - It does this by throwing a page fault exception 65 https://en.wikipedia.org/wiki/Buffer_overflow#Stack-based_exploitation
  42. After Non-Executable Added - Return-Oriented Programming (ROP) Chains - Allows

    an attacker to set the stack full of jumps to code sections already placed in memory - Modifies the NX bit of the page(s) where attackers code is stored - This allows the code to be executed and not throw a page fault exception - Requires libraries and executables to be statically compiled with absolute base addresses - Until… 66 https://en.wikipedia.org/wiki/Return-oriented_programming
  43. ASLR - Address Space Layout Randomisation - Places executables and

    associated libraries in random locations in virtual memory space - Windows flags: DYNAMICBASE / HIGHENTROPYVA - Linux flags: PT_INTERP (kinda) - Makes ROP chains harder but not impossible - K(ernel)ASLR - Used to move the kernel to prevent kernel based exploits 69 https://en.wikipedia.org/wiki/Address_space_layout_randomization
  44. PIC & PIE - Position-Independent Code/Executable - Used primarily for

    libraries - Allows them to move dynamically per application - Can be executed at any memory address without modification - Differs from relocatable code (load time patching) 70 http://www.iecc.com/linker/linker08.html
  45. Testing the entropy 72 Windows 8.1, 64 bit (VM) ~3,000

    results Debian Linux, 64 bit (native) 10,000 results