The Little Memory Block That Could

The Little Memory Block That Could All things memory...

Who are we?? Pablo Crossa - 4th year EH student
- Programmer - Low-level Junkie - Ranter of all-the-things Mathew Payne (@GeekMasher) - 4th year EH student - Programmer - I can Powerpoint - Listens to this Ranter 2

Agenda - History - MMU - CPU Cache - VT-x
/ AMD-V - VT-d / AMD-Vi / IOMMU - NX / XN / XD / EVP - ASLR Where the rant will take us 3

Please ask question during the presentation There are no stupid
questions 4

History of Memory 5 Small introduction to how and why
memory became “a thing”

Why was it introduced? 6 - CPUs needed to store
temporary data - Volatile memory was born - Directly accessible by the CPU - Faster than secondary storage - Lost on shutdown https://en.wikipedia.org/wiki/Computer_memory

Memory Technologies - Unreliable types - “store bits of information
in the form of sound waves propagating through mercury” - Magnetic-core memory - Semiconductor memory - Current memory type 7 https://en.wikipedia.org/wiki/Computer_memory

Semiconductor memory types DRAM - Each bit is stored in
a capacitor - These slowly discharge - All values need to be read and re-written periodically (called refreshing) - Slow but cheap - Denser due to component count (one capacitor and one transistor per bit) SRAM - Each bit is stored in a flip-flop circuit - No refresh needed - Fast but expensive 8 https://en.wikipedia.org/wiki/Computer_memory

MMU Memory does need to be managed y’know? 9

Introduction - An OS needed a way to: - Protect
from buggy or malicious software - Isolate applications - Isolate the kernel - Reduce memory fragmentation - Memory Management Unit - MMU was introduced, which: - Included memory protection - Supported per-process address mapping - Paging 10 https://en.wikipedia.org/wiki/Memory_management_unit

Overview - The virtual address space is divided into “pages”
- Usually a few kilobytes in size - Can be much larger (1GB on x64) - These pages can have different access permissions - Reading, writing and/or executing, or none - Several pages can map to the same physical address - A page can also map to several physical addresses 11 https://en.wikipedia.org/wiki/Memory_management_unit

Paging - When low on memory, OS can mark physical
pages as unassigned - When accessing this page, another active page is swapped into secondary storage and allocated to this section - When accessing the ‘swapped out’ page, another is stored in the ‘paging file’ (‘PAGEFILE.SYS’ on Windows, ‘swap’ on Unix) and the old one loaded in its place 12 http://wiki.osdev.org/Paging

Sample Implementations 13

Page sizes - x86 - x86: - 4KiB - 4MiB
- Since Pentium - 2MiB - PAE enabled - 36bit physical address - All can coexist within the same process 14 https://en.wikipedia.org/wiki/Memory_management_unit#IA-32_.2F_x86

Page sizes - x64 - x64: - 4KiB - 2MiB
- 1GiB - Huge tables - All can coexist within the same process 15 https://en.wikipedia.org/wiki/Memory_management_unit#x86-64

Page sizes - ARM - ARM - 4KiB - 64KiB
- 1MiB - 16MiB - (Legacy: 1KiB Tiny pages) - All can coexist within the same process 16 https://en.wikipedia.org/wiki/Memory_management_unit#ARM

x86 MMU in depth 17

Original x86 MMU design - 4KiB pages only - 2
table types - 2 stage translation - Page Directory - Page Table - Page directory contains 1024 x 32bit entries, each pointing to a page table (bits 31-12 address) - Page table contains 1024 x 32bit entries, each pointing to a 4KiB page in memory (bits 31-12 address) - 1024 * 1024 * 4KiB = 4GiB address space 18 http://wiki.osdev.org/Paging

Like all x86, it got extended... - 4MiB pages -
Extend page directory - 1 stage translation - Setting S to 1 will mark the entry as 4MiB - bits 21 through 12 Reserved (bits 31-22 address) 19 http://wiki.osdev.org/Paging

And extended... - PAE - 4KiB and 2MiB pages -
New table with 4 x 64bit entries, each pointing to a page directory - one more translation stage - Page directory with 512 x 64bit entries, each pointing to a 2MiB page or a page table - Page table with 512 x 64bit entries, each pointing to a 4KiB page 20 http://wiki.osdev.org/Paging

PAE 2MiB pages 4KiB pages 21 https://en.wikipedia.org/wiki/Physical_Address_Extension#Page_table_structures

And extended some more... - Long mode (x64) - 4KiB
pages - 4 stage translation - 2MiB pages - 3 stage translation - 1GiB pages - 2 stage translation 22 http://wiki.osdev.org/Page_Tables#Long_mode_.2864-bit.29_page_map

4 KiB pages 23 http://wiki.osdev.org/Page_Tables#Long_mode_.2864-bit.29_page_map

2 MiB pages 24 http://wiki.osdev.org/Page_Tables#Long_mode_.2864-bit.29_page_map

1 GiB pages 25 http://wiki.osdev.org/Page_Tables#Long_mode_.2864-bit.29_page_map

So got all of that?? So let's dive even lower
26

MMU lookup 27 - An x64 4KiB page read should
need AT LEAST 5 memory accesses for 1 read (4 to translate, one to read) - But does it? http://wiki.osdev.org/Page_Tables#Long_mode_.2864-bit.29_page_map

CPU Caches The lowest of the low memory storage 28

CPUs got too good - Memory was lagging behind CPUs
- Latency too high, speed too low - Cache was widespread since the 1980s - Now cache is hierarchical 29 http://www.extremetech.com/extreme/188776-how-l1-and-l2-cpu-caches-work-and-why-theyre-an-essential-part-of-modern-chips

CPU Caches - Present in 3 or 4 levels (L1
to L4) - If there was a cache miss (not a cache hit) proceed to the next level - Different types and specialisation at each level 30 http://www.extremetech.com/extreme/188776-how-l1-and-l2-cpu-caches-work-and-why-theyre-an-essential-part-of-modern-chips

Cache types - Direct mapped - Any address can only
exist on one cache location - Fully associative - Any address can exist on any cache location - Every cache location must be checked - Many way set associative - In between - In a 2 way set associative cache, any address can exist on 2 cache locations - Duplicated circuitry (i.e. 8 way set associative has 8 parallel readers for each possible location) 31 https://en.wikipedia.org/wiki/CPU_cache#Associativity

Cache index fetch - All cache will have some form
of modulo operation - A directly mapped 256 byte cache will do: index = [ address % 256 ] - Lowest hit count (addresses 0 and 256, and many others, can’t coexist) - More complex schemes will perform more calculations - A 2 way set associative 256 byte cache will do: index[0] = [ address % 128 ] * 2 index[1] = index[0] + 1 - From 4 way a very high hit count 32 https://en.wikipedia.org/wiki/CPU_cache#Associativity

Cache operation - Convert an address to a cache index
- Relevant entries have a tag that verifies the address corresponds to the cache entry - If there is a miss, traverse to the next level 33 https://en.wikipedia.org/wiki/CPU_cache#Address_translation

Cache index and tags - Physically indexed, Physically tagged -
Needs to go through the MMU before the lookup can even begin - Virtually indexed, Virtually tagged - Very fast but flushed on every context switch - If two virtual addresses point to the same physical address (or vice- versa) it’s very difficult to ensure coherency - Virtually indexed, Physically tagged - Index computation and fetch can begin alongside MMU translation - No coherency issues - Physically indexed, Virtually tagged - ... 34 https://en.wikipedia.org/wiki/CPU_cache#Address_translation

Example Entry 35

Example Entry 36 - index: 0xF0 - Tag = 0x456789F0
- Data = 0xCAFEBABE 64bit for 32bit data http://www.7-cpu.com/cpu/Skylake.html

Example VIPT lookup - 256 byte direct mapped cache, 32bit
data read - Virtual address: 0x801234F0 - MMU lookup starts & Cache lookup starts - index = [ 0x801234F0 % 256 ] = 0xF0 - tag = 0x456789F0 - MMU returns address 0x456789F0 - Cache entry valid, data is read - 0xCAFEBABE 37 https://en.wikipedia.org/wiki/Translation_lookaside_buffer#Overview

Different CPU Caches 38

L1 - Priority is speed - Very small - Separate
data and instruction - i7-6700: 64KiB 8-way set associative - VIPT - 32KiB instruction: 4 cycle latency - Split into 2 x 16KiB caches if HT - 32KiB data: 5 cycle latency (for complex addresses) - Memory can be ~120 cycle latency 39 http://www.7-cpu.com/cpu/Skylake.html

L2 - Larger than L1 - Unified data and instruction
- i7-6700: 256KiB 4-way set associative, 12 cycle latency - PIPT 40 http://www.7-cpu.com/cpu/Skylake.html

- Shared on all cores - i7-6700: 8MiB 16-way set associative, 42 cycle latency - PIPT - 2MiB per core, 1MiB per thread (HT enabled) 41 http://www.7-cpu.com/cpu/Skylake.html

- Off silicon, but not off chip - Victim cache for L3 (eviction) - Haswell with Iris Pro GT3e: 128MiB - PIPT - IGP dedicated memory - ~4x speed of memory - 61 cycle latency 42 http://www.extremetech.com/extreme/188776-how-l1-and-l2-cpu-caches-work-and-why-theyre-an-essential-part-of-modern-chips

MMU caches - TLBs - Intel Nehalem virtual translation TLBs:
- L1 DTLB - L1 ITLB - Split into 2 x ITLB if HT - L2 TLB - Unified - First type of CPU cache was an MMU TLB 43 https://en.wikipedia.org/wiki/Translation_lookaside_buffer#Multiple_TLBs

Now let's take a look at Virtualisation 44

VT-x / AMD-V Virtualising the world… 45

Before the Introduction of VT-x - First implementation of virtualisation
was done in software - Very slow - Software emulated instruction decoder and execution - Software emulated MMU - Limited access to hardware - Similar to running an application on the host OS - Runs in Ring 3 - Apart from VirtualBox 46 https://en.wikipedia.org/wiki/X86_virtualization#Software-based_virtualization

Introduction of VT-x - Sped up virtualisation - By running
the virtualised instructions through the native decoder and running it natively - Requires BIOS and CPU support - Still had MMU limitations - MMU still was emulated in software - Until… 47 https://en.wikipedia.org/wiki/X86_virtualization#Intel_virtualization_.28VT-x.29

SLAT (RVI / EPT) Support … Another MMU extension -
Second Level Address Translation - Rapid Virtualization Indexing / Extended Page Tables - Allows for page-table virtualisation - Avoids the overhead associated with software-managed shadow page tables - Uses the native MMU, not a virtualised one in software - An x64 4KiB page read with SLAT can need AT LEAST 6 (!) memory accesses... 48 https://en.wikipedia.org/wiki/Second_Level_Address_Translation

ARM Virtualisation Extensions - The first draft of the ARM
virtualisation extensions include all the above technologies - Virtualised execution - SLAT - The Raspberry Pi 2 CPU supports virtualisation - play around with KVM 49 http://genode.org/documentation/articles/arm_virtualization

VT-d, AMD-Vi & IOMMU Directed I/O between Hardware and Memory
50

What is Directed I/O ?? - Used for I/O devices
VM passthrough - PCI & PCI Express devices - Network, GPU, sound, etc. cards - Secures memory isolation - Devices cannot access memory outside of the range of memory allocated to it on a hardware level - Supports DMA remapping - Address translations for device DMA data transfers - Interrupt remapping - Provides VM routing and isolation of device interrupts 51 http://pix.toile-libre. org/upload/original/1353272461.png https://software.intel.com/en-us/blogs/2009/06/25/understanding-vt-d-intel-virtualization-technology-for-directed-io https://www-ssl.intel.com/content/dam/www/public/us/en/documents/product-specifications/vt-directed-io-spec.pdf

Support & Implementations - VT-d requires chipset, CPU and BIOS
support - All of these factors must be met - Not like VT-x which does not require chipset support - Many devices claim to support this functionality - Devices typically ship with broken implementations with from the beginning or updates - Typically BIOS updates 52 http://wiki.xen.org/wiki/VTd_HowTo

ARM Virtualisation Extensions - The first draft of the ARM
virtualisation extensions include the above technology - IOMMU - Very powerful technologies introduced simultaneously 53 http://genode.org/documentation/articles/arm_virtualization

Some real-life examples 54

55 https://www.qubes-os.org/doc/security-guidelines/#tocAnchor-1-1-4 Qubes-OS uses it for security

VT-d Rigs are a thing... 56 https://youtu.be/LXOaCkbt4lI?t=10m40s

Thats enough virtualisation Let's go back to some more MMU
stuff… 57

NX, XN, XD & EVP The big hardware brothers of
DEP 58

The non-executable bit - Sets memory page to non-executable -
if set to 1: - code cannot be executed from the page and assumed to be data - else: - executable - This bit is set in the page table - NX bit is found on the 63 bit of the page entry - Stops code from being executed in data storage areas - For example: on the stack where variables are stored 59 https://en.wikipedia.org/wiki/NX_bit

Remember that MMU page table stuff?? 60

DEP - Data Execution Prevention - Used on Windows (with
OS X and iOS implementations) - DEP is an abstraction of the all the different non- executable bits available - If the x86 processor supports NX, the system's BIOS supports it and it has been enabled - Then the DEP features are turned on in the OS - Also uses SafeSEH which does not need the NX bit 61 https://en.wikipedia.org/wiki/Data_Execution_Prevention

Software Implementations - For history purposes - This is a
lot slower - Projects: - Exec Shield and W^X - Emulates an NX bit on x86 CPUs that lack a native NX bit - It’s all done in hardware now - Only implemented on unsupported NX systems 62 https://en.wikipedia.org/wiki/W%5EX http://web.archive.org/web/20050512030425/http://www.redhat.com/f/pdf/rhel/WHP0006US_Execshield.pdf

ARM XN - Introduced on ARMv6, supported on all major
platforms 63 http://infocenter.arm.com/help/topic/com.arm.doc.ddi0360f/CACHFICI.html

Some Sample Attacks 64

Before Non-Executable Added - Buffer overflow exploits - One of
the most basic attacks - Where an attacker’s code can set and then executed on the stack Once implemented, the MMU handles page entries and stopped memory blocks from being executed with the NX bit set to true. - It does this by throwing a page fault exception 65 https://en.wikipedia.org/wiki/Buffer_overflow#Stack-based_exploitation

After Non-Executable Added - Return-Oriented Programming (ROP) Chains - Allows
an attacker to set the stack full of jumps to code sections already placed in memory - Modifies the NX bit of the page(s) where attackers code is stored - This allows the code to be executed and not throw a page fault exception - Requires libraries and executables to be statically compiled with absolute base addresses - Until… 66 https://en.wikipedia.org/wiki/Return-oriented_programming

Finally Let’s take a look at another memory based technology
67

ASLR All the hardware… Now a little bit of software
68

ASLR - Address Space Layout Randomisation - Places executables and
associated libraries in random locations in virtual memory space - Windows flags: DYNAMICBASE / HIGHENTROPYVA - Linux flags: PT_INTERP (kinda) - Makes ROP chains harder but not impossible - K(ernel)ASLR - Used to move the kernel to prevent kernel based exploits 69 https://en.wikipedia.org/wiki/Address_space_layout_randomization

PIC & PIE - Position-Independent Code/Executable - Used primarily for
libraries - Allows them to move dynamically per application - Can be executed at any memory address without modification - Differs from relocatable code (load time patching) 70 http://www.iecc.com/linker/linker08.html

Mathew’s 3rd year research back in 2015 71

Testing the entropy 72 Windows 8.1, 64 bit (VM) ~3,000
results Debian Linux, 64 bit (native) 10,000 results

Windows 8.1, 32 bit “feature” (~2,000 results) 73

What Pablo said at the time... 74 *nibble: 4 bits
or half a byte

A talk for another time 75

Conclusion What did we learn?? 76

The MMU is super important 77

Developing new technologies is HARD 78

Legacy is difficult, complicated and ugly 79

Thank you for listening No really, thank you :P 80

¿?¿ Questions ?¿? 81 If not now, at the pub

The Little Memory Block That Could

The Little Memory Block That Could

Other Decks in Technology

Featured

Transcript