Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Overview of binary file formats

Overview of binary file formats

A brief description of file formats like ELF using the compilation process of a simple C program to describe what goes on.

Noufal Ibrahim

March 11, 2017
Tweet

More Decks by Noufal Ibrahim

Other Decks in Programming

Transcript

  1. Compilation: Blow by Blow Noufal Ibrahim March 11, 2017 Noufal

    Ibrahim Compilation: Blow by Blow March 11, 2017 1 / 58
  2. Act 1 :Introduction What happens when you run gcc program.c?

    How is the a.out produced? What does it contain? What is a .o file? What is a .so file? Noufal Ibrahim Compilation: Blow by Blow March 11, 2017 3 / 58
  3. Act 1 :Introduction What happens when you run gcc program.c?

    How is the a.out produced? What does it contain? What is a .o file? What is a .so file? This presentation should give you a starting point to think about all this. With some digressions and live coding when relevant. Noufal Ibrahim Compilation: Blow by Blow March 11, 2017 3 / 58
  4. Act 1 :Introduction What happens when you run gcc program.c?

    How is the a.out produced? What does it contain? What is a .o file? What is a .so file? This presentation should give you a starting point to think about all this. With some digressions and live coding when relevant. First we’ll shoot through the process Then we’ll rewind and go through it step by step. Noufal Ibrahim Compilation: Blow by Blow March 11, 2017 3 / 58
  5. Act 1 :Introduction What happens when you run gcc program.c?

    How is the a.out produced? What does it contain? What is a .o file? What is a .so file? This presentation should give you a starting point to think about all this. With some digressions and live coding when relevant. First we’ll shoot through the process Then we’ll rewind and go through it step by step. A sneak peek - Try gcc -v program.c. Noufal Ibrahim Compilation: Blow by Blow March 11, 2017 3 / 58
  6. Sample file sample.c / sample . c / int main

    ( ) { return 7; } Compiling and running gcc −o sample sample . c . / sample echo $? ⇒ 7 Noufal Ibrahim Compilation: Blow by Blow March 11, 2017 4 / 58
  7. The preprocessor gcc -E sample.c # 1 " sample .

    c " # 1 "<built −in>" # 1 "<command−line >" # 1 " / usr / include / stdc −predef . h" 1 3 4 # 1 "<command−line >" 2 # 1 " sample . c " int main ( ) { return 7; } Noufal Ibrahim Compilation: Blow by Blow March 11, 2017 6 / 58
  8. What does it do? Not seen by compiler Common uses

    #define textual macros #include include other files #warning, #error for user info #line to change the line number and filename __LINE__ and __FILE__ for line number and file. Noufal Ibrahim Compilation: Blow by Blow March 11, 2017 7 / 58
  9. What does it do? Not seen by compiler Common uses

    #define textual macros #include include other files #warning, #error for user info #line to change the line number and filename __LINE__ and __FILE__ for line number and file. Pragmas Compiler specific directives e.g. Use some kinds of instructions etc. Noufal Ibrahim Compilation: Blow by Blow March 11, 2017 7 / 58
  10. Round 2 : The compiler Now we have pure C

    Noufal Ibrahim Compilation: Blow by Blow March 11, 2017 8 / 58
  11. gcc -S -fno-asynchronous-unwind-tables sample.c ⇒ sample.s sample.s . f i

    l e " sample.c " . t e x t . g l o b l main .type main , @function main : pushq %rbp movq %rsp , %rbp movl $7 , %eax popq %rbp ret . s i z e main , . −main .ident "GCC: ( Debian 4 .9.2 −10) 4 . 9 . 2 " . s e c t i o n .note.GNU−stack , " " , @progbits Noufal Ibrahim Compilation: Blow by Blow March 11, 2017 9 / 58
  12. Assemblers Assemble! What is -fno-asynchronous-unwind-tables? What is main: What are

    the . statements? What is the whole %rbp %rsp trickery? A little note on g++ Noufal Ibrahim Compilation: Blow by Blow March 11, 2017 10 / 58
  13. Assemblers Assemble! What is -fno-asynchronous-unwind-tables? What is main: What are

    the . statements? What is the whole %rbp %rsp trickery? A little note on g++ We will revisit this whole thing. Noufal Ibrahim Compilation: Blow by Blow March 11, 2017 10 / 58
  14. as sample.s -o sample.o → sample.o file sample.o sample.o: ELF

    64-bit LSB relocatable, x86-64, version 1 (SYSV), not stripped Noufal Ibrahim Compilation: Blow by Blow March 11, 2017 12 / 58
  15. ELF files A common format for object code executables shared

    libraries Core dumps Replaces a.out, COFF etc. Noufal Ibrahim Compilation: Blow by Blow March 11, 2017 13 / 58
  16. ELF files A common format for object code executables shared

    libraries Core dumps Replaces a.out, COFF etc. Programs intended to execute directly on a processor. Noufal Ibrahim Compilation: Blow by Blow March 11, 2017 13 / 58
  17. Structure of ELF files Depends on what you’re going to

    use it for. ELF header (fixed position) has a roadmap for sections Sections hold things like symbol table, instructions, data etc. Noufal Ibrahim Compilation: Blow by Blow March 11, 2017 14 / 58
  18. ELF Linking view ELF Header Program header table (optional) Section

    1 . . . Section n Section header table "Section header table" tells you where the various sections are - name, size etc. Noufal Ibrahim Compilation: Blow by Blow March 11, 2017 15 / 58
  19. ELF executable view ELF Header Program header table Segment 1

    Segment 2 . . . Section header table (optional) Noufal Ibrahim Compilation: Blow by Blow March 11, 2017 16 / 58
  20. The ELF header 64 bit header is 64 bytes long

    4 bytes magic number 1 byte to indicate 32/64 bit 1 byte for endinanees 1 byte for ELF version 1 byte for OS ABI Rest at https://en.wikipedia.org/wiki/Executable_and_Linkable_Forma You can actually twiddle this 8th byte is the OS ABI. - 0x03 is Linux, 0x07 is AIX Noufal Ibrahim Compilation: Blow by Blow March 11, 2017 17 / 58
  21. Inspecting the file You can dump it with objdump objdump

    sample.o sample.o: file format elf64-x86-64 architecture: i386:x86-64, flags 0x00000011: HAS_RELOC, HAS_SYMS start address 0x0000000000000000 Noufal Ibrahim Compilation: Blow by Blow March 11, 2017 18 / 58
  22. Inspecting the file Or eu-readelf eu-readelf -h sample.o ELF Header:

    Magic: 7f 45 4c 46 02 01 01 03 00 00 00 00 00 00 00 00 Class: ELF64 Data: 2’s complement, little endian Ident Version: 1 (current) OS/ABI: Linux ABI Version: 0 Type: REL (Relocatable file) Machine: AMD x86-64 Version: 1 (current) Entry point address: 0 . . . Noufal Ibrahim Compilation: Blow by Blow March 11, 2017 19 / 58
  23. Round 4: Linking Object files have to be linked so

    that they can be made executable. Noufal Ibrahim Compilation: Blow by Blow March 11, 2017 20 / 58
  24. Linking ld -dynamic-linker /lib64/ld-linux-x86-64.so.2 \ /usr/lib/x86_64-linux-gnu/crt1.o \ /usr/lib/x86_64-linux-gnu/crti.o \ /usr/lib/x86_64-linux-gnu/crtn.o

    \ -o sample sample.o -lc ./sample ; echo $? 7 Noufal Ibrahim Compilation: Blow by Blow March 11, 2017 21 / 58
  25. A recap gcc -E sample.c not much value for us

    gcc -S sample.c sample.c (C) → sample.s (assembly) as sample.s -o sample.o sample.s (assembly) → sample.o (ELF object) ld ... -o sample sample.o -lc sample.o (ELF object) → sample (ELF dynamic executable) ./sample Run it! Noufal Ibrahim Compilation: Blow by Blow March 11, 2017 22 / 58
  26. Act 2 : Behind the scenes We’ll look at each

    stage in some depth now. Noufal Ibrahim Compilation: Blow by Blow March 11, 2017 23 / 58
  27. Preprocessor Not a very interesting thing. Textual substitution (no evaluation)

    #warning and #error Noufal Ibrahim Compilation: Blow by Blow March 11, 2017 24 / 58
  28. Assembler By simply assembling it, you’ll see a lot of

    extra instructions. gcc -S sample1.c . f i l e " sample1.c " . t e x t . g l o b l main .type main , @function main : .LFB0 : . c f i _ s t a r t p r o c pushq %rbp . c f i _ d e f _ c f a _ o f f s e t 16 . c f i _ o f f s e t 6 , −16 movq %rsp , %rbp . c f i _ d e f _ c f a _ r e g i s t e r 6 movl $9 , %eax popq %rbp . c f i _ d e f _ c f a 7 , 8 ret .cfi_endproc .LFE0 : . s i z e main , . −main .ident "GCC: ( Debian 4 .9.2 −10) 4 . 9 . 2 " . s e c t i o n .note.GNU−stack , " " , @progbits Noufal Ibrahim Compilation: Blow by Blow March 11, 2017 25 / 58
  29. Assembler Specifically, lots of .cfi... instructions. Stuff that starts with

    a . are assembler directives. Tell the assembler how to assemble the file. Noufal Ibrahim Compilation: Blow by Blow March 11, 2017 26 / 58
  30. Assembler Specifically, lots of .cfi... instructions. Stuff that starts with

    a . are assembler directives. Tell the assembler how to assemble the file. CFI stands for "Call Frame Information". Tells the assembler to add extra information useful for unwinding the call stack. We can remove this using -fno-asynchronous-unwind-tables We’ll come back to this when we discuss linking Noufal Ibrahim Compilation: Blow by Blow March 11, 2017 26 / 58
  31. Assembler Let’s take the small program gcc -S -fno-asynchronous-unwind-tables sample1.c

    . f i l e " sample1.c " . t e x t . g l o b l main .type main , @function main : pushq %rbp movq %rsp , %rbp movl $9 , %eax popq %rbp ret . s i z e main , . −main .ident "GCC: ( Debian 4 .9.2 −10) 4 . 9 . 2 " . s e c t i o n .note.GNU−stack , " " , @progbits Noufal Ibrahim Compilation: Blow by Blow March 11, 2017 27 / 58
  32. Assembler .file is used to tell as that we’re starting

    a new file. .text assembles this into the "text segment" (executable code) .globl Makes the symbol visible to ld (the linker) - Not just internal. .type Marks the gives symbol as the given type (in this case a function) main: is the label from where the main function starts. The initial pushq and movq are to save the current stack base and create a new frame movl puts 9 into eax. stack restored and function returns. .size is used to define the size of the main symbol. .ident adds some comments on the file .section is used to add a new section. Here it marks the stack as executable and the @progbits includes this into the binary. Noufal Ibrahim Compilation: Blow by Blow March 11, 2017 28 / 58
  33. Assembler Examine the generated .o nm nm sample1.o 0000000000000000 T

    main prints symbols in object files. value, type, name The T tells you that the symbol is in the text section. main is the name. Noufal Ibrahim Compilation: Blow by Blow March 11, 2017 29 / 58
  34. Assembler objdump objdump -f sample.o sample . o : f

    i l e format elf64 −x86−64 architecture : i386 : x86 −64, flags 0x00000010 : HAS_SYMS s t a r t address 0x0000000000000000 Notice the HAS_SYMS. It’s not stripped. Also notice the start address (of main). If you built it using -g, you can ask for the debugging info. objdump -x will give you header information Noufal Ibrahim Compilation: Blow by Blow March 11, 2017 30 / 58
  35. Assembler eu-readelf More "user friendly" than objdump -h gives you

    the file header -S gives you the sections in the file. Notice the symbol table. That’s where we get the stuff from nm We can strip it. After this, nm will not work. Noufal Ibrahim Compilation: Blow by Blow March 11, 2017 31 / 58
  36. Assembler Let’s look at slightly more complex programs int add

    ( int x , int y ) { return x + y ; } int main ( ) { return add (4 , 5 ) ; } Noufal Ibrahim Compilation: Blow by Blow March 11, 2017 32 / 58
  37. Assembler gcc -S -fno-asynchronous-unwind-tables sample2.c . f i l e

    " sample2.c " . t e x t . g l o b l add .type add , @function add : pushq %rbp movq %rsp , %rbp movl %edi , −4(%rbp ) movl %esi , −8(%rbp ) movl −4(%rbp ) , %edx movl −8(%rbp ) , %eax addl %edx , %eax popq %rbp ret . s i z e add , . −add . g l o b l main .type main , @function main : pushq %rbp movq %rsp , %rbp movl $5 , %e s i movl $4 , %edi c a l l add popq %rbp ret . s i z e main , . −main .ident "GCC: ( Debian 4 .9.2 −10) 4 . 9 . 2 " . s e c t i o n .note.GNU−stack , " " , @progbits Noufal Ibrahim Compilation: Blow by Blow March 11, 2017 33 / 58
  38. Assembler Let’s take main Remember, this is unoptimised. main :

    pushq %rbp movq %rsp , %rbp movl $5 , %e s i movl $4 , %edi c a l l add popq %rbp ret Loads the two arguments into esi and edi Then calls add The rest is just creating an activation record for main itself. Noufal Ibrahim Compilation: Blow by Blow March 11, 2017 34 / 58
  39. Assembler Now, let’s take add add : pushq %rbp movq

    %rsp , %rbp movl %edi , −4(%rbp ) movl %esi , −8(%rbp ) movl −4(%rbp ) , %edx movl −8(%rbp ) , %eax addl %edx , %eax popq %rbp ret . size add , . −add . globl main . type main , @function Noufal Ibrahim Compilation: Blow by Blow March 11, 2017 35 / 58
  40. Assembler The values in edi and esi are pushed onto

    the stack using movl Then they’re loaded into into edx and eax They’re added and the result stored in eax Stack restored. Since it’s in eax, returning from main will be fine. Noufal Ibrahim Compilation: Blow by Blow March 11, 2017 36 / 58
  41. Assembler You can also see multiple symbols now nm sample2.o

    0000000000000000 T add 0000000000000014 T main The symbols change if you use g++. Notice that all the functions are T. Let’s change that. Noufal Ibrahim Compilation: Blow by Blow March 11, 2017 37 / 58
  42. Assembler cat hello.c #include <stdio . h> int main (

    ) { return printf ( " Hello , world\n" ) ; } nm hello.o 0000000000000000 T main U printf Notice the U with printf. This is an undefined symbol. It should get resolved when we do linking. No compilation errors but linking crash! We’ll touch this again when we discuss linking and shared libs Noufal Ibrahim Compilation: Blow by Blow March 11, 2017 38 / 58
  43. Assembler Suppose we make a small change to the program

    cat hello1.c #include <stdio . h> int main ( ) { printf ( " Hello , world\n" ) ; return 0; } nm hello1.o 0000000000000000 T main U puts Notice how gcc replaces the printf with puts now. Who knew? Noufal Ibrahim Compilation: Blow by Blow March 11, 2017 39 / 58
  44. Linking The .o file is just object code. You can’t

    run it. To do that, you need to link it Take multiple files and convert into one. Let’s link our file Noufal Ibrahim Compilation: Blow by Blow March 11, 2017 40 / 58
  45. Linking ld -o sample sample.o ; ./sample ld : warning

    : cannot find entry symbol _ s t a r t ; defaulting to 00000000004000b0 zsh : segmentation fault . / sample Noufal Ibrahim Compilation: Blow by Blow March 11, 2017 41 / 58
  46. Linking So, what’s the problem? Let’s build a conventional one

    with gcc sample.c -o sample-good Noufal Ibrahim Compilation: Blow by Blow March 11, 2017 42 / 58
  47. Linking So, what’s the problem? Let’s build a conventional one

    with gcc sample.c -o sample-good Look at the file file sample sample : ELF 64− bit LSB executable , x86 −64, version 1 (SYSV) , s t a t i c a l l y linked , not stripped file sample-good sample−good : ELF 64− bit LSB executable , x86 −64, version 1 (SYSV) , dynamically linked , interpreter / lib64 /ld−linux −x86 −64. so .2 , . . . Noufal Ibrahim Compilation: Blow by Blow March 11, 2017 42 / 58
  48. Linking Let’s step back and write a tiny program in

    direct assembler. Simply returns 9 to the OS Noufal Ibrahim Compilation: Blow by Blow March 11, 2017 43 / 58
  49. Linking Old style cat exit.s . s e c t

    i o n .data . s e c t i o n . t e x t . g l o b l _ s t a r t _ s t a r t : mov $1 , %eax mov $9 , %ebx int $0x80 Noufal Ibrahim Compilation: Blow by Blow March 11, 2017 44 / 58
  50. Linking Better The syscall statement . s e c t

    i o n .data . s e c t i o n . t e x t . g l o b l _ s t a r t _ s t a r t : mov $60 , %rax mov $0 , %rdi s y s c a l l Noufal Ibrahim Compilation: Blow by Blow March 11, 2017 45 / 58
  51. Linking Either way, We assemble it using as -o exit.o

    exit.s And link it using ld -o exit exit.o ./exit ; echo $? 3 Noufal Ibrahim Compilation: Blow by Blow March 11, 2017 46 / 58
  52. Linking So, what’s the deal? In it’s simplest sense, the

    .o file we’ve generated above has just a _start symbol When we link, it looks for a _start symbol. (Stripping this produces a warning ) Noufal Ibrahim Compilation: Blow by Blow March 11, 2017 47 / 58
  53. Linking So, what’s the deal? In it’s simplest sense, the

    .o file we’ve generated above has just a _start symbol When we link, it looks for a _start symbol. (Stripping this produces a warning ) That’s the piece of code that will run when you run the program. It simply sets a few registers and calls an interrupt to exit the program. Can’t get much smaller. Noufal Ibrahim Compilation: Blow by Blow March 11, 2017 47 / 58
  54. Linking What’s wrong with our program then? Well, for one,

    we don’t have a _start Noufal Ibrahim Compilation: Blow by Blow March 11, 2017 48 / 58
  55. Linking What’s wrong with our program then? Well, for one,

    we don’t have a _start Heck! I have main. I’ll just use that. ld -e main sample.o -o sample ; ./sample zsh: segmentation fault ./sample Noufal Ibrahim Compilation: Blow by Blow March 11, 2017 48 / 58
  56. Linking So, this is not just a pure "no runtime"

    binary anymore. C has a runtime. crt*.o e.g. /usr/lib/x86_64-linux-gnu/crt1.o has our _start symbol Stuff happens here before the handover to main. Let’s link this too. Noufal Ibrahim Compilation: Blow by Blow March 11, 2017 49 / 58
  57. Linking So, this is not just a pure "no runtime"

    binary anymore. C has a runtime. crt*.o e.g. /usr/lib/x86_64-linux-gnu/crt1.o has our _start symbol Stuff happens here before the handover to main. Let’s link this too. ld /usr/lib/x86_64-linux-gnu/crt1.o sample.o -o sample / usr / l i b /x86_64−linux −gnu/ crt1 . o : In function ‘ _start ’ : / build / glibc −qK83Be/ glibc −2.19/ csu / . . / sysdeps /x86_64/ s t a r t . S :115: undefined reference to ‘ __libc_csu_fini ’ / build / glibc −qK83Be/ glibc −2.19/ csu / . . / sysdeps /x86_64/ s t a r t . S :116: undefined reference to ‘ __libc_csu_init ’ / build / glibc −qK83Be/ glibc −2.19/ csu / . . / sysdeps /x86_64/ s t a r t . S :122: undefined reference to ‘ __libc_start_main ’ Noufal Ibrahim Compilation: Blow by Blow March 11, 2017 49 / 58
  58. Linking libc is missing. Let’s add that ld /usr/lib/x86_64-linux-gnu/crt1.o sample.o

    -o sample -lc / usr / l i b /x86_64−linux −gnu/ libc_nonshared . a ( elf − i n i t . oS ) : In function ‘ __libc_csu_init ’ : ( . text +0x2f ) : undefined reference to ‘ _init ’ Noufal Ibrahim Compilation: Blow by Blow March 11, 2017 50 / 58
  59. Linking This is in /usr/lib/x86_64-linux-gnu/crti.o Let’s include that in our

    linker invocation and try again ld /usr/lib/x86_64-linux-gnu/crt1.o /usr/lib/x86_64-linux-gnu/crti.o sample.o -o sample -lc Links but can’t run. file sample sample : ELF 64− bit LSB executable , x86 −64, version 1 (SYSV) , dynamically linked , interpreter / l i b /ld64 . so .1 . . . Noufal Ibrahim Compilation: Blow by Blow March 11, 2017 51 / 58
  60. Linking You can see the the "interpreter" is wrong here.

    Although we’ve made it a dynamic executable ld -dynamic-linker /lib64/ld-linux-x86-64.so.2 /usr/lib/x86_64-linux-gnu/crt1.o /usr/lib/x86_64-linux-gnu/crti.o sample.o -o sample -lc Still segfaults Noufal Ibrahim Compilation: Blow by Blow March 11, 2017 52 / 58
  61. Linking This is actually running but the exit routines are

    not there yet. For that, we need a few more crt* libs This becomes doubly obvious when we use a printf. ld -dynamic-linker /lib64/ld-linux-x86-64.so.2 \ /usr/lib/x86_64-linux-gnu/crt1.o \ /usr/lib/x86_64-linux-gnu/crti.o \ /usr/lib/x86_64-linux-gnu/crtn.o \ -o sample sample.o -lc Noufal Ibrahim Compilation: Blow by Blow March 11, 2017 53 / 58
  62. Linking We could also compile this into shared library. No

    entry point Just callable functions Noufal Ibrahim Compilation: Blow by Blow March 11, 2017 55 / 58
  63. Using shared libraries from other languages An example using a

    C function from python Noufal Ibrahim Compilation: Blow by Blow March 11, 2017 57 / 58