Slide 1

Slide 1 text

Secrets Behind HelloWorld.exe —— Compilation & Linking & Execution —— Shao-Chung Chen Trend Micro # 2015/11/03

Slide 2

Slide 2 text

Goal 1. Write better programs 2. Build large programs 3. Avoid dangerous programming errors 4. Understand how language rules are implemented 5. Understand important system concepts

Slide 3

Slide 3 text

Topics 1. Compilation 2. Static linking 3. Dynamic linking 4. Process launch [Note] Unless stated, all experiments are conducted with Visual Studio 2012 on Windows 7 SP1 x86

Slide 4

Slide 4 text

#include int main(int argc, char* argv[]) { char const greeting[] = "Hello, world!"; printf("%s\n", greeting); return 0; } See anything?

Slide 5

Slide 5 text

#include int main(int argc, char* argv[]) { char const greeting[] = "Hello, world!"; printf("%s\n", greeting); return 0; } Huh?

Slide 6

Slide 6 text

#include int main(int argc, char* argv[]) { char const greeting[] = "Hello, world!"; printf("%s\n", greeting); return 0; } So many questions! What is stdio.h? What is #include? Why? Who invokes main()? Where’re the arguments from? When’s const-ness checked? Where’s and How’s printf() implemented? Why does main() return an int? What’ll happen after it returns? How does printf() format the output string? How does it take variable-length arguments? Where is the string literal stored? Can we modify the content? Are char[] and char* different? What’s the value of greetings? When’s the value assigned? Is it always the same value?

Slide 7

Slide 7 text

Hello.c Write Hello.asm Hello.obj Hello.exe Hello, World! Compile Assemble Link Execute msvcrt.lib ntdll.dll kernel32.dll msvcr110.dll (Static Libraries) (Dynamic Libraries) Lifecycle Overview of Windows Executables Hello.i Preprocess stdio.h (Header Files)

Slide 8

Slide 8 text

Compilation

Slide 9

Slide 9 text

#include // Hello! int main(int argc, char* argv[]) { char const greeting[] = "Hello, world!"; printf("%s\n", greeting); return 0; } Hello.c #line 1 "Hello.c" #line 1 "C:\\Program Files\\Microsoft Visual Studio 11.0\\VC\\INCLUDE\\stdio.h" #pragma once // …5700 lines removed… #line 271 "C:\\Program Files\\Microsoft Visual Studio 11.0\\VC\\INCLUDE\\stdio.h" int __cdecl printf( const char * _Format, ...); // …600 lines removed… #line 2 "Hello.c" int main(int argc, char* argv[]) { char const greeting[] = "Hello, world!"; printf("%s\n", greeting); return 0; } Hello.i > CL.exe /P Hello.c Preprocessing 1. Apply #defines, and expand all macros 2. Handle all conditional compilation directives (e.g., #if) 3. Insert all #include’d files in place, recursively 4. Remove all comments (i.e., //… and /*…*/) 5. Insert line numbers and file paths for error handling 6. Preserve all #pragma directives for compiler

Slide 10

Slide 10 text

// … #line 2 "Hello.c" int main(int argc, char* argv[]) { // … } Hello.i .686P INCLUDELIB MSVCRT INCLUDELIB OLDNAMES _DATA SEGMENT $SG2940 DB! 'Hello, world!', 00H $SG2941 DB! '%s', 0aH, 00H _DATA ENDS PUBLIC _main EXTERN _printf:PROC _TEXT SEGMENT _main PROC push ebp mov ebp, esp push OFFSET $SG2940 push OFFSET $SG2941 call _printf mov esp, ebp pop ebp ret 0 _main! ENDP _TEXT! ENDS END Hello.asm > CL.exe /MD /Fa /Tc Hello.i Compilation Source Code Tokens Abstract Syntax Tree Tagged Abstract Syntax Tree Intermediate Representation Optimized I.R. Target Code Optimized Target Code (Hello.i) (Hello.asm) Lexical Analysis Syntactic Analysis IR Code Gen. IR Optimization Target Code Optimization Semantic Analysis Target Code Generation

Slide 11

Slide 11 text

Token Type array Identifier [ TK_LBracket index Identifier ] TK_RBracket = Tk_Assign ( TK_LParen index Identifier + TK_Plus 4 Number ) TK_RParen * TK_Asterisk ( TK_LParen 2 Number + TK_Plus 6 Number ) TK_RParen ; TK_Semicolon // … array[index] = (index + 4) * (2 + 6); // … Stmt.c Compilation: Lexical Analysis Character Stream Token Stream Application: Syntax highlighting.

Slide 12

Slide 12 text

Token Type array Identifier [ TK_LBracket index Identifier ] TK_RBracket = Tk_Assign ( TK_LParen index Identifier + TK_Plus 4 Number ) TK_RParen * TK_Asterisk ( TK_LParen 2 Number + TK_Plus 6 Number ) TK_RParen ; TK_Semicolon Compilation: Syntactic Analysis Token Stream Abstract Syntax Tree Application: Source code formatting. Assignment = Subscription [] Multiplication * Identifier array Identifier index Addition + Addition + Identifier index Number 4 Number 2 Number 6

Slide 13

Slide 13 text

Compilation: Semantic Analysis Abstract Syntax Tree Tagged AST Application: IntelliSense. Assignment = int Subscription [] int Multiplication * int Identifier array int[] Identifier index int Addition + int Addition + int Identifier index int Number 4 int Number 2 int Number 6 int 1. Check for static semantics (e.g., type information) 2. Insert nodes for implicit type conversion

Slide 14

Slide 14 text

Compilation: IR Code Generation Tagged AST Intermediate Representation This experiment is conducted with LLVM-3.4 on Ubuntu 14.04.3 amd64 int main() { int a = 55, b = 66; return (a + b); } Simple.c TranslationUnitDecl  0x272b990  <> |-­‐TypedefDecl  0x272be90  <>  __int128_t  '__int128' |-­‐TypedefDecl  0x272bef0  <>  __uint128_t  'unsigned  __int128' |-­‐TypedefDecl  0x272c240  <>  __builtin_va_list  '__va_list_tag  [1]' `-­‐FunctionDecl  0x272c2e0    main  'int  ()'    `-­‐CompoundStmt  0x272c5b0          |-­‐DeclStmt  0x272c4b0          |  |-­‐VarDecl  0x272c390    a  'int'        |  |  `-­‐IntegerLiteral  0x272c3e8    'int'  55        |  `-­‐VarDecl  0x272c420    b  'int'        |      `-­‐IntegerLiteral  0x272c478    'int'  66        `-­‐ReturnStmt  0x272c590              `-­‐ParenExpr  0x272c570    'int'                `-­‐BinaryOperator  0x272c548    'int'  '+'                    |-­‐ImplicitCastExpr  0x272c518    'int'                      |  `-­‐DeclRefExpr  0x272c4c8    'int'  lvalue  Var  0x272c390  'a'  'int'                    `-­‐ImplicitCastExpr  0x272c530    'int'                          `-­‐DeclRefExpr  0x272c4f0    'int'  lvalue  Var  0x272c420  'b'  'int' Hello.ast $ clang -cc1 -ast-dump Simple.c ;  ModuleID  =  'Simple.c' target  datalayout  =  "e-­‐p:64:64:64-­‐i1:8:8-­‐i8:8:8-­‐ i16:16:16-­‐i32:32:32-­‐i64:64:64-­‐f32:32:32-­‐f64:64:64-­‐ v64:64:64-­‐v128:128:128-­‐a0:0:64-­‐s0:64:64-­‐ f80:128:128-­‐n8:16:32:64-­‐S128" target  triple  =  "x86_64-­‐pc-­‐linux-­‐gnu" ;  Function  Attrs:  nounwind define  i32  @main()  #0  {    %1  =  alloca  i32,  align  4    %a  =  alloca  i32,  align  4    %b  =  alloca  i32,  align  4    store  i32  0,  i32*  %1    store  i32  55,  i32*  %a,  align  4    store  i32  66,  i32*  %b,  align  4    %2  =  load  i32*  %a,  align  4    %3  =  load  i32*  %b,  align  4    %4  =  add  nsw  i32  %2,  %3    ret  i32  %4 } attributes  #0  =  {  nounwind  "less-­‐precise-­‐ fpmad"="false"  "no-­‐frame-­‐pointer-­‐elim"="false"   "no-­‐infs-­‐fp-­‐math"="false"  "no-­‐nans-­‐fp-­‐ math"="false"  "no-­‐realign-­‐stack"  "stack-­‐protector-­‐ buffer-­‐size"="8"  "unsafe-­‐fp-­‐math"="false"  "use-­‐ soft-­‐float"="false"  } !llvm.ident  =  !{!0} Simple.ll $ clang -S -emit-llvm Simple.c

Slide 15

Slide 15 text

Who’s next? Intermediate Representation Optimized I.R. Target Code Optimized Target Code IR Optimization Target Code Optimization Target Code Generation … … (Hello.asm) Hello.c

Slide 16

Slide 16 text

.686P INCLUDELIB MSVCRT INCLUDELIB OLDNAMES _DATA SEGMENT $SG2940 DB! 'Hello, world!', 00H $SG2941 DB! '%s', 0aH, 00H _DATA ENDS PUBLIC _main EXTERN _printf:PROC _TEXT SEGMENT _main PROC push ebp mov ebp, esp push OFFSET $SG2940 push OFFSET $SG2941 call _printf mov esp, ebp pop ebp ret 0 _main! ENDP _TEXT! ENDS END Hello.asm Assembly Target Code PE Object File ?

Slide 17

Slide 17 text

Segmentation MZ Header .text .data PE Header … .idata … reserved … .text .data heap stack … kernel space Executable Image File Process Virtual Memory Space Disk Storage Hello.exe Page #1 Page #2 Page #3 Page #4 Page #220 … Physical Memory Pages …

Slide 18

Slide 18 text

.686P INCLUDELIB MSVCRT INCLUDELIB OLDNAMES _DATA SEGMENT $SG2940 DB! 'Hello, world!', 00H $SG2941 DB! '%s', 0aH, 00H _DATA ENDS PUBLIC _main EXTERN _printf:PROC _TEXT SEGMENT _main PROC push ebp mov ebp, esp push OFFSET $SG2940 push OFFSET $SG2941 call _printf mov esp, ebp pop ebp ret 0 _main! ENDP _TEXT! ENDS END Hello.asm [.text] _main: 00000000: 55 push ebp 00000001: 8B EC mov ebp,esp 00000003: 68 00 00 00 00 push offset $SG2940 00000008: 68 00 00 00 00 push offset $SG2941 0000000D: E8 00 00 00 00 call _printf 00000012: 8B E5 mov esp,ebp 00000014: 5D pop ebp 00000015: C3 ret Symbol Symbol Offset Type Applied To Index Name -------- ---------------- ----------------- -------- ------ 00000004 DIR32 00000000 B $SG2940 00000009 DIR32 00000000 C $SG2941 0000000E REL32 00000000 A _printf [.data] 00000000: 4865 6C6C 6F2C 2077 6F72 6C64 2100 2573 Hello, world!.%s 00000010: 0A00 .. [.debug$S] … [.drectve] /DEFAULTLIB:MSVCRT /DEFAULTLIB:OLDNAMES [COFF SYMBOL TABLE] 00000000 SECT1 notype Static | .data 00000000 SECT2 notype Static | .text 00000000 SECT3 notype Static | .debug$S 00000000 SECT4 notype Static | .drectve 00000000 UNDEF notype External | _printf 00000000 SECT1 notype Static | $SG2940 0000000E SECT1 notype Static | $SG2941 Hello.obj Assembly Target Code PE Object File COFF Header .text .data … COFF Symtab .drectve

Slide 19

Slide 19 text

Static Linking

Slide 20

Slide 20 text

Static Linking 1. Address & storage allocation 2. Symbol resolution 3. Relocation COFF Header .text .data … COFF Symtab .drectve Relocatable Object File A COFF Header .text .data … COFF Symtab .drectve Relocatable Object File B COFF Header .text .data … COFF Symtab .drectve Relocatable Object File C MZ Header .text .data PE Header … .idata … Executable Object File + + = a.obj b.obj c.obj Hello.exe In contrast to the compiler and assembler, Linkers should have minimal understanding of the target machine.

Slide 21

Slide 21 text

Symbols 1. Global symbols — can be referenced by other modules 2. External global symbols — defined by some other module 3. Local symbols — defined and referenced exclusively extern int shared; int add(int, int); int a = 0x689; int main() { return add(a, shared); } a.c int shared = 0x92; static int private; int add(int x, int y) { return x + y; } b.c External global Global Global Global Local Global

Slide 22

Slide 22 text

extern int shared; int add(int, int); int a = 0x689; int main() { return add(a, shared); } a.c [.text] _main: 00000000: 55 push ebp 00000001: 8B EC mov ebp,esp 00000003: A1 00 00 00 00 mov eax,dword ptr [_shared] 00000008: 50 push eax 00000009: 8B 0D 00 00 00 00 mov ecx,dword ptr [_a] 0000000F: 51 push ecx 00000010: E8 00 00 00 00 call _add 00000015: 83 C4 08 add esp,8 00000018: 5D pop ebp 00000019: C3 ret Symbol Symbol Offset Type Applied To Index Name -------- -------- ------------ -------- ------ 00000004 DIR32 00000000 D _shared 0000000B DIR32 00000000 8 _a 00000011 REL32 00000000 B _add [.data] 00000000: 89 06 00 00 [COFF SYMBOL TABLE] 02 00000000 SECT1 notype Static | .drectve 06 00000000 SECT3 notype Static | .data 08 00000000 SECT3 notype External | _a 09 00000000 SECT4 notype Static | .text 0B 00000000 UNDEF notype () External | _add 0C 00000000 SECT4 notype () External | _main 0D 00000000 UNDEF notype External | _shared a.obj int shared = 0x92; static int private; int add(int x, int y) { return x + y; } b.c [.text] _add: 00000000: 55 push ebp 00000001: 8B EC mov ebp,esp 00000003: 8B 45 08 mov eax,dword ptr [ebp+8] 00000006: 8B 00 mov eax,dword ptr [eax] 00000008: 8B 4D 0C mov ecx,dword ptr [ebp+0Ch] 0000000B: 03 01 add eax,dword ptr [ecx] 0000000D: 5D pop ebp 0000000E: C3 ret [.data] 00000000: 92 00 00 00 [COFF SYMBOL TABLE] 02 00000000 SECT1 notype Static | .drectve 04 00000000 SECT2 notype Static | .debug$S 06 00000000 SECT3 notype Static | .data 08 00000000 SECT3 notype External | _shared 09 00000000 SECT4 notype Static | .text 0B 00000000 SECT4 notype () External | _add b.obj

Slide 23

Slide 23 text

[OPTIONAL HEADER VALUES] … 400000 image base (00400000 to 00404FFF) 1000 section alignment 1298 entry point (00401298) … [.text : 00401000–00401869] _main: 00401000: 55 push ebp 00401001: 8B EC mov ebp,esp 00401003: A1 04 30 40 00 mov eax,dword ptr ds:[00403004h] 00401008: 50 push eax 00401009: 8B 0D 00 30 40 00 mov ecx,dword ptr ds:[00403000h] 0040100F: 51 push ecx 00401010: E8 0B 00 00 00 call 00401020 00401015: 83 C4 08 add esp,8 00401018: 5D pop ebp 00401019: C3 ret … _add: 00401020: 55 push ebp 00401021: 8B EC mov ebp,esp 00401023: 8B 45 08 mov eax,dword ptr [ebp+8] 00401026: 8B 00 mov eax,dword ptr [eax] 00401028: 8B 4D 0C mov ecx,dword ptr [ebp+0Ch] 0040102B: 03 01 add eax,dword ptr [ecx] 0040102D: 5D pop ebp 0040102E: C3 ret … 00401298: E8 B4 01 00 00 call 00401451 0040129D: E9 91 FE FF FF jmp 00401133 … [.data : 00403000–00403383] 00403000: 89 06 00 00 92 00 00 00 01 00 00 00 00 00 00 00 00403010: FE FF FF FF FF FF FF FF 4E E6 40 BB B1 19 BF 44 00403020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 … ab.exe int shared = 0x92; int add(int x, int y) { return x + y; } b.c [.text] _add: 00000000: 55 push ebp 00000001: 8B EC mov ebp,esp 00000003: 8B 45 08 mov eax,dword ptr [ebp+8] 00000006: 8B 00 mov eax,dword ptr [eax] 00000008: 8B 4D 0C mov ecx,dword ptr [ebp+0Ch] 0000000B: 03 01 add eax,dword ptr [ecx] 0000000D: 5D pop ebp 0000000E: C3 ret b.obj extern int shared; int add(int, int); int a = 0x689; int main() { return add(a, shared); } a.c [.text] _main: 00000000: 55 push ebp 00000001: 8B EC mov ebp,esp 00000003: A1 00 00 00 00 mov eax,dword ptr [_shared] 00000008: 50 push eax 00000009: 8B 0D 00 00 00 00 mov ecx,dword ptr [_a] 0000000F: 51 push ecx 00000010: E8 00 00 00 00 call _add 00000015: 83 C4 08 add esp,8 00000018: 5D pop ebp 00000019: C3 ret a.obj + CL.exe

Slide 24

Slide 24 text

https://en.wikipedia.org/wiki/ File:Portable_Executable_32_bit_Structure_in_SVG.svg … … Though named optional, it’s actually required :P

Slide 25

Slide 25 text

MSVC Runtime Static Library Dynamic Library Category Compiler Arg. libcmt.lib - Multi-Thread, Static /MT msvcrt.lib msvcr110.dll Multi-Thread, Dynamic /MD libcmtd.lib - Multi-Thread, Static, Debug /MTd msvcrtd.lib msvcr110d.dll Multi-Thread, Dynamic, Debug /MDd msvcmrt.lib msvcm110.dll Managed / Unmanaged Hybrid /CLR msvcurt.lib msvcm110.dll Managed /CLR:PURE … ntdll!NtRequestWaitReplyPort kernel32!ConsoleClientCallServer kernel32!WriteConsoleInternal kernel32!WriteFileImplementation Hello!_write_no_lock Hello!_write Hello!_flush Hello!_ftbuf Hello!printf Hello!main Hello!__tmainCRTStartup kernel32!BaseThreadInitThunk ntdll!__RtlUserThreadStart ntdll!_RtlUserThreadStart … Windows Windows CRT CRT Hello.c

Slide 26

Slide 26 text

Dynamic Linking

Slide 27

Slide 27 text

No content

Slide 28

Slide 28 text

Import Address Table (IAT) Image Source: ʬఔং员తࣗզम䟙ʭby 俞ߕࢠɺੴຌɺᖊ爱ຽ

Slide 29

Slide 29 text

00fd1016 ff259020fd00 jmp dword ptr [Hello+0x2090 (00fd2090)] Hello.exe msvcr110.dll … … ntdll.dll … … … Hello.exe Process Virtual Memory Space 00fd1000 55 push ebp 00fd1001 8bec mov ebp,esp 00fd1003 680030fd00 push offset Hello+0x3000 (00fd3000) 00fd1008 680e30fd00 push offset Hello+0x300e (00fd300e) 00fd100d e804000000 call Hello+0x1016 (00fd1016) 00fd1012 8be5 mov esp,ebp 00fd1014 5d pop ebp 00fd1015 c3 ret … 00fd2090 00002248 00fd2248 20067072 696e7466 00004d53 56435231 .printf..MSVCR1 00fd2258 31302e64 6c6c00 10.dll. 00fd2090 6af8d1d5 MSVCR110!printf: 6af8d1d5 6a0c push 0Ch 6af8d1d7 6888d2f86a push 6af8d288 (offset MSVCR110!_CT??_R0?AV…) 6af8d1dc e8371af9ff call MSVCR110!_SEH_prolog4 (6af1ec18) 6af8d1e1 33c0 xor eax,eax Dynamic Linking: MSVCR110!printf Before Relocation After Relocation 00fd0000 00fd5000 6af10000 6afe2000 76dc0000 76f01000 (RVA) (Relative CALL)

Slide 30

Slide 30 text

Experiment: Break on Load Module

Slide 31

Slide 31 text

Process Launch

Slide 32

Slide 32 text

Double-Click Hello.exe SHELL32! CExecuteAssociation::Execute kernel32!CreateProcessW ntdll!NtCreateUserProcess nt!ZwOpenFile nt!ZwCreateSection nt!PspAllocateProcess nt!PspAllocateThread nt!PspUserThreadStartup nt!NtTerminateProcess ntdll!LoadDll ntdll!LdrpProcessStaticImports Hello!mainCRTStartup msvcrt11d!exit kernel32!ExitProcessStub ntdll!RtlExitUserProcess Open & validate executable image file Allocate process resources Launch loader Handle dynamic linking Queue user-mode APC Initialize C runtime (msvcrtd.lib) Hello!main Clean up process resources Explorer.exe Hello.exe Windows Kernel

Slide 33

Slide 33 text

Further Reading

Slide 34

Slide 34 text

• jserv, “How a Compiler Works: GNU Toolchain” • ຂ৴੒ “Compiling a Compiler” • Kito Cheng, ʬᕆஊฤᩄث࠷ՂԽٕज़ʭ • MSDN, “Peering Inside the PE: A Tour of the Win32 Portable Executable File Format” • Alexander Sotirov, “Tiny PE: Creating the smallest possible PE executable” • Randal E. Bryant & David R. O'Hallaron, “Computer Systems: A Programmer's Perspective” w 俞ߕࢠɺੴຌɺᖊѪຽ ʬఔࣜઃܭࢣతࣗզमཆɿ࿈݁ɺࡌೖɺఔࣜݿʭ http://csapp.cs.cmu.edu/ http://www.tenlong.com.tw/items/9861818286?item_id=53897 http://www.phreedom.org/research/tinype/ https://msdn.microsoft.com/en-us/library/ms809762.aspx http://www.slideshare.net/kitocheng/ss-42438227 https://wiki.sars.tw/doku.php?id=programming:compiling_a_compiler http://www.slideshare.net/jserv/how-a-compiler-works-gnu-toolchain Further Reading

Slide 35

Slide 35 text

< EOF >