Safe and Efficient Execution of LLVM-based Languages on the Java Virtual Machine Swiss LLVM Compiler and Code Generation Social 14. March 2019 Manuel Rigger Advanced Software Technologies Lab (Zhendong Su) @RiggerManuel
C/C++ is Responsible for Dangerous Vulnerabilities 5 Heartbleed Cloudbleed Caused by buffer overflows, the most dangerous vulnerability in unsafe languages
What Makes a Language Unsafe? 6 Undefined Behavior (UB) “behavior, upon use of a nonportable or erroneous program construct or of erroneous data, for which this International Standard imposes no requirements “ (C99 standard)
Buffer Overflows: Leaking Sensitive Data 10 long *arr = malloc(3 * sizeof(long)); long dest[4]; memcpy(dest, arr, sizeof(dest)); arr: dest: Heartbleed and Cloudbleed were such vulnerabilities secret secret
Integer Overflow 15 void pause() { int a = 0; // run until overflow while (a < a + 1) { a++; } } What’s the compilation output of Clang/GCC? 1. The function works as expected by the programmer 2. The function body is optimized away 3. The function results in an endless loop 4. It depends on the optimization level
State of the Art: Instrumentation-based Tools Compile-time instrumentation • AddressSanitizer • SoftBound+CETS 28 a.out Clang/GCC C ./a.out Hello world!
Conundrum: Finding Bugs vs. Performance 29 a.out Clang/GCC C ./a.out Hello world! Static compilers: optimize code based on Undefined Behavior Bug-finding tools: find bugs assuming that violations are visible side effects (Wang et al. 2012, D'Silva 2015)
Map Data Structures and Operations to Java 32 long[] arr = new long[3]; arr[4] = … long *arr = malloc(3 * sizeof(long)); arr[4] = … Map to Java Code The semantics of an out-of- bounds access are well specified
Map Data Structures and Operations to Java 32 long[] arr = new long[3]; arr[4] = … long *arr = malloc(3 * sizeof(long)); arr[4] = … Map to Java Code ArrayIndexOutOfBoundsException The semantics of an out-of- bounds access are well specified
Map Data Structures and Operations to Java 32 long[] arr = new long[3]; arr[4] = … long *arr = malloc(3 * sizeof(long)); arr[4] = … Map to Java Code ArrayIndexOutOfBoundsException The semantics of an out-of- bounds access are well specified Automatic bounds checks that cannot be optimized away
Execution of LLVM IR 35 Safe Execution Platform LLVM IR Clang C C++ GCC Fortran Other LLVM frontend ... (Lattner et al. 2004) [Languages other than C?]
Execution of LLVM IR 35 Safe Execution Platform LLVM IR Clang C C++ GCC Fortran Other LLVM frontend ... (Lattner et al. 2004) We disable compiler optimizations of the front ends [Languages other than C?]
Execution of LLVM IR 35 Safe Execution Platform LLVM IR Clang C C++ GCC Fortran Other LLVM frontend ... (Lattner et al. 2004) We disable compiler optimizations of the front ends [Languages other than C?]
Execution of LLVM IR 35 Safe Execution Platform LLVM IR Clang C C++ GCC Fortran Other LLVM frontend ... (Lattner et al. 2004) Targeting LLVM IR allows executing multiple unsafe languages [Languages other than C?]
Execution of LLVM IR 35 Safe Execution Platform LLVM IR Clang C C++ GCC Fortran Other LLVM frontend ... (Lattner et al. 2004) Targeting LLVM IR allows executing multiple unsafe languages [Languages other than C?]
Execution of LLVM IR 36 LLVM IR Interpreter Truffle LLVM IR Graal JVM [How does the compilation work?] [Array bounds check elimination] [Optimizations Overview] [Completenesss vs. Soundness] [Languages other than C?]
Execution of LLVM IR 36 LLVM IR Interpreter Truffle LLVM IR Graal JVM [How does the compilation work?] [Array bounds check elimination] [Optimizations Overview] [Completenesss vs. Soundness] [Languages other than C?]
Execution of LLVM IR 36 LLVM IR Interpreter Truffle LLVM IR Graal JVM (Würthinger et al. 2012, 2017) [How does the compilation work?] [Array bounds check elimination] [Optimizations Overview] [Completenesss vs. Soundness] [Languages other than C?]
Execution of LLVM IR 36 LLVM IR Interpreter Truffle LLVM IR Graal JVM (Würthinger et al. 2012, 2017) Using Truffle and Graal, we can minimize the instrumentation overhead [How does the compilation work?] [Array bounds check elimination] [Optimizations Overview] [Completenesss vs. Soundness] [Languages other than C?]
Execution of LLVM IR 36 LLVM IR Interpreter Truffle LLVM IR Graal JVM (Würthinger et al. 2012, 2017) [How does the compilation work?] [Array bounds check elimination] [Optimizations Overview] [Completenesss vs. Soundness] [Languages other than C?]
Execution of LLVM IR 36 LLVM IR Interpreter Truffle LLVM IR Graal JVM (Würthinger et al. 2012, 2017) [How does the compilation work?] [Array bounds check elimination] [Optimizations Overview] Safe Sulong can rely on the underlying JVM • Automatic checks • Safe optimizations • Abstraction from the underlying machine and OS [Completenesss vs. Soundness] [Languages other than C?]
{0, 0, 0} Address offset = 0 data I64Array contents Prevent Out-Of-Bounds Accesses 37 long *arr = malloc(3 * sizeof(long)); [How do we know the type?] [What other errors can Safe Sulong detect?] [Pointer to an integer?] [Array bounds check elimination] [Strict-aliasing rule]
Prevent Integer Overflows 43 int a = 1, b = INT_MAX; int val = a + b; Math.addExact(a, b); [What other errors can Safe Sulong detect?] [Pointer to an integer?]
Prevent Integer Overflows 43 int a = 1, b = INT_MAX; int val = a + b; Math.addExact(a, b); ArithmeticException [What other errors can Safe Sulong detect?] [Pointer to an integer?]
Safe Optimizations 44 ArrayIndexOutOfBoundsException NullPointerException ArithmeticException Exceptions are visible side effects and cannot be optimized away
47 write %2 add read %i.0 1 Abstract Syntax Tree class LLVMI32LiteralNode extends LLVMExpressionNode { final int literal; public LLVMI32LiteralNode(int literal) { this.literal = literal; } @Override public int executeI32(VirtualFrame frame) { return literal; } } Executable AST node Nodes return their result in an execute() method Implementation of Operations (Würthinger et al. 2012)
48 Abstract Syntax Tree @NodeChildren({@NodeChild("leftNode"), @NodeChild("rightNode")}) class LLVMI32AddNode extends LLVMExpressionNode { @Specialization protected int executeI32(int left, int right) { return left + right; } } Executable AST node write %2 add read %i.0 1 A DSL allows a declarative style of specifying and executing nodes Implementation of Operations (Humer et al. 2015)
49 Abstract Syntax Tree @NodeChild("valueNode") class LLVMWriteI32Node extends LLVMExpressionNode { final FrameSlot slot; public LLVMWriteI32Node(FrameSlot slot) { this.slot = slot; } @Specialization public void writeI32(VirtualFrame frame, int value) { frame.setInt(slot, value); } } Executable AST node write %2 add read %i.0 1 Local variables are represented by an array-like VirtualFrame object Implementation of Operations
Evaluation Hypotheses • Effectiveness: Safe Sulong detects bugs that are overlooked by other tools • Performance: Safe Sulong’s performance overhead is “reasonable” 57
Effectiveness: Errors in GitHub Projects • Valgrind detected half of the errors • 8 errors not found by LLVM’s AddressSanitizer (and Valgrind) • Compiler optimizations (ASan –O3) prevented the detection of 4 additional bugs 59 [What are the other errors?] [Completenesss vs. Soundness] [Comparison tools]
Effectiveness: Errors in GitHub Projects 60 int main(int argc, char** argv) { printf("%d %s\n", argc, argv[5]); } Out-of-bounds accesses to argv are not instrumented by ASan [What are the other errors?] [Comparison tools]
Effectiveness: Errors in GitHub Projects • 8 errors not found by LLVM’s AddressSanitizer and Valgrind 62 int main(int argc, char** argv) { printf("%d %s\n", argc, argv[5]); } In Safe Sulong instrumentation cannot be omitted by design [What are the other errors?] [Completenesss vs. Soundness] [Comparison tools]
Symbolic execution Hardware security Static analysis Attacker mitigation Existing Approaches 66 Instrumentation- based bug-finding tools Safe languages Safe Sulong leverages a safe implementation language for its bug-finding capabilities
Limitations/Selected Threats to Validity • Lack of support for binary libraries • Generalizability of the benchmark results • Relied on a custom libc for evaluation • Lacks common low-level features 67
Defined Behavior in C 69 C11 Implementing the semantics described in the standard is (often) relatively straightforward int arr[3]; int result = &arr[0] < &arr[2];
Integer Representation: Safe Sulong 71 integer_rep(a) = a.offset int arr[3]; int result = &arr[0] < &arr[2]; Can anyone see where our implementation could break programs? {0, 0, 0} Address offset = 2 data I64Array contents
72 Response % of Respondants Yes 33% Yes, but it shouldn’t 12% No, but there might well be 29% No, that would be crazy 16% Don’t know 8% [Do you know code that uses] relational comparison (with <, >, <=, or >=) of two pointers to separately allocated objects (of compatible object types)? (Memarian et al. 2016) Code Relies on Undefined Behavior
{0, 0, 0} Address offset = 2 data I64Array contents Integer Representation: Lenient C 76 Breaks antisymmetry as different objects might have the same hash code integer_rep(a) = (long) System.identityHashCode(a.pointee) << 32 | offset;
Address offset = 0 data I64Array contents {0, 0, 0} Mitigate Use-after-Free Errors contents[0] = … 79 long *arr = malloc(3 * sizeof(long)); free(arr); arr[0] = … The GC will collect the object when it is no longer referenced
Idea 84 Records metadata int *arr = malloc(sizeof (int) * 10); … arr[4] = … ; arr.size = 40 Checks accesses Query meta data From the tool int size = size_right(str);
Introspection Functions 85 int *arr = malloc(sizeof (int) * 10) ; int *ptr = &(arr[4]); printf ("%ld\n", size_right(ptr)); // prints 24 _size_right() sizeof(int) * 10 We also designed introspection functions for other meta data
Example: strlen() 86 size_t strlen(const char *str) { size_t len = 0; while (*str != '\0') { len++; str++; } return len; } P r o g r a m m i n g \0 ... ...
Example: strlen() 86 size_t strlen(const char *str) { size_t len = 0; while (*str != '\0') { len++; str++; } return len; } P r o g r a m m i n g \0 ... ...
Example: strlen() 86 size_t strlen(const char *str) { size_t len = 0; while (*str != '\0') { len++; str++; } return len; } 11 P r o g r a m m i n g \0 ... ...
Example: strlen() 87 size_t strlen(const char *str) { size_t len = 0; while (*str != '\0') { len++; str++; } return len; } P r o g r a m m i n g ... ...
Example: strlen() 87 size_t strlen(const char *str) { size_t len = 0; while (*str != '\0') { len++; str++; } return len; } P r o g r a m m i n g ... ...
Example: strlen() 87 size_t strlen(const char *str) { size_t len = 0; while (*str != '\0') { len++; str++; } return len; } P r o g r a m m i n g ... ... ==16497==ERROR: AddressSanitizer: stack-buffer- overflow on address 0x7ffc59c0ef63 READ of size 1 at 0x7ffc59c0ef63 thread T0 #0 0x4e7442 in strlen /home/manuel/test.c:10:12 #1 0x4e7392 in main /home/manuel/test.c:5:5
size_t strlen(const char *str) { size_t len = 0; while ( size_right(str) > 0 && *str != '\0') { len++; str++; } return len; } Example: strlen() 90 P r o g r a m m i n g ... ...
size_t strlen(const char *str) { size_t len = 0; while ( size_right(str) > 0 && *str != '\0') { len++; str++; } return len; } Example: strlen() 90 P r o g r a m m i n g ... ...
size_t strlen(const char *str) { size_t len = 0; while ( size_right(str) > 0 && *str != '\0') { len++; str++; } return len; } Example: strlen() 90 11 P r o g r a m m i n g ... ...
size_t strlen(const char *str) { size_t len = 0; while ( size_right(str) > 0 && *str != '\0') { len++; str++; } return len; } Example: strlen() 90 11 P r o g r a m m i n g ... ... We enhanced a libc to deal with unterminated strings
CVE-2017-9047 (Libxml2) 93 if (content->name != NULL) strcat(buf, (char *) content->name); The parser printed a truncated error message, similar to the fixed version
Sulong Key Collaborators 101 Jacob Kreindl Raphael Mosaner Roland Schatz Josef Eisl Christian Häubl Matthias Grimmer Thomas Pointhuber Daniel Pekarek Chris Seaton Lukas Stadler Florian Angerer David Gnedt https://github.com/graalvm/sulong/graphs/contributors Swapnil Gaikwad
Sulong Key Collaborators 102 Jacob Kreindl Raphael Mosaner Roland Schatz Josef Eisl Christian Häubl Matthias Grimmer Thomas Pointhuber Daniel Pekarek Chris Seaton Lukas Stadler Florian Angerer David Gnedt Swapnil Gaikwad EuroLLVM 2019 Talk LLVM IR in GraalVM: Multi-Level, Polyglot Debugging with Sulong https://github.com/graalvm/sulong/graphs/contributors
Sulong Key Collaborators 103 Jacob Kreindl Raphael Mosaner Roland Schatz Josef Eisl Christian Häubl Matthias Grimmer Thomas Pointhuber Daniel Pekarek Chris Seaton Lukas Stadler Florian Angerer David Gnedt Swapnil Gaikwad EuroLLVM 2019 Talk Sulong: An experience report of using the "other end" of LLVM in GraalVM. https://github.com/graalvm/sulong/graphs/contributors
Summary 104 UB is problematic Existing approaches can “optimize” UB “away” Execute C/C++ on the JVM! Automatic checks detect UB But: Programs often invoke UB Metadata for manual checks GraalVM