Programs that Write Programs: How Compilers Work

Slide 1

Slide 1 text

Programs that Write Programs Craig Stuntz https://speakerdeck.com/craigstuntz https://github.com/CraigStuntz/TinyLanguage You can grab my slides and code from the links above. I won’t reserve time at the end of the talk for questions. I have a lot of material to cover. Please interrupt and ask if I’m unclear about anything. And I’d be happy to “buy you dinner” afterwards if you want to talk more.

Slide 2

Slide 2 text

–Steve Yegge “You're actually surrounded by compilation problems. You run into them almost every day.” http://steve-yegge.blogspot.ca/2007/06/rich-programmer-food.html Why bother learning compilers? What can I tell you in just an hour? I think it’s fun & interesting in its own right. Maybe you agree, since you’re here. But you might think you don’t have the opportunity to use compiler techniques in your day job. Maybe you can! Obviously I can’t explain everything there is to know about compiler implementations in an hour, which is why I’m also giving you complete source code for a simple one to look at later! I want you to be able to look at a hard problem and tell your colleagues you know how to solve it.

Slide 3

Slide 3 text

–Greenspun’s Tenth Rule “Any sufficiently complicated C or Fortran program contains an ad hoc, informally- specified, bug-ridden, slow implementation of half of Co!"on Lisp.” https://commons.wikimedia.org/wiki/File:Philip_Greenspun_and_Alex_the_dog.jpg Compilation is fundamental to the problem of producing software. If you’re a professional programmer, you will — you will! — be asked to solve problems which you can do better if you recognize them as pieces of the compiler toolchain.

Slide 4

Slide 4 text

The Hoover Dam Second thing: There exist problems too big to test. You can’t test every possible function a compiler might need to compile. You need a diﬀerent methodology to drive your designs. You can solve them anyway, and you can produce a design with conﬁdence it’s going to work.

Slide 5

Slide 5 text

Generalize the Problem The real skill I want you to leave this talk with: Recognize compilation problems (they’re everywhere!), and apply proven and reliable patterns towards solving. Simply recognizing these problems for what they are and knowing where to look to ﬁnd the solution will make you a better developer. This is a super power!

Slide 6

Slide 6 text

–Eugene Wallingford “…compilers ultimately depend on a single big idea from the theory of computer science: that a certain kind of machine can simulate anything — including itself. As a result, this certain kind of machine, the Turing machine, is the very definition of computability.” http://www.cs.uni.edu/~wallingf/blog/archives/monthly/2015-09.html#e2015-09-03T15_26_47.htm What Is a Compiler, Really? (pause) It’s a program that writes a program! Or a program which takes a representation of data and writes another representation. Often, it’s a program that can write any possible program. Many compilers can compile themselves, producing their own executable as output. There are weird and non-obvious consequences of this! This idea on the screen — this is huge!

Slide 7

Slide 7 text

Compiler So a compiler produces a program by mapping source code to output. But maybe you prefer Ruby or JS? (Click) Interpreter is a compiler which recognizes source code plus additional user input at the same time. I’ll talk mostly about compilers, but remember an interpreter is just a compiler with an extra feature or three.

Slide 8

Slide 8 text

Compiler Interpreter So a compiler produces a program by mapping source code to output. But maybe you prefer Ruby or JS? (Click) Interpreter is a compiler which recognizes source code plus additional user input at the same time. I’ll talk mostly about compilers, but remember an interpreter is just a compiler with an extra feature or three.

Slide 9

Slide 9 text

code exe Actually, I lied just slightly. We like to think of a compiler as code -> exe, but this is mostly wrong. (Click) Primary purpose is to reject incorrect code. Maybe 1 in 10 times we produce a program, right? Right?

Slide 10

Slide 10 text

Slide 11

Slide 11 text

Useful Bits • Regular Expressions (lexing) • Deserializers (parsing) • Linters, static analysis (syntax, type checking) • Solvers, theorem provers (optimization) • Code migration tools (compilers!) Also, most of the individual pieces of the compilation pipeline are useful in their own right When you learn to write a compiler, you learn all of the above, and more!

Slide 12

Slide 12 text

A → B More generally, a compiler is a formalization for taking data in one representation and turning it into data in a diﬀerent representation, if valid. The transformation (from A to B) is often nontrivial. This concept describes a large percentage of the code we write, so compilation concepts are really useful, even when not writing a compiler per se.

Slide 13

Slide 13 text

Source code → Program JPEG file → Image on screen Source code → Potential style error list JSON → Object graph Code with 2 digit years → Y2K compliant code VB6 → C# Object graph → User interface markup Algorithm → Faster, equivalent algorithm Some of these look like what we typically consider a compiler. Some don’t. None are contrived. I do all of these in my day job. These are problems you must routinely solve even if you’re not a language author.

Slide 14

Slide 14 text

Designing with Formal Methods Many people here are probably familiar with test driven design, where you write tests to guide evolution of program design. Although I do write (many!) tests for my compilers, Test Driven Design doesn’t hold up as a design methodology. Lets examine why, and what the alternative might be.

Slide 15

Slide 15 text

#define D define #D Y return #D R for #D e while #D I printf #D l int #D W if #D C y=v+111;H(x,v)*y++= *x #D H(a,b)R(a=b+11;au){R(w=i=0;i<4;i++)w+=(m=v[h[i]])==f?300:m==q?-300:(t=v[ih[i]])==f?-50: t==q?50:0;Y w;}H(z,0){W(E(v,z,f,100)){c++;w= -S(d+1,n,q,0,-b,-j);W(w>j){g=bz=z; j=w;W(w#$b%&w#$8003)Y w;}}}W(!c){g=0;W(_){H(x,v)c+= *x==f?1:*x==3-f?-1:0;Y c>0? 8000+c:c-8000;}C;j= -S(d+1,n,q,1,-b,-j);)bz=g;Y d#$u-1?j+(c'(3):j;}main(){R(;t< 1600;t+=100)R(m=0;m<100;m++)V[t+m]=m<11%&m>88%&(m+1)%10<2?3:0;I("Level:");V[44] =V[55]=1;V[45]=V[54]=2;s(u);e(lv>0){Z do{I("You:");s(m);}e(!E(V,m,2,0))*m+,99); W(m+,99)lv--;W(lv<15)*u<10)u+=2;U("Wait\n");I("Value:%d\n",S(0,V,1,0,-9000,9000 ));I("move: %d\n",(lv-=E(V,bz,1,0),bz));}}E(v,z,f,o)l*v;{l*j,q=3-f,g=0,i,w,*k=v +z;W(*k==0)R(i=7;i#$0;i--){j=k+(w=r[i]);e(*j==q)j+=w;W(*j==f)*j-w+,k){W(!g){g=1 ;C;}e(j+,k)*((j-=w)+o)=f;}}Y g;} Anyone here know any C? Does this look like a valid C program to you? This one is not valid, but it would be if you changed one character. Can you ﬁnd it, or could you write a program to ﬁnd it? This is a hard problem: Write a program that for any string whatsoever (and there are lots of possible strings!) either declares it a valid program or explains clearly to a human why it isn’t.

Slide 16

Slide 16 text

Duff’s Device There Are No Edge Cases In Progra"#ing Languages send(to, from, count) register short *to, *from; register count; { register n = (count + 7) / 8; switch (count % 8) { case 0: do { *to = *from++; case 7: *to = *from++; case 6: *to = *from++; case 5: *to = *from++; case 4: *to = *from++; case 3: *to = *from++; case 2: *to = *from++; case 1: *to = *from++; } while (--n > 0); } } And I mean every possible program! (Explain Duﬀ) People will do absolutely anything the PL grammar allows. Therefore your compiler must be able to classify literally any arbitrary string into a valid or invalid program, and you can’t predict the valid programs people will write.

Slide 17

Slide 17 text

Even the designer of a language can’t begin to predict all the things the compiler will be asked to parse. You can’t design a compiler which can parse any legal program by poking around with tests of possible programs. You would need an uncountably large number of tests. You must design with formal methods. But you do still write tests; it’s just that you use a diﬀerent design methodology!

Slide 18

Slide 18 text

1 + 2 + 3 + … + 100 = 100 * 101 / 2 = 5050 Another reason compilers are interesting is they try to solve a very hard problem: Don’t just convert A to B. Convert A into a highly-optimized-but-entirely-equivalent B. For any given A whatsoever! If a compiler crashes, that’s bad. If it silently emits incorrect code, it’s worse than bad; the world ends. Hard problems are interesting!

Slide 19

Slide 19 text

Good news! Although this is a hard problem, it’s a mostly solved problem. “Solved” not as in it will never get better. “Solved” as in there is a recipe, which most programmers are capable of following. There are lots of steps, but each individual step is pretty simple. You can learn it one tiny piece at a time. This is a really learnable skill. In fact, if you have a hard problem and can distort it until it looks like a compiler then you have a solved problem. This is magic! Contrast this with web development, which is also hard, but we change our minds about what constitutes “best practices” annually.

Slide 20

Slide 20 text

Lexer → Regular Expressions Parser → Context Free Gra!"ar Optimizer → Algebra Type Checker → Logical Inference Rules Code Generator → Denotational Semantics For each part of the compiler pipeline, there exists a formalism which guides our designs. Some of these words are long and unfamiliar, but just think of them as recipes for implementation. If you follow them, you will cover all of the cases in your language speciﬁcation.

Slide 21

Slide 21 text

–Leslie Lamport “You don’t achieve simplicity by thinking in terms of complicated languages. Simplicity requires thinking abstractly before you start implementing.” http://www.heidelberg-laureate-forum.org/blog/video/lecture-monday-august-24-2015-leslie-lamport/ https://commons.wikimedia.org/wiki/File:Leslie_Lamport.jpg Worth noting: Compiler implementation is fairly straightforward, and there’s a recipe to follow. Language design is much harder, and there’s no “best” recipe. Lots of people can write good compilers. Far fewer can write a good language. Fewer still can write a good, simple language. Respect language designers. It’s common to try to learn compilers by inventing your own language or chasing a complicated language. I recommend you don’t do that! I use a tiny Lisp here, but I started with a math expression evaluator. Most math expressions are simpler than PLs.

Slide 22

Slide 22 text

A Few Important Concepts Before we dive into compiler implementation per se…

Slide 23

Slide 23 text

Syntax x = x + 1; alert(x); Sequence Assign Invoke x add x 1 alert x A compiler must handle both the written form (syntax) and the internal meaning (semantics) of the language. Syntax has both a literal (characters) and a tree form. These are both representations of the syntax. This is the “abstract” syntax tree. Believe it or not it’s the simpliﬁed version!

Slide 24

Slide 24 text

Semantics name = "Nate" # +/ "Nate" String.upcase(name) # +/ "NATE" name # +/ "Nate" name = "Nate" # +/ "Nate" name.upcase! # +/ "NATE" name # +/ "NATE" http://www.natescottwest.com/elixir-for-rubyists-part-2/ People obsess about syntax. They reject LISPs because (). But semantics are, I think, most important. Nobody says we don’t need JavaScript since we have Java. People do ask if we need VB.NET since there is C# (syntactically different, but semantically equivalent). I want to clarify the distinction. Elixir on left, Ruby on right. Looks similar, but Elixir strings are immutable. Similar syntax; different semantics. Similar looking programs mean different things.

Slide 25

Slide 25 text

Semantics Imports System Namespace Hello Class HelloWorld Overloads Shared Sub Main(ByVal args() As String) Dim name As String = "VB.NET" 'See if argument passed If args.Length = 1 Then name = args(0) Console.WriteLine("Hello, " & name & "!") End Sub End Class End Namespace using System; namespace Hello { public class HelloWorld { public static void Main(string[] args) { string name = "C#"; !" See if argument passed if (args.Length == 1) name = args[0]; Console.WriteLine("Hello, " + name + "!"); } } } http://www.harding.edu/fmccown/vbnet_csharp_comparison.html VB.NET on left, C# on right. Looks different, but semantics are identical. Different looking programs mean the same thing. “The syntax of a language is governed by the constructs that define its types, and its semantics is determined by the interactions among those constructs.” -Robert Harper Is this clear? Compilers deal with both; distinction is important. Let’s look at the big picture.

Slide 26

Slide 26 text

Front End: Understand Language Back End: Emit Code Front end, back end Deﬁnitions vary, but nearly always exist

Slide 27

Slide 27 text

Lexer IL Generator Parser Type Checker Optimizer Optimizer Object Code Generator Binder This is simpliﬁed. Production compilers have more stages. Front end, back end… Middle end? There is an intermediate representation — collection of types — for each stage. I show two optimizers here. Production compilers have more.

Slide 28

Slide 28 text

OK, so let’s compile something already! module Compiler let compile = Lexer.lex 01 Parser.parse 01 Binder.bind 01 Optimize Binding.optimize 01 IlGenerator.codegen 01 Railway.map OptimizeIl.optimize 01 Railway.map Il.toAssemblyBuilder That’s the real, full source code for the compiler itself is on the screen; it just chains together the “recipes” we’ll be examining in detail in the examples to come. I’m literally just piping one module into the next.

Slide 29

Slide 29 text

(inc -1) Let’s start with something really, really simple. We want to be able to compile this program. (Remember, start with a toy language.) To what? There’s more than one valid representation of this program. Here’s one (click, explain IL). Here’s another (click). How do we even decide? Is it clear to everyone why these are both valid representations of the program I’m compiling?

Slide 30

Slide 30 text

(inc -1) Ldc.i4 -1 Ldc.i4 1 Add Let’s start with something really, really simple. We want to be able to compile this program. (Remember, start with a toy language.) To what? There’s more than one valid representation of this program. Here’s one (click, explain IL). Here’s another (click). How do we even decide? Is it clear to everyone why these are both valid representations of the program I’m compiling?

Slide 31

Slide 31 text

(inc -1) Ldc.i4 -1 Ldc.i4 1 Add Ldc.i4.0 Let’s start with something really, really simple. We want to be able to compile this program. (Remember, start with a toy language.) To what? There’s more than one valid representation of this program. Here’s one (click, explain IL). Here’s another (click). How do we even decide? Is it clear to everyone why these are both valid representations of the program I’m compiling?

Slide 32

Slide 32 text

(inc -1) Lex LeftParen, Identifier(inc), Number(-1), RightParen Parse Apply “inc” to -1 Type check “inc” exists and takes an int argument, and -1 is an int. Great! Optimize -1 + 1 = 0, so just emit int 0! IL generate Ldc.i4 0 Optimize Ldc.i4 0 → Ldc.i4.0 Object code Produce assembly with entry point which contains the IL generated Let’s break this process into individual steps. All we’ve done is bust hard problem of source code -> optimized EXE into a bunch of relatively small problems. (Read each) We’ll explain them in detail in just a moment. Production compiler engineers will say this is too simple, but their code, like most production code, is probably a mess. You can learn a lot from the simple case!

Slide 33

Slide 33 text

(defun add-1 (int x) (inc x)) (defun main () (print (add-1 2))) Or maybe this one. It’s a little more complicated, but hopefully it still makes sense. Let’s look at the individual pieces of the pipeline.

Slide 34

Slide 34 text

Lexer What Problem Are We Solving? String → Sequence of tokens Non-Compiler Example Text search Lexers break strings into tokens using a grammar which is a bunch really simple regular expressions. Tokens/lexemes are like characters, but they’re just a little bit richer. You get a token for the int 123 instead of individual characters 1, 2, and 3 Regular expressions search text within a string. Lexers search for reserved words/symbols in a language.

Slide 35

Slide 35 text

Lexer Search “am” I am. You are. You don’t expect a search for “am” to match the word “are” simply because they’re conjugations of the same verb. This is the diﬀerence between lexing and parsing; lexing deals with symbols and parsing deals with language grammar. Lexing works on character input. Parsing works on lexeme input (roughly, words or symbols)— the fundamental unit of the PL’s grammar and also the output of the lexer.

Slide 36

Slide 36 text

Regular Expressions leftParenthesis = ‘(‘ rightParenthesis = ‘)’ letter = ‘A’ | ‘B’ | ‘C’ | … digit = ‘0’ | ‘1’ | ‘2’ | … number = (‘+’digit|‘-’digit|digit) digit* alphanumeric = letter | number !3 … Of course an acceptable program can’t really be any random string. There are rules. We deﬁne the rules via formalisms. For lexers, the formalism is regular expressions. This doesn’t mean a PCRE; it’s much simpler. As with the example above, a RE can be a literal character, a choice, or a sequence. That’s it! there are no other options. Lexers are really simple! But how can we be sure that the code behaves like this grammar, every time

Slide 37

Slide 37 text

Lexer If there’s one rule I know which will for sure make your programs better, it’s this: Make illegal state not representable. (Yaron Minsky)

Slide 38

Slide 38 text

Lexer type Lexeme = | LeftParenthesis | RightParenthesis | Identifier of string | LiteralInt of int | LiteralString of string | Unrecognized of char The regular expressions in the lexical grammar directly map to the types I create for the lexer. I cannot construct an instance of something which doesn’t ﬁt in the types, hence, I cannot construct an instance of a program which doesn’t ﬁt in the grammar. Any code input which doesn’t match the lexical grammar gets tossed into the bottom “Unrecognized” type. It will eventually surface as an error to the user.

Slide 39

Slide 39 text

Lexer (inc -1) Start with this, transform to (click). Haven’t tried to infer any kind of meaning at all yet. What’s the point of lexing vs parsing? Output of lexing is terminals of the grammar used by the parser. Lexer sets the parser up for success. Makes the formalism possible.

Slide 40

Slide 40 text

Lexer (inc -1) “(“ “inc” “-1” “)” LeftParenthesis Identifier(“inc”) LiteralInt(-1) RightParenthesis Start with this, transform to (click). Haven’t tried to infer any kind of meaning at all yet. What’s the point of lexing vs parsing? Output of lexing is terminals of the grammar used by the parser. Lexer sets the parser up for success. Makes the formalism possible.

Slide 41

Slide 41 text

Lexer ( -1) Possible errors: Something which doesn’t ﬁt the lexical grammar, and also can’t be recognized as a terminal for the parsing grammar.

Slide 42

Slide 42 text

Lexer let rec private lexChars (source: char list) : Lexeme list = match source with | '(' :: rest !→ LeftParenthesis :: lexChars rest | ')' :: rest !→ RightParenthesis :: lexChars rest | '"' :: rest !→ lexString(rest, "") | c :: rest when isIdentifierStart c !→ lexName (source, "") | d :: rest when System.Char.IsDigit d !→ lexNumber(source, "") | [] !→ [] | w :: rest when System.Char.IsWhiteSpace w !→ lexChars rest | c :: rest !→ Unrecognized c :: lexChars rest I write all of my compiler code as purely functional. Most example compiler code requires understanding the state of the compiler as well as the behavior of the code. You can understand this code from the code alone. There is no state at all to know. Lexer is recursive: Look at ﬁrst char, decide what to do, then repeat.

Slide 43

Slide 43 text

Lexer http://stackoverﬂow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags Counterexamples: The thing you must know about lexing is when not to use it. You can’t parse XML with a regex. The user experience for validation should be gently helping the user to avoid fat ﬁngers and submit correct data.

Slide 44

Slide 44 text

Lexer http://www.regular-expressions.info/email.html “So even when following official standards, there are still trade-offs to be made. Don't blindly copy regular expressions from online libraries or discussion forums.” -Jan Goyvaerts, regular-expressions.info Don’t piss them oﬀ by telling them they’ve misspelt their Irish surname or that their real email address is “invalid.” Don’t validate a grammar (like email addresses) with a lexer/regex.

Slide 45

Slide 45 text

Parser What Problem Are We Solving? Sequence of tokens → Syntax tree Non-Compiler Example Deserialization The lexer produces a sequence of tokens. We want to turn that into an abstract syntax tree, respecting operator precedence.

Slide 46

Slide 46 text

PEMDAS 1 + 2 * 3 1 + (2 * 3) What is precedence? In most languages, for example, you multiply before you add, regardless of their sequence in an expression. The expressions above and below the line should be identical.

Slide 47

Slide 47 text

To parse a language, you must understand its grammar. Example is part of ECMAScript grammar. All valid statements in the language (all!) can be constructed from the grammar. Anything we can’t construct from the grammar is invalid and will be a parse error. The converse isn’t true, though. We can also construct invalid statements from the grammar, because the parser doesn’t type check. Just because something can be parsed doesn’t make it correct, but it’s certainly invalid if it can’t be parsed. True in spoken languages as well: “Colorless green ideas sleep furiously” Grammatically correct, but semantically nonsensical. Parsing works on syntax, not semantics.

Slide 48

Slide 48 text

Gra"#ar := | := | := “(defun” identifier “)” := number | string | := “(” identifier “)” With a reasonably speciﬁed language, the rules are pretty easy to follow. (good: C#; bad: Ruby) Our goal today is to implement a really simple language. We’ll follow the grammar here. Explain Terminals vs. nonterminals. Important for lexing vs. parsing.

Slide 49

Slide 49 text

Gra"#ar type Expression = | IntExpr of int | StringExpr of string | DefunExpr of name: string * argument: ArgumentExpression option * body: Expressi | InvokeExpr of name: string * argument: Expression option | IdentifierExpr of string | ErrorExpr of string | EmptyListExpr Just as we deﬁned types to represent productions in the lexical grammar, we do the same for the parsing grammar. At this point you might ask if this mapping from the formal grammar to F# types can be automated, and it can! I’ve built my entire compiler “from scratch” so you can see how it works, but it’s common to use lexer and parser generators in real-world work.

Slide 50

Slide 50 text

Parser LeftParenthesis Identifier(“inc”) LiteralInt(-1) RightParenthesis How does parsing work? Start with a list of tokens from the lexer. Click Produce syntax tree. Here we have a const -1 argument; argument could be another expression like another invocation.

Slide 51

Slide 51 text

Parser LeftParenthesis Identifier(“inc”) LiteralInt(-1) RightParenthesis Invoke “inc” -1 How does parsing work? Start with a list of tokens from the lexer. Click Produce syntax tree. Here we have a const -1 argument; argument could be another expression like another invocation.

Slide 52

Slide 52 text

Parser LeftParenthesis Identifier(“inc”) LiteralInt(-1) LiteralInt(-1) Possible errors: Bad syntax. Every stage of compiler checks the form of the stage previous. (click) Parser checks lexer output.

Slide 53

Slide 53 text

Parser LeftParenthesis Identifier(“inc”) LiteralInt(-1) LiteralInt(-1) “Expected ‘)’” Possible errors: Bad syntax. Every stage of compiler checks the form of the stage previous. (click) Parser checks lexer output.

Slide 54

Slide 54 text

Parser let rec private parseExpression (state : ParseState): ParseState = match state.Remaining with | LeftParenthesis :: Identifier "defun" :: Identifier name :: rest !→ let defun = parseDefun (name, { state with Remaining = rest }) match defun.Expressions, defun.Remaining with | [ ErrorExpr _ ], _ !→ defun | _, RightParenthesis :: remaining !→ { defun with Remaining = remaining } | _, [] !→ error ("Expected ')'.") | _, wrong :: _ !→ error (sprintf "Expected ')'; found %A." wrong) | LeftParenthesis :: Identifier name :: argumentsAndBody !→ let invoke = parseInvoke (name, { state with Remaining = argumentsAndBody }) match invoke.Remaining with | RightParenthesis :: remaining !→ { invoke with Remaining = remaining } | [] !→ error ("Expected ')'.") | wrong :: _ !→ error (sprintf "Expected ')'; found %A." wrong) | LeftParenthesis :: wrong !→ error (sprintf "%A cannot follow '('." wrong) Implementing parsers can be a bit tricky since there are lots of choices in implementation style with subtle tradeoﬀs. But you can use an oﬀ-the-shelf toolkit in most cases. Parts of my parser shown here. Again, I’m just doing a pattern match against the token stream from the lexer, because parsing a LISP is easy. Key to not getting stuck in compiler is do only one thing at a time. Do not optimize. Do not type check. Just look for valid syntax.

Slide 55

Slide 55 text

Parser Practical parsing: Deserialization, especially of untrusted input like reading a ﬁle format such as PNG is a parsing job and you should use formal methods to implement it. If you try and “wing it” then you are probably letting the bad guys into anyone who uses your library.

Slide 56

Slide 56 text

–Guy Steele “If it's worth telling another progra!"er, it's worth telling the compiler, I think.” https://joshvarty.wordpress.com/2015/08/03/learn-roslyn-now-part-11-introduction-to-code-ﬁxes/ One super practical use for a parser is enforcing code style guides. Many companies write style guides as Word documents or worse. That leads to lax enforcement, usually targeted at new employees only, and lots of pissy email threads.

Slide 57

Slide 57 text

Parser https://joshvarty.wordpress.com/2015/08/03/learn-roslyn-now-part-11-introduction-to-code-ﬁxes/ Instead, you can write a Roslyn syntax analyzer and just fail the build if a rule is broken. Parsers: Bringing world peace through technology!

Slide 58

Slide 58 text

Scope What Problem Are We Solving? What does “x” mean right now? Non-Compiler Example Bounded Context in Domain Driven Design In an ideal world (for the compiler author), developers would give a unique and unambiguous name to every variable. In the real world, probably half the variables in any average program are called “temp”. The compiler has to decide which assignment to something called “temp” to use in a given context.

Slide 59

Slide 59 text

Scope https://msujaws.wordpress.com/2011/05/03/static-vs-dynamic-scoping/ Scoping rules unambiguously associate occurrences of identiﬁer names to their binding locations, or declaration sites. Essentially all modern languages use lexical (static) scoping. This is the rarest of things: A settled argument in computer science. Essentially nobody argues that dynamic scoping is a good idea anymore. Only archaic languages like SNOBOL and early LISPs use dynamic scoping. (Perl has opt-in dynamic scoping because it’s Perl.) Most contemporary dynamic languages use lexical scoping. As a compiler writer, this requires some care.

Slide 60

Slide 60 text

Binding InvokeExpr “inc” -1 The binder handles scoping concerns. Binder connects syntactic identiﬁers like “inc” with meaning. Start with an abstract syntax tree. Click. Produce a binding tree. Note the line Function = Inc — the string “inc” has been mapped to a built-in function the compiler understands and can code generate. Binding enforces scope and gives us the ability to check types.

Slide 61

Slide 61 text

InvokeBinding { FunctionName = "inc" Function = Inc Argument = IntBinding -1} Binding InvokeExpr “inc” -1 The binder handles scoping concerns. Binder connects syntactic identiﬁers like “inc” with meaning. Start with an abstract syntax tree. Click. Produce a binding tree. Note the line Function = Inc — the string “inc” has been mapped to a built-in function the compiler understands and can code generate. Binding enforces scope and gives us the ability to check types.

Slide 62

Slide 62 text

InvokeExpr { Name = "not-a-function" Argument = StringExpr "" } Binding Possible errors. (click)

Slide 63

Slide 63 text

InvokeExpr { Name = "not-a-function" Argument = StringExpr "" } Binding “Undefined function ‘not-a-function’.” Possible errors. (click)

Slide 64

Slide 64 text

https://msdn.microsoft.com/en-us/library/ms228296.aspx?f=255&MSPPError=-2147217396 Someone asked me how much code in a compiler is dedicated to error handling? Let’s take a digression and consider this. Remember the most common case for a compiler is failing on bad code and giving user a good error message. C# sure has a lot of them. Each of these has code behind it.

Slide 65

Slide 65 text

As a compiler author, you either put in the eﬀort to do it well or you leave your user with a really poor experience.

Slide 66

Slide 66 text

About Those Errors [] member this.``should return error for unbound invocation``() = let source = "(bad-method 2)" let expected = ErrorBinding ( "Undefined function 'bad-method'.", EmptyBinding) let actual = bind source actual |> should equal expected So I write tests for each. Each error needs a test example, or several. This isn’t test driven design; it’s regression testing.

Slide 67

Slide 67 text

About Those Errors http://www.drdobbs.com/architecture-and-design/so-you-want-to-write-your-own-language/240165488?pgno=2 There are a few possible ways to deal with errors. (click) First is common when getting started; entirely unsuitable for production compilers. Just fail on ﬁrst error found. (click) Second friendly but hard and often wrong. Guess what the user intended and try to continue compilation. Report all errors at the end of the process. (click) I use third.

Slide 68

Slide 68 text

About Those Errors • Die in a fire http://www.drdobbs.com/architecture-and-design/so-you-want-to-write-your-own-language/240165488?pgno=2 There are a few possible ways to deal with errors. (click) First is common when getting started; entirely unsuitable for production compilers. Just fail on ﬁrst error found. (click) Second friendly but hard and often wrong. Guess what the user intended and try to continue compilation. Report all errors at the end of the process. (click) I use third.

Slide 69

Slide 69 text

About Those Errors • Die in a fire • Guess what I meant, not what I said http://www.drdobbs.com/architecture-and-design/so-you-want-to-write-your-own-language/240165488?pgno=2 There are a few possible ways to deal with errors. (click) First is common when getting started; entirely unsuitable for production compilers. Just fail on ﬁrst error found. (click) Second friendly but hard and often wrong. Guess what the user intended and try to continue compilation. Report all errors at the end of the process. (click) I use third.

Slide 70

Slide 70 text

About Those Errors • Die in a fire • Guess what I meant, not what I said • Poisoning http://www.drdobbs.com/architecture-and-design/so-you-want-to-write-your-own-language/240165488?pgno=2 There are a few possible ways to deal with errors. (click) First is common when getting started; entirely unsuitable for production compilers. Just fail on ﬁrst error found. (click) Second friendly but hard and often wrong. Guess what the user intended and try to continue compilation. Report all errors at the end of the process. (click) I use third.

Slide 71

Slide 71 text

Type Checking What Problem Are We Solving? AST → Boolean “Is it valid?” Non-Compiler Example Linter Moving on, Type checking doesn’t substantially change the intermediate representation. Mostly just “thumbs up/thumbs down.”  But some compilers do type inference, which does. Transforms nodes of unknown type to nodes of known type. Type checkers are really useful outside of compilers: (Swagger)

Slide 72

Slide 72 text

Type Checking ldstr "Hi" ldstr "Hi" div This is bad. Don’t do this. One way to understand type checking is to look at a language which doesn’t do any. IL/ASM very little time checking. You can try to "divide" two strings, for example, and your application might crash or, worse, silently produce incorrect results. So it is critical that the compiler never emit such IL.

Slide 73

Slide 73 text

Type Inference Rules Γ ⊢ A Γ ⊢ B Γ ⊢ A×B Γ ⊢ v1 :Int Γ ⊢ v2 :Int Γ ⊢ v1 +v2 :Int Type systems are speciﬁed by inference rules. Given the premises or assumptions above the line, we’re allowed to form the derivations below the line. There’s an example on the slide. Here we say that if A and B are both types in a certain type environment named gamma, then we are also allowed to form a pair with members of types A and B. (You read “Γ ⊢” as “it’s provable within an environment Γ that…”) Similarly, if we know that two values, v1 and v2 are both integers, then so is their sum. There are lots of rules, but each one should be pretty simple. These seem almost too obvious to bother stating, but it helps to be really clear what the rules are, because, as you’ve seen with JavaScript, the corner cases can be a bit scary.

Slide 74

Slide 74 text

Type Checking • Statically typed • Unityped (“dynamic language”) • Untyped You might ask yourself, “Self, what if I'm working on a dynamic language? Then I don't have to do any type checking, right?” However, you must do nearly the same type checking for evaluating a dynamic language as for pre-compiling a static language. The biggest diﬀerence is that the check is deferred until runtime. Other than that, it is very similar. Some languages, like ASM and C do very little type checking at all. This is part of the reason why Open SSL has so many issues and your Toyota accelerates itself. C has its uses, but please don’t build another C.

Slide 75

Slide 75 text

Type Checking let rec private toBinding (environment: Map) (expression : Expression) : Bind match expression with | IntExpr n !→ IntBinding n | StringExpr str !→ String Binding str “A type system speciﬁes the type rules of a programming language independently of particular typechecking algorithms. This is analogous to describing the syntax of a programming language by a formal grammar, independently of particular parsing algorithms.” - Luca Cardelli Some of this is easy!

Slide 76

Slide 76 text

Type Checking | InvokeExpr (name, argument) !→ match environment.TryFind name with | Some (Function Binding func) !→ let argumentBinding = toInvokedArgumentBinding environment argument match argumentTypeError argumentBinding func with | None !→ InvokeBinding { FunctionName = name Function = func Argument = argumentBinding } | Some argumentTypeErrorMessage !→ ErrorBinding (argumentTypeErrorMessage, EmptyBinding) | Some bindingType !→ ErrorBinding (sprintf "Expected function; found %A" bindingType, EmptyBinding) | None !→ ErrorBinding (sprintf "Undefined function '%s'." name, EmptyBinding) Some of it is considerably harder. When binding an invocation, we must ﬁrst make sure it’s a real function and then that the argument can be bound at all, and lastly that the types match. Type checking is easier than type inference, but both are mostly solved problems, at least for simpler cases. Now we’re going to a semantic understanding of the code.

Slide 77

Slide 77 text

InvokeExpr { Name = "inc" Argument = StringExpr “Oops!" } Type Checking Possible errors.

Slide 78

Slide 78 text

InvokeExpr { Name = "inc" Argument = StringExpr “Oops!" } Type Checking “Expected integer; found ‘Oops!’.” Possible errors.

Slide 79

Slide 79 text

Optimizers What Problem Are We Solving? Program → Faster, but equivalent program Non-Compiler Example Theorem prover Fight the urge to optimize outside the optimizers! Remember, one of the essential characteristics of compiler optimization is you can turn it oﬀ, e.g., for easier debugging. This is really hard if you optimize outside the optimizer. Optimizer must never change program behavior, except maybe making it harder to debug. Non-optimizer code should be so non-optimal it looks dumb.

Slide 80

Slide 80 text

Optimization (I) InvokeBinding “inc” -1 There cannot be any errors based on user input. Either make it better or leave it alone. Unlike other phases. Here’s an optimized version of the tree (click)

Slide 81

Slide 81 text

Optimization (I) InvokeBinding “inc” -1 IntBinding 0 There cannot be any errors based on user input. Either make it better or leave it alone. Unlike other phases. Here’s an optimized version of the tree (click)

Slide 82

Slide 82 text

Optimization (I) Invoke “some-method” -1 Often, you’ll do nothing. (click)

Slide 83

Slide 83 text

Optimization (I) Invoke “some-method” -1 Invoke “some-method” -1 Often, you’ll do nothing. (click)

Slide 84

Slide 84 text

Optimization (I) let private optimizeInc (binding: Binding) : Binding = match binding with | IncBinding (IntBinding number) !→ IntBinding (number + 1) | IncBinding _ | BoolBinding _ | IntBinding _ | String Binding _ | VariableBinding _ | Function Binding _ | InvokeBinding _ | DefBinding _ | ErrorBinding _ | EmptyBinding _ !→ binding This is an example of a function which does one speciﬁc optimization: Find when the “inc” function is applied to a literal int and substitute the correct result. Hooray, we’ve optimized away the function call! It ignores other kinds of nodes.

Slide 85

Slide 85 text

IL Generation IntBinding 0 After optimizing the tree, generate IL. If we start from a simple binding… (click) We want to produce If you’ve seen IL before, you may wonder why not (click). More optimized! Why don’t we do this? (anyone?)

Slide 86

Slide 86 text

IL Generation IntBinding 0 Ldc.i4 0 After optimizing the tree, generate IL. If we start from a simple binding… (click) We want to produce If you’ve seen IL before, you may wonder why not (click). More optimized! Why don’t we do this? (anyone?)

Slide 87

Slide 87 text

IL Generation IntBinding 0 Ldc.i4 0 Ldc.i4.0 After optimizing the tree, generate IL. If we start from a simple binding… (click) We want to produce If you’ve seen IL before, you may wonder why not (click). More optimized! Why don’t we do this? (anyone?)

Slide 88

Slide 88 text

Slide 89

Slide 89 text

IL Generation let private writeLineMethod = typeof.GetMethod( "WriteLine", [| typeof |] let private codegenOper = function | IncInt !→ [ Instruction.Ldc_I4_1 Instruction.Add ] | WriteLine !→ [ Instruction.Call writeLineMethod ] But for built-in, primitive operations, I have to write out the code in IL. I need these primitives for more complicated programs.

Slide 90

Slide 90 text

Optimization (II) Ldc.i4 0 For IL optimization, we want to replace the generic operations with their optimized IL short forms. Go from this, to (click)

Slide 91

Slide 91 text

Optimization (II) Ldc.i4 0 Ldc.i4.0 For IL optimization, we want to replace the generic operations with their optimized IL short forms. Go from this, to (click)

Slide 92

Slide 92 text

Slide 93

Slide 93 text

Special Tools! PE Verify, ildasm. Similar tools exist for other platforms. ildsam: May have seen before. Decompile an EXE/DLL into IL. PEVerify

Slide 94

Slide 94 text

Compare! When in doubt, try it in C#, and see what that compiler emits. Sometimes it’s a little strange. Roll with it.

Slide 95

Slide 95 text

https://www.ece.cmu.edu/~ganger/712.fall02/papers/p761-thompson.pdf OK, this seems really weird when you ﬁrst hear about it. Thompson says we should trust people more than code. But people are fallible and code is precise, right?

Slide 96

Slide 96 text

Trusting Trust Compiler Executable Compiler Source Code Compiler Executable Many compilers can compile themselves. This is how it’s supposed to work. The two green boxes are identical, right? Right. This is how we expect it to work. Compilers map source code to executables.

Slide 97

Slide 97 text

Trusting Trust Compiler Executable Compiler Source Code Trojaned Compiler Executable Trojan Code What if someone adds some malicious code to the compiler source code? What does that do?

Slide 98

Slide 98 text

Trusting Trust Trojaned Compiler Executable Benign App Source Code Trojaned App Executable Code which adds a trojan to any app the compiler compiles? This is obviously bad, but maybe not too surprising, right?

Slide 99

Slide 99 text

Trusting Trust Trojaned Compiler Executable (Benign!) Compiler Source Code Trojaned Compiler Executable Now the trojan lives in the compiler EXE only, not the source code! Even if you recompile the compiler itself from good, benign source code, you don’t know if you’re secure. You need to know the full lineage of the compiler. This is true in the context of formal veriﬁcation, as well, not just security against bad guys.

Slide 100

Slide 100 text

Conclusion Don’t fear hard problems. Recognize solved problems, use recipes that have been developed over a half century of prior art.

Slide 101

Slide 101 text

Further Reading Don’t buy the dragon book. Universally recommended by people who haven’t read it, or who have read nothing else.

Slide 102

Slide 102 text

Further Reading • Progra!"ing Language Concepts, by Peter Sestoft • Modern Compiler Implementation in ML, by Andrew W. Appel • miniml (608 line implementation of ML subset), by Andrej Bauer • Coursera Compilers Course, by Alex Aiken

Slide 103

Slide 103 text

Craig Stuntz @craigstuntz [email protected] http://blogs.teamb.com/craigstuntz http://www.meetup.com/Papers-We-Love-Columbus/ https://speakerdeck.com/craigstuntz https://github.com/CraigStuntz/TinyLanguage Feel free to reach out and ask follow up questions, either here at the conference or by one of the ways on the slide.