Slide 1

Slide 1 text

Programs that Write Programs Craig Stuntz https://speakerdeck.com/craigstuntz https://github.com/CraigStuntz/TinyLanguage What is a compiler?
 What is the formalism? Different than TDD. How do we drive code implementation via the formalism?

Slide 2

Slide 2 text

A → B Why bother learning compilers? I think it’s fun & interesting in its own right. Maybe you agree, since you’re here. But you might think you don’t have the opportunity to use this stuff in your day job. Maybe you can! A compiler is a formalization for taking data in one representation and turning it into data in a different representation. The transformation (from A to B) is often nontrivial.

Slide 3

Slide 3 text

–Steve Yegge “You're actually surrounded by compilation problems. You run into them almost every day.” http://steve-yegge.blogspot.ca/2007/06/rich-programmer-food.html This is, more or less, every problem in software engineering. Every problem.

Slide 4

Slide 4 text

Source code → Program JPEG file → Image on screen Source code → Potential style error list JSON → Object graph Code with 2 digit years → Y2K compliant code VB6 → C# Object graph → User interface markup Algorithm → Faster, equivalent algorithm Some examples Some of these look like what we typically consider a compiler. Some don’t. None are contrived. I do all of these in my day job.

Slide 5

Slide 5 text

Useful Bits • Regular Expressions (lexing) • Deserializers (parsing) • Linters, static analysis (syntax, type checking) • Solvers, theorem provers (optimization) • Code migration tools (compilers!) Also, most of the individual pieces of the compilation pipeline are useful in their own right When you learn to write a compiler, you learn all of the above, and more!

Slide 6

Slide 6 text

–Greenspun’s Tenth Rule “Any sufficiently complicated C or Fortran program contains an ad hoc, informally-specified, bug- ridden, slow implementation of half of Co"#on Lisp.” Another good reason: You will doing this stuff anyway; may as well do it right!” DIBOL story.

Slide 7

Slide 7 text

1 + 2 + 3 + … + 100 = 100 * 101 / 2 = 5050 Another reason compilers are interesting is they try to solve a very hard problem: Don’t just convert A to B. Convert A into a highly-optimized-but-entirely-equivalent B. For any given A whatsoever! If a compiler crashes, that’s bad. If it silently emits incorrect code, it’s worse than bad; the world ends. Hard problems are interesting!

Slide 8

Slide 8 text

Good News! Although this is a hard problem, it’s a mostly solved problem. “Solved” not as in it will never get better. “Solved” as in there is a recipe. In fact, if you have a hard problem and can distort it until it looks like a compiler then you have a solved problem. Contrast this with web development, which is also hard, but nobody has any bloody idea what they’re doing.

Slide 9

Slide 9 text

–Eugene Wallingford “…compilers ultimately depend on a single big idea from the theory of computer science: that a certain kind of machine can simulate anything — including itself. As a result, this certain kind of machine, the Turing machine, is the very definition of computability.” http://www.cs.uni.edu/~wallingf/blog/archives/monthly/2015-09.html#e2015-09-03T15_26_47.htm What Is a Compiler, Really? (pause) It’s a program that writes a program! Or a program which takes a representation and writes a representation. Often, it’s a program that can write any possible program. Many compilers can compile themselves, producing their own executable as output. There are weird and non-obvious consequences of this! This idea on the screen — this is huge!

Slide 10

Slide 10 text

code exe Actually, I lied just slightly. We like to think of a compiler as code -> exe, but this is mostly wrong. (Click) Primary purpose is to reject incorrect code. Maybe 1 in 10 times we produce a program. A good compiler is the ultimate total function: Take any random string and either produce a program or explain clearly why it’s not a program.

Slide 11

Slide 11 text

code exe Actually, I lied just slightly. We like to think of a compiler as code -> exe, but this is mostly wrong. (Click) Primary purpose is to reject incorrect code. Maybe 1 in 10 times we produce a program. A good compiler is the ultimate total function: Take any random string and either produce a program or explain clearly why it’s not a program.

Slide 12

Slide 12 text

Syntax and Semantics • Java and JavaScript have similar syntax, but different semantics. Ruby and Elixir have similar syntax, but different semantics • C# and VB.NET have different syntax, but fairly similar semantics A compiler must handle both the written form (syntax) and the internal meaning (semantics) of the language. People obsess about syntax. They reject LISPs because (). But semantics are, I think, most important. Nobody says we don’t need JavaScript since we have Java. People do ask if we need VB.NET since there is C# (syntactically different, but semantically equivalent). Syntax is really important!

Slide 13

Slide 13 text

Similar Syntax, Different Semantics name = "Nate" # $% "Nate" String.upcase(name) # $% "NATE" name # $% "Nate" name = "Nate" # $% "Nate" name.upcase! # $% "NATE" name # $% "NATE" http://www.natescottwest.com/elixir-for-rubyists-part-2/ Elixir on left, Ruby on right. Looks similar, but Elixir strings are immutable.

Slide 14

Slide 14 text

Different Syntax, Similar Semantics Imports System Namespace Hello Class HelloWorld Overloads Shared Sub Main(ByVal args() As String) Dim name As String = "VB.NET" 'See if argument passed If args.Length = 1 Then name = args(0) Console.WriteLine("Hello, " & name & "!") End Sub End Class End Namespace using System; namespace Hello { public class HelloWorld { public static void Main(string[] args) { string name = "C#"; !" See if argument passed if (args.Length == 1) name = args[0]; Console.WriteLine("Hello, " + name + "!"); } } } http://www.harding.edu/fmccown/vbnet_csharp_comparison.html VB.NET on left, C# on right. Looks different, but semantics are identical Is this clear? Compilers deal with both; distinction is important.

Slide 15

Slide 15 text

x = x + 1; alert(x); Sequence Assign Invoke x add x 1 alert x Syntax has both a literal (characters) and a tree form. This is the “abstract” syntax tree. Believe it or not it’s the simplified version! But recognizing the syntax and semantics is only the start of what a compiler does. Let’s look at the big picture.

Slide 16

Slide 16 text

Compiler I have a very general notion of a compiler… Includes building EXEs, but also a lot else. Compiler produces a program, which maps input to output. But maybe you prefer Ruby or JS? (Click) Interpreter is a compiler with added user input. Maps source and input to output. I’ll talk mostly about compilers, but remember an interpreter is just a compiler with an extra feature or three.

Slide 17

Slide 17 text

Compiler Interpreter I have a very general notion of a compiler… Includes building EXEs, but also a lot else. Compiler produces a program, which maps input to output. But maybe you prefer Ruby or JS? (Click) Interpreter is a compiler with added user input. Maps source and input to output. I’ll talk mostly about compilers, but remember an interpreter is just a compiler with an extra feature or three.

Slide 18

Slide 18 text

Front End: Understand Language Back End: Emit Code Front end, back end Definitions vary, but nearly always exist A C# app has two compilers. (What are they?) csc and the JITter.

Slide 19

Slide 19 text

Lexer IL Generator Parser Type Checker Optimizer Optimizer Object Code Generator This is simplified. Production compilers have more stages. Front end, back end… Middle end? There is an intermediate representation — collection of types — for each stage. I show two optimizers here. Production compilers have more.

Slide 20

Slide 20 text

To lex parse a language, you must understand its grammar. This is one of the lovely problems in software (seriously) which can easily be formally specified! Example is part of ECMAScript grammar.

Slide 21

Slide 21 text

(inc -1) Let’s start with something simple. Simpler. We want to be able to compile this program. Is this program statically typed or dynamically typed? (Start with a toy language. Language design is harder than compiler design. Even amongst production languages, some are way harder to implement than others. Not Ruby. Not made up. Start with a Lisp.) To what? There’s more than one valid representation of this program. Here’s one (click, explain IL). Here’s another (click). How do we even decide?

Slide 22

Slide 22 text

(inc -1) Ldc.i4 -1 Ldc.i4 1 Add Let’s start with something simple. Simpler. We want to be able to compile this program. Is this program statically typed or dynamically typed? (Start with a toy language. Language design is harder than compiler design. Even amongst production languages, some are way harder to implement than others. Not Ruby. Not made up. Start with a Lisp.) To what? There’s more than one valid representation of this program. Here’s one (click, explain IL). Here’s another (click). How do we even decide?

Slide 23

Slide 23 text

(inc -1) Ldc.i4 -1 Ldc.i4 1 Add Ldc.i4.0 Let’s start with something simple. Simpler. We want to be able to compile this program. Is this program statically typed or dynamically typed? (Start with a toy language. Language design is harder than compiler design. Even amongst production languages, some are way harder to implement than others. Not Ruby. Not made up. Start with a Lisp.) To what? There’s more than one valid representation of this program. Here’s one (click, explain IL). Here’s another (click). How do we even decide?

Slide 24

Slide 24 text

(inc -1) Lex LeftParen, Identifier(inc), Number(-1), RightParen Parse Apply “inc” to -1 Type check “inc” exists and takes an int argument, and -1 is an int. Great! Optimize -1 + 1 = 0, so just emit int 0! IL generate Ldc.i4 0 Optimize Ldc.i4 0 → Ldc.i4.0 Object code Produce assembly with entry point which contains the IL generated Here’s a simple example… Note Number(-1)? Optimize 1: Compiler can do arithmetic! Optimize 2…. All we’ve done is bust hard problem of source code -> optimized EXE into a bunch of relatively small problems. Make sense? Production compiler engineers will say this is too simple, but their code, like most production code, is probably a mess. You can learn a lot from the simple case!

Slide 25

Slide 25 text

(defun add-1 (x) (inc x)) (defun main () (print (add-1 2))) Or maybe this one.

Slide 26

Slide 26 text

There Are No Edge Cases In Progra"#ing Languages Duff’s Device send(to, from, count) register short *to, *from; register count; { register n = (count + 7) / 8; switch (count % 8) { case 0: do { *to = *from++; case 7: *to = *from++; case 6: *to = *from++; case 5: *to = *from++; case 4: *to = *from++; case 3: *to = *from++; case 2: *to = *from++; case 1: *to = *from++; } while (--n > 0); } } We must consider every possible program! People will do absolutely anything the PL grammar allows. Therefore you a compiler must be able to classify literally any arbitrary string into a valid or invalid program. There are a lot of possible strings!

Slide 27

Slide 27 text

Gra"#ar := | := | := “(defun” identifier “)” := number | string | := “(” identifier “)” identifier := "". number := "". string := "". Of course an acceptable program can’t really be any random string. There are rules. TDD vs. formal methods. I like tests, but… totality. With a reasonably specified language (good: C#; bad: Ruby), the rules are pretty easy to follow. Our goal today is to implement a really simple language. We’ll follow the grammar here. Terminals vs. nonterminals. Important for lexing vs. parsing.

Slide 28

Slide 28 text

Lexing (inc -1) Start with this, transform to (click). Haven’t tried to infer any kind of meaning at all yet. What’s the point of lexing vs parsing? Output of lexing is terminals of the grammar used by the parser. Lexer sets the parser up for success. Makes the formalism possible.

Slide 29

Slide 29 text

Lexing (inc -1) “(“ “inc” “-1” “)” LeftParenthesis Identifier(“inc”) LiteralInt(-1) RightParenthesis Start with this, transform to (click). Haven’t tried to infer any kind of meaning at all yet. What’s the point of lexing vs parsing? Output of lexing is terminals of the grammar used by the parser. Lexer sets the parser up for success. Makes the formalism possible.

Slide 30

Slide 30 text

Lexing ( -1) Possible errors: Something which can’t be recognized as a terminal for the grammar.

Slide 31

Slide 31 text

Lexing let rec private lexChars (source: char list) : Lexeme list = match source with | '(' :: rest "→ LeftParenthesis :: lexChars rest | ')' :: rest "→ RightParenthesis :: lexChars rest | '"' :: rest "→ lexString(rest, "") | c :: rest when isIdentifierStart c "→ lexName (source, "") | d :: rest when System.Char.IsDigit d "→ lexNumber(source, "") | [] "→ [] | w :: rest when System.Char.IsWhiteSpace w "→ lexChars rest | c :: rest "→ Unrecognized c :: lexChars rest I write all of my compiler code as purely functional. Most example compiler code requires understanding the state of the compiler as well as the behavior of the code. You can understand this code from the code alone. Lexer is recursive: Look at first char, decide what to do. How do you eat an elephant?

Slide 32

Slide 32 text

Parsing LeftParenthesis Identifier(“inc”) LiteralInt(-1) RightParenthesis Start with a list of tokens from the lexer. Click Produce syntax tree. Here we have a const -1; could be another expression like another invocation.

Slide 33

Slide 33 text

Parsing LeftParenthesis Identifier(“inc”) LiteralInt(-1) RightParenthesis Invoke “inc” -1 Start with a list of tokens from the lexer. Click Produce syntax tree. Here we have a const -1; could be another expression like another invocation.

Slide 34

Slide 34 text

Parsing LeftParenthesis Identifier(“inc”) LiteralInt(-1) LiteralInt(-1) Possible errors: Bad syntax. Every stage of compiler checks the form of the stage previous. (click) Parser checks lexer output.

Slide 35

Slide 35 text

Parsing LeftParenthesis Identifier(“inc”) LiteralInt(-1) LiteralInt(-1) “Expected ‘)’” Possible errors: Bad syntax. Every stage of compiler checks the form of the stage previous. (click) Parser checks lexer output.

Slide 36

Slide 36 text

Parsing let rec private parseExpression (state : ParseState): ParseState = match state.Remaining with | LeftParenthesis :: Identifier "defun" :: Identifier name :: rest "→ let defun = parseDefun (name, { state with Remaining = rest }) match defun.Expressions, defun.Remaining with | [ ErrorExpr _ ], _ "→ defun | _, RightParenthesis :: remaining "→ { defun with Remaining = remaining } | _, [] "→ error ("Expected ')'.") | _, wrong :: _ "→ error (sprintf "Expected ')'; found %A." wrong) | LeftParenthesis :: Identifier name :: argumentsAndBody "→ let invoke = parseInvoke (name, { state with Remaining = argumentsAndBody }) match invoke.Remaining with | RightParenthesis :: remaining "→ { invoke with Remaining = remaining } | [] "→ error ("Expected ')'.") | wrong :: _ "→ error (sprintf "Expected ')'; found %A." wrong) | LeftParenthesis :: wrong "→ error (sprintf "%A cannot follow '('." wrong) Do not optimize. Do not type check. Just look for valid syntax.

Slide 37

Slide 37 text

Optimization (I) Invoke “inc” -1 There cannot be any errors. Unlike other phases.

Slide 38

Slide 38 text

Optimization (I) Invoke “inc” -1 Constant Int 0 There cannot be any errors. Unlike other phases.

Slide 39

Slide 39 text

Optimization Fight the Urge to Optimize Outside the Optimizers!

Slide 40

Slide 40 text

Optimization (I) Invoke “some-method” -1 Often, you’ll do nothing.

Slide 41

Slide 41 text

Optimization (I) Invoke “some-method” -1 Invoke “some-method” -1 Often, you’ll do nothing.

Slide 42

Slide 42 text

Binding / Type Checking Invoke “inc” -1 Start with an abstract syntax tree. Click. Produce a binding tree.

Slide 43

Slide 43 text

("inc", Function Binding { Name = "inc" Argument = { Name = "value" ArgumentType = IntType } Body = IntBinding 0; ResultType = IntType }) InvokeBinding { Name = "inc" Argument = IntBinding -1 ResultType = IntType } Binding / Type Checking Invoke “inc” -1 Start with an abstract syntax tree. Click. Produce a binding tree.

Slide 44

Slide 44 text

InvokeBinding { Name = "inc" Argument = String Binding "Oops!" ResultType = IntType } Binding / Type Checking Possible errors.

Slide 45

Slide 45 text

InvokeBinding { Name = "inc" Argument = String Binding "Oops!" ResultType = IntType } Binding / Type Checking “Expected integer; found ‘Oops!’.” Possible errors.

Slide 46

Slide 46 text

Binding / Type Checking [] member this.``should return error for unbound invocation``() = let source = "(bad-method 2)" let expected = [ Ignore (ErrorBinding "Undefined function 'bad-method'.") ] let actual = bind source actual |> should equal expected

Slide 47

Slide 47 text

Binding / Type Checking let rec private toBinding (environment: Map) (expression : Expression) : Bind match expression with | IntExpr n "→ IntBinding n, None | StringExpr str "→ String Binding str, None Now we’re going to a semantic understanding of the code. Some of this is easy!

Slide 48

Slide 48 text

Binding / Type Checking Some of it is considerably harder. Type checking is easier than type inference, but both are mostly solved problems, at least for simpler cases. Now we’re going to a semantic understanding of the code. Multiple formalisms are involved: Type theory, denotational semantics, operational semantics.

Slide 49

Slide 49 text

IL Generation IntBinding 0 If we start from a simple binding… (click) We want to produce If you’ve seen IL before, you may wonder why not (click). More optimized! We’ll get to that.

Slide 50

Slide 50 text

IL Generation IntBinding 0 Ldc.i4 0 If we start from a simple binding… (click) We want to produce If you’ve seen IL before, you may wonder why not (click). More optimized! We’ll get to that.

Slide 51

Slide 51 text

IL Generation IntBinding 0 Ldc.i4 0 Ldc.i4.0 If we start from a simple binding… (click) We want to produce If you’ve seen IL before, you may wonder why not (click). More optimized! We’ll get to that.

Slide 52

Slide 52 text

Optimization (II) Ldc.i4 0

Slide 53

Slide 53 text

Optimization (II) Ldc.i4 0 Ldc.i4.0

Slide 54

Slide 54 text

Optimization (II) let private optimalShortEncodingFor = function | Ldc_I4 0 "→ Ldc_I4_0 | Ldc_I4 1 "→ Ldc_I4_1 | Ldc_I4 2 "→ Ldc_I4_2 | Ldc_I4 3 "→ Ldc_I4_3 | Ldc_I4 4 "→ Ldc_I4_4 | Ldc_I4 5 "→ Ldc_I4_5 | Ldc_I4 6 "→ Ldc_I4_6 | Ldc_I4 7 "→ Ldc_I4_7 | Ldc_I4 8 "→ Ldc_I4_8 | Ldloc 0 "→ Ldloc_0 | Ldloc 1 "→ Ldloc_1 | Ldloc 2 "→ Ldloc_2 | Ldloc 3 "→ Ldloc_3 | Ldloc i when i ,- maxByte "→ Ldloc_S(Convert.ToByte(i)) Anything not listed here stays the same. No errors!

Slide 55

Slide 55 text

Special Tools! PE Verify, ildasm. Similar tools exist for other platforms.

Slide 56

Slide 56 text

Compare! When in doubt, try it in C#, and see what that compiler emits. Sometimes it’s a little strange. Roll with it.

Slide 57

Slide 57 text

https://www.ece.cmu.edu/~ganger/712.fall02/papers/p761-thompson.pdf OK, this is weird.

Slide 58

Slide 58 text

Further Reading Don’t buy the dragon book. Universally recommended by people who haven’t read it, or who have read nothing else.

Slide 59

Slide 59 text

Further Reading • Progra"#ing Language Concepts, by Peter Sestoft • Modern Compiler Implementation in ML, by Andrew W. Appel

Slide 60

Slide 60 text

Craig Stuntz @craigstuntz [email protected] http://blogs.teamb.com/craigstuntz http://www.meetup.com/Papers-We-Love-Columbus/