Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Programs That Write Programs

Programs That Write Programs

While production compilers can be quite complicated, the principles of compiler design are not too hard to learn, and are broadly applicable to many seemingly difficult programming problems. In this session you will learn to write a compiler for a toy language, targeting the .NET runtime, in F#, even if you have no previous experience with compilers, F#, or functional programming. You will learn how every part of a modern compiler toolchain works, including lexing, parsing, optimization, and code generation.

56e5c49368a2e0ab999848a8d9e3c116?s=128

Craig Stuntz

October 08, 2015
Tweet

Transcript

  1. Programs that Write Programs Craig Stuntz https://speakerdeck.com/craigstuntz https://github.com/CraigStuntz/TinyLanguage What is

    a compiler?
 What is the formalism? Different than TDD. How do we drive code implementation via the formalism?
  2. A → B Why bother learning compilers? I think it’s

    fun & interesting in its own right. Maybe you agree, since you’re here. But you might think you don’t have the opportunity to use this stuff in your day job. Maybe you can! A compiler is a formalization for taking data in one representation and turning it into data in a different representation. The transformation (from A to B) is often nontrivial.
  3. –Steve Yegge “You're actually surrounded by compilation problems. You run

    into them almost every day.” http://steve-yegge.blogspot.ca/2007/06/rich-programmer-food.html This is, more or less, every problem in software engineering. Every problem.
  4. Source code → Program JPEG file → Image on screen

    Source code → Potential style error list JSON → Object graph Code with 2 digit years → Y2K compliant code VB6 → C# Object graph → User interface markup Algorithm → Faster, equivalent algorithm Some examples Some of these look like what we typically consider a compiler. Some don’t. None are contrived. I do all of these in my day job.
  5. Useful Bits • Regular Expressions (lexing) • Deserializers (parsing) •

    Linters, static analysis (syntax, type checking) • Solvers, theorem provers (optimization) • Code migration tools (compilers!) Also, most of the individual pieces of the compilation pipeline are useful in their own right When you learn to write a compiler, you learn all of the above, and more!
  6. –Greenspun’s Tenth Rule “Any sufficiently complicated C or Fortran program

    contains an ad hoc, informally-specified, bug- ridden, slow implementation of half of Co"#on Lisp.” Another good reason: You will doing this stuff anyway; may as well do it right!” DIBOL story.
  7. 1 + 2 + 3 + … + 100 =

    100 * 101 / 2 = 5050 Another reason compilers are interesting is they try to solve a very hard problem: Don’t just convert A to B. Convert A into a highly-optimized-but-entirely-equivalent B. For any given A whatsoever! If a compiler crashes, that’s bad. If it silently emits incorrect code, it’s worse than bad; the world ends. Hard problems are interesting!
  8. Good News! Although this is a hard problem, it’s a

    mostly solved problem. “Solved” not as in it will never get better. “Solved” as in there is a recipe. In fact, if you have a hard problem and can distort it until it looks like a compiler then you have a solved problem. Contrast this with web development, which is also hard, but nobody has any bloody idea what they’re doing.
  9. –Eugene Wallingford “…compilers ultimately depend on a single big idea

    from the theory of computer science: that a certain kind of machine can simulate anything — including itself. As a result, this certain kind of machine, the Turing machine, is the very definition of computability.” http://www.cs.uni.edu/~wallingf/blog/archives/monthly/2015-09.html#e2015-09-03T15_26_47.htm What Is a Compiler, Really? (pause) It’s a program that writes a program! Or a program which takes a representation and writes a representation. Often, it’s a program that can write any possible program. Many compilers can compile themselves, producing their own executable as output. There are weird and non-obvious consequences of this! This idea on the screen — this is huge!
  10. code exe Actually, I lied just slightly. We like to

    think of a compiler as code -> exe, but this is mostly wrong. (Click) Primary purpose is to reject incorrect code. Maybe 1 in 10 times we produce a program. A good compiler is the ultimate total function: Take any random string and either produce a program or explain clearly why it’s not a program.
  11. code exe Actually, I lied just slightly. We like to

    think of a compiler as code -> exe, but this is mostly wrong. (Click) Primary purpose is to reject incorrect code. Maybe 1 in 10 times we produce a program. A good compiler is the ultimate total function: Take any random string and either produce a program or explain clearly why it’s not a program.
  12. Syntax and Semantics • Java and JavaScript have similar syntax,

    but different semantics. Ruby and Elixir have similar syntax, but different semantics • C# and VB.NET have different syntax, but fairly similar semantics A compiler must handle both the written form (syntax) and the internal meaning (semantics) of the language. People obsess about syntax. They reject LISPs because (). But semantics are, I think, most important. Nobody says we don’t need JavaScript since we have Java. People do ask if we need VB.NET since there is C# (syntactically different, but semantically equivalent). Syntax is really important!
  13. Similar Syntax, Different Semantics name = "Nate" # $% "Nate"

    String.upcase(name) # $% "NATE" name # $% "Nate" name = "Nate" # $% "Nate" name.upcase! # $% "NATE" name # $% "NATE" http://www.natescottwest.com/elixir-for-rubyists-part-2/ Elixir on left, Ruby on right. Looks similar, but Elixir strings are immutable.
  14. Different Syntax, Similar Semantics Imports System Namespace Hello Class HelloWorld

    Overloads Shared Sub Main(ByVal args() As String) Dim name As String = "VB.NET" 'See if argument passed If args.Length = 1 Then name = args(0) Console.WriteLine("Hello, " & name & "!") End Sub End Class End Namespace using System; namespace Hello { public class HelloWorld { public static void Main(string[] args) { string name = "C#"; !" See if argument passed if (args.Length == 1) name = args[0]; Console.WriteLine("Hello, " + name + "!"); } } } http://www.harding.edu/fmccown/vbnet_csharp_comparison.html VB.NET on left, C# on right. Looks different, but semantics are identical Is this clear? Compilers deal with both; distinction is important.
  15. x = x + 1; alert(x); Sequence Assign Invoke x

    add x 1 alert x Syntax has both a literal (characters) and a tree form. This is the “abstract” syntax tree. Believe it or not it’s the simplified version! But recognizing the syntax and semantics is only the start of what a compiler does. Let’s look at the big picture.
  16. Compiler I have a very general notion of a compiler…

    Includes building EXEs, but also a lot else. Compiler produces a program, which maps input to output. But maybe you prefer Ruby or JS? (Click) Interpreter is a compiler with added user input. Maps source and input to output. I’ll talk mostly about compilers, but remember an interpreter is just a compiler with an extra feature or three.
  17. Compiler Interpreter I have a very general notion of a

    compiler… Includes building EXEs, but also a lot else. Compiler produces a program, which maps input to output. But maybe you prefer Ruby or JS? (Click) Interpreter is a compiler with added user input. Maps source and input to output. I’ll talk mostly about compilers, but remember an interpreter is just a compiler with an extra feature or three.
  18. Front End: Understand Language Back End: Emit Code Front end,

    back end Definitions vary, but nearly always exist A C# app has two compilers. (What are they?) csc and the JITter.
  19. Lexer IL Generator Parser Type Checker Optimizer Optimizer Object Code

    Generator This is simplified. Production compilers have more stages. Front end, back end… Middle end? There is an intermediate representation — collection of types — for each stage. I show two optimizers here. Production compilers have more.
  20. To lex parse a language, you must understand its grammar.

    This is one of the lovely problems in software (seriously) which can easily be formally specified! Example is part of ECMAScript grammar.
  21. (inc -1) Let’s start with something simple. Simpler. We want

    to be able to compile this program. Is this program statically typed or dynamically typed? (Start with a toy language. Language design is harder than compiler design. Even amongst production languages, some are way harder to implement than others. Not Ruby. Not made up. Start with a Lisp.) To what? There’s more than one valid representation of this program. Here’s one (click, explain IL). Here’s another (click). How do we even decide?
  22. (inc -1) Ldc.i4 -1 Ldc.i4 1 Add Let’s start with

    something simple. Simpler. We want to be able to compile this program. Is this program statically typed or dynamically typed? (Start with a toy language. Language design is harder than compiler design. Even amongst production languages, some are way harder to implement than others. Not Ruby. Not made up. Start with a Lisp.) To what? There’s more than one valid representation of this program. Here’s one (click, explain IL). Here’s another (click). How do we even decide?
  23. (inc -1) Ldc.i4 -1 Ldc.i4 1 Add Ldc.i4.0 Let’s start

    with something simple. Simpler. We want to be able to compile this program. Is this program statically typed or dynamically typed? (Start with a toy language. Language design is harder than compiler design. Even amongst production languages, some are way harder to implement than others. Not Ruby. Not made up. Start with a Lisp.) To what? There’s more than one valid representation of this program. Here’s one (click, explain IL). Here’s another (click). How do we even decide?
  24. (inc -1) Lex LeftParen, Identifier(inc), Number(-1), RightParen Parse Apply “inc”

    to -1 Type check “inc” exists and takes an int argument, and -1 is an int. Great! Optimize -1 + 1 = 0, so just emit int 0! IL generate Ldc.i4 0 Optimize Ldc.i4 0 → Ldc.i4.0 Object code Produce assembly with entry point which contains the IL generated Here’s a simple example… Note Number(-1)? Optimize 1: Compiler can do arithmetic! Optimize 2…. All we’ve done is bust hard problem of source code -> optimized EXE into a bunch of relatively small problems. Make sense? Production compiler engineers will say this is too simple, but their code, like most production code, is probably a mess. You can learn a lot from the simple case!
  25. (defun add-1 (x) (inc x)) (defun main () (print (add-1

    2))) Or maybe this one.
  26. There Are No Edge Cases In Progra"#ing Languages Duff’s Device

    send(to, from, count) register short *to, *from; register count; { register n = (count + 7) / 8; switch (count % 8) { case 0: do { *to = *from++; case 7: *to = *from++; case 6: *to = *from++; case 5: *to = *from++; case 4: *to = *from++; case 3: *to = *from++; case 2: *to = *from++; case 1: *to = *from++; } while (--n > 0); } } We must consider every possible program! People will do absolutely anything the PL grammar allows. Therefore you a compiler must be able to classify literally any arbitrary string into a valid or invalid program. There are a lot of possible strings!
  27. Gra"#ar <program> := <statement> | <program> <statement> <statement> := <defun>

    | <expr> <defun> := “(defun” identifier <expr> <expr> “)” <expr> := number | string | <invoke> <invoke> := “(” identifier <expr> “)” identifier := "". number := "". string := "". Of course an acceptable program can’t really be any random string. There are rules. TDD vs. formal methods. I like tests, but… totality. With a reasonably specified language (good: C#; bad: Ruby), the rules are pretty easy to follow. Our goal today is to implement a really simple language. We’ll follow the grammar here. Terminals vs. nonterminals. Important for lexing vs. parsing.
  28. Lexing (inc -1) Start with this, transform to (click). Haven’t

    tried to infer any kind of meaning at all yet. What’s the point of lexing vs parsing? Output of lexing is terminals of the grammar used by the parser. Lexer sets the parser up for success. Makes the formalism possible.
  29. Lexing (inc -1) “(“ “inc” “-1” “)” LeftParenthesis Identifier(“inc”) LiteralInt(-1)

    RightParenthesis Start with this, transform to (click). Haven’t tried to infer any kind of meaning at all yet. What’s the point of lexing vs parsing? Output of lexing is terminals of the grammar used by the parser. Lexer sets the parser up for success. Makes the formalism possible.
  30. Lexing ( -1) Possible errors: Something which can’t be recognized

    as a terminal for the grammar.
  31. Lexing let rec private lexChars (source: char list) : Lexeme

    list = match source with | '(' :: rest "→ LeftParenthesis :: lexChars rest | ')' :: rest "→ RightParenthesis :: lexChars rest | '"' :: rest "→ lexString(rest, "") | c :: rest when isIdentifierStart c "→ lexName (source, "") | d :: rest when System.Char.IsDigit d "→ lexNumber(source, "") | [] "→ [] | w :: rest when System.Char.IsWhiteSpace w "→ lexChars rest | c :: rest "→ Unrecognized c :: lexChars rest I write all of my compiler code as purely functional. Most example compiler code requires understanding the state of the compiler as well as the behavior of the code. You can understand this code from the code alone. Lexer is recursive: Look at first char, decide what to do. How do you eat an elephant?
  32. Parsing LeftParenthesis Identifier(“inc”) LiteralInt(-1) RightParenthesis Start with a list of

    tokens from the lexer. Click Produce syntax tree. Here we have a const -1; could be another expression like another invocation.
  33. Parsing LeftParenthesis Identifier(“inc”) LiteralInt(-1) RightParenthesis Invoke “inc” -1 Start with

    a list of tokens from the lexer. Click Produce syntax tree. Here we have a const -1; could be another expression like another invocation.
  34. Parsing LeftParenthesis Identifier(“inc”) LiteralInt(-1) LiteralInt(-1) Possible errors: Bad syntax. Every

    stage of compiler checks the form of the stage previous. (click) Parser checks lexer output.
  35. Parsing LeftParenthesis Identifier(“inc”) LiteralInt(-1) LiteralInt(-1) “Expected ‘)’” Possible errors: Bad

    syntax. Every stage of compiler checks the form of the stage previous. (click) Parser checks lexer output.
  36. Parsing let rec private parseExpression (state : ParseState): ParseState =

    match state.Remaining with | LeftParenthesis :: Identifier "defun" :: Identifier name :: rest "→ let defun = parseDefun (name, { state with Remaining = rest }) match defun.Expressions, defun.Remaining with | [ ErrorExpr _ ], _ "→ defun | _, RightParenthesis :: remaining "→ { defun with Remaining = remaining } | _, [] "→ error ("Expected ')'.") | _, wrong :: _ "→ error (sprintf "Expected ')'; found %A." wrong) | LeftParenthesis :: Identifier name :: argumentsAndBody "→ let invoke = parseInvoke (name, { state with Remaining = argumentsAndBody }) match invoke.Remaining with | RightParenthesis :: remaining "→ { invoke with Remaining = remaining } | [] "→ error ("Expected ')'.") | wrong :: _ "→ error (sprintf "Expected ')'; found %A." wrong) | LeftParenthesis :: wrong "→ error (sprintf "%A cannot follow '('." wrong) Do not optimize. Do not type check. Just look for valid syntax.
  37. Optimization (I) Invoke “inc” -1 There cannot be any errors.

    Unlike other phases.
  38. Optimization (I) Invoke “inc” -1 Constant Int 0 There cannot

    be any errors. Unlike other phases.
  39. Optimization Fight the Urge to Optimize Outside the Optimizers!

  40. Optimization (I) Invoke “some-method” -1 Often, you’ll do nothing.

  41. Optimization (I) Invoke “some-method” -1 Invoke “some-method” -1 Often, you’ll

    do nothing.
  42. Binding / Type Checking Invoke “inc” -1 Start with an

    abstract syntax tree. Click. Produce a binding tree.
  43. ("inc", Function Binding { Name = "inc" Argument = {

    Name = "value" ArgumentType = IntType } Body = IntBinding 0; ResultType = IntType }) InvokeBinding { Name = "inc" Argument = IntBinding -1 ResultType = IntType } Binding / Type Checking Invoke “inc” -1 Start with an abstract syntax tree. Click. Produce a binding tree.
  44. InvokeBinding { Name = "inc" Argument = String Binding "Oops!"

    ResultType = IntType } Binding / Type Checking Possible errors.
  45. InvokeBinding { Name = "inc" Argument = String Binding "Oops!"

    ResultType = IntType } Binding / Type Checking “Expected integer; found ‘Oops!’.” Possible errors.
  46. Binding / Type Checking [<Test>] member this.``should return error for

    unbound invocation``() = let source = "(bad-method 2)" let expected = [ Ignore (ErrorBinding "Undefined function 'bad-method'.") ] let actual = bind source actual |> should equal expected
  47. Binding / Type Checking let rec private toBinding (environment: Map<string,

    Binding>) (expression : Expression) : Bind match expression with | IntExpr n "→ IntBinding n, None | StringExpr str "→ String Binding str, None Now we’re going to a semantic understanding of the code. Some of this is easy!
  48. Binding / Type Checking Some of it is considerably harder.

    Type checking is easier than type inference, but both are mostly solved problems, at least for simpler cases. Now we’re going to a semantic understanding of the code. Multiple formalisms are involved: Type theory, denotational semantics, operational semantics.
  49. IL Generation IntBinding 0 If we start from a simple

    binding… (click) We want to produce If you’ve seen IL before, you may wonder why not (click). More optimized! We’ll get to that.
  50. IL Generation IntBinding 0 Ldc.i4 0 If we start from

    a simple binding… (click) We want to produce If you’ve seen IL before, you may wonder why not (click). More optimized! We’ll get to that.
  51. IL Generation IntBinding 0 Ldc.i4 0 Ldc.i4.0 If we start

    from a simple binding… (click) We want to produce If you’ve seen IL before, you may wonder why not (click). More optimized! We’ll get to that.
  52. Optimization (II) Ldc.i4 0

  53. Optimization (II) Ldc.i4 0 Ldc.i4.0

  54. Optimization (II) let private optimalShortEncodingFor = function | Ldc_I4 0

    "→ Ldc_I4_0 | Ldc_I4 1 "→ Ldc_I4_1 | Ldc_I4 2 "→ Ldc_I4_2 | Ldc_I4 3 "→ Ldc_I4_3 | Ldc_I4 4 "→ Ldc_I4_4 | Ldc_I4 5 "→ Ldc_I4_5 | Ldc_I4 6 "→ Ldc_I4_6 | Ldc_I4 7 "→ Ldc_I4_7 | Ldc_I4 8 "→ Ldc_I4_8 | Ldloc 0 "→ Ldloc_0 | Ldloc 1 "→ Ldloc_1 | Ldloc 2 "→ Ldloc_2 | Ldloc 3 "→ Ldloc_3 | Ldloc i when i ,- maxByte "→ Ldloc_S(Convert.ToByte(i)) Anything not listed here stays the same. No errors!
  55. Special Tools! PE Verify, ildasm. Similar tools exist for other

    platforms.
  56. Compare! When in doubt, try it in C#, and see

    what that compiler emits. Sometimes it’s a little strange. Roll with it.
  57. https://www.ece.cmu.edu/~ganger/712.fall02/papers/p761-thompson.pdf OK, this is weird.

  58. Further Reading Don’t buy the dragon book. Universally recommended by

    people who haven’t read it, or who have read nothing else.
  59. Further Reading • Progra"#ing Language Concepts, by Peter Sestoft •

    Modern Compiler Implementation in ML, by Andrew W. Appel
  60. Craig Stuntz @craigstuntz Craig.Stuntz@Improving.com http://blogs.teamb.com/craigstuntz http://www.meetup.com/Papers-We-Love-Columbus/