Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Programs That Write Programs

Programs That Write Programs

While production compilers can be quite complicated, the principles of compiler design are not too hard to learn, and are broadly applicable to many seemingly difficult programming problems. In this session you will learn to write a compiler for a toy language, targeting the .NET runtime, in F#, even if you have no previous experience with compilers, F#, or functional programming. You will learn how every part of a modern compiler toolchain works, including lexing, parsing, optimization, and code generation.

Craig Stuntz

October 08, 2015
Tweet

More Decks by Craig Stuntz

Other Decks in Programming

Transcript

  1. Programs that
    Write Programs
    Craig Stuntz
    https://speakerdeck.com/craigstuntz
    https://github.com/CraigStuntz/TinyLanguage
    What is a compiler?

    What is the formalism? Different than TDD.

    How do we drive code implementation via the formalism?

    View Slide

  2. A → B
    Why bother learning compilers?

    I think it’s fun & interesting in its own right. Maybe you agree, since you’re here. But you might think you don’t have the opportunity to use this stuff in your day job.
    Maybe you can!

    A compiler is a formalization for taking data in one representation and turning it into data in a different representation. The transformation (from A to B) is often nontrivial.

    View Slide

  3. –Steve Yegge
    “You're actually surrounded by
    compilation problems. You run into
    them almost every day.”
    http://steve-yegge.blogspot.ca/2007/06/rich-programmer-food.html
    This is, more or less, every problem in software engineering. Every problem.

    View Slide

  4. Source code → Program
    JPEG file → Image on screen
    Source code → Potential style error list
    JSON → Object graph
    Code with 2 digit years → Y2K compliant code
    VB6 → C#
    Object graph → User interface markup
    Algorithm → Faster, equivalent
    algorithm
    Some examples

    Some of these look like what we typically consider a compiler. Some don’t.

    None are contrived.

    I do all of these in my day job.

    View Slide

  5. Useful Bits
    • Regular Expressions (lexing)
    • Deserializers (parsing)
    • Linters, static analysis (syntax, type
    checking)
    • Solvers, theorem provers (optimization)
    • Code migration tools (compilers!)
    Also, most of the individual pieces of the compilation pipeline are useful in their own right

    When you learn to write a compiler, you learn all of the above, and more!

    View Slide

  6. –Greenspun’s Tenth Rule
    “Any sufficiently complicated C or
    Fortran program contains an ad
    hoc, informally-specified, bug-
    ridden, slow implementation of
    half of Co"#on Lisp.”
    Another good reason: You will doing this stuff anyway; may as well do it right!”

    DIBOL story.

    View Slide

  7. 1 + 2 + 3 + … + 100
    =
    100 * 101 / 2
    =
    5050
    Another reason compilers are interesting is they try to solve a very hard problem:

    Don’t just convert A to B.

    Convert A into a highly-optimized-but-entirely-equivalent B. For any given A whatsoever!

    If a compiler crashes, that’s bad. If it silently emits incorrect code, it’s worse than bad; the world ends.

    Hard problems are interesting!

    View Slide

  8. Good News!
    Although this is a hard problem, it’s a mostly solved problem.

    “Solved” not as in it will never get better.

    “Solved” as in there is a recipe.

    In fact, if you have a hard problem and can distort it until it looks like a compiler then you have a solved problem.

    Contrast this with web development, which is also hard, but nobody has any bloody idea what they’re doing.

    View Slide

  9. –Eugene Wallingford
    “…compilers ultimately depend on a
    single big idea from the theory of
    computer science: that a certain
    kind of machine can simulate
    anything — including itself. As a
    result, this certain kind of
    machine, the Turing machine, is
    the very definition of
    computability.”
    http://www.cs.uni.edu/~wallingf/blog/archives/monthly/2015-09.html#e2015-09-03T15_26_47.htm
    What Is a Compiler, Really? (pause) It’s a program that writes a program! Or a program which takes a representation and writes a representation. Often, it’s a program that
    can write any possible program. Many compilers can compile themselves, producing their own executable as output. There are weird and non-obvious consequences of
    this! This idea on the screen — this is huge!

    View Slide

  10. code exe
    Actually, I lied just slightly.

    We like to think of a compiler as code -> exe, but this is mostly wrong. (Click)

    Primary purpose is to reject incorrect code. Maybe 1 in 10 times we produce a program.

    A good compiler is the ultimate total function: Take any random string and either produce a program or explain clearly why it’s not a program.

    View Slide

  11. code exe
    Actually, I lied just slightly.

    We like to think of a compiler as code -> exe, but this is mostly wrong. (Click)

    Primary purpose is to reject incorrect code. Maybe 1 in 10 times we produce a program.

    A good compiler is the ultimate total function: Take any random string and either produce a program or explain clearly why it’s not a program.

    View Slide

  12. Syntax and Semantics
    • Java and JavaScript have similar
    syntax, but different semantics.
    Ruby and Elixir have similar
    syntax, but different semantics
    • C# and VB.NET have different
    syntax, but fairly similar
    semantics
    A compiler must handle both the written form (syntax) and the internal meaning (semantics) of the language.

    People obsess about syntax. They reject LISPs because (). But semantics are, I think, most important.

    Nobody says we don’t need JavaScript since we have Java. People do ask if we need VB.NET since there is C# (syntactically different, but semantically equivalent).
    Syntax is really important!

    View Slide

  13. Similar Syntax,
    Different Semantics
    name = "Nate"
    # $% "Nate"
    String.upcase(name)
    # $% "NATE"
    name
    # $% "Nate"
    name = "Nate"
    # $% "Nate"
    name.upcase!
    # $% "NATE"
    name
    # $% "NATE"
    http://www.natescottwest.com/elixir-for-rubyists-part-2/
    Elixir on left, Ruby on right. Looks similar, but Elixir strings are immutable.

    View Slide

  14. Different Syntax,
    Similar Semantics
    Imports System
    Namespace Hello
    Class HelloWorld
    Overloads Shared Sub Main(ByVal args() As String)
    Dim name As String = "VB.NET"
    'See if argument passed
    If args.Length = 1 Then name = args(0)
    Console.WriteLine("Hello, " & name & "!")
    End Sub
    End Class
    End Namespace
    using System;
    namespace Hello {
    public class HelloWorld {
    public static void Main(string[] args) {
    string name = "C#";
    !" See if argument passed
    if (args.Length == 1) name = args[0];
    Console.WriteLine("Hello, " + name + "!");
    }
    }
    }
    http://www.harding.edu/fmccown/vbnet_csharp_comparison.html
    VB.NET on left, C# on right. Looks different, but semantics are identical

    Is this clear? Compilers deal with both; distinction is important.

    View Slide

  15. x = x + 1;
    alert(x);
    Sequence
    Assign Invoke
    x add
    x 1
    alert x
    Syntax has both a literal (characters) and a tree form.

    This is the “abstract” syntax tree. Believe it or not it’s the simplified version!

    But recognizing the syntax and semantics is only the start of what a compiler does. Let’s look at the big picture.

    View Slide

  16. Compiler
    I have a very general notion of a compiler… Includes building EXEs, but also a lot else. Compiler produces a program, which maps input to output.

    But maybe you prefer Ruby or JS? (Click) Interpreter is a compiler with added user input. Maps source and input to output.

    I’ll talk mostly about compilers, but remember an interpreter is just a compiler with an extra feature or three.

    View Slide

  17. Compiler Interpreter
    I have a very general notion of a compiler… Includes building EXEs, but also a lot else. Compiler produces a program, which maps input to output.

    But maybe you prefer Ruby or JS? (Click) Interpreter is a compiler with added user input. Maps source and input to output.

    I’ll talk mostly about compilers, but remember an interpreter is just a compiler with an extra feature or three.

    View Slide

  18. Front End:
    Understand
    Language
    Back End:
    Emit Code
    Front end, back end

    Definitions vary, but nearly always exist

    A C# app has two compilers. (What are they?) csc and the JITter.

    View Slide

  19. Lexer
    IL
    Generator
    Parser
    Type
    Checker
    Optimizer Optimizer
    Object
    Code
    Generator
    This is simplified. Production compilers have more stages. Front end, back end… Middle end?

    There is an intermediate representation — collection of types — for each stage.

    I show two optimizers here. Production compilers have more.

    View Slide

  20. To lex parse a language, you must understand its grammar.

    This is one of the lovely problems in software (seriously) which can easily be formally specified!

    Example is part of ECMAScript grammar.

    View Slide

  21. (inc -1)
    Let’s start with something simple. Simpler.

    We want to be able to compile this program. Is this program statically typed or dynamically typed?

    (Start with a toy language. Language design is harder than compiler design. Even amongst production languages, some are way harder to implement than others. Not
    Ruby. Not made up. Start with a Lisp.)

    To what? There’s more than one valid representation of this program. Here’s one (click, explain IL). Here’s another (click). How do we even decide?

    View Slide

  22. (inc -1)
    Ldc.i4 -1
    Ldc.i4 1
    Add
    Let’s start with something simple. Simpler.

    We want to be able to compile this program. Is this program statically typed or dynamically typed?

    (Start with a toy language. Language design is harder than compiler design. Even amongst production languages, some are way harder to implement than others. Not
    Ruby. Not made up. Start with a Lisp.)

    To what? There’s more than one valid representation of this program. Here’s one (click, explain IL). Here’s another (click). How do we even decide?

    View Slide

  23. (inc -1)
    Ldc.i4 -1
    Ldc.i4 1
    Add
    Ldc.i4.0
    Let’s start with something simple. Simpler.

    We want to be able to compile this program. Is this program statically typed or dynamically typed?

    (Start with a toy language. Language design is harder than compiler design. Even amongst production languages, some are way harder to implement than others. Not
    Ruby. Not made up. Start with a Lisp.)

    To what? There’s more than one valid representation of this program. Here’s one (click, explain IL). Here’s another (click). How do we even decide?

    View Slide

  24. (inc -1) Lex
    LeftParen, Identifier(inc),
    Number(-1), RightParen
    Parse Apply “inc” to -1
    Type
    check
    “inc” exists and takes an int
    argument, and -1 is an int. Great!
    Optimize -1 + 1 = 0, so just emit int 0!
    IL
    generate
    Ldc.i4 0
    Optimize Ldc.i4 0 → Ldc.i4.0
    Object
    code
    Produce assembly with entry point
    which contains the IL generated
    Here’s a simple example… Note Number(-1)?

    Optimize 1: Compiler can do arithmetic!

    Optimize 2….

    All we’ve done is bust hard problem of source code -> optimized EXE into a bunch of relatively small problems. Make sense?

    Production compiler engineers will say this is too simple, but their code, like most production code, is probably a mess. You can learn a lot from the simple case!

    View Slide

  25. (defun add-1 (x)
    (inc x))
    (defun main ()
    (print (add-1 2)))
    Or maybe this one.

    View Slide

  26. There Are No
    Edge Cases In
    Progra"#ing
    Languages
    Duff’s Device
    send(to, from, count)
    register short *to, *from;
    register count;
    {
    register n = (count + 7) / 8;
    switch (count % 8) {
    case 0: do { *to = *from++;
    case 7: *to = *from++;
    case 6: *to = *from++;
    case 5: *to = *from++;
    case 4: *to = *from++;
    case 3: *to = *from++;
    case 2: *to = *from++;
    case 1: *to = *from++;
    } while (--n > 0);
    }
    }
    We must consider every possible program!

    People will do absolutely anything the PL grammar allows.

    Therefore you a compiler must be able to classify literally any arbitrary string into a valid or invalid program.

    There are a lot of possible strings!

    View Slide

  27. Gra"#ar
    := |
    := |
    := “(defun” identifier “)”
    := number | string |
    := “(” identifier “)”
    identifier := "".
    number := "".
    string := "".
    Of course an acceptable program can’t really be any random string. There are rules.

    TDD vs. formal methods. I like tests, but… totality.

    With a reasonably specified language (good: C#; bad: Ruby), the rules are pretty easy to follow.

    Our goal today is to implement a really simple language. We’ll follow the grammar here.

    Terminals vs. nonterminals. Important for lexing vs. parsing.

    View Slide

  28. Lexing
    (inc -1)
    Start with this, transform to (click). Haven’t tried to infer any kind of meaning at all yet.

    What’s the point of lexing vs parsing? Output of lexing is terminals of the grammar used by the parser. Lexer sets the parser up for success. Makes the formalism
    possible.

    View Slide

  29. Lexing
    (inc -1)
    “(“ “inc” “-1” “)”
    LeftParenthesis
    Identifier(“inc”)
    LiteralInt(-1)
    RightParenthesis
    Start with this, transform to (click). Haven’t tried to infer any kind of meaning at all yet.

    What’s the point of lexing vs parsing? Output of lexing is terminals of the grammar used by the parser. Lexer sets the parser up for success. Makes the formalism
    possible.

    View Slide

  30. Lexing
    ( -1)
    Possible errors: Something which can’t be recognized as a terminal for the grammar.

    View Slide

  31. Lexing
    let rec private lexChars (source: char list) : Lexeme list =
    match source with
    | '(' :: rest "→ LeftParenthesis :: lexChars rest
    | ')' :: rest "→ RightParenthesis :: lexChars rest
    | '"' :: rest "→ lexString(rest, "")
    | c :: rest when isIdentifierStart c "→ lexName (source, "")
    | d :: rest when System.Char.IsDigit d "→ lexNumber(source, "")
    | [] "→ []
    | w :: rest when System.Char.IsWhiteSpace w "→ lexChars rest
    | c :: rest "→ Unrecognized c :: lexChars rest
    I write all of my compiler code as purely functional.

    Most example compiler code requires understanding the state of the compiler as well as the behavior of the code. You can understand this code from the code alone.

    Lexer is recursive: Look at first char, decide what to do. How do you eat an elephant?

    View Slide

  32. Parsing
    LeftParenthesis
    Identifier(“inc”)
    LiteralInt(-1)
    RightParenthesis
    Start with a list of tokens from the lexer. Click

    Produce syntax tree.

    Here we have a const -1; could be another expression like another invocation.

    View Slide

  33. Parsing
    LeftParenthesis
    Identifier(“inc”)
    LiteralInt(-1)
    RightParenthesis
    Invoke “inc” -1
    Start with a list of tokens from the lexer. Click

    Produce syntax tree.

    Here we have a const -1; could be another expression like another invocation.

    View Slide

  34. Parsing
    LeftParenthesis
    Identifier(“inc”)
    LiteralInt(-1)
    LiteralInt(-1)
    Possible errors: Bad syntax.

    Every stage of compiler checks the form of the stage previous. (click)

    Parser checks lexer output.

    View Slide

  35. Parsing
    LeftParenthesis
    Identifier(“inc”)
    LiteralInt(-1)
    LiteralInt(-1)
    “Expected ‘)’”
    Possible errors: Bad syntax.

    Every stage of compiler checks the form of the stage previous. (click)

    Parser checks lexer output.

    View Slide

  36. Parsing
    let rec private parseExpression (state : ParseState): ParseState =
    match state.Remaining with
    | LeftParenthesis :: Identifier "defun" :: Identifier name :: rest "→
    let defun = parseDefun (name, { state with Remaining = rest })
    match defun.Expressions, defun.Remaining with
    | [ ErrorExpr _ ], _ "→ defun
    | _, RightParenthesis :: remaining "→ { defun with Remaining = remaining }
    | _, [] "→ error ("Expected ')'.")
    | _, wrong :: _ "→ error (sprintf "Expected ')'; found %A." wrong)
    | LeftParenthesis :: Identifier name :: argumentsAndBody "→
    let invoke = parseInvoke (name, { state with Remaining = argumentsAndBody })
    match invoke.Remaining with
    | RightParenthesis :: remaining "→ { invoke with Remaining = remaining }
    | [] "→ error ("Expected ')'.")
    | wrong :: _ "→ error (sprintf "Expected ')'; found %A." wrong)
    | LeftParenthesis :: wrong "→ error (sprintf "%A cannot follow '('." wrong)
    Do not optimize.

    Do not type check.

    Just look for valid syntax.

    View Slide

  37. Optimization (I)
    Invoke “inc” -1
    There cannot be any errors.

    Unlike other phases.

    View Slide

  38. Optimization (I)
    Invoke “inc” -1
    Constant Int 0
    There cannot be any errors.

    Unlike other phases.

    View Slide

  39. Optimization
    Fight the Urge to Optimize Outside the
    Optimizers!

    View Slide

  40. Optimization (I)
    Invoke “some-method” -1
    Often, you’ll do nothing.

    View Slide

  41. Optimization (I)
    Invoke “some-method” -1
    Invoke “some-method” -1
    Often, you’ll do nothing.

    View Slide

  42. Binding / Type Checking
    Invoke “inc” -1
    Start with an abstract syntax tree. Click.

    Produce a binding tree.

    View Slide

  43. ("inc", Function Binding {
    Name = "inc"
    Argument = {
    Name = "value"
    ArgumentType = IntType
    }
    Body = IntBinding 0;
    ResultType = IntType
    })
    InvokeBinding {
    Name = "inc"
    Argument = IntBinding -1
    ResultType = IntType }
    Binding / Type Checking
    Invoke “inc” -1
    Start with an abstract syntax tree. Click.

    Produce a binding tree.

    View Slide

  44. InvokeBinding {
    Name = "inc"
    Argument = String Binding "Oops!"
    ResultType = IntType }
    Binding / Type Checking
    Possible errors.

    View Slide

  45. InvokeBinding {
    Name = "inc"
    Argument = String Binding "Oops!"
    ResultType = IntType }
    Binding / Type Checking
    “Expected integer; found ‘Oops!’.”
    Possible errors.

    View Slide

  46. Binding / Type Checking
    []
    member this.``should return error for unbound invocation``() =
    let source = "(bad-method 2)"
    let expected = [
    Ignore (ErrorBinding "Undefined function 'bad-method'.")
    ]
    let actual = bind source
    actual |> should equal expected

    View Slide

  47. Binding / Type Checking
    let rec private toBinding (environment: Map) (expression : Expression) : Bind
    match expression with
    | IntExpr n "→ IntBinding n, None
    | StringExpr str "→ String Binding str, None
    Now we’re going to a semantic understanding of the code.

    Some of this is easy!

    View Slide

  48. Binding / Type Checking
    Some of it is considerably harder.

    Type checking is easier than type inference, but both are mostly solved problems, at least for simpler cases.

    Now we’re going to a semantic understanding of the code.

    Multiple formalisms are involved: Type theory, denotational semantics, operational semantics.

    View Slide

  49. IL Generation
    IntBinding 0
    If we start from a simple binding… (click)

    We want to produce

    If you’ve seen IL before, you may wonder why not (click). More optimized!

    We’ll get to that.

    View Slide

  50. IL Generation
    IntBinding 0
    Ldc.i4 0
    If we start from a simple binding… (click)

    We want to produce

    If you’ve seen IL before, you may wonder why not (click). More optimized!

    We’ll get to that.

    View Slide

  51. IL Generation
    IntBinding 0
    Ldc.i4 0
    Ldc.i4.0
    If we start from a simple binding… (click)

    We want to produce

    If you’ve seen IL before, you may wonder why not (click). More optimized!

    We’ll get to that.

    View Slide

  52. Optimization (II)
    Ldc.i4 0

    View Slide

  53. Optimization (II)
    Ldc.i4 0
    Ldc.i4.0

    View Slide

  54. Optimization (II)
    let private optimalShortEncodingFor = function
    | Ldc_I4 0 "→ Ldc_I4_0
    | Ldc_I4 1 "→ Ldc_I4_1
    | Ldc_I4 2 "→ Ldc_I4_2
    | Ldc_I4 3 "→ Ldc_I4_3
    | Ldc_I4 4 "→ Ldc_I4_4
    | Ldc_I4 5 "→ Ldc_I4_5
    | Ldc_I4 6 "→ Ldc_I4_6
    | Ldc_I4 7 "→ Ldc_I4_7
    | Ldc_I4 8 "→ Ldc_I4_8
    | Ldloc 0 "→ Ldloc_0
    | Ldloc 1 "→ Ldloc_1
    | Ldloc 2 "→ Ldloc_2
    | Ldloc 3 "→ Ldloc_3
    | Ldloc i when i ,- maxByte "→ Ldloc_S(Convert.ToByte(i))
    Anything not listed here stays the same.

    No errors!

    View Slide

  55. Special Tools!
    PE Verify, ildasm. Similar tools exist for other platforms.

    View Slide

  56. Compare!
    When in doubt, try it in C#, and see what that compiler emits. Sometimes it’s a little strange. Roll with it.

    View Slide

  57. https://www.ece.cmu.edu/~ganger/712.fall02/papers/p761-thompson.pdf
    OK, this is weird.

    View Slide

  58. Further Reading

    Don’t buy the dragon book. Universally recommended by people who haven’t read it, or who have read nothing else.

    View Slide

  59. Further Reading
    • Progra"#ing Language Concepts, by
    Peter Sestoft
    • Modern Compiler Implementation in
    ML, by Andrew W. Appel

    View Slide

  60. Craig Stuntz
    @craigstuntz
    [email protected]
    http://blogs.teamb.com/craigstuntz
    http://www.meetup.com/Papers-We-Love-Columbus/

    View Slide