Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Programs that Write Programs: How Compilers Work

Craig Stuntz
January 07, 2016

Programs that Write Programs: How Compilers Work

Slides and speaker notes for my CodeMash 2016 presentation

Craig Stuntz

January 07, 2016
Tweet

More Decks by Craig Stuntz

Other Decks in Programming

Transcript

  1. Programs that
    Write Programs
    Craig Stuntz
    https://speakerdeck.com/craigstuntz
    https://github.com/CraigStuntz/TinyLanguage
    You can grab my slides and code from the links above.

    I won’t reserve time at the end of the talk for questions. I have a lot of material to cover. Please interrupt and ask if I’m unclear about anything. And I’d be happy to “buy
    you dinner” afterwards if you want to talk more.

    View Slide

  2. –Steve Yegge
    “You're actually
    surrounded by
    compilation problems.
    You run into them
    almost every day.”
    http://steve-yegge.blogspot.ca/2007/06/rich-programmer-food.html
    Why bother learning compilers? What can I tell you in just an hour?

    I think it’s fun & interesting in its own right. Maybe you agree, since you’re here. But you might think you don’t have the opportunity to use compiler techniques in your
    day job. Maybe you can! Obviously I can’t explain everything there is to know about compiler implementations in an hour, which is why I’m also giving you complete
    source code for a simple one to look at later! I want you to be able to look at a hard problem and tell your colleagues you know how to solve it.

    View Slide

  3. –Greenspun’s Tenth Rule
    “Any sufficiently
    complicated C or Fortran
    program contains an ad
    hoc, informally-
    specified, bug-ridden,
    slow implementation of
    half of Co!"on Lisp.”
    https://commons.wikimedia.org/wiki/File:Philip_Greenspun_and_Alex_the_dog.jpg
    Compilation is fundamental to the problem of producing software. If you’re a professional programmer, you will — you will! — be asked to solve problems which you can
    do better if you recognize them as pieces of the compiler toolchain.

    View Slide

  4. The Hoover Dam
    Second thing: There exist problems too big to test. You can’t test every possible function a compiler might need to compile. You need a different methodology to drive
    your designs. You can solve them anyway, and you can produce a design with confidence it’s going to work.

    View Slide

  5. Generalize the
    Problem
    The real skill I want you to leave this talk with: Recognize compilation problems (they’re everywhere!), and apply proven and reliable patterns towards solving.

    Simply recognizing these problems for what they are and knowing where to look to find the solution will make you a better developer. This is a super power!

    View Slide

  6. –Eugene Wallingford
    “…compilers ultimately depend on a
    single big idea from the theory of
    computer science: that a certain
    kind of machine can simulate
    anything — including itself. As a
    result, this certain kind of
    machine, the Turing machine, is the
    very definition of computability.”
    http://www.cs.uni.edu/~wallingf/blog/archives/monthly/2015-09.html#e2015-09-03T15_26_47.htm
    What Is a Compiler, Really? (pause) It’s a program that writes a program! Or a program which takes a representation of data and writes another representation. Often, it’s
    a program that can write any possible program. Many compilers can compile themselves, producing their own executable as output. There are weird and non-obvious
    consequences of this! This idea on the screen — this is huge!

    View Slide

  7. Compiler
    So a compiler produces a program by mapping source code to output.

    But maybe you prefer Ruby or JS? (Click) Interpreter is a compiler which recognizes source code plus additional user input at the same time.

    I’ll talk mostly about compilers, but remember an interpreter is just a compiler with an extra feature or three.

    View Slide

  8. Compiler Interpreter
    So a compiler produces a program by mapping source code to output.

    But maybe you prefer Ruby or JS? (Click) Interpreter is a compiler which recognizes source code plus additional user input at the same time.

    I’ll talk mostly about compilers, but remember an interpreter is just a compiler with an extra feature or three.

    View Slide

  9. code exe
    Actually, I lied just slightly.

    We like to think of a compiler as code -> exe, but this is mostly wrong. (Click)

    Primary purpose is to reject incorrect code. Maybe 1 in 10 times we produce a program, right? Right?

    View Slide

  10. code exe
    Actually, I lied just slightly.

    We like to think of a compiler as code -> exe, but this is mostly wrong. (Click)

    Primary purpose is to reject incorrect code. Maybe 1 in 10 times we produce a program, right? Right?

    View Slide

  11. Useful Bits
    • Regular Expressions (lexing)
    • Deserializers (parsing)
    • Linters, static analysis (syntax, type
    checking)
    • Solvers, theorem provers (optimization)
    • Code migration tools (compilers!)
    Also, most of the individual pieces of the compilation pipeline are useful in their own right

    When you learn to write a compiler, you learn all of the above, and more!

    View Slide

  12. A → B
    More generally, a compiler is a formalization for taking data in one representation and turning it into data in a different representation, if valid. The transformation (from A
    to B) is often nontrivial. This concept describes a large percentage of the code we write, so compilation concepts are really useful, even when not writing a compiler per
    se.

    View Slide

  13. Source code → Program
    JPEG file → Image on screen
    Source code → Potential style error list
    JSON → Object graph
    Code with 2 digit years → Y2K compliant code
    VB6 → C#
    Object graph → User interface markup
    Algorithm → Faster, equivalent
    algorithm
    Some of these look like what we typically consider a compiler. Some don’t.

    None are contrived.

    I do all of these in my day job. These are problems you must routinely solve even if you’re not a language author.

    View Slide

  14. Designing with
    Formal Methods
    Many people here are probably familiar with test driven design, where you write tests to guide evolution of program design.

    Although I do write (many!) tests for my compilers, Test Driven Design doesn’t hold up as a design methodology. Lets examine why, and what the alternative might be.

    View Slide

  15. #define D define
    #D Y return
    #D R for
    #D e while
    #D I printf
    #D l int
    #D W if
    #D C y=v+111;H(x,v)*y++= *x
    #D H(a,b)R(a=b+11;a#D s(a)t=scanf("%d",&a)
    #D U Z I
    #D Z I("123\
    45678\n");H(x,V){putchar(".XO"[*x]);W((x-V)%10==8){x+=2;I("%d\n",(x-V)/10-1);}}
    l V[1600],u,r[]={-1,-11,-10,-9,1,11,10,9},h[]={11,18,81,88},ih[]={22,27,72,77},
    bz,lv=60,*x,*y,m,t;S(d,v,f,_,a,b)l*v;{l c=0,*n=v+100,j=d3-f;W(d>u){R(w=i=0;i<4;i++)w+=(m=v[h[i]])==f?300:m==q?-300:(t=v[ih[i]])==f?-50:
    t==q?50:0;Y w;}H(z,0){W(E(v,z,f,100)){c++;w= -S(d+1,n,q,0,-b,-j);W(w>j){g=bz=z;
    j=w;W(w#$b%&w#$8003)Y w;}}}W(!c){g=0;W(_){H(x,v)c+= *x==f?1:*x==3-f?-1:0;Y c>0?
    8000+c:c-8000;}C;j= -S(d+1,n,q,1,-b,-j);)bz=g;Y d#$u-1?j+(c'(3):j;}main(){R(;t<
    1600;t+=100)R(m=0;m<100;m++)V[t+m]=m<11%&m>88%&(m+1)%10<2?3:0;I("Level:");V[44]
    =V[55]=1;V[45]=V[54]=2;s(u);e(lv>0){Z do{I("You:");s(m);}e(!E(V,m,2,0))*m+,99);
    W(m+,99)lv--;W(lv<15)*u<10)u+=2;U("Wait\n");I("Value:%d\n",S(0,V,1,0,-9000,9000
    ));I("move: %d\n",(lv-=E(V,bz,1,0),bz));}}E(v,z,f,o)l*v;{l*j,q=3-f,g=0,i,w,*k=v
    +z;W(*k==0)R(i=7;i#$0;i--){j=k+(w=r[i]);e(*j==q)j+=w;W(*j==f)*j-w+,k){W(!g){g=1
    ;C;}e(j+,k)*((j-=w)+o)=f;}}Y g;}
    Anyone here know any C? Does this look like a valid C program to you?

    This one is not valid, but it would be if you changed one character. Can you find it, or could you write a program to find it?

    This is a hard problem: Write a program that for any string whatsoever (and there are lots of possible strings!) either declares it a valid program or explains clearly to a
    human why it isn’t.

    View Slide

  16. Duff’s
    Device
    There Are No Edge
    Cases In
    Progra"#ing
    Languages
    send(to, from, count)
    register short *to, *from;
    register count;
    {
    register n = (count + 7) / 8;
    switch (count % 8) {
    case 0: do { *to = *from++;
    case 7: *to = *from++;
    case 6: *to = *from++;
    case 5: *to = *from++;
    case 4: *to = *from++;
    case 3: *to = *from++;
    case 2: *to = *from++;
    case 1: *to = *from++;
    } while (--n > 0);
    }
    }
    And I mean every possible program! (Explain Duff)

    People will do absolutely anything the PL grammar allows.

    Therefore your compiler must be able to classify literally any arbitrary string into a valid or invalid program, and you can’t predict the valid programs people will write.

    View Slide

  17. Even the designer of a language can’t begin to predict all the things the compiler will be asked to parse.

    You can’t design a compiler which can parse any legal program by poking around with tests of possible programs. You would need an uncountably large number of tests.
    You must design with formal methods. But you do still write tests; it’s just that you use a different design methodology!

    View Slide

  18. 1 + 2 + 3 + … + 100
    =
    100 * 101 / 2
    =
    5050
    Another reason compilers are interesting is they try to solve a very hard problem:

    Don’t just convert A to B.

    Convert A into a highly-optimized-but-entirely-equivalent B. For any given A whatsoever!

    If a compiler crashes, that’s bad. If it silently emits incorrect code, it’s worse than bad; the world ends.

    Hard problems are interesting!

    View Slide

  19. Good news! Although this is a hard problem, it’s a mostly solved problem. “Solved” not as in it will never get better.

    “Solved” as in there is a recipe, which most programmers are capable of following. There are lots of steps, but each individual step is pretty simple. You can learn it one
    tiny piece at a time. This is a really learnable skill.

    In fact, if you have a hard problem and can distort it until it looks like a compiler then you have a solved problem. This is magic!

    Contrast this with web development, which is also hard, but we change our minds about what constitutes “best practices” annually.

    View Slide

  20. Lexer → Regular Expressions
    Parser →
    Context Free
    Gra!"ar
    Optimizer → Algebra
    Type Checker →
    Logical Inference
    Rules
    Code Generator →
    Denotational
    Semantics
    For each part of the compiler pipeline, there exists a formalism which guides our designs. Some of these words are long and unfamiliar, but just think of them as recipes
    for implementation. If you follow them, you will cover all of the cases in your language specification.

    View Slide

  21. –Leslie Lamport
    “You don’t achieve
    simplicity by thinking
    in terms of complicated
    languages. Simplicity
    requires thinking
    abstractly before you
    start implementing.”
    http://www.heidelberg-laureate-forum.org/blog/video/lecture-monday-august-24-2015-leslie-lamport/
    https://commons.wikimedia.org/wiki/File:Leslie_Lamport.jpg
    Worth noting: Compiler implementation is fairly straightforward, and there’s a recipe to follow.

    Language design is much harder, and there’s no “best” recipe. Lots of people can write good compilers. Far fewer can write a good language. Fewer still can write a
    good, simple language.

    Respect language designers. It’s common to try to learn compilers by inventing your own language or chasing a complicated language. I recommend you don’t do that! I
    use a tiny Lisp here, but I started with a math expression evaluator. Most math expressions are simpler than PLs.

    View Slide

  22. A Few Important
    Concepts
    Before we dive into compiler implementation per se…

    View Slide

  23. Syntax
    x = x + 1;
    alert(x);
    Sequence
    Assign Invoke
    x add
    x 1
    alert x
    A compiler must handle both the written form (syntax) and the internal meaning (semantics) of the language.

    Syntax has both a literal (characters) and a tree form. These are both representations of the syntax.

    This is the “abstract” syntax tree. Believe it or not it’s the simplified version!

    View Slide

  24. Semantics
    name = "Nate"
    # +/ "Nate"
    String.upcase(name)
    # +/ "NATE"
    name
    # +/ "Nate"
    name = "Nate"
    # +/ "Nate"
    name.upcase!
    # +/ "NATE"
    name
    # +/ "NATE"
    http://www.natescottwest.com/elixir-for-rubyists-part-2/
    People obsess about syntax. They reject LISPs because (). But semantics are, I think, most important.

    Nobody says we don’t need JavaScript since we have Java. People do ask if we need VB.NET since there is C# (syntactically different, but semantically equivalent). I
    want to clarify the distinction.

    Elixir on left, Ruby on right. Looks similar, but Elixir strings are immutable. Similar syntax; different semantics. Similar looking programs mean different things.

    View Slide

  25. Semantics
    Imports System
    Namespace Hello
    Class HelloWorld
    Overloads Shared Sub Main(ByVal args() As String)
    Dim name As String = "VB.NET"
    'See if argument passed
    If args.Length = 1 Then name = args(0)
    Console.WriteLine("Hello, " & name & "!")
    End Sub
    End Class
    End Namespace
    using System;
    namespace Hello {
    public class HelloWorld {
    public static void Main(string[] args) {
    string name = "C#";
    !" See if argument passed
    if (args.Length == 1) name = args[0];
    Console.WriteLine("Hello, " + name + "!");
    }
    }
    }
    http://www.harding.edu/fmccown/vbnet_csharp_comparison.html
    VB.NET on left, C# on right. Looks different, but semantics are identical. Different looking programs mean the same thing.

    “The syntax of a language is governed by the constructs that define its types, and its semantics is determined by the interactions among those constructs.” -Robert
    Harper

    Is this clear? Compilers deal with both; distinction is important.

    Let’s look at the big picture.

    View Slide

  26. Front End:
    Understand
    Language
    Back End:
    Emit Code
    Front end, back end

    Definitions vary, but nearly always exist

    View Slide

  27. Lexer
    IL Generator
    Parser
    Type
    Checker
    Optimizer
    Optimizer
    Object Code
    Generator
    Binder
    This is simplified. Production compilers have more stages. Front end, back end… Middle end?

    There is an intermediate representation — collection of types — for each stage.

    I show two optimizers here. Production compilers have more.

    View Slide

  28. OK, so let’s compile
    something already!
    module Compiler
    let compile =
    Lexer.lex
    01 Parser.parse
    01 Binder.bind
    01 Optimize Binding.optimize
    01 IlGenerator.codegen
    01 Railway.map OptimizeIl.optimize
    01 Railway.map Il.toAssemblyBuilder
    That’s the real, full source code for the compiler itself is on the screen; it just chains together the “recipes” we’ll be examining in detail in the examples to come. I’m
    literally just piping one module into the next.

    View Slide

  29. (inc -1)
    Let’s start with something really, really simple.

    We want to be able to compile this program. (Remember, start with a toy language.)

    To what? There’s more than one valid representation of this program. Here’s one (click, explain IL). Here’s another (click). How do we even decide?

    Is it clear to everyone why these are both valid representations of the program I’m compiling?

    View Slide

  30. (inc -1)
    Ldc.i4 -1
    Ldc.i4 1
    Add
    Let’s start with something really, really simple.

    We want to be able to compile this program. (Remember, start with a toy language.)

    To what? There’s more than one valid representation of this program. Here’s one (click, explain IL). Here’s another (click). How do we even decide?

    Is it clear to everyone why these are both valid representations of the program I’m compiling?

    View Slide

  31. (inc -1)
    Ldc.i4 -1
    Ldc.i4 1
    Add
    Ldc.i4.0
    Let’s start with something really, really simple.

    We want to be able to compile this program. (Remember, start with a toy language.)

    To what? There’s more than one valid representation of this program. Here’s one (click, explain IL). Here’s another (click). How do we even decide?

    Is it clear to everyone why these are both valid representations of the program I’m compiling?

    View Slide

  32. (inc -1) Lex
    LeftParen, Identifier(inc),
    Number(-1), RightParen
    Parse Apply “inc” to -1
    Type
    check
    “inc” exists and takes an int
    argument, and -1 is an int. Great!
    Optimize -1 + 1 = 0, so just emit int 0!
    IL
    generate
    Ldc.i4 0
    Optimize Ldc.i4 0 → Ldc.i4.0
    Object
    code
    Produce assembly with entry point
    which contains the IL generated
    Let’s break this process into individual steps.

    All we’ve done is bust hard problem of source code -> optimized EXE into a bunch of relatively small problems. (Read each) We’ll explain them in detail in just a moment.

    Production compiler engineers will say this is too simple, but their code, like most production code, is probably a mess. You can learn a lot from the simple case!

    View Slide

  33. (defun add-1 (int x)
    (inc x))
    (defun main ()
    (print (add-1 2)))
    Or maybe this one. It’s a little more complicated, but hopefully it still makes sense.

    Let’s look at the individual pieces of the pipeline.

    View Slide

  34. Lexer
    What Problem Are
    We Solving?
    String →
    Sequence of tokens
    Non-Compiler
    Example
    Text search
    Lexers break strings into tokens using a grammar which is a bunch really simple regular expressions.

    Tokens/lexemes are like characters, but they’re just a little bit richer. You get a token for the int 123 instead of individual characters 1, 2, and 3

    Regular expressions search text within a string.

    Lexers search for reserved words/symbols in a language.

    View Slide

  35. Lexer
    Search “am”
    I am. You are.
    You don’t expect a search for “am” to match the word “are” simply because they’re conjugations of the same verb. This is the difference between lexing and parsing;
    lexing deals with symbols and parsing deals with language grammar. Lexing works on character input. Parsing works on lexeme input (roughly, words or symbols)— the
    fundamental unit of the PL’s grammar and also the output of the lexer.

    View Slide

  36. Regular Expressions
    leftParenthesis = ‘(‘
    rightParenthesis = ‘)’
    letter = ‘A’ | ‘B’ | ‘C’ | …
    digit = ‘0’ | ‘1’ | ‘2’ | …
    number = (‘+’digit|‘-’digit|digit) digit*
    alphanumeric = letter | number
    !3 …
    Of course an acceptable program can’t really be any random string. There are rules. We define the rules via formalisms. For lexers, the formalism is regular expressions.
    This doesn’t mean a PCRE; it’s much simpler.

    As with the example above, a RE can be a literal character, a choice, or a sequence. That’s it! there are no other options. Lexers are really simple!

    But how can we be sure that the code behaves like this grammar, every time

    View Slide

  37. Lexer
    If there’s one rule I know which will for sure make your programs better, it’s this: Make illegal state not representable. (Yaron Minsky)

    View Slide

  38. Lexer
    type Lexeme =
    | LeftParenthesis
    | RightParenthesis
    | Identifier of string
    | LiteralInt of int
    | LiteralString of string
    | Unrecognized of char
    The regular expressions in the lexical grammar directly map to the types I create for the lexer.

    I cannot construct an instance of something which doesn’t fit in the types, hence, I cannot construct an instance of a program which doesn’t fit in the grammar. Any
    code input which doesn’t match the lexical grammar gets tossed into the bottom “Unrecognized” type. It will eventually surface as an error to the user.

    View Slide

  39. Lexer
    (inc -1)
    Start with this, transform to (click). Haven’t tried to infer any kind of meaning at all yet.

    What’s the point of lexing vs parsing? Output of lexing is terminals of the grammar used by the parser. Lexer sets the parser up for success. Makes the formalism
    possible.

    View Slide

  40. Lexer
    (inc -1)
    “(“ “inc” “-1” “)”
    LeftParenthesis
    Identifier(“inc”)
    LiteralInt(-1)
    RightParenthesis
    Start with this, transform to (click). Haven’t tried to infer any kind of meaning at all yet.

    What’s the point of lexing vs parsing? Output of lexing is terminals of the grammar used by the parser. Lexer sets the parser up for success. Makes the formalism
    possible.

    View Slide

  41. Lexer
    ( -1)
    Possible errors: Something which doesn’t fit the lexical grammar, and also can’t be recognized as a terminal for the parsing grammar.

    View Slide

  42. Lexer
    let rec private lexChars (source: char list) : Lexeme list =
    match source with
    | '(' :: rest !→ LeftParenthesis :: lexChars rest
    | ')' :: rest !→ RightParenthesis :: lexChars rest
    | '"' :: rest !→ lexString(rest, "")
    | c :: rest when isIdentifierStart c !→ lexName (source, "")
    | d :: rest when System.Char.IsDigit d !→ lexNumber(source, "")
    | [] !→ []
    | w :: rest when System.Char.IsWhiteSpace w !→ lexChars rest
    | c :: rest !→ Unrecognized c :: lexChars rest
    I write all of my compiler code as purely functional.

    Most example compiler code requires understanding the state of the compiler as well as the behavior of the code. You can understand this code from the code alone.
    There is no state at all to know.

    Lexer is recursive: Look at first char, decide what to do, then repeat.

    View Slide

  43. Lexer
    http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags
    Counterexamples: The thing you must know about lexing is when not to use it. You can’t parse XML with a regex.

    The user experience for validation should be gently helping the user to avoid fat fingers and submit correct data.

    View Slide

  44. Lexer
    http://www.regular-expressions.info/email.html
    “So even when following official standards,
    there are still trade-offs to be made. Don't
    blindly copy regular expressions from online
    libraries or discussion forums.”
    -Jan Goyvaerts, regular-expressions.info
    Don’t piss them off by telling them they’ve misspelt their Irish surname or that their real email address is “invalid.”

    Don’t validate a grammar (like email addresses) with a lexer/regex.

    View Slide

  45. Parser
    What Problem Are
    We Solving?
    Sequence of tokens
    → Syntax tree
    Non-Compiler
    Example
    Deserialization
    The lexer produces a sequence of tokens. We want to turn that into an abstract syntax tree, respecting operator precedence.

    View Slide

  46. PEMDAS
    1 + 2 * 3
    1 + (2 * 3)
    What is precedence? In most languages, for example, you multiply before you add, regardless of their sequence in an expression. The expressions above and below the
    line should be identical.

    View Slide

  47. To parse a language, you must understand its grammar. Example is part of ECMAScript grammar.

    All valid statements in the language (all!) can be constructed from the grammar. Anything we can’t construct from the grammar is invalid and will be a parse error. The
    converse isn’t true, though.

    We can also construct invalid statements from the grammar, because the parser doesn’t type check. Just because something can be parsed doesn’t make it correct, but
    it’s certainly invalid if it can’t be parsed. True in spoken languages as well: “Colorless green ideas sleep furiously” Grammatically correct, but semantically nonsensical.
    Parsing works on syntax, not semantics.

    View Slide

  48. Gra"#ar
    := |
    := |
    := “(defun” identifier “)”
    := number | string |
    := “(” identifier “)”
    With a reasonably specified language, the rules are pretty easy to follow. (good: C#; bad: Ruby)

    Our goal today is to implement a really simple language. We’ll follow the grammar here.

    Explain Terminals vs. nonterminals. Important for lexing vs. parsing.

    View Slide

  49. Gra"#ar
    type Expression =
    | IntExpr of int
    | StringExpr of string
    | DefunExpr of name: string * argument: ArgumentExpression option * body: Expressi
    | InvokeExpr of name: string * argument: Expression option
    | IdentifierExpr of string
    | ErrorExpr of string
    | EmptyListExpr
    Just as we defined types to represent productions in the lexical grammar, we do the same for the parsing grammar. At this point you might ask if this mapping from the
    formal grammar to F# types can be automated, and it can! I’ve built my entire compiler “from scratch” so you can see how it works, but it’s common to use lexer and
    parser generators in real-world work.

    View Slide

  50. Parser
    LeftParenthesis
    Identifier(“inc”)
    LiteralInt(-1)
    RightParenthesis
    How does parsing work?

    Start with a list of tokens from the lexer. Click

    Produce syntax tree.

    Here we have a const -1 argument; argument could be another expression like another invocation.

    View Slide

  51. Parser
    LeftParenthesis
    Identifier(“inc”)
    LiteralInt(-1)
    RightParenthesis
    Invoke “inc” -1
    How does parsing work?

    Start with a list of tokens from the lexer. Click

    Produce syntax tree.

    Here we have a const -1 argument; argument could be another expression like another invocation.

    View Slide

  52. Parser
    LeftParenthesis
    Identifier(“inc”)
    LiteralInt(-1)
    LiteralInt(-1)
    Possible errors: Bad syntax.

    Every stage of compiler checks the form of the stage previous. (click)

    Parser checks lexer output.

    View Slide

  53. Parser
    LeftParenthesis
    Identifier(“inc”)
    LiteralInt(-1)
    LiteralInt(-1)
    “Expected ‘)’”
    Possible errors: Bad syntax.

    Every stage of compiler checks the form of the stage previous. (click)

    Parser checks lexer output.

    View Slide

  54. Parser
    let rec private parseExpression (state : ParseState): ParseState =
    match state.Remaining with
    | LeftParenthesis :: Identifier "defun" :: Identifier name :: rest !→
    let defun = parseDefun (name, { state with Remaining = rest })
    match defun.Expressions, defun.Remaining with
    | [ ErrorExpr _ ], _ !→ defun
    | _, RightParenthesis :: remaining !→ { defun with Remaining = remaining }
    | _, [] !→ error ("Expected ')'.")
    | _, wrong :: _ !→ error (sprintf "Expected ')'; found %A." wrong)
    | LeftParenthesis :: Identifier name :: argumentsAndBody !→
    let invoke = parseInvoke (name, { state with Remaining = argumentsAndBody })
    match invoke.Remaining with
    | RightParenthesis :: remaining !→ { invoke with Remaining = remaining }
    | [] !→ error ("Expected ')'.")
    | wrong :: _ !→ error (sprintf "Expected ')'; found %A." wrong)
    | LeftParenthesis :: wrong !→ error (sprintf "%A cannot follow '('." wrong)
    Implementing parsers can be a bit tricky since there are lots of choices in implementation style with subtle tradeoffs. But you can use an off-the-shelf toolkit in most
    cases. Parts of my parser shown here. Again, I’m just doing a pattern match against the token stream from the lexer, because parsing a LISP is easy.

    Key to not getting stuck in compiler is do only one thing at a time.

    Do not optimize.

    Do not type check.

    Just look for valid syntax.

    View Slide

  55. Parser
    Practical parsing:

    Deserialization, especially of untrusted input like reading a file format such as PNG is a parsing job and you should use formal methods to implement it. If you try and
    “wing it” then you are probably letting the bad guys into anyone who uses your library.

    View Slide

  56. –Guy Steele
    “If it's worth
    telling another
    progra!"er, it's
    worth telling the
    compiler, I think.”
    https://joshvarty.wordpress.com/2015/08/03/learn-roslyn-now-part-11-introduction-to-code-fixes/
    One super practical use for a parser is enforcing code style guides.

    Many companies write style guides as Word documents or worse. That leads to lax enforcement, usually targeted at new employees only, and lots of pissy email threads.

    View Slide

  57. Parser
    https://joshvarty.wordpress.com/2015/08/03/learn-roslyn-now-part-11-introduction-to-code-fixes/
    Instead, you can write a Roslyn syntax analyzer and just fail the build if a rule is broken.

    Parsers: Bringing world peace through technology!

    View Slide

  58. Scope
    What Problem Are
    We Solving?
    What does “x” mean
    right now?
    Non-Compiler
    Example
    Bounded Context in
    Domain Driven
    Design
    In an ideal world (for the compiler author), developers would give a unique and unambiguous name to every variable.

    In the real world, probably half the variables in any average program are called “temp”.

    The compiler has to decide which assignment to something called “temp” to use in a given context.

    View Slide

  59. Scope
    https://msujaws.wordpress.com/2011/05/03/static-vs-dynamic-scoping/
    Scoping rules unambiguously associate occurrences of identifier names to their binding locations, or declaration sites.

    Essentially all modern languages use lexical (static) scoping. This is the rarest of things: A settled argument in computer science. Essentially nobody argues that dynamic
    scoping is a good idea anymore. Only archaic languages like SNOBOL and early LISPs use dynamic scoping. (Perl has opt-in dynamic scoping because it’s Perl.) Most
    contemporary dynamic languages use lexical scoping.

    As a compiler writer, this requires some care.

    View Slide

  60. Binding
    InvokeExpr “inc” -1
    The binder handles scoping concerns. Binder connects syntactic identifiers like “inc” with meaning.

    Start with an abstract syntax tree. Click.

    Produce a binding tree. Note the line Function = Inc — the string “inc” has been mapped to a built-in function the compiler understands and can code generate.

    Binding enforces scope and gives us the ability to check types.

    View Slide

  61. InvokeBinding {
    FunctionName = "inc"
    Function = Inc
    Argument = IntBinding -1}
    Binding
    InvokeExpr “inc” -1
    The binder handles scoping concerns. Binder connects syntactic identifiers like “inc” with meaning.

    Start with an abstract syntax tree. Click.

    Produce a binding tree. Note the line Function = Inc — the string “inc” has been mapped to a built-in function the compiler understands and can code generate.

    Binding enforces scope and gives us the ability to check types.

    View Slide

  62. InvokeExpr {
    Name = "not-a-function"
    Argument = StringExpr "" }
    Binding
    Possible errors. (click)

    View Slide

  63. InvokeExpr {
    Name = "not-a-function"
    Argument = StringExpr "" }
    Binding
    “Undefined function ‘not-a-function’.”
    Possible errors. (click)

    View Slide

  64. https://msdn.microsoft.com/en-us/library/ms228296.aspx?f=255&MSPPError=-2147217396
    Someone asked me how much code in a compiler is dedicated to error handling? Let’s take a digression and consider this.

    Remember the most common case for a compiler is failing on bad code and giving user a good error message.

    C# sure has a lot of them. Each of these has code behind it.

    View Slide

  65. As a compiler author, you either put in the effort to do it well or you leave your user with a really poor experience.

    View Slide

  66. About Those Errors
    []
    member this.``should return error for unbound invocation``() =
    let source = "(bad-method 2)"
    let expected = ErrorBinding (
    "Undefined function 'bad-method'.", EmptyBinding)
    let actual = bind source
    actual |> should equal expected
    So I write tests for each.

    Each error needs a test example, or several. This isn’t test driven design; it’s regression testing.

    View Slide

  67. About Those Errors
    http://www.drdobbs.com/architecture-and-design/so-you-want-to-write-your-own-language/240165488?pgno=2
    There are a few possible ways to deal with errors. (click)

    First is common when getting started; entirely unsuitable for production compilers. Just fail on first error found. (click)

    Second friendly but hard and often wrong. Guess what the user intended and try to continue compilation. Report all errors at the end of the process. (click)

    I use third.

    View Slide

  68. About Those Errors
    • Die in a fire
    http://www.drdobbs.com/architecture-and-design/so-you-want-to-write-your-own-language/240165488?pgno=2
    There are a few possible ways to deal with errors. (click)

    First is common when getting started; entirely unsuitable for production compilers. Just fail on first error found. (click)

    Second friendly but hard and often wrong. Guess what the user intended and try to continue compilation. Report all errors at the end of the process. (click)

    I use third.

    View Slide

  69. About Those Errors
    • Die in a fire
    • Guess what I meant, not what I
    said
    http://www.drdobbs.com/architecture-and-design/so-you-want-to-write-your-own-language/240165488?pgno=2
    There are a few possible ways to deal with errors. (click)

    First is common when getting started; entirely unsuitable for production compilers. Just fail on first error found. (click)

    Second friendly but hard and often wrong. Guess what the user intended and try to continue compilation. Report all errors at the end of the process. (click)

    I use third.

    View Slide

  70. About Those Errors
    • Die in a fire
    • Guess what I meant, not what I
    said
    • Poisoning
    http://www.drdobbs.com/architecture-and-design/so-you-want-to-write-your-own-language/240165488?pgno=2
    There are a few possible ways to deal with errors. (click)

    First is common when getting started; entirely unsuitable for production compilers. Just fail on first error found. (click)

    Second friendly but hard and often wrong. Guess what the user intended and try to continue compilation. Report all errors at the end of the process. (click)

    I use third.

    View Slide

  71. Type Checking
    What Problem Are
    We Solving?
    AST →
    Boolean
    “Is it valid?”
    Non-Compiler
    Example
    Linter
    Moving on, Type checking doesn’t substantially change the intermediate representation. Mostly just “thumbs up/thumbs down.”

    But some compilers do type inference, which does. Transforms nodes of unknown type to nodes of known type.

    Type checkers are really useful outside of compilers: (Swagger)

    View Slide

  72. Type Checking
    ldstr "Hi"
    ldstr "Hi"
    div
    This is bad.
    Don’t do this.
    One way to understand type checking is to look at a language which doesn’t do any. IL/ASM very little time checking. You can try to "divide" two strings, for example,
    and your application might crash or, worse, silently produce incorrect results. So it is critical that the compiler never emit such IL.

    View Slide

  73. Type Inference Rules
    Γ ⊢ A Γ ⊢ B
    Γ ⊢ A×B
    Γ ⊢ v1
    :Int Γ ⊢ v2
    :Int
    Γ ⊢ v1
    +v2
    :Int
    Type systems are specified by inference rules. Given the premises or assumptions above the line, we’re allowed to form the derivations below the line. There’s an
    example on the slide. Here we say that if A and B are both types in a certain type environment named gamma, then we are also allowed to form a pair with members of
    types A and B. (You read “Γ ⊢” as “it’s provable within an environment Γ that…”) Similarly, if we know that two values, v1
    and v2
    are both integers, then so is their sum.
    There are lots of rules, but each one should be pretty simple.

    These seem almost too obvious to bother stating, but it helps to be really clear what the rules are, because, as you’ve seen with JavaScript, the corner cases can be a bit
    scary.

    View Slide

  74. Type Checking
    • Statically typed
    • Unityped (“dynamic language”)
    • Untyped
    You might ask yourself, “Self, what if I'm working on a dynamic language? Then I don't have to do any type checking, right?” However, you must do nearly the same type
    checking for evaluating a dynamic language as for pre-compiling a static language. The biggest difference is that the check is deferred until runtime. Other than that, it is
    very similar.

    Some languages, like ASM and C do very little type checking at all. This is part of the reason why Open SSL has so many issues and your Toyota accelerates itself. C has
    its uses, but please don’t build another C.

    View Slide

  75. Type Checking
    let rec private toBinding (environment: Map) (expression : Expression) : Bind
    match expression with
    | IntExpr n !→ IntBinding n
    | StringExpr str !→ String Binding str
    “A type system specifies the type rules of a programming language independently of particular typechecking algorithms. This is analogous to describing the syntax of a
    programming language by a formal grammar, independently of particular parsing algorithms.” - Luca Cardelli

    Some of this is easy!

    View Slide

  76. Type Checking
    | InvokeExpr (name, argument) !→
    match environment.TryFind name with
    | Some (Function Binding func) !→
    let argumentBinding = toInvokedArgumentBinding environment argument
    match argumentTypeError argumentBinding func with
    | None !→
    InvokeBinding {
    FunctionName = name
    Function = func
    Argument = argumentBinding
    }
    | Some argumentTypeErrorMessage !→
    ErrorBinding (argumentTypeErrorMessage, EmptyBinding)
    | Some bindingType !→
    ErrorBinding (sprintf "Expected function; found %A" bindingType, EmptyBinding)
    | None !→
    ErrorBinding (sprintf "Undefined function '%s'." name, EmptyBinding)
    Some of it is considerably harder. When binding an invocation, we must first make sure it’s a real function and then that the argument can be bound at all, and lastly that
    the types match.

    Type checking is easier than type inference, but both are mostly solved problems, at least for simpler cases.

    Now we’re going to a semantic understanding of the code.

    View Slide

  77. InvokeExpr {
    Name = "inc"
    Argument = StringExpr “Oops!" }
    Type Checking
    Possible errors.

    View Slide

  78. InvokeExpr {
    Name = "inc"
    Argument = StringExpr “Oops!" }
    Type Checking
    “Expected integer; found ‘Oops!’.”
    Possible errors.

    View Slide

  79. Optimizers
    What Problem Are
    We Solving?
    Program →
    Faster, but
    equivalent program
    Non-Compiler
    Example
    Theorem prover
    Fight the urge to optimize outside the optimizers!

    Remember, one of the essential characteristics of compiler optimization is you can turn it off, e.g., for easier debugging. This is really hard if you optimize outside the
    optimizer.

    Optimizer must never change program behavior, except maybe making it harder to debug.

    Non-optimizer code should be so non-optimal it looks dumb.

    View Slide

  80. Optimization (I)
    InvokeBinding “inc” -1
    There cannot be any errors based on user input. Either make it better or leave it alone.

    Unlike other phases.

    Here’s an optimized version of the tree (click)

    View Slide

  81. Optimization (I)
    InvokeBinding “inc” -1
    IntBinding 0
    There cannot be any errors based on user input. Either make it better or leave it alone.

    Unlike other phases.

    Here’s an optimized version of the tree (click)

    View Slide

  82. Optimization (I)
    Invoke “some-method” -1
    Often, you’ll do nothing. (click)

    View Slide

  83. Optimization (I)
    Invoke “some-method” -1
    Invoke “some-method” -1
    Often, you’ll do nothing. (click)

    View Slide

  84. Optimization (I)
    let private optimizeInc (binding: Binding) : Binding =
    match binding with
    | IncBinding (IntBinding number)
    !→ IntBinding (number + 1)
    | IncBinding _
    | BoolBinding _
    | IntBinding _
    | String Binding _
    | VariableBinding _
    | Function Binding _
    | InvokeBinding _
    | DefBinding _
    | ErrorBinding _
    | EmptyBinding _
    !→ binding
    This is an example of a function which does one specific optimization:

    Find when the “inc” function is applied to a literal int and substitute the correct result.

    Hooray, we’ve optimized away the function call!

    It ignores other kinds of nodes.

    View Slide

  85. IL Generation
    IntBinding 0
    After optimizing the tree, generate IL.

    If we start from a simple binding… (click)

    We want to produce

    If you’ve seen IL before, you may wonder why not (click). More optimized!

    Why don’t we do this? (anyone?)

    View Slide

  86. IL Generation
    IntBinding 0
    Ldc.i4 0
    After optimizing the tree, generate IL.

    If we start from a simple binding… (click)

    We want to produce

    If you’ve seen IL before, you may wonder why not (click). More optimized!

    Why don’t we do this? (anyone?)

    View Slide

  87. IL Generation
    IntBinding 0
    Ldc.i4 0
    Ldc.i4.0
    After optimizing the tree, generate IL.

    If we start from a simple binding… (click)

    We want to produce

    If you’ve seen IL before, you may wonder why not (click). More optimized!

    Why don’t we do this? (anyone?)

    View Slide

  88. IL Generation
    let rec private codegenBinding (binding : Binding) =
    match binding with
    | BoolBinding b !→
    match b with
    | true !→ [Ldc_I4_1]
    | false !→ [Ldc_I4_0]
    | IntBinding n !→ [Ldc_I4 n]
    | String Binding s !→ [Ldstr s]
    | !" …
    Some of this is pretty straightforward.

    View Slide

  89. IL Generation
    let private writeLineMethod =
    typeof.GetMethod(
    "WriteLine", [| typeof |]
    let private codegenOper = function
    | IncInt !→
    [ Instruction.Ldc_I4_1
    Instruction.Add ]
    | WriteLine !→
    [ Instruction.Call writeLineMethod ]
    But for built-in, primitive operations, I have to write out the code in IL. I need these primitives for more complicated programs.

    View Slide

  90. Optimization (II)
    Ldc.i4 0
    For IL optimization, we want to replace the generic operations with their optimized IL short forms. Go from this, to (click)

    View Slide

  91. Optimization (II)
    Ldc.i4 0
    Ldc.i4.0
    For IL optimization, we want to replace the generic operations with their optimized IL short forms. Go from this, to (click)

    View Slide

  92. Optimization (II)
    let private optimalShortEncodingFor = function
    | Ldc_I4 0 !→ Ldc_I4_0
    | Ldc_I4 1 !→ Ldc_I4_1
    | Ldc_I4 2 !→ Ldc_I4_2
    | Ldc_I4 3 !→ Ldc_I4_3
    | Ldc_I4 4 !→ Ldc_I4_4
    | Ldc_I4 5 !→ Ldc_I4_5
    | Ldc_I4 6 !→ Ldc_I4_6
    | Ldc_I4 7 !→ Ldc_I4_7
    | Ldc_I4 8 !→ Ldc_I4_8
    | Ldloc 0 !→ Ldloc_0
    | Ldloc 1 !→ Ldloc_1
    | Ldloc 2 !→ Ldloc_2
    | Ldloc 3 !→ Ldloc_3
    | Ldloc i when i :; maxByte !→ Ldloc_S(Convert.ToByte(i))
    Anything not listed here stays the same.

    No errors!

    View Slide

  93. Special Tools!
    PE Verify, ildasm. Similar tools exist for other platforms.

    ildsam: May have seen before. Decompile an EXE/DLL into IL.

    PEVerify

    View Slide

  94. Compare!
    When in doubt, try it in C#, and see what that compiler emits. Sometimes it’s a little strange. Roll with it.

    View Slide

  95. https://www.ece.cmu.edu/~ganger/712.fall02/papers/p761-thompson.pdf
    OK, this seems really weird when you first hear about it.

    Thompson says we should trust people more than code. But people are fallible and code is precise, right?

    View Slide

  96. Trusting Trust
    Compiler
    Executable
    Compiler
    Source
    Code
    Compiler
    Executable
    Many compilers can compile themselves. This is how it’s supposed to work.

    The two green boxes are identical, right? Right. This is how we expect it to work. Compilers map source code to executables.

    View Slide

  97. Trusting Trust
    Compiler
    Executable
    Compiler
    Source
    Code
    Trojaned
    Compiler
    Executable
    Trojan
    Code
    What if someone adds some malicious code to the compiler source code?

    What does that do?

    View Slide

  98. Trusting Trust
    Trojaned
    Compiler
    Executable
    Benign App
    Source
    Code
    Trojaned
    App
    Executable
    Code which adds a trojan to any app the compiler compiles?

    This is obviously bad, but maybe not too surprising, right?

    View Slide

  99. Trusting Trust
    Trojaned
    Compiler
    Executable
    (Benign!)
    Compiler
    Source
    Code
    Trojaned
    Compiler
    Executable
    Now the trojan lives in the compiler EXE only, not the source code! Even if you recompile the compiler itself from good, benign source code, you don’t know if you’re
    secure. You need to know the full lineage of the compiler.

    This is true in the context of formal verification, as well, not just security against bad guys.

    View Slide

  100. Conclusion
    Don’t fear hard problems. Recognize solved problems, use recipes that have been developed over a half century of prior art.

    View Slide

  101. Further Reading

    Don’t buy the dragon book. Universally recommended by people who haven’t read it, or who have read nothing else.

    View Slide

  102. Further Reading
    • Progra!"ing Language Concepts, by Peter
    Sestoft
    • Modern Compiler Implementation in ML, by
    Andrew W. Appel
    • miniml (608 line implementation of ML
    subset), by Andrej Bauer
    • Coursera Compilers Course, by Alex Aiken

    View Slide

  103. Craig Stuntz
    @craigstuntz
    [email protected]
    http://blogs.teamb.com/craigstuntz
    http://www.meetup.com/Papers-We-Love-Columbus/
    https://speakerdeck.com/craigstuntz
    https://github.com/CraigStuntz/TinyLanguage
    Feel free to reach out and ask follow up questions, either here at the conference or by one of the ways on the slide.

    View Slide