Programs that Write Programs: How Compilers Work

Programs that Write Programs Craig Stuntz https://speakerdeck.com/craigstuntz https://github.com/CraigStuntz/TinyLanguage You can
grab my slides and code from the links above. I won’t reserve time at the end of the talk for questions. I have a lot of material to cover. Please interrupt and ask if I’m unclear about anything. And I’d be happy to “buy you dinner” afterwards if you want to talk more.

–Steve Yegge “You're actually surrounded by compilation problems. You run
into them almost every day.” http://steve-yegge.blogspot.ca/2007/06/rich-programmer-food.html Why bother learning compilers? What can I tell you in just an hour? I think it’s fun & interesting in its own right. Maybe you agree, since you’re here. But you might think you don’t have the opportunity to use compiler techniques in your day job. Maybe you can! Obviously I can’t explain everything there is to know about compiler implementations in an hour, which is why I’m also giving you complete source code for a simple one to look at later! I want you to be able to look at a hard problem and tell your colleagues you know how to solve it.

–Greenspun’s Tenth Rule “Any sufficiently complicated C or Fortran program
contains an ad hoc, informally- specified, bug-ridden, slow implementation of half of Co!"on Lisp.” https://commons.wikimedia.org/wiki/File:Philip_Greenspun_and_Alex_the_dog.jpg Compilation is fundamental to the problem of producing software. If you’re a professional programmer, you will — you will! — be asked to solve problems which you can do better if you recognize them as pieces of the compiler toolchain.

The Hoover Dam Second thing: There exist problems too big
to test. You can’t test every possible function a compiler might need to compile. You need a diﬀerent methodology to drive your designs. You can solve them anyway, and you can produce a design with conﬁdence it’s going to work.

Generalize the Problem The real skill I want you to
leave this talk with: Recognize compilation problems (they’re everywhere!), and apply proven and reliable patterns towards solving. Simply recognizing these problems for what they are and knowing where to look to ﬁnd the solution will make you a better developer. This is a super power!

–Eugene Wallingford “…compilers ultimately depend on a single big idea
from the theory of computer science: that a certain kind of machine can simulate anything — including itself. As a result, this certain kind of machine, the Turing machine, is the very definition of computability.” http://www.cs.uni.edu/~wallingf/blog/archives/monthly/2015-09.html#e2015-09-03T15_26_47.htm What Is a Compiler, Really? (pause) It’s a program that writes a program! Or a program which takes a representation of data and writes another representation. Often, it’s a program that can write any possible program. Many compilers can compile themselves, producing their own executable as output. There are weird and non-obvious consequences of this! This idea on the screen — this is huge!

Compiler So a compiler produces a program by mapping source
code to output. But maybe you prefer Ruby or JS? (Click) Interpreter is a compiler which recognizes source code plus additional user input at the same time. I’ll talk mostly about compilers, but remember an interpreter is just a compiler with an extra feature or three.

Compiler Interpreter So a compiler produces a program by mapping
source code to output. But maybe you prefer Ruby or JS? (Click) Interpreter is a compiler which recognizes source code plus additional user input at the same time. I’ll talk mostly about compilers, but remember an interpreter is just a compiler with an extra feature or three.

code exe Actually, I lied just slightly. We like to
think of a compiler as code -> exe, but this is mostly wrong. (Click) Primary purpose is to reject incorrect code. Maybe 1 in 10 times we produce a program, right? Right?

Useful Bits • Regular Expressions (lexing) • Deserializers (parsing) •
Linters, static analysis (syntax, type checking) • Solvers, theorem provers (optimization) • Code migration tools (compilers!) Also, most of the individual pieces of the compilation pipeline are useful in their own right When you learn to write a compiler, you learn all of the above, and more!

A → B More generally, a compiler is a formalization
for taking data in one representation and turning it into data in a diﬀerent representation, if valid. The transformation (from A to B) is often nontrivial. This concept describes a large percentage of the code we write, so compilation concepts are really useful, even when not writing a compiler per se.

Source code → Program JPEG file → Image on screen
Source code → Potential style error list JSON → Object graph Code with 2 digit years → Y2K compliant code VB6 → C# Object graph → User interface markup Algorithm → Faster, equivalent algorithm Some of these look like what we typically consider a compiler. Some don’t. None are contrived. I do all of these in my day job. These are problems you must routinely solve even if you’re not a language author.

Designing with Formal Methods Many people here are probably familiar
with test driven design, where you write tests to guide evolution of program design. Although I do write (many!) tests for my compilers, Test Driven Design doesn’t hold up as a design methodology. Lets examine why, and what the alternative might be.

#define D define #D Y return #D R for #D
e while #D I printf #D l int #D W if #D C y=v+111;H(x,v)*y++= *x #D H(a,b)R(a=b+11;a<b+89;a++) #D s(a)t=scanf("%d",&a) #D U Z I #D Z I("123\ 45678\n");H(x,V){putchar(".XO"[*x]);W((x-V)%10==8){x+=2;I("%d\n",(x-V)/10-1);}} l V[1600],u,r[]={-1,-11,-10,-9,1,11,10,9},h[]={11,18,81,88},ih[]={22,27,72,77}, bz,lv=60,*x,*y,m,t;S(d,v,f,_,a,b)l*v;{l c=0,*n=v+100,j=d<u-1?a:-9000,w,z,i,g,q= 3-f;W(d>u){R(w=i=0;i<4;i++)w+=(m=v[h[i]])==f?300:m==q?-300:(t=v[ih[i]])==f?-50: t==q?50:0;Y w;}H(z,0){W(E(v,z,f,100)){c++;w= -S(d+1,n,q,0,-b,-j);W(w>j){g=bz=z; j=w;W(w#$b%&w#$8003)Y w;}}}W(!c){g=0;W(_){H(x,v)c+= *x==f?1:*x==3-f?-1:0;Y c>0? 8000+c:c-8000;}C;j= -S(d+1,n,q,1,-b,-j);)bz=g;Y d#$u-1?j+(c'(3):j;}main(){R(;t< 1600;t+=100)R(m=0;m<100;m++)V[t+m]=m<11%&m>88%&(m+1)%10<2?3:0;I("Level:");V[44] =V[55]=1;V[45]=V[54]=2;s(u);e(lv>0){Z do{I("You:");s(m);}e(!E(V,m,2,0))*m+,99); W(m+,99)lv--;W(lv<15)*u<10)u+=2;U("Wait\n");I("Value:%d\n",S(0,V,1,0,-9000,9000 ));I("move: %d\n",(lv-=E(V,bz,1,0),bz));}}E(v,z,f,o)l*v;{l*j,q=3-f,g=0,i,w,*k=v +z;W(*k==0)R(i=7;i#$0;i--){j=k+(w=r[i]);e(*j==q)j+=w;W(*j==f)*j-w+,k){W(!g){g=1 ;C;}e(j+,k)*((j-=w)+o)=f;}}Y g;} Anyone here know any C? Does this look like a valid C program to you? This one is not valid, but it would be if you changed one character. Can you ﬁnd it, or could you write a program to ﬁnd it? This is a hard problem: Write a program that for any string whatsoever (and there are lots of possible strings!) either declares it a valid program or explains clearly to a human why it isn’t.

Duff’s Device There Are No Edge Cases In Progra"#ing Languages
send(to, from, count) register short *to, *from; register count; { register n = (count + 7) / 8; switch (count % 8) { case 0: do { *to = *from++; case 7: *to = *from++; case 6: *to = *from++; case 5: *to = *from++; case 4: *to = *from++; case 3: *to = *from++; case 2: *to = *from++; case 1: *to = *from++; } while (--n > 0); } } And I mean every possible program! (Explain Duﬀ) People will do absolutely anything the PL grammar allows. Therefore your compiler must be able to classify literally any arbitrary string into a valid or invalid program, and you can’t predict the valid programs people will write.

Even the designer of a language can’t begin to predict
all the things the compiler will be asked to parse. You can’t design a compiler which can parse any legal program by poking around with tests of possible programs. You would need an uncountably large number of tests. You must design with formal methods. But you do still write tests; it’s just that you use a diﬀerent design methodology!

1 + 2 + 3 + … + 100 =
100 * 101 / 2 = 5050 Another reason compilers are interesting is they try to solve a very hard problem: Don’t just convert A to B. Convert A into a highly-optimized-but-entirely-equivalent B. For any given A whatsoever! If a compiler crashes, that’s bad. If it silently emits incorrect code, it’s worse than bad; the world ends. Hard problems are interesting!

Good news! Although this is a hard problem, it’s a
mostly solved problem. “Solved” not as in it will never get better. “Solved” as in there is a recipe, which most programmers are capable of following. There are lots of steps, but each individual step is pretty simple. You can learn it one tiny piece at a time. This is a really learnable skill. In fact, if you have a hard problem and can distort it until it looks like a compiler then you have a solved problem. This is magic! Contrast this with web development, which is also hard, but we change our minds about what constitutes “best practices” annually.

Lexer → Regular Expressions Parser → Context Free Gra!"ar Optimizer
→ Algebra Type Checker → Logical Inference Rules Code Generator → Denotational Semantics For each part of the compiler pipeline, there exists a formalism which guides our designs. Some of these words are long and unfamiliar, but just think of them as recipes for implementation. If you follow them, you will cover all of the cases in your language speciﬁcation.

–Leslie Lamport “You don’t achieve simplicity by thinking in terms
of complicated languages. Simplicity requires thinking abstractly before you start implementing.” http://www.heidelberg-laureate-forum.org/blog/video/lecture-monday-august-24-2015-leslie-lamport/ https://commons.wikimedia.org/wiki/File:Leslie_Lamport.jpg Worth noting: Compiler implementation is fairly straightforward, and there’s a recipe to follow. Language design is much harder, and there’s no “best” recipe. Lots of people can write good compilers. Far fewer can write a good language. Fewer still can write a good, simple language. Respect language designers. It’s common to try to learn compilers by inventing your own language or chasing a complicated language. I recommend you don’t do that! I use a tiny Lisp here, but I started with a math expression evaluator. Most math expressions are simpler than PLs.

A Few Important Concepts Before we dive into compiler implementation
per se…

Syntax x = x + 1; alert(x); Sequence Assign Invoke
x add x 1 alert x A compiler must handle both the written form (syntax) and the internal meaning (semantics) of the language. Syntax has both a literal (characters) and a tree form. These are both representations of the syntax. This is the “abstract” syntax tree. Believe it or not it’s the simpliﬁed version!

Semantics name = "Nate" # +/ "Nate" String.upcase(name) # +/
"NATE" name # +/ "Nate" name = "Nate" # +/ "Nate" name.upcase! # +/ "NATE" name # +/ "NATE" http://www.natescottwest.com/elixir-for-rubyists-part-2/ People obsess about syntax. They reject LISPs because (). But semantics are, I think, most important. Nobody says we don’t need JavaScript since we have Java. People do ask if we need VB.NET since there is C# (syntactically different, but semantically equivalent). I want to clarify the distinction. Elixir on left, Ruby on right. Looks similar, but Elixir strings are immutable. Similar syntax; different semantics. Similar looking programs mean different things.

Semantics Imports System Namespace Hello Class HelloWorld Overloads Shared Sub
Main(ByVal args() As String) Dim name As String = "VB.NET" 'See if argument passed If args.Length = 1 Then name = args(0) Console.WriteLine("Hello, " & name & "!") End Sub End Class End Namespace using System; namespace Hello { public class HelloWorld { public static void Main(string[] args) { string name = "C#"; !" See if argument passed if (args.Length == 1) name = args[0]; Console.WriteLine("Hello, " + name + "!"); } } } http://www.harding.edu/fmccown/vbnet_csharp_comparison.html VB.NET on left, C# on right. Looks different, but semantics are identical. Different looking programs mean the same thing. “The syntax of a language is governed by the constructs that define its types, and its semantics is determined by the interactions among those constructs.” -Robert Harper Is this clear? Compilers deal with both; distinction is important. Let’s look at the big picture.

Front End: Understand Language Back End: Emit Code Front end,
back end Deﬁnitions vary, but nearly always exist

Lexer IL Generator Parser Type Checker Optimizer Optimizer Object Code
Generator Binder This is simpliﬁed. Production compilers have more stages. Front end, back end… Middle end? There is an intermediate representation — collection of types — for each stage. I show two optimizers here. Production compilers have more.

OK, so let’s compile something already! module Compiler let compile
= Lexer.lex 01 Parser.parse 01 Binder.bind 01 Optimize Binding.optimize 01 IlGenerator.codegen 01 Railway.map OptimizeIl.optimize 01 Railway.map Il.toAssemblyBuilder That’s the real, full source code for the compiler itself is on the screen; it just chains together the “recipes” we’ll be examining in detail in the examples to come. I’m literally just piping one module into the next.

(inc -1) Let’s start with something really, really simple. We
want to be able to compile this program. (Remember, start with a toy language.) To what? There’s more than one valid representation of this program. Here’s one (click, explain IL). Here’s another (click). How do we even decide? Is it clear to everyone why these are both valid representations of the program I’m compiling?

(inc -1) Ldc.i4 -1 Ldc.i4 1 Add Let’s start with
something really, really simple. We want to be able to compile this program. (Remember, start with a toy language.) To what? There’s more than one valid representation of this program. Here’s one (click, explain IL). Here’s another (click). How do we even decide? Is it clear to everyone why these are both valid representations of the program I’m compiling?

(inc -1) Ldc.i4 -1 Ldc.i4 1 Add Ldc.i4.0 Let’s start
with something really, really simple. We want to be able to compile this program. (Remember, start with a toy language.) To what? There’s more than one valid representation of this program. Here’s one (click, explain IL). Here’s another (click). How do we even decide? Is it clear to everyone why these are both valid representations of the program I’m compiling?

(inc -1) Lex LeftParen, Identifier(inc), Number(-1), RightParen Parse Apply “inc”
to -1 Type check “inc” exists and takes an int argument, and -1 is an int. Great! Optimize -1 + 1 = 0, so just emit int 0! IL generate Ldc.i4 0 Optimize Ldc.i4 0 → Ldc.i4.0 Object code Produce assembly with entry point which contains the IL generated Let’s break this process into individual steps. All we’ve done is bust hard problem of source code -> optimized EXE into a bunch of relatively small problems. (Read each) We’ll explain them in detail in just a moment. Production compiler engineers will say this is too simple, but their code, like most production code, is probably a mess. You can learn a lot from the simple case!

(defun add-1 (int x) (inc x)) (defun main () (print
(add-1 2))) Or maybe this one. It’s a little more complicated, but hopefully it still makes sense. Let’s look at the individual pieces of the pipeline.

Lexer What Problem Are We Solving? String → Sequence of
tokens Non-Compiler Example Text search Lexers break strings into tokens using a grammar which is a bunch really simple regular expressions. Tokens/lexemes are like characters, but they’re just a little bit richer. You get a token for the int 123 instead of individual characters 1, 2, and 3 Regular expressions search text within a string. Lexers search for reserved words/symbols in a language.

Lexer Search “am” I am. You are. You don’t expect
a search for “am” to match the word “are” simply because they’re conjugations of the same verb. This is the diﬀerence between lexing and parsing; lexing deals with symbols and parsing deals with language grammar. Lexing works on character input. Parsing works on lexeme input (roughly, words or symbols)— the fundamental unit of the PL’s grammar and also the output of the lexer.

Regular Expressions leftParenthesis = ‘(‘ rightParenthesis = ‘)’ letter =
‘A’ | ‘B’ | ‘C’ | … digit = ‘0’ | ‘1’ | ‘2’ | … number = (‘+’digit|‘-’digit|digit) digit* alphanumeric = letter | number !3 … Of course an acceptable program can’t really be any random string. There are rules. We deﬁne the rules via formalisms. For lexers, the formalism is regular expressions. This doesn’t mean a PCRE; it’s much simpler. As with the example above, a RE can be a literal character, a choice, or a sequence. That’s it! there are no other options. Lexers are really simple! But how can we be sure that the code behaves like this grammar, every time

Lexer If there’s one rule I know which will for
sure make your programs better, it’s this: Make illegal state not representable. (Yaron Minsky)

Lexer type Lexeme = | LeftParenthesis | RightParenthesis | Identifier
of string | LiteralInt of int | LiteralString of string | Unrecognized of char The regular expressions in the lexical grammar directly map to the types I create for the lexer. I cannot construct an instance of something which doesn’t ﬁt in the types, hence, I cannot construct an instance of a program which doesn’t ﬁt in the grammar. Any code input which doesn’t match the lexical grammar gets tossed into the bottom “Unrecognized” type. It will eventually surface as an error to the user.

Lexer (inc -1) Start with this, transform to (click). Haven’t
tried to infer any kind of meaning at all yet. What’s the point of lexing vs parsing? Output of lexing is terminals of the grammar used by the parser. Lexer sets the parser up for success. Makes the formalism possible.

Lexer (inc -1) “(“ “inc” “-1” “)” LeftParenthesis Identifier(“inc”) LiteralInt(-1)
RightParenthesis Start with this, transform to (click). Haven’t tried to infer any kind of meaning at all yet. What’s the point of lexing vs parsing? Output of lexing is terminals of the grammar used by the parser. Lexer sets the parser up for success. Makes the formalism possible.

Lexer ( -1) Possible errors: Something which doesn’t ﬁt the
lexical grammar, and also can’t be recognized as a terminal for the parsing grammar.

Lexer let rec private lexChars (source: char list) : Lexeme
list = match source with | '(' :: rest !→ LeftParenthesis :: lexChars rest | ')' :: rest !→ RightParenthesis :: lexChars rest | '"' :: rest !→ lexString(rest, "") | c :: rest when isIdentifierStart c !→ lexName (source, "") | d :: rest when System.Char.IsDigit d !→ lexNumber(source, "") | [] !→ [] | w :: rest when System.Char.IsWhiteSpace w !→ lexChars rest | c :: rest !→ Unrecognized c :: lexChars rest I write all of my compiler code as purely functional. Most example compiler code requires understanding the state of the compiler as well as the behavior of the code. You can understand this code from the code alone. There is no state at all to know. Lexer is recursive: Look at ﬁrst char, decide what to do, then repeat.

Lexer http://stackoverﬂow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags Counterexamples: The thing you must know about lexing
is when not to use it. You can’t parse XML with a regex. The user experience for validation should be gently helping the user to avoid fat ﬁngers and submit correct data.

Lexer http://www.regular-expressions.info/email.html “So even when following official standards, there are
still trade-offs to be made. Don't blindly copy regular expressions from online libraries or discussion forums.” -Jan Goyvaerts, regular-expressions.info Don’t piss them oﬀ by telling them they’ve misspelt their Irish surname or that their real email address is “invalid.” Don’t validate a grammar (like email addresses) with a lexer/regex.

Parser What Problem Are We Solving? Sequence of tokens →
Syntax tree Non-Compiler Example Deserialization The lexer produces a sequence of tokens. We want to turn that into an abstract syntax tree, respecting operator precedence.

PEMDAS 1 + 2 * 3 1 + (2 *
3) What is precedence? In most languages, for example, you multiply before you add, regardless of their sequence in an expression. The expressions above and below the line should be identical.

To parse a language, you must understand its grammar. Example
is part of ECMAScript grammar. All valid statements in the language (all!) can be constructed from the grammar. Anything we can’t construct from the grammar is invalid and will be a parse error. The converse isn’t true, though. We can also construct invalid statements from the grammar, because the parser doesn’t type check. Just because something can be parsed doesn’t make it correct, but it’s certainly invalid if it can’t be parsed. True in spoken languages as well: “Colorless green ideas sleep furiously” Grammatically correct, but semantically nonsensical. Parsing works on syntax, not semantics.

Gra"#ar <program> := <statement> | <program> <statement> <statement> := <defun>
| <expr> <defun> := “(defun” identifier <expr> <expr> “)” <expr> := number | string | <invoke> <invoke> := “(” identifier <expr> “)” With a reasonably speciﬁed language, the rules are pretty easy to follow. (good: C#; bad: Ruby) Our goal today is to implement a really simple language. We’ll follow the grammar here. Explain Terminals vs. nonterminals. Important for lexing vs. parsing.

Gra"#ar type Expression = | IntExpr of int | StringExpr
of string | DefunExpr of name: string * argument: ArgumentExpression option * body: Expressi | InvokeExpr of name: string * argument: Expression option | IdentifierExpr of string | ErrorExpr of string | EmptyListExpr Just as we deﬁned types to represent productions in the lexical grammar, we do the same for the parsing grammar. At this point you might ask if this mapping from the formal grammar to F# types can be automated, and it can! I’ve built my entire compiler “from scratch” so you can see how it works, but it’s common to use lexer and parser generators in real-world work.

Parser LeftParenthesis Identifier(“inc”) LiteralInt(-1) RightParenthesis How does parsing work? Start
with a list of tokens from the lexer. Click Produce syntax tree. Here we have a const -1 argument; argument could be another expression like another invocation.

Parser LeftParenthesis Identifier(“inc”) LiteralInt(-1) RightParenthesis Invoke “inc” -1 How does
parsing work? Start with a list of tokens from the lexer. Click Produce syntax tree. Here we have a const -1 argument; argument could be another expression like another invocation.

Parser LeftParenthesis Identifier(“inc”) LiteralInt(-1) LiteralInt(-1) Possible errors: Bad syntax. Every
stage of compiler checks the form of the stage previous. (click) Parser checks lexer output.

Parser LeftParenthesis Identifier(“inc”) LiteralInt(-1) LiteralInt(-1) “Expected ‘)’” Possible errors: Bad
syntax. Every stage of compiler checks the form of the stage previous. (click) Parser checks lexer output.

Parser let rec private parseExpression (state : ParseState): ParseState =
match state.Remaining with | LeftParenthesis :: Identifier "defun" :: Identifier name :: rest !→ let defun = parseDefun (name, { state with Remaining = rest }) match defun.Expressions, defun.Remaining with | [ ErrorExpr _ ], _ !→ defun | _, RightParenthesis :: remaining !→ { defun with Remaining = remaining } | _, [] !→ error ("Expected ')'.") | _, wrong :: _ !→ error (sprintf "Expected ')'; found %A." wrong) | LeftParenthesis :: Identifier name :: argumentsAndBody !→ let invoke = parseInvoke (name, { state with Remaining = argumentsAndBody }) match invoke.Remaining with | RightParenthesis :: remaining !→ { invoke with Remaining = remaining } | [] !→ error ("Expected ')'.") | wrong :: _ !→ error (sprintf "Expected ')'; found %A." wrong) | LeftParenthesis :: wrong !→ error (sprintf "%A cannot follow '('." wrong) Implementing parsers can be a bit tricky since there are lots of choices in implementation style with subtle tradeoﬀs. But you can use an oﬀ-the-shelf toolkit in most cases. Parts of my parser shown here. Again, I’m just doing a pattern match against the token stream from the lexer, because parsing a LISP is easy. Key to not getting stuck in compiler is do only one thing at a time. Do not optimize. Do not type check. Just look for valid syntax.

Parser Practical parsing: Deserialization, especially of untrusted input like reading
a ﬁle format such as PNG is a parsing job and you should use formal methods to implement it. If you try and “wing it” then you are probably letting the bad guys into anyone who uses your library.

–Guy Steele “If it's worth telling another progra!"er, it's worth
telling the compiler, I think.” https://joshvarty.wordpress.com/2015/08/03/learn-roslyn-now-part-11-introduction-to-code-ﬁxes/ One super practical use for a parser is enforcing code style guides. Many companies write style guides as Word documents or worse. That leads to lax enforcement, usually targeted at new employees only, and lots of pissy email threads.

Parser https://joshvarty.wordpress.com/2015/08/03/learn-roslyn-now-part-11-introduction-to-code-ﬁxes/ Instead, you can write a Roslyn syntax analyzer
and just fail the build if a rule is broken. Parsers: Bringing world peace through technology!

Scope What Problem Are We Solving? What does “x” mean
right now? Non-Compiler Example Bounded Context in Domain Driven Design In an ideal world (for the compiler author), developers would give a unique and unambiguous name to every variable. In the real world, probably half the variables in any average program are called “temp”. The compiler has to decide which assignment to something called “temp” to use in a given context.

Scope https://msujaws.wordpress.com/2011/05/03/static-vs-dynamic-scoping/ Scoping rules unambiguously associate occurrences of identiﬁer names
to their binding locations, or declaration sites. Essentially all modern languages use lexical (static) scoping. This is the rarest of things: A settled argument in computer science. Essentially nobody argues that dynamic scoping is a good idea anymore. Only archaic languages like SNOBOL and early LISPs use dynamic scoping. (Perl has opt-in dynamic scoping because it’s Perl.) Most contemporary dynamic languages use lexical scoping. As a compiler writer, this requires some care.

Binding InvokeExpr “inc” -1 The binder handles scoping concerns. Binder
connects syntactic identiﬁers like “inc” with meaning. Start with an abstract syntax tree. Click. Produce a binding tree. Note the line Function = Inc — the string “inc” has been mapped to a built-in function the compiler understands and can code generate. Binding enforces scope and gives us the ability to check types.

InvokeBinding { FunctionName = "inc" Function = Inc Argument =
IntBinding -1} Binding InvokeExpr “inc” -1 The binder handles scoping concerns. Binder connects syntactic identiﬁers like “inc” with meaning. Start with an abstract syntax tree. Click. Produce a binding tree. Note the line Function = Inc — the string “inc” has been mapped to a built-in function the compiler understands and can code generate. Binding enforces scope and gives us the ability to check types.

InvokeExpr { Name = "not-a-function" Argument = StringExpr "" }
Binding Possible errors. (click)

InvokeExpr { Name = "not-a-function" Argument = StringExpr "" }
Binding “Undefined function ‘not-a-function’.” Possible errors. (click)

https://msdn.microsoft.com/en-us/library/ms228296.aspx?f=255&MSPPError=-2147217396 Someone asked me how much code in a compiler
is dedicated to error handling? Let’s take a digression and consider this. Remember the most common case for a compiler is failing on bad code and giving user a good error message. C# sure has a lot of them. Each of these has code behind it.

As a compiler author, you either put in the eﬀort
to do it well or you leave your user with a really poor experience.

About Those Errors [<Test>] member this.``should return error for unbound
invocation``() = let source = "(bad-method 2)" let expected = ErrorBinding ( "Undefined function 'bad-method'.", EmptyBinding) let actual = bind source actual |> should equal expected So I write tests for each. Each error needs a test example, or several. This isn’t test driven design; it’s regression testing.

About Those Errors http://www.drdobbs.com/architecture-and-design/so-you-want-to-write-your-own-language/240165488?pgno=2 There are a few possible ways
to deal with errors. (click) First is common when getting started; entirely unsuitable for production compilers. Just fail on ﬁrst error found. (click) Second friendly but hard and often wrong. Guess what the user intended and try to continue compilation. Report all errors at the end of the process. (click) I use third.

About Those Errors • Die in a fire http://www.drdobbs.com/architecture-and-design/so-you-want-to-write-your-own-language/240165488?pgno=2 There
are a few possible ways to deal with errors. (click) First is common when getting started; entirely unsuitable for production compilers. Just fail on ﬁrst error found. (click) Second friendly but hard and often wrong. Guess what the user intended and try to continue compilation. Report all errors at the end of the process. (click) I use third.

About Those Errors • Die in a fire • Guess
what I meant, not what I said http://www.drdobbs.com/architecture-and-design/so-you-want-to-write-your-own-language/240165488?pgno=2 There are a few possible ways to deal with errors. (click) First is common when getting started; entirely unsuitable for production compilers. Just fail on ﬁrst error found. (click) Second friendly but hard and often wrong. Guess what the user intended and try to continue compilation. Report all errors at the end of the process. (click) I use third.

About Those Errors • Die in a fire • Guess
what I meant, not what I said • Poisoning http://www.drdobbs.com/architecture-and-design/so-you-want-to-write-your-own-language/240165488?pgno=2 There are a few possible ways to deal with errors. (click) First is common when getting started; entirely unsuitable for production compilers. Just fail on ﬁrst error found. (click) Second friendly but hard and often wrong. Guess what the user intended and try to continue compilation. Report all errors at the end of the process. (click) I use third.

Type Checking What Problem Are We Solving? AST → Boolean
“Is it valid?” Non-Compiler Example Linter Moving on, Type checking doesn’t substantially change the intermediate representation. Mostly just “thumbs up/thumbs down.”  But some compilers do type inference, which does. Transforms nodes of unknown type to nodes of known type. Type checkers are really useful outside of compilers: (Swagger)

Type Checking ldstr "Hi" ldstr "Hi" div This is bad.
Don’t do this. One way to understand type checking is to look at a language which doesn’t do any. IL/ASM very little time checking. You can try to "divide" two strings, for example, and your application might crash or, worse, silently produce incorrect results. So it is critical that the compiler never emit such IL.

Type Inference Rules Γ ⊢ A Γ ⊢ B Γ
⊢ A×B Γ ⊢ v1 :Int Γ ⊢ v2 :Int Γ ⊢ v1 +v2 :Int Type systems are speciﬁed by inference rules. Given the premises or assumptions above the line, we’re allowed to form the derivations below the line. There’s an example on the slide. Here we say that if A and B are both types in a certain type environment named gamma, then we are also allowed to form a pair with members of types A and B. (You read “Γ ⊢” as “it’s provable within an environment Γ that…”) Similarly, if we know that two values, v1 and v2 are both integers, then so is their sum. There are lots of rules, but each one should be pretty simple. These seem almost too obvious to bother stating, but it helps to be really clear what the rules are, because, as you’ve seen with JavaScript, the corner cases can be a bit scary.

Type Checking • Statically typed • Unityped (“dynamic language”) •
Untyped You might ask yourself, “Self, what if I'm working on a dynamic language? Then I don't have to do any type checking, right?” However, you must do nearly the same type checking for evaluating a dynamic language as for pre-compiling a static language. The biggest diﬀerence is that the check is deferred until runtime. Other than that, it is very similar. Some languages, like ASM and C do very little type checking at all. This is part of the reason why Open SSL has so many issues and your Toyota accelerates itself. C has its uses, but please don’t build another C.

Type Checking let rec private toBinding (environment: Map<string, Binding>) (expression
: Expression) : Bind match expression with | IntExpr n !→ IntBinding n | StringExpr str !→ String Binding str “A type system speciﬁes the type rules of a programming language independently of particular typechecking algorithms. This is analogous to describing the syntax of a programming language by a formal grammar, independently of particular parsing algorithms.” - Luca Cardelli Some of this is easy!

Type Checking | InvokeExpr (name, argument) !→ match environment.TryFind name
with | Some (Function Binding func) !→ let argumentBinding = toInvokedArgumentBinding environment argument match argumentTypeError argumentBinding func with | None !→ InvokeBinding { FunctionName = name Function = func Argument = argumentBinding } | Some argumentTypeErrorMessage !→ ErrorBinding (argumentTypeErrorMessage, EmptyBinding) | Some bindingType !→ ErrorBinding (sprintf "Expected function; found %A" bindingType, EmptyBinding) | None !→ ErrorBinding (sprintf "Undefined function '%s'." name, EmptyBinding) Some of it is considerably harder. When binding an invocation, we must ﬁrst make sure it’s a real function and then that the argument can be bound at all, and lastly that the types match. Type checking is easier than type inference, but both are mostly solved problems, at least for simpler cases. Now we’re going to a semantic understanding of the code.

InvokeExpr { Name = "inc" Argument = StringExpr “Oops!" }
Type Checking Possible errors.

InvokeExpr { Name = "inc" Argument = StringExpr “Oops!" }
Type Checking “Expected integer; found ‘Oops!’.” Possible errors.

Optimizers What Problem Are We Solving? Program → Faster, but
equivalent program Non-Compiler Example Theorem prover Fight the urge to optimize outside the optimizers! Remember, one of the essential characteristics of compiler optimization is you can turn it oﬀ, e.g., for easier debugging. This is really hard if you optimize outside the optimizer. Optimizer must never change program behavior, except maybe making it harder to debug. Non-optimizer code should be so non-optimal it looks dumb.

Optimization (I) InvokeBinding “inc” -1 There cannot be any errors
based on user input. Either make it better or leave it alone. Unlike other phases. Here’s an optimized version of the tree (click)

Optimization (I) InvokeBinding “inc” -1 IntBinding 0 There cannot be
any errors based on user input. Either make it better or leave it alone. Unlike other phases. Here’s an optimized version of the tree (click)

Optimization (I) Invoke “some-method” -1 Often, you’ll do nothing. (click)

Optimization (I) Invoke “some-method” -1 Invoke “some-method” -1 Often, you’ll
do nothing. (click)

Optimization (I) let private optimizeInc (binding: Binding) : Binding =
match binding with | IncBinding (IntBinding number) !→ IntBinding (number + 1) | IncBinding _ | BoolBinding _ | IntBinding _ | String Binding _ | VariableBinding _ | Function Binding _ | InvokeBinding _ | DefBinding _ | ErrorBinding _ | EmptyBinding _ !→ binding This is an example of a function which does one speciﬁc optimization: Find when the “inc” function is applied to a literal int and substitute the correct result. Hooray, we’ve optimized away the function call! It ignores other kinds of nodes.

IL Generation IntBinding 0 After optimizing the tree, generate IL.
If we start from a simple binding… (click) We want to produce If you’ve seen IL before, you may wonder why not (click). More optimized! Why don’t we do this? (anyone?)

IL Generation IntBinding 0 Ldc.i4 0 After optimizing the tree,
generate IL. If we start from a simple binding… (click) We want to produce If you’ve seen IL before, you may wonder why not (click). More optimized! Why don’t we do this? (anyone?)

IL Generation IntBinding 0 Ldc.i4 0 Ldc.i4.0 After optimizing the
tree, generate IL. If we start from a simple binding… (click) We want to produce If you’ve seen IL before, you may wonder why not (click). More optimized! Why don’t we do this? (anyone?)

IL Generation let private writeLineMethod = typeof<System.Console>.GetMethod( "WriteLine", [| typeof<System.Int32>
|] let private codegenOper = function | IncInt !→ [ Instruction.Ldc_I4_1 Instruction.Add ] | WriteLine !→ [ Instruction.Call writeLineMethod ] But for built-in, primitive operations, I have to write out the code in IL. I need these primitives for more complicated programs.

Optimization (II) Ldc.i4 0 For IL optimization, we want to
replace the generic operations with their optimized IL short forms. Go from this, to (click)

Optimization (II) Ldc.i4 0 Ldc.i4.0 For IL optimization, we want
to replace the generic operations with their optimized IL short forms. Go from this, to (click)

Special Tools! PE Verify, ildasm. Similar tools exist for other
platforms. ildsam: May have seen before. Decompile an EXE/DLL into IL. PEVerify

Compare! When in doubt, try it in C#, and see
what that compiler emits. Sometimes it’s a little strange. Roll with it.

https://www.ece.cmu.edu/~ganger/712.fall02/papers/p761-thompson.pdf OK, this seems really weird when you ﬁrst hear
about it. Thompson says we should trust people more than code. But people are fallible and code is precise, right?

Trusting Trust Compiler Executable Compiler Source Code Compiler Executable Many
compilers can compile themselves. This is how it’s supposed to work. The two green boxes are identical, right? Right. This is how we expect it to work. Compilers map source code to executables.

Trusting Trust Compiler Executable Compiler Source Code Trojaned Compiler Executable
Trojan Code What if someone adds some malicious code to the compiler source code? What does that do?

Trusting Trust Trojaned Compiler Executable Benign App Source Code Trojaned
App Executable Code which adds a trojan to any app the compiler compiles? This is obviously bad, but maybe not too surprising, right?

Trusting Trust Trojaned Compiler Executable (Benign!) Compiler Source Code Trojaned
Compiler Executable Now the trojan lives in the compiler EXE only, not the source code! Even if you recompile the compiler itself from good, benign source code, you don’t know if you’re secure. You need to know the full lineage of the compiler. This is true in the context of formal veriﬁcation, as well, not just security against bad guys.

Conclusion Don’t fear hard problems. Recognize solved problems, use recipes
that have been developed over a half century of prior art.

Further Reading Don’t buy the dragon book. Universally recommended by
people who haven’t read it, or who have read nothing else.

Further Reading • Progra!"ing Language Concepts, by Peter Sestoft •
Modern Compiler Implementation in ML, by Andrew W. Appel • miniml (608 line implementation of ML subset), by Andrej Bauer • Coursera Compilers Course, by Alex Aiken

Craig Stuntz @craigstuntz [email protected] http://blogs.teamb.com/craigstuntz http://www.meetup.com/Papers-We-Love-Columbus/ https://speakerdeck.com/craigstuntz https://github.com/CraigStuntz/TinyLanguage Feel free
to reach out and ask follow up questions, either here at the conference or by one of the ways on the slide.

Programs that Write Programs: How Compilers Work

Programs that Write Programs: How Compilers Work

More Decks by Craig Stuntz

Other Decks in Programming

Featured

Transcript