grab my slides and code from the links above. I won’t reserve time at the end of the talk for questions. I have a lot of material to cover. Please interrupt and ask if I’m unclear about anything. And I’d be happy to “buy you dinner” afterwards if you want to talk more.
into them almost every day.” http://steve-yegge.blogspot.ca/2007/06/rich-programmer-food.html Why bother learning compilers? What can I tell you in just an hour? I think it’s fun & interesting in its own right. Maybe you agree, since you’re here. But you might think you don’t have the opportunity to use compiler techniques in your day job. Maybe you can! Obviously I can’t explain everything there is to know about compiler implementations in an hour, which is why I’m also giving you complete source code for a simple one to look at later! I want you to be able to look at a hard problem and tell your colleagues you know how to solve it.
contains an ad hoc, informally- specified, bug-ridden, slow implementation of half of Co!"on Lisp.” https://commons.wikimedia.org/wiki/File:Philip_Greenspun_and_Alex_the_dog.jpg Compilation is fundamental to the problem of producing software. If you’re a professional programmer, you will — you will! — be asked to solve problems which you can do better if you recognize them as pieces of the compiler toolchain.
to test. You can’t test every possible function a compiler might need to compile. You need a different methodology to drive your designs. You can solve them anyway, and you can produce a design with confidence it’s going to work.
leave this talk with: Recognize compilation problems (they’re everywhere!), and apply proven and reliable patterns towards solving. Simply recognizing these problems for what they are and knowing where to look to find the solution will make you a better developer. This is a super power!
from the theory of computer science: that a certain kind of machine can simulate anything — including itself. As a result, this certain kind of machine, the Turing machine, is the very definition of computability.” http://www.cs.uni.edu/~wallingf/blog/archives/monthly/2015-09.html#e2015-09-03T15_26_47.htm What Is a Compiler, Really? (pause) It’s a program that writes a program! Or a program which takes a representation of data and writes another representation. Often, it’s a program that can write any possible program. Many compilers can compile themselves, producing their own executable as output. There are weird and non-obvious consequences of this! This idea on the screen — this is huge!
code to output. But maybe you prefer Ruby or JS? (Click) Interpreter is a compiler which recognizes source code plus additional user input at the same time. I’ll talk mostly about compilers, but remember an interpreter is just a compiler with an extra feature or three.
source code to output. But maybe you prefer Ruby or JS? (Click) Interpreter is a compiler which recognizes source code plus additional user input at the same time. I’ll talk mostly about compilers, but remember an interpreter is just a compiler with an extra feature or three.
think of a compiler as code -> exe, but this is mostly wrong. (Click) Primary purpose is to reject incorrect code. Maybe 1 in 10 times we produce a program, right? Right?
think of a compiler as code -> exe, but this is mostly wrong. (Click) Primary purpose is to reject incorrect code. Maybe 1 in 10 times we produce a program, right? Right?
Linters, static analysis (syntax, type checking) • Solvers, theorem provers (optimization) • Code migration tools (compilers!) Also, most of the individual pieces of the compilation pipeline are useful in their own right When you learn to write a compiler, you learn all of the above, and more!
for taking data in one representation and turning it into data in a different representation, if valid. The transformation (from A to B) is often nontrivial. This concept describes a large percentage of the code we write, so compilation concepts are really useful, even when not writing a compiler per se.
Source code → Potential style error list JSON → Object graph Code with 2 digit years → Y2K compliant code VB6 → C# Object graph → User interface markup Algorithm → Faster, equivalent algorithm Some of these look like what we typically consider a compiler. Some don’t. None are contrived. I do all of these in my day job. These are problems you must routinely solve even if you’re not a language author.
with test driven design, where you write tests to guide evolution of program design. Although I do write (many!) tests for my compilers, Test Driven Design doesn’t hold up as a design methodology. Lets examine why, and what the alternative might be.
e while #D I printf #D l int #D W if #D C y=v+111;H(x,v)*y++= *x #D H(a,b)R(a=b+11;a<b+89;a++) #D s(a)t=scanf("%d",&a) #D U Z I #D Z I("123\ 45678\n");H(x,V){putchar(".XO"[*x]);W((x-V)%10==8){x+=2;I("%d\n",(x-V)/10-1);}} l V[1600],u,r[]={-1,-11,-10,-9,1,11,10,9},h[]={11,18,81,88},ih[]={22,27,72,77}, bz,lv=60,*x,*y,m,t;S(d,v,f,_,a,b)l*v;{l c=0,*n=v+100,j=d<u-1?a:-9000,w,z,i,g,q= 3-f;W(d>u){R(w=i=0;i<4;i++)w+=(m=v[h[i]])==f?300:m==q?-300:(t=v[ih[i]])==f?-50: t==q?50:0;Y w;}H(z,0){W(E(v,z,f,100)){c++;w= -S(d+1,n,q,0,-b,-j);W(w>j){g=bz=z; j=w;W(w#$b%&w#$8003)Y w;}}}W(!c){g=0;W(_){H(x,v)c+= *x==f?1:*x==3-f?-1:0;Y c>0? 8000+c:c-8000;}C;j= -S(d+1,n,q,1,-b,-j);)bz=g;Y d#$u-1?j+(c'(3):j;}main(){R(;t< 1600;t+=100)R(m=0;m<100;m++)V[t+m]=m<11%&m>88%&(m+1)%10<2?3:0;I("Level:");V[44] =V[55]=1;V[45]=V[54]=2;s(u);e(lv>0){Z do{I("You:");s(m);}e(!E(V,m,2,0))*m+,99); W(m+,99)lv--;W(lv<15)*u<10)u+=2;U("Wait\n");I("Value:%d\n",S(0,V,1,0,-9000,9000 ));I("move: %d\n",(lv-=E(V,bz,1,0),bz));}}E(v,z,f,o)l*v;{l*j,q=3-f,g=0,i,w,*k=v +z;W(*k==0)R(i=7;i#$0;i--){j=k+(w=r[i]);e(*j==q)j+=w;W(*j==f)*j-w+,k){W(!g){g=1 ;C;}e(j+,k)*((j-=w)+o)=f;}}Y g;} Anyone here know any C? Does this look like a valid C program to you? This one is not valid, but it would be if you changed one character. Can you find it, or could you write a program to find it? This is a hard problem: Write a program that for any string whatsoever (and there are lots of possible strings!) either declares it a valid program or explains clearly to a human why it isn’t.
send(to, from, count) register short *to, *from; register count; { register n = (count + 7) / 8; switch (count % 8) { case 0: do { *to = *from++; case 7: *to = *from++; case 6: *to = *from++; case 5: *to = *from++; case 4: *to = *from++; case 3: *to = *from++; case 2: *to = *from++; case 1: *to = *from++; } while (--n > 0); } } And I mean every possible program! (Explain Duff) People will do absolutely anything the PL grammar allows. Therefore your compiler must be able to classify literally any arbitrary string into a valid or invalid program, and you can’t predict the valid programs people will write.
all the things the compiler will be asked to parse. You can’t design a compiler which can parse any legal program by poking around with tests of possible programs. You would need an uncountably large number of tests. You must design with formal methods. But you do still write tests; it’s just that you use a different design methodology!
100 * 101 / 2 = 5050 Another reason compilers are interesting is they try to solve a very hard problem: Don’t just convert A to B. Convert A into a highly-optimized-but-entirely-equivalent B. For any given A whatsoever! If a compiler crashes, that’s bad. If it silently emits incorrect code, it’s worse than bad; the world ends. Hard problems are interesting!
mostly solved problem. “Solved” not as in it will never get better. “Solved” as in there is a recipe, which most programmers are capable of following. There are lots of steps, but each individual step is pretty simple. You can learn it one tiny piece at a time. This is a really learnable skill. In fact, if you have a hard problem and can distort it until it looks like a compiler then you have a solved problem. This is magic! Contrast this with web development, which is also hard, but we change our minds about what constitutes “best practices” annually.
→ Algebra Type Checker → Logical Inference Rules Code Generator → Denotational Semantics For each part of the compiler pipeline, there exists a formalism which guides our designs. Some of these words are long and unfamiliar, but just think of them as recipes for implementation. If you follow them, you will cover all of the cases in your language specification.
of complicated languages. Simplicity requires thinking abstractly before you start implementing.” http://www.heidelberg-laureate-forum.org/blog/video/lecture-monday-august-24-2015-leslie-lamport/ https://commons.wikimedia.org/wiki/File:Leslie_Lamport.jpg Worth noting: Compiler implementation is fairly straightforward, and there’s a recipe to follow. Language design is much harder, and there’s no “best” recipe. Lots of people can write good compilers. Far fewer can write a good language. Fewer still can write a good, simple language. Respect language designers. It’s common to try to learn compilers by inventing your own language or chasing a complicated language. I recommend you don’t do that! I use a tiny Lisp here, but I started with a math expression evaluator. Most math expressions are simpler than PLs.
x add x 1 alert x A compiler must handle both the written form (syntax) and the internal meaning (semantics) of the language. Syntax has both a literal (characters) and a tree form. These are both representations of the syntax. This is the “abstract” syntax tree. Believe it or not it’s the simplified version!
"NATE" name # +/ "Nate" name = "Nate" # +/ "Nate" name.upcase! # +/ "NATE" name # +/ "NATE" http://www.natescottwest.com/elixir-for-rubyists-part-2/ People obsess about syntax. They reject LISPs because (). But semantics are, I think, most important. Nobody says we don’t need JavaScript since we have Java. People do ask if we need VB.NET since there is C# (syntactically different, but semantically equivalent). I want to clarify the distinction. Elixir on left, Ruby on right. Looks similar, but Elixir strings are immutable. Similar syntax; different semantics. Similar looking programs mean different things.
Main(ByVal args() As String) Dim name As String = "VB.NET" 'See if argument passed If args.Length = 1 Then name = args(0) Console.WriteLine("Hello, " & name & "!") End Sub End Class End Namespace using System; namespace Hello { public class HelloWorld { public static void Main(string[] args) { string name = "C#"; !" See if argument passed if (args.Length == 1) name = args[0]; Console.WriteLine("Hello, " + name + "!"); } } } http://www.harding.edu/fmccown/vbnet_csharp_comparison.html VB.NET on left, C# on right. Looks different, but semantics are identical. Different looking programs mean the same thing. “The syntax of a language is governed by the constructs that define its types, and its semantics is determined by the interactions among those constructs.” -Robert Harper Is this clear? Compilers deal with both; distinction is important. Let’s look at the big picture.
Generator Binder This is simplified. Production compilers have more stages. Front end, back end… Middle end? There is an intermediate representation — collection of types — for each stage. I show two optimizers here. Production compilers have more.
= Lexer.lex 01 Parser.parse 01 Binder.bind 01 Optimize Binding.optimize 01 IlGenerator.codegen 01 Railway.map OptimizeIl.optimize 01 Railway.map Il.toAssemblyBuilder That’s the real, full source code for the compiler itself is on the screen; it just chains together the “recipes” we’ll be examining in detail in the examples to come. I’m literally just piping one module into the next.
want to be able to compile this program. (Remember, start with a toy language.) To what? There’s more than one valid representation of this program. Here’s one (click, explain IL). Here’s another (click). How do we even decide? Is it clear to everyone why these are both valid representations of the program I’m compiling?
something really, really simple. We want to be able to compile this program. (Remember, start with a toy language.) To what? There’s more than one valid representation of this program. Here’s one (click, explain IL). Here’s another (click). How do we even decide? Is it clear to everyone why these are both valid representations of the program I’m compiling?
with something really, really simple. We want to be able to compile this program. (Remember, start with a toy language.) To what? There’s more than one valid representation of this program. Here’s one (click, explain IL). Here’s another (click). How do we even decide? Is it clear to everyone why these are both valid representations of the program I’m compiling?
to -1 Type check “inc” exists and takes an int argument, and -1 is an int. Great! Optimize -1 + 1 = 0, so just emit int 0! IL generate Ldc.i4 0 Optimize Ldc.i4 0 → Ldc.i4.0 Object code Produce assembly with entry point which contains the IL generated Let’s break this process into individual steps. All we’ve done is bust hard problem of source code -> optimized EXE into a bunch of relatively small problems. (Read each) We’ll explain them in detail in just a moment. Production compiler engineers will say this is too simple, but their code, like most production code, is probably a mess. You can learn a lot from the simple case!
(add-1 2))) Or maybe this one. It’s a little more complicated, but hopefully it still makes sense. Let’s look at the individual pieces of the pipeline.
tokens Non-Compiler Example Text search Lexers break strings into tokens using a grammar which is a bunch really simple regular expressions. Tokens/lexemes are like characters, but they’re just a little bit richer. You get a token for the int 123 instead of individual characters 1, 2, and 3 Regular expressions search text within a string. Lexers search for reserved words/symbols in a language.
a search for “am” to match the word “are” simply because they’re conjugations of the same verb. This is the difference between lexing and parsing; lexing deals with symbols and parsing deals with language grammar. Lexing works on character input. Parsing works on lexeme input (roughly, words or symbols)— the fundamental unit of the PL’s grammar and also the output of the lexer.
‘A’ | ‘B’ | ‘C’ | … digit = ‘0’ | ‘1’ | ‘2’ | … number = (‘+’digit|‘-’digit|digit) digit* alphanumeric = letter | number !3 … Of course an acceptable program can’t really be any random string. There are rules. We define the rules via formalisms. For lexers, the formalism is regular expressions. This doesn’t mean a PCRE; it’s much simpler. As with the example above, a RE can be a literal character, a choice, or a sequence. That’s it! there are no other options. Lexers are really simple! But how can we be sure that the code behaves like this grammar, every time
of string | LiteralInt of int | LiteralString of string | Unrecognized of char The regular expressions in the lexical grammar directly map to the types I create for the lexer. I cannot construct an instance of something which doesn’t fit in the types, hence, I cannot construct an instance of a program which doesn’t fit in the grammar. Any code input which doesn’t match the lexical grammar gets tossed into the bottom “Unrecognized” type. It will eventually surface as an error to the user.
tried to infer any kind of meaning at all yet. What’s the point of lexing vs parsing? Output of lexing is terminals of the grammar used by the parser. Lexer sets the parser up for success. Makes the formalism possible.
RightParenthesis Start with this, transform to (click). Haven’t tried to infer any kind of meaning at all yet. What’s the point of lexing vs parsing? Output of lexing is terminals of the grammar used by the parser. Lexer sets the parser up for success. Makes the formalism possible.
list = match source with | '(' :: rest !→ LeftParenthesis :: lexChars rest | ')' :: rest !→ RightParenthesis :: lexChars rest | '"' :: rest !→ lexString(rest, "") | c :: rest when isIdentifierStart c !→ lexName (source, "") | d :: rest when System.Char.IsDigit d !→ lexNumber(source, "") | [] !→ [] | w :: rest when System.Char.IsWhiteSpace w !→ lexChars rest | c :: rest !→ Unrecognized c :: lexChars rest I write all of my compiler code as purely functional. Most example compiler code requires understanding the state of the compiler as well as the behavior of the code. You can understand this code from the code alone. There is no state at all to know. Lexer is recursive: Look at first char, decide what to do, then repeat.
is when not to use it. You can’t parse XML with a regex. The user experience for validation should be gently helping the user to avoid fat fingers and submit correct data.
still trade-offs to be made. Don't blindly copy regular expressions from online libraries or discussion forums.” -Jan Goyvaerts, regular-expressions.info Don’t piss them off by telling them they’ve misspelt their Irish surname or that their real email address is “invalid.” Don’t validate a grammar (like email addresses) with a lexer/regex.
Syntax tree Non-Compiler Example Deserialization The lexer produces a sequence of tokens. We want to turn that into an abstract syntax tree, respecting operator precedence.
3) What is precedence? In most languages, for example, you multiply before you add, regardless of their sequence in an expression. The expressions above and below the line should be identical.
is part of ECMAScript grammar. All valid statements in the language (all!) can be constructed from the grammar. Anything we can’t construct from the grammar is invalid and will be a parse error. The converse isn’t true, though. We can also construct invalid statements from the grammar, because the parser doesn’t type check. Just because something can be parsed doesn’t make it correct, but it’s certainly invalid if it can’t be parsed. True in spoken languages as well: “Colorless green ideas sleep furiously” Grammatically correct, but semantically nonsensical. Parsing works on syntax, not semantics.
| <expr> <defun> := “(defun” identifier <expr> <expr> “)” <expr> := number | string | <invoke> <invoke> := “(” identifier <expr> “)” With a reasonably specified language, the rules are pretty easy to follow. (good: C#; bad: Ruby) Our goal today is to implement a really simple language. We’ll follow the grammar here. Explain Terminals vs. nonterminals. Important for lexing vs. parsing.
of string | DefunExpr of name: string * argument: ArgumentExpression option * body: Expressi | InvokeExpr of name: string * argument: Expression option | IdentifierExpr of string | ErrorExpr of string | EmptyListExpr Just as we defined types to represent productions in the lexical grammar, we do the same for the parsing grammar. At this point you might ask if this mapping from the formal grammar to F# types can be automated, and it can! I’ve built my entire compiler “from scratch” so you can see how it works, but it’s common to use lexer and parser generators in real-world work.
with a list of tokens from the lexer. Click Produce syntax tree. Here we have a const -1 argument; argument could be another expression like another invocation.
parsing work? Start with a list of tokens from the lexer. Click Produce syntax tree. Here we have a const -1 argument; argument could be another expression like another invocation.
match state.Remaining with | LeftParenthesis :: Identifier "defun" :: Identifier name :: rest !→ let defun = parseDefun (name, { state with Remaining = rest }) match defun.Expressions, defun.Remaining with | [ ErrorExpr _ ], _ !→ defun | _, RightParenthesis :: remaining !→ { defun with Remaining = remaining } | _, [] !→ error ("Expected ')'.") | _, wrong :: _ !→ error (sprintf "Expected ')'; found %A." wrong) | LeftParenthesis :: Identifier name :: argumentsAndBody !→ let invoke = parseInvoke (name, { state with Remaining = argumentsAndBody }) match invoke.Remaining with | RightParenthesis :: remaining !→ { invoke with Remaining = remaining } | [] !→ error ("Expected ')'.") | wrong :: _ !→ error (sprintf "Expected ')'; found %A." wrong) | LeftParenthesis :: wrong !→ error (sprintf "%A cannot follow '('." wrong) Implementing parsers can be a bit tricky since there are lots of choices in implementation style with subtle tradeoffs. But you can use an off-the-shelf toolkit in most cases. Parts of my parser shown here. Again, I’m just doing a pattern match against the token stream from the lexer, because parsing a LISP is easy. Key to not getting stuck in compiler is do only one thing at a time. Do not optimize. Do not type check. Just look for valid syntax.
a file format such as PNG is a parsing job and you should use formal methods to implement it. If you try and “wing it” then you are probably letting the bad guys into anyone who uses your library.
telling the compiler, I think.” https://joshvarty.wordpress.com/2015/08/03/learn-roslyn-now-part-11-introduction-to-code-fixes/ One super practical use for a parser is enforcing code style guides. Many companies write style guides as Word documents or worse. That leads to lax enforcement, usually targeted at new employees only, and lots of pissy email threads.
right now? Non-Compiler Example Bounded Context in Domain Driven Design In an ideal world (for the compiler author), developers would give a unique and unambiguous name to every variable. In the real world, probably half the variables in any average program are called “temp”. The compiler has to decide which assignment to something called “temp” to use in a given context.
to their binding locations, or declaration sites. Essentially all modern languages use lexical (static) scoping. This is the rarest of things: A settled argument in computer science. Essentially nobody argues that dynamic scoping is a good idea anymore. Only archaic languages like SNOBOL and early LISPs use dynamic scoping. (Perl has opt-in dynamic scoping because it’s Perl.) Most contemporary dynamic languages use lexical scoping. As a compiler writer, this requires some care.
connects syntactic identifiers like “inc” with meaning. Start with an abstract syntax tree. Click. Produce a binding tree. Note the line Function = Inc — the string “inc” has been mapped to a built-in function the compiler understands and can code generate. Binding enforces scope and gives us the ability to check types.
IntBinding -1} Binding InvokeExpr “inc” -1 The binder handles scoping concerns. Binder connects syntactic identifiers like “inc” with meaning. Start with an abstract syntax tree. Click. Produce a binding tree. Note the line Function = Inc — the string “inc” has been mapped to a built-in function the compiler understands and can code generate. Binding enforces scope and gives us the ability to check types.
is dedicated to error handling? Let’s take a digression and consider this. Remember the most common case for a compiler is failing on bad code and giving user a good error message. C# sure has a lot of them. Each of these has code behind it.
invocation``() = let source = "(bad-method 2)" let expected = ErrorBinding ( "Undefined function 'bad-method'.", EmptyBinding) let actual = bind source actual |> should equal expected So I write tests for each. Each error needs a test example, or several. This isn’t test driven design; it’s regression testing.
to deal with errors. (click) First is common when getting started; entirely unsuitable for production compilers. Just fail on first error found. (click) Second friendly but hard and often wrong. Guess what the user intended and try to continue compilation. Report all errors at the end of the process. (click) I use third.
are a few possible ways to deal with errors. (click) First is common when getting started; entirely unsuitable for production compilers. Just fail on first error found. (click) Second friendly but hard and often wrong. Guess what the user intended and try to continue compilation. Report all errors at the end of the process. (click) I use third.
what I meant, not what I said http://www.drdobbs.com/architecture-and-design/so-you-want-to-write-your-own-language/240165488?pgno=2 There are a few possible ways to deal with errors. (click) First is common when getting started; entirely unsuitable for production compilers. Just fail on first error found. (click) Second friendly but hard and often wrong. Guess what the user intended and try to continue compilation. Report all errors at the end of the process. (click) I use third.
what I meant, not what I said • Poisoning http://www.drdobbs.com/architecture-and-design/so-you-want-to-write-your-own-language/240165488?pgno=2 There are a few possible ways to deal with errors. (click) First is common when getting started; entirely unsuitable for production compilers. Just fail on first error found. (click) Second friendly but hard and often wrong. Guess what the user intended and try to continue compilation. Report all errors at the end of the process. (click) I use third.
“Is it valid?” Non-Compiler Example Linter Moving on, Type checking doesn’t substantially change the intermediate representation. Mostly just “thumbs up/thumbs down.” But some compilers do type inference, which does. Transforms nodes of unknown type to nodes of known type. Type checkers are really useful outside of compilers: (Swagger)
Don’t do this. One way to understand type checking is to look at a language which doesn’t do any. IL/ASM very little time checking. You can try to "divide" two strings, for example, and your application might crash or, worse, silently produce incorrect results. So it is critical that the compiler never emit such IL.
⊢ A×B Γ ⊢ v1 :Int Γ ⊢ v2 :Int Γ ⊢ v1 +v2 :Int Type systems are specified by inference rules. Given the premises or assumptions above the line, we’re allowed to form the derivations below the line. There’s an example on the slide. Here we say that if A and B are both types in a certain type environment named gamma, then we are also allowed to form a pair with members of types A and B. (You read “Γ ⊢” as “it’s provable within an environment Γ that…”) Similarly, if we know that two values, v1 and v2 are both integers, then so is their sum. There are lots of rules, but each one should be pretty simple. These seem almost too obvious to bother stating, but it helps to be really clear what the rules are, because, as you’ve seen with JavaScript, the corner cases can be a bit scary.
Untyped You might ask yourself, “Self, what if I'm working on a dynamic language? Then I don't have to do any type checking, right?” However, you must do nearly the same type checking for evaluating a dynamic language as for pre-compiling a static language. The biggest difference is that the check is deferred until runtime. Other than that, it is very similar. Some languages, like ASM and C do very little type checking at all. This is part of the reason why Open SSL has so many issues and your Toyota accelerates itself. C has its uses, but please don’t build another C.
: Expression) : Bind match expression with | IntExpr n !→ IntBinding n | StringExpr str !→ String Binding str “A type system specifies the type rules of a programming language independently of particular typechecking algorithms. This is analogous to describing the syntax of a programming language by a formal grammar, independently of particular parsing algorithms.” - Luca Cardelli Some of this is easy!
with | Some (Function Binding func) !→ let argumentBinding = toInvokedArgumentBinding environment argument match argumentTypeError argumentBinding func with | None !→ InvokeBinding { FunctionName = name Function = func Argument = argumentBinding } | Some argumentTypeErrorMessage !→ ErrorBinding (argumentTypeErrorMessage, EmptyBinding) | Some bindingType !→ ErrorBinding (sprintf "Expected function; found %A" bindingType, EmptyBinding) | None !→ ErrorBinding (sprintf "Undefined function '%s'." name, EmptyBinding) Some of it is considerably harder. When binding an invocation, we must first make sure it’s a real function and then that the argument can be bound at all, and lastly that the types match. Type checking is easier than type inference, but both are mostly solved problems, at least for simpler cases. Now we’re going to a semantic understanding of the code.
equivalent program Non-Compiler Example Theorem prover Fight the urge to optimize outside the optimizers! Remember, one of the essential characteristics of compiler optimization is you can turn it off, e.g., for easier debugging. This is really hard if you optimize outside the optimizer. Optimizer must never change program behavior, except maybe making it harder to debug. Non-optimizer code should be so non-optimal it looks dumb.
match binding with | IncBinding (IntBinding number) !→ IntBinding (number + 1) | IncBinding _ | BoolBinding _ | IntBinding _ | String Binding _ | VariableBinding _ | Function Binding _ | InvokeBinding _ | DefBinding _ | ErrorBinding _ | EmptyBinding _ !→ binding This is an example of a function which does one specific optimization: Find when the “inc” function is applied to a literal int and substitute the correct result. Hooray, we’ve optimized away the function call! It ignores other kinds of nodes.
If we start from a simple binding… (click) We want to produce If you’ve seen IL before, you may wonder why not (click). More optimized! Why don’t we do this? (anyone?)
generate IL. If we start from a simple binding… (click) We want to produce If you’ve seen IL before, you may wonder why not (click). More optimized! Why don’t we do this? (anyone?)
tree, generate IL. If we start from a simple binding… (click) We want to produce If you’ve seen IL before, you may wonder why not (click). More optimized! Why don’t we do this? (anyone?)
match binding with | BoolBinding b !→ match b with | true !→ [Ldc_I4_1] | false !→ [Ldc_I4_0] | IntBinding n !→ [Ldc_I4 n] | String Binding s !→ [Ldstr s] | !" … Some of this is pretty straightforward.
|] let private codegenOper = function | IncInt !→ [ Instruction.Ldc_I4_1 Instruction.Add ] | WriteLine !→ [ Instruction.Call writeLineMethod ] But for built-in, primitive operations, I have to write out the code in IL. I need these primitives for more complicated programs.
compilers can compile themselves. This is how it’s supposed to work. The two green boxes are identical, right? Right. This is how we expect it to work. Compilers map source code to executables.
Compiler Executable Now the trojan lives in the compiler EXE only, not the source code! Even if you recompile the compiler itself from good, benign source code, you don’t know if you’re secure. You need to know the full lineage of the compiler. This is true in the context of formal verification, as well, not just security against bad guys.
Modern Compiler Implementation in ML, by Andrew W. Appel • miniml (608 line implementation of ML subset), by Andrej Bauer • Coursera Compilers Course, by Alex Aiken