Programs That Write Programs

Programs that Write Programs Craig Stuntz https://speakerdeck.com/craigstuntz https://github.com/CraigStuntz/TinyLanguage What is
a compiler?  What is the formalism? Diﬀerent than TDD. How do we drive code implementation via the formalism?

A → B Why bother learning compilers? I think it’s
fun & interesting in its own right. Maybe you agree, since you’re here. But you might think you don’t have the opportunity to use this stuﬀ in your day job. Maybe you can! A compiler is a formalization for taking data in one representation and turning it into data in a diﬀerent representation. The transformation (from A to B) is often nontrivial.

–Steve Yegge “You're actually surrounded by compilation problems. You run
into them almost every day.” http://steve-yegge.blogspot.ca/2007/06/rich-programmer-food.html This is, more or less, every problem in software engineering. Every problem.

Source code → Program JPEG file → Image on screen
Source code → Potential style error list JSON → Object graph Code with 2 digit years → Y2K compliant code VB6 → C# Object graph → User interface markup Algorithm → Faster, equivalent algorithm Some examples Some of these look like what we typically consider a compiler. Some don’t. None are contrived. I do all of these in my day job.

Useful Bits • Regular Expressions (lexing) • Deserializers (parsing) •
Linters, static analysis (syntax, type checking) • Solvers, theorem provers (optimization) • Code migration tools (compilers!) Also, most of the individual pieces of the compilation pipeline are useful in their own right When you learn to write a compiler, you learn all of the above, and more!

–Greenspun’s Tenth Rule “Any sufficiently complicated C or Fortran program
contains an ad hoc, informally-specified, bug- ridden, slow implementation of half of Co"#on Lisp.” Another good reason: You will doing this stuﬀ anyway; may as well do it right!” DIBOL story.

1 + 2 + 3 + … + 100 =
100 * 101 / 2 = 5050 Another reason compilers are interesting is they try to solve a very hard problem: Don’t just convert A to B. Convert A into a highly-optimized-but-entirely-equivalent B. For any given A whatsoever! If a compiler crashes, that’s bad. If it silently emits incorrect code, it’s worse than bad; the world ends. Hard problems are interesting!

Good News! Although this is a hard problem, it’s a
mostly solved problem. “Solved” not as in it will never get better. “Solved” as in there is a recipe. In fact, if you have a hard problem and can distort it until it looks like a compiler then you have a solved problem. Contrast this with web development, which is also hard, but nobody has any bloody idea what they’re doing.

–Eugene Wallingford “…compilers ultimately depend on a single big idea
from the theory of computer science: that a certain kind of machine can simulate anything — including itself. As a result, this certain kind of machine, the Turing machine, is the very definition of computability.” http://www.cs.uni.edu/~wallingf/blog/archives/monthly/2015-09.html#e2015-09-03T15_26_47.htm What Is a Compiler, Really? (pause) It’s a program that writes a program! Or a program which takes a representation and writes a representation. Often, it’s a program that can write any possible program. Many compilers can compile themselves, producing their own executable as output. There are weird and non-obvious consequences of this! This idea on the screen — this is huge!

code exe Actually, I lied just slightly. We like to
think of a compiler as code -> exe, but this is mostly wrong. (Click) Primary purpose is to reject incorrect code. Maybe 1 in 10 times we produce a program. A good compiler is the ultimate total function: Take any random string and either produce a program or explain clearly why it’s not a program.

Syntax and Semantics • Java and JavaScript have similar syntax,
but different semantics. Ruby and Elixir have similar syntax, but different semantics • C# and VB.NET have different syntax, but fairly similar semantics A compiler must handle both the written form (syntax) and the internal meaning (semantics) of the language. People obsess about syntax. They reject LISPs because (). But semantics are, I think, most important. Nobody says we don’t need JavaScript since we have Java. People do ask if we need VB.NET since there is C# (syntactically diﬀerent, but semantically equivalent). Syntax is really important!

Similar Syntax, Different Semantics name = "Nate" # $% "Nate"
String.upcase(name) # $% "NATE" name # $% "Nate" name = "Nate" # $% "Nate" name.upcase! # $% "NATE" name # $% "NATE" http://www.natescottwest.com/elixir-for-rubyists-part-2/ Elixir on left, Ruby on right. Looks similar, but Elixir strings are immutable.

Different Syntax, Similar Semantics Imports System Namespace Hello Class HelloWorld
Overloads Shared Sub Main(ByVal args() As String) Dim name As String = "VB.NET" 'See if argument passed If args.Length = 1 Then name = args(0) Console.WriteLine("Hello, " & name & "!") End Sub End Class End Namespace using System; namespace Hello { public class HelloWorld { public static void Main(string[] args) { string name = "C#"; !" See if argument passed if (args.Length == 1) name = args[0]; Console.WriteLine("Hello, " + name + "!"); } } } http://www.harding.edu/fmccown/vbnet_csharp_comparison.html VB.NET on left, C# on right. Looks diﬀerent, but semantics are identical Is this clear? Compilers deal with both; distinction is important.

x = x + 1; alert(x); Sequence Assign Invoke x
add x 1 alert x Syntax has both a literal (characters) and a tree form. This is the “abstract” syntax tree. Believe it or not it’s the simpliﬁed version! But recognizing the syntax and semantics is only the start of what a compiler does. Let’s look at the big picture.

Compiler I have a very general notion of a compiler…
Includes building EXEs, but also a lot else. Compiler produces a program, which maps input to output. But maybe you prefer Ruby or JS? (Click) Interpreter is a compiler with added user input. Maps source and input to output. I’ll talk mostly about compilers, but remember an interpreter is just a compiler with an extra feature or three.

Compiler Interpreter I have a very general notion of a
compiler… Includes building EXEs, but also a lot else. Compiler produces a program, which maps input to output. But maybe you prefer Ruby or JS? (Click) Interpreter is a compiler with added user input. Maps source and input to output. I’ll talk mostly about compilers, but remember an interpreter is just a compiler with an extra feature or three.

Front End: Understand Language Back End: Emit Code Front end,
back end Deﬁnitions vary, but nearly always exist A C# app has two compilers. (What are they?) csc and the JITter.

Lexer IL Generator Parser Type Checker Optimizer Optimizer Object Code
Generator This is simpliﬁed. Production compilers have more stages. Front end, back end… Middle end? There is an intermediate representation — collection of types — for each stage. I show two optimizers here. Production compilers have more.

To lex parse a language, you must understand its grammar.
This is one of the lovely problems in software (seriously) which can easily be formally speciﬁed! Example is part of ECMAScript grammar.

(inc -1) Let’s start with something simple. Simpler. We want
to be able to compile this program. Is this program statically typed or dynamically typed? (Start with a toy language. Language design is harder than compiler design. Even amongst production languages, some are way harder to implement than others. Not Ruby. Not made up. Start with a Lisp.) To what? There’s more than one valid representation of this program. Here’s one (click, explain IL). Here’s another (click). How do we even decide?

(inc -1) Ldc.i4 -1 Ldc.i4 1 Add Let’s start with
something simple. Simpler. We want to be able to compile this program. Is this program statically typed or dynamically typed? (Start with a toy language. Language design is harder than compiler design. Even amongst production languages, some are way harder to implement than others. Not Ruby. Not made up. Start with a Lisp.) To what? There’s more than one valid representation of this program. Here’s one (click, explain IL). Here’s another (click). How do we even decide?

(inc -1) Ldc.i4 -1 Ldc.i4 1 Add Ldc.i4.0 Let’s start
with something simple. Simpler. We want to be able to compile this program. Is this program statically typed or dynamically typed? (Start with a toy language. Language design is harder than compiler design. Even amongst production languages, some are way harder to implement than others. Not Ruby. Not made up. Start with a Lisp.) To what? There’s more than one valid representation of this program. Here’s one (click, explain IL). Here’s another (click). How do we even decide?

(inc -1) Lex LeftParen, Identifier(inc), Number(-1), RightParen Parse Apply “inc”
to -1 Type check “inc” exists and takes an int argument, and -1 is an int. Great! Optimize -1 + 1 = 0, so just emit int 0! IL generate Ldc.i4 0 Optimize Ldc.i4 0 → Ldc.i4.0 Object code Produce assembly with entry point which contains the IL generated Here’s a simple example… Note Number(-1)? Optimize 1: Compiler can do arithmetic! Optimize 2…. All we’ve done is bust hard problem of source code -> optimized EXE into a bunch of relatively small problems. Make sense? Production compiler engineers will say this is too simple, but their code, like most production code, is probably a mess. You can learn a lot from the simple case!

(defun add-1 (x) (inc x)) (defun main () (print (add-1
2))) Or maybe this one.

There Are No Edge Cases In Progra"#ing Languages Duff’s Device
send(to, from, count) register short *to, *from; register count; { register n = (count + 7) / 8; switch (count % 8) { case 0: do { *to = *from++; case 7: *to = *from++; case 6: *to = *from++; case 5: *to = *from++; case 4: *to = *from++; case 3: *to = *from++; case 2: *to = *from++; case 1: *to = *from++; } while (--n > 0); } } We must consider every possible program! People will do absolutely anything the PL grammar allows. Therefore you a compiler must be able to classify literally any arbitrary string into a valid or invalid program. There are a lot of possible strings!

Gra"#ar <program> := <statement> | <program> <statement> <statement> := <defun>
| <expr> <defun> := “(defun” identifier <expr> <expr> “)” <expr> := number | string | <invoke> <invoke> := “(” identifier <expr> “)” identifier := "". number := "". string := "". Of course an acceptable program can’t really be any random string. There are rules. TDD vs. formal methods. I like tests, but… totality. With a reasonably speciﬁed language (good: C#; bad: Ruby), the rules are pretty easy to follow. Our goal today is to implement a really simple language. We’ll follow the grammar here. Terminals vs. nonterminals. Important for lexing vs. parsing.

Lexing (inc -1) Start with this, transform to (click). Haven’t
tried to infer any kind of meaning at all yet. What’s the point of lexing vs parsing? Output of lexing is terminals of the grammar used by the parser. Lexer sets the parser up for success. Makes the formalism possible.

Lexing (inc -1) “(“ “inc” “-1” “)” LeftParenthesis Identifier(“inc”) LiteralInt(-1)
RightParenthesis Start with this, transform to (click). Haven’t tried to infer any kind of meaning at all yet. What’s the point of lexing vs parsing? Output of lexing is terminals of the grammar used by the parser. Lexer sets the parser up for success. Makes the formalism possible.

Lexing ( -1) Possible errors: Something which can’t be recognized
as a terminal for the grammar.

Lexing let rec private lexChars (source: char list) : Lexeme
list = match source with | '(' :: rest "→ LeftParenthesis :: lexChars rest | ')' :: rest "→ RightParenthesis :: lexChars rest | '"' :: rest "→ lexString(rest, "") | c :: rest when isIdentifierStart c "→ lexName (source, "") | d :: rest when System.Char.IsDigit d "→ lexNumber(source, "") | [] "→ [] | w :: rest when System.Char.IsWhiteSpace w "→ lexChars rest | c :: rest "→ Unrecognized c :: lexChars rest I write all of my compiler code as purely functional. Most example compiler code requires understanding the state of the compiler as well as the behavior of the code. You can understand this code from the code alone. Lexer is recursive: Look at ﬁrst char, decide what to do. How do you eat an elephant?

Parsing LeftParenthesis Identifier(“inc”) LiteralInt(-1) RightParenthesis Start with a list of
tokens from the lexer. Click Produce syntax tree. Here we have a const -1; could be another expression like another invocation.

Parsing LeftParenthesis Identifier(“inc”) LiteralInt(-1) RightParenthesis Invoke “inc” -1 Start with
a list of tokens from the lexer. Click Produce syntax tree. Here we have a const -1; could be another expression like another invocation.

Parsing LeftParenthesis Identifier(“inc”) LiteralInt(-1) LiteralInt(-1) Possible errors: Bad syntax. Every
stage of compiler checks the form of the stage previous. (click) Parser checks lexer output.

Parsing LeftParenthesis Identifier(“inc”) LiteralInt(-1) LiteralInt(-1) “Expected ‘)’” Possible errors: Bad
syntax. Every stage of compiler checks the form of the stage previous. (click) Parser checks lexer output.

Parsing let rec private parseExpression (state : ParseState): ParseState =
match state.Remaining with | LeftParenthesis :: Identifier "defun" :: Identifier name :: rest "→ let defun = parseDefun (name, { state with Remaining = rest }) match defun.Expressions, defun.Remaining with | [ ErrorExpr _ ], _ "→ defun | _, RightParenthesis :: remaining "→ { defun with Remaining = remaining } | _, [] "→ error ("Expected ')'.") | _, wrong :: _ "→ error (sprintf "Expected ')'; found %A." wrong) | LeftParenthesis :: Identifier name :: argumentsAndBody "→ let invoke = parseInvoke (name, { state with Remaining = argumentsAndBody }) match invoke.Remaining with | RightParenthesis :: remaining "→ { invoke with Remaining = remaining } | [] "→ error ("Expected ')'.") | wrong :: _ "→ error (sprintf "Expected ')'; found %A." wrong) | LeftParenthesis :: wrong "→ error (sprintf "%A cannot follow '('." wrong) Do not optimize. Do not type check. Just look for valid syntax.

Optimization (I) Invoke “inc” -1 There cannot be any errors.
Unlike other phases.

Optimization (I) Invoke “inc” -1 Constant Int 0 There cannot
be any errors. Unlike other phases.

Optimization Fight the Urge to Optimize Outside the Optimizers!

Optimization (I) Invoke “some-method” -1 Often, you’ll do nothing.

Optimization (I) Invoke “some-method” -1 Invoke “some-method” -1 Often, you’ll
do nothing.

Binding / Type Checking Invoke “inc” -1 Start with an
abstract syntax tree. Click. Produce a binding tree.

("inc", Function Binding { Name = "inc" Argument = {
Name = "value" ArgumentType = IntType } Body = IntBinding 0; ResultType = IntType }) InvokeBinding { Name = "inc" Argument = IntBinding -1 ResultType = IntType } Binding / Type Checking Invoke “inc” -1 Start with an abstract syntax tree. Click. Produce a binding tree.

InvokeBinding { Name = "inc" Argument = String Binding "Oops!"
ResultType = IntType } Binding / Type Checking Possible errors.

InvokeBinding { Name = "inc" Argument = String Binding "Oops!"
ResultType = IntType } Binding / Type Checking “Expected integer; found ‘Oops!’.” Possible errors.

Binding / Type Checking [<Test>] member this.``should return error for
unbound invocation``() = let source = "(bad-method 2)" let expected = [ Ignore (ErrorBinding "Undefined function 'bad-method'.") ] let actual = bind source actual |> should equal expected

Binding / Type Checking let rec private toBinding (environment: Map<string,
Binding>) (expression : Expression) : Bind match expression with | IntExpr n "→ IntBinding n, None | StringExpr str "→ String Binding str, None Now we’re going to a semantic understanding of the code. Some of this is easy!

Binding / Type Checking Some of it is considerably harder.
Type checking is easier than type inference, but both are mostly solved problems, at least for simpler cases. Now we’re going to a semantic understanding of the code. Multiple formalisms are involved: Type theory, denotational semantics, operational semantics.

IL Generation IntBinding 0 If we start from a simple
binding… (click) We want to produce If you’ve seen IL before, you may wonder why not (click). More optimized! We’ll get to that.

IL Generation IntBinding 0 Ldc.i4 0 If we start from
a simple binding… (click) We want to produce If you’ve seen IL before, you may wonder why not (click). More optimized! We’ll get to that.

IL Generation IntBinding 0 Ldc.i4 0 Ldc.i4.0 If we start
from a simple binding… (click) We want to produce If you’ve seen IL before, you may wonder why not (click). More optimized! We’ll get to that.

Optimization (II) Ldc.i4 0

Optimization (II) Ldc.i4 0 Ldc.i4.0

Special Tools! PE Verify, ildasm. Similar tools exist for other
platforms.

Compare! When in doubt, try it in C#, and see
what that compiler emits. Sometimes it’s a little strange. Roll with it.

https://www.ece.cmu.edu/~ganger/712.fall02/papers/p761-thompson.pdf OK, this is weird.

Further Reading Don’t buy the dragon book. Universally recommended by
people who haven’t read it, or who have read nothing else.

Further Reading • Progra"#ing Language Concepts, by Peter Sestoft •
Modern Compiler Implementation in ML, by Andrew W. Appel

Craig Stuntz @craigstuntz [email protected] http://blogs.teamb.com/craigstuntz http://www.meetup.com/Papers-We-Love-Columbus/

Programs That Write Programs

Programs That Write Programs

More Decks by Craig Stuntz

Other Decks in Programming

Featured

Transcript