Since the previous talk at Go Con 2014 Autumn, lots of things in the internals have changed. In this talk, I will try to give an overview of Go compiler internals and update the information as much as possible, along with my new hacks.
Hacking Go Compiler Internals Hacking Go Compiler Internals 2nd season 2nd season May 18, 2019 May 18, 2019 Moriyoshi Koizumi Moriyoshi Koizumi Open Collector, Inc. Open Collector, Inc.
Agenda Agenda Compiler basics recap Compiler basics recap A brief history of Go language infrastructure A brief history of Go language infrastructure Hacking the internals Hacking the internals Who Who am am I I: : @moriyoshi at github.com / @moriyoshit at twitter.com. @moriyoshi at github.com / @moriyoshit at twitter.com. An early Go contributor. An early Go contributor. Reviewed the Japanese translation of "Concurrency in Go". Reviewed the Japanese translation of "Concurrency in Go". commit a8fbf5dc2cd5b58167402df47bb06217c5e8fd22 commit a8fbf5dc2cd5b58167402df47bb06217c5e8fd22 Author: Moriyoshi Koizumi Author: Moriyoshi Koizumi Date: Tue Dec 15 21:24:17 2009 -0800 Date: Tue Dec 15 21:24:17 2009 -0800 This patch enables cgo utility to correctly convert enums in the C source This patch enables cgo utility to correctly convert enums in the C source into consts in the resulting Go source. Previously known as issue 161047, into consts in the resulting Go source. Previously known as issue 161047, which I deleted accidentally. Fixes issue 207. which I deleted accidentally. Fixes issue 207. R=rsc R=rsc https://golang.org/cl/166059 https://golang.org/cl/166059
Lexer Lexer The lexer scans over the source code and cut it into a series of meaningful chunks. The lexer scans over the source code and cut it into a series of meaningful chunks. src/cmd/compile/internal/syntax/tokens.go src/cmd/compile/internal/syntax/tokens.go src/cmd/compile/internal/syntax/scanner.go src/cmd/compile/internal/syntax/scanner.go a := b + c(12) _Name _Define _IncOp _Name _Lparen _Rparen _Name _Literal
Parser Parser The parser reads the generated tokens and build an AST (abstract syntax tree). The parser reads the generated tokens and build an AST (abstract syntax tree). src/cmd/compile/internal/syntax/nodes.go src/cmd/compile/internal/syntax/nodes.go src/cmd/compile/internal/syntax/parser.go src/cmd/compile/internal/syntax/parser.go CallExpr Operation Name _Name _Define _IncOp _Name _Lparen _Rparen _Name _Literal BasicLit Name AssignStmt Name Tokens AST + a b c 12 :=
Parser (cont'd) Parser (cont'd) A Go source file corresponds to a *syntax.File node, which consist of declaration nodes. A Go source file corresponds to a *syntax.File node, which consist of declaration nodes. package main import "fmt" const world = "world" type myString string var hello myString = "hello, " + world func main() { fmt.Println(hello) } FuncDecl File ImportDecl ConstDecl TypeDecl VarDecl ImportDecl ImportDecl ConstDecl ConstDecl TypeDecl TypeDecl VarDecl VarDecl FuncDecl FuncDecl CallExpr SelectorExpr Name Name Name fmt Println hello
Annotated AST construction Annotated AST construction "Noder" translates the AST to the annotated AST (node tree). "Noder" translates the AST to the annotated AST (node tree). src/cmd/compile/internal/gc/noder.go src/cmd/compile/internal/gc/noder.go Translation is done on a one-to-one basis. Translation is done on a one-to-one basis. CallExpr Operation Name BasicLit Name AssignStmt Name AST OCALL OADD ONAME OLITERAL ONAME OAS ONAME Node Tree + a b c 12 :=
Typechecking Typechecking "Typecheck" walks through the node tree and tries to... "Typecheck" walks through the node tree and tries to... Determine the type of each node. Determine the type of each node. Annotate the nodes with the information used in later stages. Annotate the nodes with the information used in later stages. Replace the nodes for special function calls (len, cap, append and so on) to dedicated Replace the nodes for special function calls (len, cap, append and so on) to dedicated nodes. nodes. Translate the references to methods to closures. Translate the references to methods to closures. etc. etc. Typechecking happens occationally in the following stages to deal with the "synthesized" Typechecking happens occationally in the following stages to deal with the "synthesized" nodes. nodes. src/cmd/compile/internal/gc/typecheck.go src/cmd/compile/internal/gc/typecheck.go
Typechecking (cont'd) Typechecking (cont'd) OCALL OADD ONAME OLITERAL ONAME OAS ONAME OCALLFUNC OADD ONAME OLITERAL ONAME OAS ONAME + a b c 12 := + a b c 12 := ONAME OCALL ONAME len d ONAME OLEN d type=string type=int type=int type=int type=int type=func() type=int type=int
Variable Capturing Variable Capturing On each declared function, checks if its body contains any closure function and finds out On each declared function, checks if its body contains any closure function and finds out how the outer variables are referenced in the closure. how the outer variables are referenced in the closure. a := 1 b := 2 func () { fmt.Println(a, b) }() a = 1 OCALLFUNC OCLOSURE OLITERAL OAS ONAME type=int assigned type=int ODCLFUNC = a 1 ODCL ONAME a type=func() ONAME type=int assigned type=int b ODCL Chosen strategies: a → pass by reference b → pass by value
Escape Analysis Escape Analysis Checks on every variable if its address leaks off the stack. Checks on every variable if its address leaks off the stack. src/cmd/compile/internal/gc/escape.go src/cmd/compile/internal/gc/escape.go var z *int func main() { ... a := 1 z = &a ... } OLITERAL OAS ONAME type=int OADDR OAS ONAME type=*int type=*int ONAME type=int addrtaken ODCLFUNC = a 1 := z & a ODCL ONAME a type=int addrtaken type=int addrtaken typecheck escAnalyze OADDR ONAME type=*int type=*int ONAME type=int addrtaken class(PAUTOHEAP) z & a Located off-stack →mark the right operand as PAUTOHEAP
Closure Rewriting Closure Rewriting Transform the immediate call to the closure to a simpler form. Transform the immediate call to the closure to a simpler form. src/cmd/compile/internal/gc/closure.go src/cmd/compile/internal/gc/closure.go func do() { a := 1 func() { fmt.Println(a) a = 2 }() } func func1(a *int) { fmt.Println(*a) *a = 2 } func do() { a := 1 func1(&a) } This can be done after escape analysis because there is no chance the outer variables will This can be done after escape analysis because there is no chance the outer variables will leak. leak.
Walk Walk "Walk" phase happen right before the code generation takes place. "Walk" phase happen right before the code generation takes place. Transforms the nodes to simpler forms Transforms the nodes to simpler forms Transforms the nodes into some function calls Transforms the nodes into some function calls Promote PAUTOHEAP variables (see Escape Analysis) into pointers and initialize them Promote PAUTOHEAP variables (see Escape Analysis) into pointers and initialize them with ONEWOBJ. with ONEWOBJ. etc. etc. src/cmd/compile/internal/gc/walk.go src/cmd/compile/internal/gc/walk.go
SSA Generation SSA Generation SSA (Static Single Assignment) form is an intermediate representation that is often used SSA (Static Single Assignment) form is an intermediate representation that is often used to mediate an AST form of source code with corresponding machine code. to mediate an AST form of source code with corresponding machine code. As SSA form, every variable gets assigned only once during its lifecycle. As SSA form, every variable gets assigned only once during its lifecycle. This ensures each basic block has exactly a single path, and thus the same This ensures each basic block has exactly a single path, and thus the same optimization strategy can be applied. optimization strategy can be applied. A basic block: a node of a control flow graph, which contains no branch operation by A basic block: a node of a control flow graph, which contains no branch operation by definition (branches are represented as edges in CFG.) definition (branches are represented as edges in CFG.) GOSSAFUNC environment variable GOSSAFUNC environment variable ~$ GOSSAFUNC=foo go tool compile foo.go ~$ GOSSAFUNC=foo go tool compile foo.go dumped SSA to /home/moriyoshi/ssa.html dumped SSA to /home/moriyoshi/ssa.html
Prehistoric Age Prehistoric Age The earliest commits of Go imply the "symbolic" lineage. The earliest commits of Go imply the "symbolic" lineage. (ken=Thompson / dmr=Ritchie / bwk=Kernighan) (ken=Thompson / dmr=Ritchie / bwk=Kernighan) There should've ever existed no git thing at that time..., did it? There should've ever existed no git thing at that time..., did it? commit 7d7c6a97f815e9279d08cfaea7d5efb5e90695a8 commit 7d7c6a97f815e9279d08cfaea7d5efb5e90695a8 Author: Brian Kernighan Author: Brian Kernighan Date: Tue Jul 18 19:05:45 1972 -0500 Date: Tue Jul 18 19:05:45 1972 -0500 hello, world hello, world R=ken R=ken DELTA=7 (7 added, 0 deleted, 0 changed) DELTA=7 (7 added, 0 deleted, 0 changed) commit 0bb0b61d6a85b2a1a33dcbc418089656f2754d32 commit 0bb0b61d6a85b2a1a33dcbc418089656f2754d32 Author: Brian Kernighan Author: Brian Kernighan Date: Sun Jan 20 01:02:03 1974 -0400 Date: Sun Jan 20 01:02:03 1974 -0400 convert to C convert to C R=dmr R=dmr DELTA=6 (0 added, 3 deleted, 3 changed) DELTA=6 (0 added, 3 deleted, 3 changed) commit 0744ac969119db8a0ad3253951d375eb77cfce9e commit 0744ac969119db8a0ad3253951d375eb77cfce9e Author: Brian Kernighan Author: Brian Kernighan Date: Fri Apr 1 02:02:04 1988 -0500 Date: Fri Apr 1 02:02:04 1988 -0500
Pre-1.0 to 1.4 Pre-1.0 to 1.4 Most of the bootstrapping tools derived from Plan 9 / Inferno Most of the bootstrapping tools derived from Plan 9 / Inferno 6a / 6c / 6g / lib9... 6a / 6c / 6g / lib9... The toolchain was written in The toolchain was written in Plan-9 Plan-9 flavored flavored C C (http://doc.cat-v.org/plan_9/4th_edition/papers/comp) (http://doc.cat-v.org/plan_9/4th_edition/papers/comp) Naive code generation facility (~1.6) Naive code generation facility (~1.6) Semi-concrete IRs were generated directly from annotated ASTs. Semi-concrete IRs were generated directly from annotated ASTs.
1.3 1.3 Near-precise GC of pointers in stack. Near-precise GC of pointers in stack. Based on the construction of control flow graphs (CFG). Based on the construction of control flow graphs (CFG). Contiguous stack model. Contiguous stack model. 1.5 1.5 Achieved complete self-hosting. Achieved complete self-hosting. 1.7 1.7 SSA-based codegen introduced. SSA-based codegen introduced.
Example 1: Let Go accept emojis for various identifiers. Example 1: Let Go accept emojis for various identifiers. Modify the portion of the lexer so it will accept "" (U+1F363) for identifiers: Modify the portion of the lexer so it will accept "" (U+1F363) for identifiers: func (s *scanner) isIdentRune(c rune, first bool) bool { func (s *scanner) isIdentRune(c rune, first bool) bool { switch { switch { case unicode.IsLetter(c) || c == '_' || c == 0x1f363: // Modified case unicode.IsLetter(c) || c == '_' || c == 0x1f363: // Modified // ok // ok case unicode.IsDigit(c): case unicode.IsDigit(c): if first { if first { s.errorf("identifier cannot begin with digit %#U", c) s.errorf("identifier cannot begin with digit %#U", c) } } case c >= utf8.RuneSelf: case c >= utf8.RuneSelf: s.errorf("invalid identifier character %#U", c) s.errorf("invalid identifier character %#U", c) default: default: return false return false } } return true return true } }
Example 2: Add a New Operator Example 2: Add a New Operator Modify the lexer so it will generate a token value for the added operator. Modify the lexer so it will generate a token value for the added operator. Modify the parser so it will understand the token and give a right AST Node for it. Modify the parser so it will understand the token and give a right AST Node for it. Modify either the typecheck or walk facility so an annotated node that corresponds to Modify either the typecheck or walk facility so an annotated node that corresponds to the AST node will be transformed into a non-fancy branch of nodes. the AST node will be transformed into a non-fancy branch of nodes. type Foo struct { type Foo struct { } } func (*Foo) ->B(i int) { func (*Foo) ->B(i int) { fmt.Println(i) fmt.Println(i) } } func (*Foo) B->() int { func (*Foo) B->() int { return 1 return 1 } } type X interface { type X interface { ->B(int) ->B(int) B->() int B->() int } } func do(x X) { func do(x X) { x->B = 0 x->B = 0 x->B, c := 2, 3 x->B, c := 2, 3 fmt.Println(c) fmt.Println(c) } }