Upgrade to Pro — share decks privately, control downloads, hide ads and more …

From Source to Code: How CPython's Compiler Works

From Source to Code: How CPython's Compiler Works

How the heck does CPython take a blob of bytes you call source code and create another blob of bytes called bytecode which it is able to execute to make the magic of Python programs work? This talk's aim is to provide a conceptual answer to that question. The overall process of tokenizing, parsing, creating an AST, and then finally emitting bytecode will be covered.

If you have no clue what any of those previous words meant, don't worry! This talk will be accessible to people who are not compiler experts. We'll also cover how various parts of the compiler are exposed through Python's standard library so you can play with what you learn afterwards.

This was presented at PyCon CA 2013: http://www.youtube.com/watch?v=R31NRWgoIWM

E8600d16ba667cc8d7f00ddc9f254340?s=128

Brett Cannon

August 10, 2013
Tweet

Transcript

  1. From Python to Code How CPython's compiler works Dr. Brett

    Cannon PyCon CA 2013 http://bit.ly/BrettCannon-PyConCA2013 http://bit.ly/brettcannon
  2. Based on "Design of CPython’s Compiler" http://docs.python.org/devguide/compiler.html

  3. The Five basic steps of compilation Decoding Tokenizing Parsing AST

    Compiling Front-end Back-end
  4. Decoding Bytes to text

  5. PEP 263 re.compile("coding[:=]\s*([-\w.]+)")

  6. PEP 3120 Using UTF-8 as the Default Source Encoding

  7. PEP 3131 Supporting Non-ASCII Identifiers

  8. Where in the Stdlib • importlib.util.decode_source(source_bytes) • tokenize.detect_encoding(readline)

  9. Tokenizing Text to "words"

  10. x=3+2 "x" "=" "3" "+" "2"

  11. echo "x=3+2" | ./python -m tokenize - e 1,0-1,1: NAME

    'x' 1,1-1,2: EQUAL '=' 1,2-1,3: NUMBER '3' 1,3-1,4: PLUS '+' 1,4-1,5: NUMBER '2' 1,5-1,6: NEWLINE '\n' 2,0-2,0: ENDMARKER ''
  12. Where in the Stdlib keyword tokenize

  13. Parsing "Words" to "sentence structure"

  14. Use a grammar to define structure pass stmt: simple_stmt |

    compound_stmt simple_stmt: small_stmt (';' small_stmt)* [';'] NEWLINE small_stmt: (expr_stmt | del_stmt | pass_stmt | flow_stmt | import_stmt | global_stmt | nonlocal_stmt | assert_stmt) del_stmt: 'del' exprlist pass_stmt: 'pass' flow_stmt: break_stmt | continue_stmt | return_stmt | raise_stmt | yield_stmt break_stmt: 'break' ...
  15. Concrete syntax tree for pass (268, stmt (269, simple_stmt (270,

    small_stmt (275, pass_stmt (1, 'pass') NAME ) ), (4, '') NEWLINE ) )
  16. 3+2 (258, (327, (302, (306, (307, (308, (309, (312, (313,

    (314, (315, (arith_expr, (term, (factor, (power, (atom, (NUMBER, '3'))))), (PLUS, '+'), (term, (factor, (power, (atom, (NUMBER, '2'))))) ) )))))))))), (4, ''), (0, ''))
  17. LL(1) Parser

  18. Where in the Stdlib parser symbol token

  19. Abstract Syntax Tree "Sentence structure" to semantics

  20. Zephyr ASDL expr = BoolOp(boolop op, expr* values) | BinOp(expr

    left, operator op, expr right) | UnaryOp(unaryop op, expr operand) | Lambda(arguments args, expr body) | IfExp(expr test, expr body, expr orelse) | Dict(expr* keys, expr* values) | Set(expr* elts) | ListComp(expr elt, comprehension* generators)
  21. x = 3 + 2, visually = x + 3

    2 targets value left right BinOp Assign
  22. x = 3 + 2, actually Module(body= [Assign( targets=[Name(id='x', ctx=Store())],

    value=BinOp(left=Num(n=3), op=Add(), right=Num(n=2) ) ) ])
  23. Where in the stdlib ast

  24. Working with the AST import ast class PEP8(ast.NodeVisitor): def visit_Assign(self,

    node): for target in node.targets: self.visit(target) print(' = ', end='') self.visit(node.value) print() def visit_BinOp(self, node): self.visit(node.left) print(' ', end='') self.visit(node.op) print(' ', end='') self.visit(node.right) def visit_Name(self, node): print(node.id, end='') def visit_Num(self, node): print(node.n, end='') def visit_Add(self, node): print('+', end='') >>> nodes = ast.parse('x=3+2;y=4') >>> PEP8().visit(nodes) x = 3 + 2 y = 4
  25. Compiling Semantics to bytecode

  26. Bytecode • CPython implementation detail • Stack-based • 101 instructions

  27. def func(): x = y + 2 1 0 LOAD_GLOBAL

    0 (y) 3 LOAD_CONST 1 (2) 6 BINARY_ADD 7 STORE_FAST 0 (x) 10 LOAD_CONST 0 (None) 13 RETURN_VALUE
  28. Control Flow Graph • Essentially a graph of bytecode blocks

    where edges are jumps to other blocks • Three step process a. Figure out scoping of each variable b. Compile blocks to bytecode without jumps calculated c. Flatten CFG into a list and calculate jumps
  29. Peepholer • Operates at the bytecode level • Simple transformations

    within a single block only • Bytecode required to either stay the same length or shrinks • Always used
  30. Where in the stdlib compile() dis symtable

  31. Recap Decoding Tokenizing Parsing AST Compiling Front-end Back-end