Curious about CPython's new PEG parser but not a compiler expert? This talk is for you. It covers the fundamentals, from traditional parsing to PEG, based on the speaker's own learning.
did python use LL(1) parser before? Why did Guido choose PEG parser? What other parsers do we have? What’s the difference between those parsers? How to implement those parsers? 11
rooted tree that represents the syntactic structure of a string according to some context-free grammar. Abstract Syntax Tree (AST) A tree representation of the abstract syntactic structure of source code written in a programming language. 21
or indirect left recursion. 26 E -> E + T | T T -> T * F | F F -> Num E -> TE’ E’ -> +TE’ | None T -> FT’ T’ -> *FT’ | None F -> Num Rewrite Grammar E in first E + T will recursively derives to second E + T, E in second E + T will repeat it to third E + T, and so on recursively.
lookahead (k>0) LR(k) = Left-to-right, Rightmost derivation, k-token lookahead (k>=0) 31 *The derivation time of LL/LR parser is different. N E E N N E E E * + N E E N N E E E * + + → * * → +
lookahead (k>0) LR(k) = Left-to-right, Rightmost derivation, k-token lookahead (k>=0) 32 Input String: 2 + 3 * 4 I am "a token of number". If I perform 1-token lookahead and meet "a token of +", what to do next?
E -> TE’ E’ -> +TE’ | None T -> FT’ T’ -> *FT’ | None F -> Num parse_Tp(parse_F()) parse_Ep( ) Step3 *recursively parse the input string started from first rule parse_E() Step2 *parse from left to right *perform k-lookahead parse_T() Step1 write function for each non-terminal
(1) E -> .E + T --- (2) E -> .T --- (3) T -> .T * Num --- (4) T -> .Num --- (5) Step1 Build Deterministic Finite Automaton(DFA) E’ -> E. E -> E. + T E -> T. T -> T. * Num T -> Num. E -> E + .T T -> .T * Num T -> .Num T -> T * .Num E -> E + T. T -> T. * Num T -> T * Num. E T Num * + T Num * Num S1 S2 S3 S4 S5 S6 S7 S8 Left recursion support
A will try A -> B first. Only after it fails at A -> B, A will only try A -> a. Derivation *some paper write <- Non-Terminal OR (if / elif / ...) *disallow ambigious syntax A -> B | a Terminal *Introduced in 2002 (Packrat Parsing: Simple, Powerful, Lazy, Linear Time) rule *support Regular Expression (EBNF grammar) in another paper
a Grammar2: A -> a | a b • LL/LR parser will fail to complete when the input grammar is ambiguous. • PEG parser only tries the first PEG rule. The latter rule will never succeed. “A PEG parser generator will resolve unintended ambiguities earliest-match-first, which may be arbitrary and lead to surprising parses.” (source)
PEG parser can be a Packrat parser, or other traditional parser with k-lookahead limitation. Mostly, PEG parser means Packrat parser. 48 CFG EBNF grammar PEG Packrat parser Traditional parser PEG Parser
parse_E() E -> E + T | T T -> T * F | F F -> Num parse_T() and parse_F() parse_E() and parse_T() Step2 *parse from left to right *perform infinite lookahead + memoization Step1 *write function for each non-terminal (PEG rule) *Idea of memoization was Introduced in 1970 Step3 *recursively parse the input string started from first rule parse_E() Left recursion support
of operator) < (count function call): return False Approach 2 reverse the call stack (adopted in CPython!) Source: Guido's Medium (Left-recursive PEG Grammars)
(*Right-to-left memo) Left-to-right Left Recursion Support (*Not support in first paper) LL needs to rewrite the grammar Ambigious Disallowed (determinism) Allowed Space Complexity O(Code Size) (space consumption) O(Depth of Parse Tree) Worst Time Complexity Super linear time (statelessness) *Because of feature like typedef in C Expotenial time Capability Basically covers all traditional cases (infinite lookahead) No left-recursion/ambigious for LL Has k lookup limitations for both (e.g. dangling else) Red text: 3 highlighted characteristics of Packrat parser.
written by Guido 30 years ago The parser requires steps to generate CST and convert CST to AST. CPython3.9 uses PEG (Packrat) parser (Infinite lookahead) PEG rule supports left-recursion No more CST to AST step - source CPython3.10 drops LL(1) parser support 62 This answers “Why PEG?”
LL(1) to LL(∞) (infinite lookahead) Hardware supports Packrat’s memory consumption now Skip intermediate parse tree (CST) construction Performance Within 10% of LL(1) parser both in speed and memory consumption (PEP 617) 66
LL(1) and LR(0) parser, and Packrat parser from scratch! Leetcode: 227. Basic Calculator II Q. How to verify my understanding? A. Get your hands dirty!
Writing a PEG parser for fun and profit Pablo Galindo and Lysandros Nikolaou @ Podcast.__init__ The Journey To Replace Python's Parser And What It Means For The Future Emily Morehouse-Valcarcel @ PyCon 2018 The AST and Me Alex Gaynor @ PyCon 2013 So you want to write an interpreter? 73