Upgrade to Pro — share decks privately, control downloads, hide ads and more …

SpiderMonkey Parser API: A Standard For Structured JS Representations

SpiderMonkey Parser API: A Standard For Structured JS Representations

Description
-----------

The representation of JavaScript programs that Mozilla used when they exposed their SpiderMonkey reflection API isn't perfect; in fact, it has a good number of flaws. But a rich ecosystem of tools has formed around this particular structured representation of JavaScript programs, most notably the popular esprima parser.

The reusability and composability of these tools has made this format the standard for all modern projects that transform, generate, analyse, or otherwise work with JavaScript programs. We will explore this burgeoning format, evaluate its design with the benefit of hindsight, and showcase some of the more useful and prominent projects that have adopted it.

Speaker Notes
-------------

=== Slide 1 ===

=== Slide 2 ===

* this is a JavaScript program
* uses the new operator on a constructor "big C"
* passes result of `1 + a`
* not very useful in this format; just a series of characters
* meaningful static analysis requires a more structured representation

=== Slide 3 ===

* in creating this structure, we usually start with lexical analysis (tokenisation)
* makes for a much simpler parser
* character stream turned into stream of more meaningful tokens
* tokens tagged with type
* whitespace characters do not create tokens

=== Slide 4 ===

* parsers are magic
* it turns the token stream into a tree

=== Slide 5 ===

* this is a representation of an abstract syntax tree (AST)
* formatted the way that Spidermonkey interpreter does internally

=== Slide 6 ===

* same AST as a JavaScript object

=== Slide 7 ===

* mid-2010, Dave Herman announced on his Mozilla blog
** new public API in SpiderMonkey
** exposes its JavaScript parser.

* even Dave Herman criticised interface
** wouldn't allow the interpreter to evolve its IR

=== Slide 8 ===

* since they couldn't change it, Mozilla created specification for the Spidermonkey AST format

=== Slide 9 ===

* specification defines Reflect.parse method with this signature
** takes a string of JS as input
** return SpiderMonkey format AST

* vast majority of document specifies AST format
* let's take a closer look at it

=== Slide 10 ===

* the Node interface
* all nodes have a "type" member
* each node may have source tracking information; line/column of start/end parse position
* used to preserve location information through transformations to track original source

=== Slide 11 ===

* Program is a Node
* any successful parse will have a top-level Program node

* Function interface shows support for ES6 features
* because SpiderMonkey’s implementation parses JavaScript, not ECMAScript

=== Slide 12 ===

* Statement interface extends Node interface
* EmptyStatement is the simplest Statement node
* BlockStatement contains a list of statements and executes them in sequence
* ExpressionStatement allows an Expression to be used in Statement position

=== Slide 13 ===

before we continue, let’s look at what we’re looking for in a good AST format

1) allows a simple and efficient visitor
2) allows moving / copying / replacing subtrees
3) allows code generator to be total
4) prevents code duplication

=== Slide 14 ===

* These nodes are combinations of similarly structured nodes... kind of
* BinaryExpression not split up into PlusExpression, MultiplicationExpression, etc.
* But for some reason, split up AssignmentExpression (assignment ops), LogicalExpression (&&, ||), and BinaryExpression (all other binary operators)

=== Slide 15 ===

* Same thing for UpdateExpression (increments, decrements) and UnaryExpression

=== Slide 16 ===

* Identifier node uses arbitrary string name
** bare identifier cannot be a reserved word

* Literal node way overloaded
** "type" depends on type of value
** cannot be serialised to JSON

* MemberExpression
** computed flag can conflict with object/property
** should be dynamic/static member access nodes

* Identifier misused for MemberExpressions
** "identifier" used as member access shouldn't be allowed to be numeric
** "identifier" *should* be allowed to be a reserved word

=== Slide 17 ===

* one thing stands out about ObjectExpression: properties are not Nodes
** makes it difficult to implement visitor

* key is an Identifier (makes sense) or a Literal
** Literal restricted to string- or number-type value
** shows that Literal node was a terrible idea

=== Slide 18 ===

* semantics are defined for nested IfStatements, but cannot represent using JS syntax
* semantics of a TryStatement without catch/finally are undefined

=== Slide 19 ===

[[ NOTE: read slide aloud ]]

* Declarations are not Statements
* 16 “Spidermonkey-specific” warnings.

=== Slide 20 ===

* SequenceExpression, VariableDeclaration, SwitchStatement

=== Slide 21 ===

* these AST problems are directly derived from problems with the language

=== Slide 22 ===

* two different programs, create same AST, have different behaviour
* lack of a DirectiveStatement node
* for now, parsers treat directives as strings
* we’re working on fixing this one

=== Slide 23 ===

* Spidermonkey AST is definitely not perfect -- why would we want to use it?
* Reflect.js introduced about a year after Reflect.parse
* JavaScript parser written in ES3-compatible JavaScript
* makes a bit of noise, but nothing came of it

=== Slide 24 ===

* today, Esprima is the most popular JavaScript Reflect.parse implementation
* heavily tested, very true to spec
* even has a harmony branch that follows ES6 development

=== Slide 25 ===

* created a fuzzer for generative testing of Reflect.parse implementations

=== Slide 26 ===

* found 11 bugs in 4 implementations in the first 2 or 3 weeks

=== Slide 27 ===

=== Slide 28 ===

* implements visitor pattern for Spidermonkey ASTs
* doesn't do much on its own
* useful for building other tools that operate on Spidermonkey ASTs

=== Slide 29 ===

* provides constructors for AST nodes

=== Slide 30 ===

* JS code generator
* inverse of esprima parser
* uses estraverse
* configurable formatting: spaces, semicolons, indentation, newlines
* minification preset omits unnecessary syntax
* guarantees re-parsing output JS generates input AST

=== Slide 31 ===

* if we generate JS, we lose source information in AST nodes
* can use source information to create source maps for the generated JS
* source maps: bidirectional mapping

=== Slide 32 ===

* allow you to see and debug a compile-to-JS language in the browser

=== Slide 33 ===

* Nick Fitzgerald: summer 2012 internship at Mozilla
* generates source maps (or JS) from annotated CSTs

=== Slide 34 ===

* thought this was really cool; rewrote escodegen to generate CSTs instead of JS
* generate trees instead of concatenating strings
* mark the generated tree with AST location info
* CSTs passed through mozilla/source-map to either generate source map or concat to JS
* now everything that uses escodegen gets source maps for free

=== Slide 35 ===

* if you have a compiled file and a source map, create an annotated AST

=== Slide 36 ===

* similar to escodegen
* attempts to preserve original formatting when rendering an AST

=== Slide 37 ===

* collect AST nodes that match a given selector

=== Slide 38 ===

=== Slide 39 ===

* determines whether a SpiderMonkey format AST represents a valid JS program
* tries to be resistant to malformed ASTs and list all problems with descriptive errors

=== Slide 40 ===

=== Slide 41 ===

* scope analysis
* originally extracted from esmangle project
* detects static/dynamic scopes: global scope, presence of `with`/`eval`
* detects references/declarations

=== Slide 42 ===

* visualisation of escope's scope analysis on my favourite JS program

=== Slide 43 ===

* uses escope to determine how deeply nested a variable is within scope chain
* inspired by Crockford request

=== Slide 44 ===

* can generate visualisations like this one

=== Slide 45 ===

* replaces identifiers and their declarations with shorter names
* meant to be a component of a minifier

=== Slide 46 ===

* generates control flow graph from AST
* keeps track of previous statements, next statement, throw target, true/false targets
* can be used for advanced esmangle transformations

=== Slide 47 ===

* example visualisation of a very small program
* notice that any nonlinear control flow causes branching/joining

=== Slide 48 ===

* web demo available

=== Slide 49 ===

* computes complexity metrics

=== Slide 50 ===

* on a single module, get
** cyclomatic complexity
** source lines of code
** maintainability index
** more

* per function and for whole program

=== Slide 51 ===

* across multiple modules, get coupling and maintainability metrics

=== Slide 52 ===

* Plato visualises these metrics

=== Slide 53 ===

* fully pluggable linter

* alerts about potential bugs

* consistent code style
** not just formatting, structural too

=== Slide 54 ===

* example eslint rule

=== Slide 55 ===

* tracks line, function, and branch coverage
* uses instrumentation

=== Slide 56 ===

* standard LCOV report
* visualised using an LCOV visualiser

=== Slide 57 ===

=== Slide 58 ===

* partially evaluates JS programs
* generates own control flow graph, does own scope analysis
* not very good at either; should use escope/esgraph
* replaces AST nodes that can be statically computed
* unrolls loops

=== Slide 59 ===

* Jez went one step further
* metacircular interpreter

=== Slide 60 ===

* step through evaluation, generate environment state at any point
* still doesn't use escope or estraverse
* so it's not always correct

=== Slide 61 ===

=== Slide 62 ===

* performs tail call elimination
* uses estraverse and escope

=== Slide 63 ===

* transforms to iterative loops

=== Slide 64 ===

* compiles ES6 generators to ES5

=== Slide 65 ===

* generates semantically equivalent, syntactically minimal AST
* uses fixed point evaluation strategy: repeatedly applies a set of rules to an AST (using estraverse) until it reaches a fixed point
* 2 phases: simplification then expansion
* simplification generates smaller AST; expansion generates larger AST
* also does name mangling, but should probably be separated out

=== Slide 66 ===

* 1st phase reduces AST to simpler AST

=== Slide 67 ===

* 1st phase reduces AST to simpler AST

=== Slide 68 ===

* 2nd phase creates AST that has more compact syntax

=== Slide 69 ===

* 2nd phase creates AST that has more compact syntax

=== Slide 70 ===

[[ NOTE: read slide aloud ]]

=== Slide 71 ===

* grepping with esquery style selectors

=== Slide 72 ===

* grep with placeholders

=== Slide 73 ===

* replacement

=== Slide 74 ===

* another 2012 summer internship from Mozilla
* write JS with hygienic macros
* basically modifies token stream before it's sent to the parser
* this is a very difficult problem; much harder than it sounds: no parsing context

=== Slide 75 ===

* simple macro: replace `function` keyword with `def`

=== Slide 76 ===

* more advanced macros
* like CoffeeScript's "class" sugar

=== Slide 77 ===

* browser bundler
* traces CommonJS require calls, builds dependency graph
* combines modules into one file with minimal plumbing needed to mimic CommonJS environment
* bundle built by combining Spidermonkey ASTs

=== Slide 78 ===

* around 20 lines of overhead
* preserves source location information, so source maps back to original files

=== Slide 79 ===

* compile-to-js languages taking advantage of these tools
* CoffeeScript, Akira, Roy, Wisp, RumCoke, LLJS all have compilers that use Spidermonkey AST

=== Slide 80 ===

* In summary:
* use the Spidermonkey AST
** it's not perfect
** unfortunately, ASTs not guaranteed to represent valid JS
** will be expanded for ES6 and beyond
** the tooling is awesome
** JS tooling is now comparable to that of mature languages

* don't make your own AST format
** you'll probably get it wrong
** you don't want to recreate all these tools

* don't ever manipulate strings of code: EVER
** not in any programming language
** especially not in JavaScript

=== Slide 81 ===

* standard CST allows preservation of original syntactic representation
* standard ASG allows semantic analysis without understanding syntax

=== Slide 82 ===

* If you're into this kind of stuff, check me out on Github

Michael Ficarra

July 22, 2014
Tweet

More Decks by Michael Ficarra

Other Decks in Programming

Transcript

  1. new C ( 1 + a ) ▩ PUNCTUATOR ▩

    KEYWORD ▩ IDENTIFIER ▩ NUMERIC Typical Tokenisation
  2. Program body: [ ▢ ] NewExpression callee: ▢ arguments: [

    ▢ ] Identifier name: "C" Identifier name: "a" Literal value: 1 BinaryExpression operator: "+" left: ▢ right: ▢ ExpressionStatement expression: ▢ new C(1 + a) Structured Representation (AST)
  3. Structured Representation (AST) { type: "Program" , body: [ {

    type: "ExpressionStatement" , expression: { type: "NewExpression" , callee: {type: "Identifier", name: "C"} , arguments: [ { type: "BinaryExpression" , operator: "+" , left: {type: "Literal", value: 1} , right: {type: "Identifier", name: "a"} } ]} } ]}
  4. Properties of a Good AST Format 1. each node tagged

    with its type(s) 2. nodes have no state or knowledge of context 3. disallows construction of invalid program 4. similar syntactic productions are meaningfully grouped
  5. Overly Permissive: Structures { type: "IfStatement", test: (...), consequent: {

    type: "IfStatement", test: (...), consequent: (...), alternate: null }, alternate: (...) } if(test) if(test) a(); else b(); { type: "TryStatement", block: (...), handler: null, guardedHandlers: [], finalizer: null } try { a() }
  6. Overly Permissive: List Properties 0, // needs to sequence at

    least 2 expressions var // needs at least one declarator // cannot contain more than one default switch(0) { default: 0 default: 0 }
  7. Overly Permissive: Context • iteration context while(0) continue; while(0) break;

    • switch context switch(0) { case 0: break; } • function context (function(){ return; }); • label set context label: label: 0; label: while(0) break label; label: while(0) continue label; label: (function(){ function f() { label: 0; } }); • strict mode context function f(){ "use strict"; return {a: 0, a: 1}; };
  8. Simplification Phase { type: "CallExpression" , callee: {type: "Identifier", name:

    "f"} , arguments: [ { type: "UnaryExpression" , prefix: true, operator: "!" , argument: { type: "UnaryExpression" , prefix: true, operator: "!" , argument: { type: "UnaryExpression" , prefix: true, operator: "!" , argument: {type: "Identifier", name: "a"} } } } ] } f(!!!a) f(!a)
  9. Simplification Phase { type: "CallExpression" , callee: {type: "Identifier", name:

    "f"} , arguments: [ { type: "UnaryExpression" , prefix: true, operator: "!" , argument: {type: "Identifier", name: "a"} } ] } f(!!!a) f(!a)
  10. Expansion Phase { type: "MemberExpression" , computed: false , object:

    {type: "Identifier", name: "a"} , property: {type: "Identifier", name: "Infinity"} } a.Infinity a[1/0]
  11. Expansion Phase { type: "MemberExpression" , computed: true , object:

    {type: "Identifier", name: "a"} , property: { type: "BinaryExpression" , operator: "/" , left: {type: "Literal", value: 1, raw: "1"} , right: {type: "Literal", value: 0, raw: "0" } } } a.Infinity a[1/0]