Slide 1

Slide 1 text

@maraujop DSLs Can't parse that!♫ Miguel Araujo

Slide 2

Slide 2 text

Text processing is hard Shell scripting with pipes are not for average user It’s a day to day problem everywhere I’m not into Excel

Slide 3

Slide 3 text

What are they?

Slide 4

Slide 4 text

DSLs Examples SQL SELECT name FROM Users WHERE age > 28;

Slide 5

Slide 5 text

DSLs Examples CSS h1 { fontSize: 24; fontWeight: ‘bold’; }

Slide 6

Slide 6 text

DSLs Examples Template A Django template {{ title }} {% for name in names %} {{ forloop.index }} hello {{ name }} {% endfor %}

Slide 7

Slide 7 text

Create a language? but how?

Slide 8

Slide 8 text

Lexical Analysis lexer (scanner) tokens (language elements) streams patterns for matching

Slide 9

Slide 9 text

Python lexer python_line = “i for in [1,2]" token_gen = tokenize.generate_tokens( StringIO(python_line).readline ) def handle_token( type, token, (srow, scol), (erow, ecol), line ): print "%d,%d-%d,%d:\t%s\t%s" % ( srow, scol, erow, ecol, tokenize.tok_name[type], repr(token) )

Slide 10

Slide 10 text

Python lexer 1,4-1,5: NAME 'i' 1,0-1,3: NAME 'for' 1,6-1,8: NAME 'in' 1,9-1,10: OP '[' 1,10-1,11: NUMBER '1' 1,11-1,12: OP ',' 1,12-1,13: NUMBER '2' … 2,0-2,0: ENDMARKER '' python_line = “i for in [1,2]" for token in tokens: handle_token(token)

Slide 11

Slide 11 text

Syntax Analysis parser Syntax trees (structure) tokens grammar rules

Slide 12

Slide 12 text

CPython parser and compile parser AST tokens Python grammar lexer compile() bytecode

Slide 13

Slide 13 text

DSLs are a little different parser Structured Python code tokens Your grammar lexer executes()

Slide 14

Slide 14 text

Lexers and parsers difficult to build lexers regex parsers LR, LALR, SLR

Slide 15

Slide 15 text

Internals Compilers Principles, Techniques and Tools

Slide 16

Slide 16 text

Generator tools Lex Yacc Lexical Analyzer generator Yet Another Compiler Compiler

Slide 17

Slide 17 text

Generator tools Flex Bison A fast scanner generator The YACC- compatible Parser Generator

Slide 18

Slide 18 text

PLY PLY • Performance • Extensive error checking • Useful diagnostics (Python Lex Yacc) by David M. Beazley

Slide 19

Slide 19 text

Creating a lexer with PLY

Slide 20

Slide 20 text

TOKENS token.type token.value OPERATOR +

Slide 21

Slide 21 text

LEXER PATTERNS t_TOKEN_TYPE = r’regex’ def t_TOKEN_TYPE(token): r’regex’ return token Default token type token value

Slide 22

Slide 22 text

CREATING A LEXER WITH PLY FIRST functions in order THEN Tokens defined by strings in decreasing regular expression length

Slide 23

Slide 23 text

Simple Arnold Example

Slide 24

Slide 24 text

ARNOLD LEXER def t_KILLER_SENTENCE(token): "no\ problemo|it's\ showtime” token.type = "OPERATOR" return token

Slide 25

Slide 25 text

ARNOLD LEXER lex.py from ply import lex tokens = ['OPERATOR'] t_ignore = ‘\t\n' def t_error(t): print "Illegal character '%s'" % \ t.value[0] t.lexer.skip(1) lex.lex()

Slide 26

Slide 26 text

ARNOLD LEXER from lex import lex lex.input(text) while True: token = lex.token() if token is None: break print token

Slide 27

Slide 27 text

ARNOLD LEXER LexToken(OPERATOR,'no problemo',1,0) LexToken(OPERATOR,"it's showtime",1,12) LexToken(OPERATOR,'no problemo',1,26) text = """ no problemo it's showtime no problemo """

Slide 28

Slide 28 text

ARNOLD LEXER Illegal character 'c' Illegal character 'o' Illegal character 'n' Illegal character 's' Illegal character 'i' … text = """ consider that a divorce no problemo it's showtime no problemo """

Slide 29

Slide 29 text

Creating a parser with PLY

Slide 30

Slide 30 text

GRAMMAR TREE expression expression expression OPERATOR NUMERAL NUMERAL 4 + 4

Slide 31

Slide 31 text

ARNOLD GRAMMAR BNF statement : | statement OPERATOR Backus-Naur Form

Slide 32

Slide 32 text

ARNOLD GRAMMAR IN PLY def p_statement(p): """statement : | statement OPERATOR""" if len(p) == 1: p[0] = [] else: p[0] = p[1] + [Sentence(p[2])] YaccProduction p[1] p[2]

Slide 33

Slide 33 text

ARNOLD GRAMMAR IN PLY def p_statement(p): """statement : | statement OPERATOR""" if len(p) == 1: p[0] = [] else: p[0] = p[1] + [Sentence(p[2])]

Slide 34

Slide 34 text

ARNOLD GRAMMAR IN PLY def p_statement(p): """statement : | statement OPERATOR""" if len(p) == 1: p[0] = [] else: p[0] = p[1] + [Sentence(p[2])] statement OPERATOR [ ]

Slide 35

Slide 35 text

PLY GRAMMAR ERROR HANDLING def p_error(p): if p is None: print “"" Syntax error most likely at the end of the text """ else: print "Syntax error!! at %s" % p print "Stopped at %s" % p.lexer.token() sys.exit(1)

Slide 36

Slide 36 text

PLY PARSER from yacc import yacc text = """ consider that a divorce no problemo it's showtime """ tree = yacc.parse(text) > [, , ]

Slide 37

Slide 37 text

TIPS You won’t get it right the first time As always practice makes perfect Read other grammars beforehand Think and plan ahead

Slide 38

Slide 38 text

TEXTUS

Slide 39

Slide 39 text

TEXTUS Text processing is hard Shell scripting with pipes are not for average user It’s a day to day problem everywhere I’m not into Excel

Slide 40

Slide 40 text

TEXTUS every word like “precio” ; count ; every row from “[“ to “]” ; match $name“:”$email ; print “upper(#{name}) -> #{email}” every word ; match {matricula} ; find repeated ;

Slide 41

Slide 41 text

TEXTUS elements word column email number paragraph row selectors first every last odd even free text modifiers like “text” length like regex date / phone / …

Slide 42

Slide 42 text

TEXTUS structured modifiers separated by “;” indicators 1 1-6 1,6 1,-2 >6 operators count average unique find unique find repeated group by group by day max min sum count_if group by month group by year …

Slide 43

Slide 43 text

TEXTUS conditions where != = in not in > < >= <= context changes without headers save as $var_name

Slide 44

Slide 44 text

TEXTUS statement selectorStatement selector element selector block_element condition rangeSelector FIRST EVERY selector element free_text_modier

Slide 45

Slide 45 text

TEXTUS input selectorStatement element block_element condition rangeSelector element free_text_modier SELECTOR SELECTOR SELECTOR 45 rules -> 43 rules

Slide 46

Slide 46 text

Bad rule example condstmt : WHERE column_selector EQUALS PATTERN | WHERE column_selector NOT_EQUALS PATTERN | WHERE column_selector NOT_IN VARIABLE | WHERE column_selector IN VARIABLE | WHERE range_selector EQUALS PATTERN | WHERE range_selector NOT_EQUALS PATTERN | WHERE range_selector NOT_IN VARIABLE | WHERE range_selector IN VARIABLE

Slide 47

Slide 47 text

Simplify grammar rules condstmt : WHERE column_selector condition | WHERE range_selector condition condition : EQUALS PATTERN | NOT_EQUALS PATTERN | EQUALS VARIABLE | NOT_EQUALS VARIABLE

Slide 48

Slide 48 text

condition rules def p_condition(p): """ condition : EQUALS PATTERN | NOT_EQUALS PATTERN | NOT_IN VARIABLE | IN VARIABLE """ p[0] = [p[1], p[2]] p[0] = {‘operator’: p[1], ‘value’: p[2]]

Slide 49

Slide 49 text

def p_condition_statement(p): """ condition_statement : WHERE column_selector condition """ p[0] = Condition( column_selector=p[2], operator=p[3][0], pattern=p[3][1], ) condition rules p[0] = [p[1], p[2]] Condition !

Slide 50

Slide 50 text

Parser tree ? [, , , ] without headers ; column 2 ; every email ; count ; Condition -> process(text, context)

Slide 51

Slide 51 text

class OperatorStatement(object): def __init__(self, *args, **kwargs): self.operation = args[0] def process(self, text, context): if isinstance(text, basestring): raise ParseException( "Operation needs to go after selectors” ) return getattr(self, self.operation.replace(' ', '_'))(text) def count(self, text): if isinstance(text, list): return len(text) elif isinstance(text, dict): results = defaultdict(int) for col_name, col_values in text: results[col_name] += 1 return results

Slide 52

Slide 52 text

Interpreter tree = yacc.parse(rule) context = Context() for processor in tree: text = processor.process(text, context) return text

Slide 53

Slide 53 text

TEXTUS

Slide 54

Slide 54 text

TEXTUS Testing def test_csv_column_every_email_count(self): text = """ Sara;eASYDK;[email protected] James;oASKDK;[email protected] Jane;oASWDK;wrongemail@; """ rules = “““without headers ; column 3 ; every email ; count ;""" result = process_text( textwrap.dedent(text), rules ) assert result == 2

Slide 55

Slide 55 text

Solving a real CSV problem name;barcode;state; Miguel;123;valid; Raul;124;valid; Maria;125;valid; Miguel;123;void; name;barcode;state; Raul;124;valid; Maria;125;valid;

Slide 56

Slide 56 text

TEXTUS name;barcode;state; Miguel;123;valid; Raul;124;valid; Maria;125;valid; Miguel;123;void; every row where column "state" = "void" ; column "barcode" ; save as $invalid_barcodes ; every row where column "barcode" not in $invalid_barcodes ;

Slide 57

Slide 57 text

TEXTUS name;barcode;state; Miguel;123;valid; Raul;124;valid; Maria;125;valid; Miguel;123;void; column "barcode" ; find repeated ; save as $invalid_barcodes ; every row where column "barcode" not in $invalid_barcodes ;

Slide 58

Slide 58 text

TEXTUS name;barcode;state; Miguel;123;valid; Raul;124;valid; Maria;125;valid; Miguel;123;void; column "barcode" ; find unique ; save as $valid_barcodes ; every row where column "barcode" in $valid_barcodes ;

Slide 59

Slide 59 text

TEXTUS

Slide 60

Slide 60 text

Thanks! @maraujop