@maraujop DSLs Can't parse that!♫ Miguel Araujo

Text processing is hard Shell scripting with pipes are not for average user It’s a day to day problem everywhere I’m not into Excel

What are they?

DSLs Examples SQL SELECT name FROM Users WHERE age > 28;

DSLs Examples CSS h1 { fontSize: 24; fontWeight: ‘bold’; }

DSLs Examples Template A Django template {{ title }} {% for name in names %} {{ forloop.index }} hello {{ name }} {% endfor %}

Create a language? but how?

Lexical Analysis lexer (scanner) tokens (language elements) streams patterns for matching

Python lexer python_line = “i for in [1,2]" token_gen = tokenize.generate_tokens( StringIO(python_line).readline ) def handle_token( type, token, (srow, scol), (erow, ecol), line ): print "%d,%d-%d,%d:\t%s\t%s" % ( srow, scol, erow, ecol, tokenize.tok_name[type], repr(token) )

Python lexer 1,4-1,5: NAME 'i' 1,0-1,3: NAME 'for' 1,6-1,8: NAME 'in' 1,9-1,10: OP '[' 1,10-1,11: NUMBER '1' 1,11-1,12: OP ',' 1,12-1,13: NUMBER '2' … 2,0-2,0: ENDMARKER '' python_line = “i for in [1,2]" for token in tokens: handle_token(token)

Syntax Analysis parser Syntax trees (structure) tokens grammar rules

CPython parser and compile parser AST tokens Python grammar lexer compile() bytecode

DSLs are a little different parser Structured Python code tokens Your grammar lexer executes()

Lexers and parsers difficult to build lexers regex parsers LR, LALR, SLR

Internals Compilers Principles, Techniques and Tools

Generator tools Lex Yacc Lexical Analyzer generator Yet Another Compiler Compiler

Generator tools Flex Bison A fast scanner generator The YACC- compatible Parser Generator

PLY PLY • Performance • Extensive error checking • Useful diagnostics (Python Lex Yacc) by David M. Beazley

Creating a lexer with PLY

TOKENS token.type token.value OPERATOR +

LEXER PATTERNS t_TOKEN_TYPE = r’regex’ def t_TOKEN_TYPE(token): r’regex’ return token Default token type token value

CREATING A LEXER WITH PLY FIRST functions in order THEN Tokens defined by strings in decreasing regular expression length

Simple Arnold Example

ARNOLD LEXER def t_KILLER_SENTENCE(token): "no\ problemo|it's\ showtime” token.type = "OPERATOR" return token

ARNOLD LEXER from ply import lex tokens = ['OPERATOR'] t_ignore = ‘\t\n' def t_error(t): print "Illegal character '%s'" % \ t.value[0] t.lexer.skip(1) lex.lex()

ARNOLD LEXER from lex import lex lex.input(text) while True: token = lex.token() if token is None: break print token

ARNOLD LEXER LexToken(OPERATOR,'no problemo',1,0) LexToken(OPERATOR,"it's showtime",1,12) LexToken(OPERATOR,'no problemo',1,26) text = """ no problemo it's showtime no problemo """

ARNOLD LEXER Illegal character 'c' Illegal character 'o' Illegal character 'n' Illegal character 's' Illegal character 'i' … text = """ consider that a divorce no problemo it's showtime no problemo """

Creating a parser with PLY

GRAMMAR TREE expression expression expression OPERATOR NUMERAL NUMERAL 4 + 4

ARNOLD GRAMMAR BNF statement : | statement OPERATOR Backus-Naur Form

ARNOLD GRAMMAR IN PLY def p_statement(p): """statement : | statement OPERATOR""" if len(p) == 1: p[0] = [] else: p[0] = p[1] + [Sentence(p[2])] YaccProduction p[1] p[2]

ARNOLD GRAMMAR IN PLY def p_statement(p): """statement : | statement OPERATOR""" if len(p) == 1: p[0] = [] else: p[0] = p[1] + [Sentence(p[2])]

ARNOLD GRAMMAR IN PLY def p_statement(p): """statement : | statement OPERATOR""" if len(p) == 1: p[0] = [] else: p[0] = p[1] + [Sentence(p[2])] statement OPERATOR [ ]

PLY GRAMMAR ERROR HANDLING def p_error(p): if p is None: print “"" Syntax error most likely at the end of the text """ else: print "Syntax error!! at %s" % p print "Stopped at %s" % p.lexer.token() sys.exit(1)

PLY PARSER from yacc import yacc text = """ consider that a divorce no problemo it's showtime """ tree = yacc.parse(text) > [, , ]

TIPS You won’t get it right the first time As always practice makes perfect Read other grammars beforehand Think and plan ahead

TEXTUS Text processing is hard Shell scripting with pipes are not for average user It’s a day to day problem everywhere I’m not into Excel

TEXTUS every word like “precio” ; count ; every row from “[“ to “]” ; match $name“:”$email ; print “upper(#{name}) -> #{email}” every word ; match {matricula} ; find repeated ;

TEXTUS elements word column email number paragraph row selectors first every last odd even free text modifiers like “text” length like regex date / phone / …

TEXTUS structured modifiers separated by “;” indicators 1 1-6 1,6 1,-2 >6 operators count average unique find unique find repeated group by group by day max min sum count_if group by month group by year …

TEXTUS conditions where != = in not in > < >= <= context changes without headers save as $var_name

TEXTUS statement selectorStatement selector element selector block_element condition rangeSelector FIRST EVERY selector element free_text_modier

TEXTUS input selectorStatement element block_element condition rangeSelector element free_text_modier SELECTOR SELECTOR SELECTOR 45 rules -> 43 rules

Bad rule example condstmt : WHERE column_selector EQUALS PATTERN | WHERE column_selector NOT_EQUALS PATTERN | WHERE column_selector NOT_IN VARIABLE | WHERE column_selector IN VARIABLE | WHERE range_selector EQUALS PATTERN | WHERE range_selector NOT_EQUALS PATTERN | WHERE range_selector NOT_IN VARIABLE | WHERE range_selector IN VARIABLE

Simplify grammar rules condstmt : WHERE column_selector condition | WHERE range_selector condition condition : EQUALS PATTERN | NOT_EQUALS PATTERN | EQUALS VARIABLE | NOT_EQUALS VARIABLE

condition rules def p_condition(p): """ condition : EQUALS PATTERN | NOT_EQUALS PATTERN | NOT_IN VARIABLE | IN VARIABLE """ p[0] = [p[1], p[2]] p[0] = {‘operator’: p[1], ‘value’: p[2]]

def p_condition_statement(p): """ condition_statement : WHERE column_selector condition """ p[0] = Condition( column_selector=p[2], operator=p[3][0], pattern=p[3][1], ) condition rules p[0] = [p[1], p[2]] Condition !

Parser tree ? [, , , ] without headers ; column 2 ; every email ; count ; Condition -> process(text, context)

class OperatorStatement(object): def __init__(self, *args, **kwargs): self.operation = args[0] def process(self, text, context): if isinstance(text, basestring): raise ParseException( "Operation needs to go after selectors” ) return getattr(self, self.operation.replace(' ', '_'))(text) def count(self, text): if isinstance(text, list): return len(text) elif isinstance(text, dict): results = defaultdict(int) for col_name, col_values in text: results[col_name] += 1 return results

Interpreter tree = yacc.parse(rule) context = Context() for processor in tree: text = processor.process(text, context) return text

TEXTUS Testing def test_csv_column_every_email_count(self): text = """ Sara;eASYDK;[email protected] James;oASKDK;[email protected] Jane;oASWDK;wrongemail@; """ rules = “““without headers ; column 3 ; every email ; count ;""" result = process_text( textwrap.dedent(text), rules ) assert result == 2

Solving a real CSV problem name;barcode;state; Miguel;123;valid; Raul;124;valid; Maria;125;valid; Miguel;123;void; name;barcode;state; Raul;124;valid; Maria;125;valid;

TEXTUS name;barcode;state; Miguel;123;valid; Raul;124;valid; Maria;125;valid; Miguel;123;void; every row where column "state" = "void" ; column "barcode" ; save as $invalid_barcodes ; every row where column "barcode" not in $invalid_barcodes ;

TEXTUS name;barcode;state; Miguel;123;valid; Raul;124;valid; Maria;125;valid; Miguel;123;void; column "barcode" ; find repeated ; save as $invalid_barcodes ; every row where column "barcode" not in $invalid_barcodes ;

TEXTUS name;barcode;state; Miguel;123;valid; Raul;124;valid; Maria;125;valid; Miguel;123;void; column "barcode" ; find unique ; save as $valid_barcodes ; every row where column "barcode" in $valid_barcodes ;

Thanks! @maraujop