DSLs can't parse that! ♫

DSLs can't parse that! ♫

Se introducirá brevemente qué es un DSL, tipos y cuándo resultan interesantes, mostrando algunos ejemplos.

Después se presentará PLY para crear nuestro propio lexer y parser en Python, primero con un sencillo lenguaje Arnold.

Para terminar, una vez familiarizados con PLY, se presenta textus, un DSL de procesamiento de textos, viendo como parsearlo, interpretarlo y resolver un problema de una forma distinta a la que estamos acostumbrados, dándole una nueva perspectiva.

1b6429aafe9e31ea3e3586e2edf80211?s=128

Miguel Araujo

November 21, 2015
Tweet

Transcript

  1. 2.

    Text processing is hard Shell scripting with pipes are not

    for average user It’s a day to day problem everywhere I’m not into Excel
  2. 6.

    DSLs Examples Template A Django template {{ title }} {%

    for name in names %} {{ forloop.index }} hello {{ name }} {% endfor %}
  3. 9.

    Python lexer python_line = “i for in [1,2]" token_gen =

    tokenize.generate_tokens( StringIO(python_line).readline ) def handle_token( type, token, (srow, scol), (erow, ecol), line ): print "%d,%d-%d,%d:\t%s\t%s" % ( srow, scol, erow, ecol, tokenize.tok_name[type], repr(token) )
  4. 10.

    Python lexer 1,4-1,5: NAME 'i' 1,0-1,3: NAME 'for' 1,6-1,8: NAME

    'in' 1,9-1,10: OP '[' 1,10-1,11: NUMBER '1' 1,11-1,12: OP ',' 1,12-1,13: NUMBER '2' … 2,0-2,0: ENDMARKER '' python_line = “i for in [1,2]" for token in tokens: handle_token(token)
  5. 18.

    PLY PLY • Performance • Extensive error checking • Useful

    diagnostics (Python Lex Yacc) by David M. Beazley
  6. 22.

    CREATING A LEXER WITH PLY FIRST functions in order THEN

    Tokens defined by strings in decreasing regular expression length
  7. 25.

    ARNOLD LEXER lex.py from ply import lex tokens = ['OPERATOR']

    t_ignore = ‘\t\n' def t_error(t): print "Illegal character '%s'" % \ t.value[0] t.lexer.skip(1) lex.lex()
  8. 26.

    ARNOLD LEXER from lex import lex lex.input(text) while True: token

    = lex.token() if token is None: break print token
  9. 28.

    ARNOLD LEXER Illegal character 'c' Illegal character 'o' Illegal character

    'n' Illegal character 's' Illegal character 'i' … text = """ consider that a divorce no problemo it's showtime no problemo """
  10. 32.

    ARNOLD GRAMMAR IN PLY def p_statement(p): """statement : | statement

    OPERATOR""" if len(p) == 1: p[0] = [] else: p[0] = p[1] + [Sentence(p[2])] YaccProduction p[1] p[2]
  11. 33.

    ARNOLD GRAMMAR IN PLY def p_statement(p): """statement : | statement

    OPERATOR""" if len(p) == 1: p[0] = [] else: p[0] = p[1] + [Sentence(p[2])]
  12. 34.

    ARNOLD GRAMMAR IN PLY def p_statement(p): """statement : | statement

    OPERATOR""" if len(p) == 1: p[0] = [] else: p[0] = p[1] + [Sentence(p[2])] statement OPERATOR [ ]
  13. 35.

    PLY GRAMMAR ERROR HANDLING def p_error(p): if p is None:

    print “"" Syntax error most likely at the end of the text """ else: print "Syntax error!! at %s" % p print "Stopped at %s" % p.lexer.token() sys.exit(1)
  14. 36.

    PLY PARSER from yacc import yacc text = """ consider

    that a divorce no problemo it's showtime """ tree = yacc.parse(text) > [<Sentence 'no problemo’>, <Sentence “it's showtime”>, <Sentence 'no problemo'>]
  15. 37.

    TIPS You won’t get it right the first time As

    always practice makes perfect Read other grammars beforehand Think and plan ahead
  16. 38.
  17. 39.

    TEXTUS Text processing is hard Shell scripting with pipes are

    not for average user It’s a day to day problem everywhere I’m not into Excel
  18. 40.

    TEXTUS every word like “precio” ; count ; every row

    from “[“ to “]” ; match $name“:”$email ; print “upper(#{name}) -> #{email}” every word ; match {matricula} ; find repeated ;
  19. 41.

    TEXTUS elements word column email number paragraph row selectors first

    every last odd even free text modifiers like “text” length like regex date / phone / …
  20. 42.

    TEXTUS structured modifiers separated by “;” indicators 1 1-6 1,6

    1,-2 >6 operators count average unique find unique find repeated group by group by day max min sum count_if group by month group by year …
  21. 43.

    TEXTUS conditions where != = in not in > <

    >= <= context changes without headers save as $var_name
  22. 46.

    Bad rule example condstmt : WHERE column_selector EQUALS PATTERN |

    WHERE column_selector NOT_EQUALS PATTERN | WHERE column_selector NOT_IN VARIABLE | WHERE column_selector IN VARIABLE | WHERE range_selector EQUALS PATTERN | WHERE range_selector NOT_EQUALS PATTERN | WHERE range_selector NOT_IN VARIABLE | WHERE range_selector IN VARIABLE
  23. 47.

    Simplify grammar rules condstmt : WHERE column_selector condition | WHERE

    range_selector condition condition : EQUALS PATTERN | NOT_EQUALS PATTERN | EQUALS VARIABLE | NOT_EQUALS VARIABLE
  24. 48.

    condition rules def p_condition(p): """ condition : EQUALS PATTERN |

    NOT_EQUALS PATTERN | NOT_IN VARIABLE | IN VARIABLE """ p[0] = [p[1], p[2]] p[0] = {‘operator’: p[1], ‘value’: p[2]]
  25. 49.

    def p_condition_statement(p): """ condition_statement : WHERE column_selector condition """ p[0]

    = Condition( column_selector=p[2], operator=p[3][0], pattern=p[3][1], ) condition rules p[0] = [p[1], p[2]] Condition !
  26. 51.

    class OperatorStatement(object): def __init__(self, *args, **kwargs): self.operation = args[0] def

    process(self, text, context): if isinstance(text, basestring): raise ParseException( "Operation needs to go after selectors” ) return getattr(self, self.operation.replace(' ', '_'))(text) def count(self, text): if isinstance(text, list): return len(text) elif isinstance(text, dict): results = defaultdict(int) for col_name, col_values in text: results[col_name] += 1 return results
  27. 52.

    Interpreter tree = yacc.parse(rule) context = Context() for processor in

    tree: text = processor.process(text, context) return text
  28. 53.
  29. 54.

    TEXTUS Testing def test_csv_column_every_email_count(self): text = """ Sara;eASYDK;sarah@test.com James;oASKDK;james@test.com Jane;oASWDK;wrongemail@;

    """ rules = “““without headers ; column 3 ; every email ; count ;""" result = process_text( textwrap.dedent(text), rules ) assert result == 2
  30. 56.

    TEXTUS name;barcode;state; Miguel;123;valid; Raul;124;valid; Maria;125;valid; Miguel;123;void; every row where column

    "state" = "void" ; column "barcode" ; save as $invalid_barcodes ; every row where column "barcode" not in $invalid_barcodes ;
  31. 57.

    TEXTUS name;barcode;state; Miguel;123;valid; Raul;124;valid; Maria;125;valid; Miguel;123;void; column "barcode" ; find

    repeated ; save as $invalid_barcodes ; every row where column "barcode" not in $invalid_barcodes ;
  32. 58.

    TEXTUS name;barcode;state; Miguel;123;valid; Raul;124;valid; Maria;125;valid; Miguel;123;void; column "barcode" ; find

    unique ; save as $valid_barcodes ; every row where column "barcode" in $valid_barcodes ;
  33. 59.