DSLs can't parse that! ♫

DSLs can't parse that! ♫

Se introducirá brevemente qué es un DSL, tipos y cuándo resultan interesantes, mostrando algunos ejemplos.

Después se presentará PLY para crear nuestro propio lexer y parser en Python, primero con un sencillo lenguaje Arnold.

Para terminar, una vez familiarizados con PLY, se presenta textus, un DSL de procesamiento de textos, viendo como parsearlo, interpretarlo y resolver un problema de una forma distinta a la que estamos acostumbrados, dándole una nueva perspectiva.

1b6429aafe9e31ea3e3586e2edf80211?s=128

Miguel Araujo

November 21, 2015
Tweet

Transcript

  1. @maraujop DSLs Can't parse that!♫ Miguel Araujo

  2. Text processing is hard Shell scripting with pipes are not

    for average user It’s a day to day problem everywhere I’m not into Excel
  3. What are they?

  4. DSLs Examples SQL SELECT name FROM Users WHERE age >

    28;
  5. DSLs Examples CSS h1 { fontSize: 24; fontWeight: ‘bold’; }

  6. DSLs Examples Template A Django template {{ title }} {%

    for name in names %} {{ forloop.index }} hello {{ name }} {% endfor %}
  7. Create a language? but how?

  8. Lexical Analysis lexer (scanner) tokens (language elements) streams patterns for

    matching
  9. Python lexer python_line = “i for in [1,2]" token_gen =

    tokenize.generate_tokens( StringIO(python_line).readline ) def handle_token( type, token, (srow, scol), (erow, ecol), line ): print "%d,%d-%d,%d:\t%s\t%s" % ( srow, scol, erow, ecol, tokenize.tok_name[type], repr(token) )
  10. Python lexer 1,4-1,5: NAME 'i' 1,0-1,3: NAME 'for' 1,6-1,8: NAME

    'in' 1,9-1,10: OP '[' 1,10-1,11: NUMBER '1' 1,11-1,12: OP ',' 1,12-1,13: NUMBER '2' … 2,0-2,0: ENDMARKER '' python_line = “i for in [1,2]" for token in tokens: handle_token(token)
  11. Syntax Analysis parser Syntax trees (structure) tokens grammar rules

  12. CPython parser and compile parser AST tokens Python grammar lexer

    compile() bytecode
  13. DSLs are a little different parser Structured Python code tokens

    Your grammar lexer executes()
  14. Lexers and parsers difficult to build lexers regex parsers LR,

    LALR, SLR
  15. Internals Compilers Principles, Techniques and Tools

  16. Generator tools Lex Yacc Lexical Analyzer generator Yet Another Compiler

    Compiler
  17. Generator tools Flex Bison A fast scanner generator The YACC-

    compatible Parser Generator
  18. PLY PLY • Performance • Extensive error checking • Useful

    diagnostics (Python Lex Yacc) by David M. Beazley
  19. Creating a lexer with PLY

  20. TOKENS token.type token.value OPERATOR +

  21. LEXER PATTERNS t_TOKEN_TYPE = r’regex’ def t_TOKEN_TYPE(token): r’regex’ return token

    Default token type token value
  22. CREATING A LEXER WITH PLY FIRST functions in order THEN

    Tokens defined by strings in decreasing regular expression length
  23. Simple Arnold Example

  24. ARNOLD LEXER def t_KILLER_SENTENCE(token): "no\ problemo|it's\ showtime” token.type = "OPERATOR"

    return token
  25. ARNOLD LEXER lex.py from ply import lex tokens = ['OPERATOR']

    t_ignore = ‘\t\n' def t_error(t): print "Illegal character '%s'" % \ t.value[0] t.lexer.skip(1) lex.lex()
  26. ARNOLD LEXER from lex import lex lex.input(text) while True: token

    = lex.token() if token is None: break print token
  27. ARNOLD LEXER LexToken(OPERATOR,'no problemo',1,0) LexToken(OPERATOR,"it's showtime",1,12) LexToken(OPERATOR,'no problemo',1,26) text =

    """ no problemo it's showtime no problemo """
  28. ARNOLD LEXER Illegal character 'c' Illegal character 'o' Illegal character

    'n' Illegal character 's' Illegal character 'i' … text = """ consider that a divorce no problemo it's showtime no problemo """
  29. Creating a parser with PLY

  30. GRAMMAR TREE expression expression expression OPERATOR NUMERAL NUMERAL 4 +

    4
  31. ARNOLD GRAMMAR BNF statement : | statement OPERATOR Backus-Naur Form

  32. ARNOLD GRAMMAR IN PLY def p_statement(p): """statement : | statement

    OPERATOR""" if len(p) == 1: p[0] = [] else: p[0] = p[1] + [Sentence(p[2])] YaccProduction p[1] p[2]
  33. ARNOLD GRAMMAR IN PLY def p_statement(p): """statement : | statement

    OPERATOR""" if len(p) == 1: p[0] = [] else: p[0] = p[1] + [Sentence(p[2])]
  34. ARNOLD GRAMMAR IN PLY def p_statement(p): """statement : | statement

    OPERATOR""" if len(p) == 1: p[0] = [] else: p[0] = p[1] + [Sentence(p[2])] statement OPERATOR [ ]
  35. PLY GRAMMAR ERROR HANDLING def p_error(p): if p is None:

    print “"" Syntax error most likely at the end of the text """ else: print "Syntax error!! at %s" % p print "Stopped at %s" % p.lexer.token() sys.exit(1)
  36. PLY PARSER from yacc import yacc text = """ consider

    that a divorce no problemo it's showtime """ tree = yacc.parse(text) > [<Sentence 'no problemo’>, <Sentence “it's showtime”>, <Sentence 'no problemo'>]
  37. TIPS You won’t get it right the first time As

    always practice makes perfect Read other grammars beforehand Think and plan ahead
  38. TEXTUS

  39. TEXTUS Text processing is hard Shell scripting with pipes are

    not for average user It’s a day to day problem everywhere I’m not into Excel
  40. TEXTUS every word like “precio” ; count ; every row

    from “[“ to “]” ; match $name“:”$email ; print “upper(#{name}) -> #{email}” every word ; match {matricula} ; find repeated ;
  41. TEXTUS elements word column email number paragraph row selectors first

    every last odd even free text modifiers like “text” length like regex date / phone / …
  42. TEXTUS structured modifiers separated by “;” indicators 1 1-6 1,6

    1,-2 >6 operators count average unique find unique find repeated group by group by day max min sum count_if group by month group by year …
  43. TEXTUS conditions where != = in not in > <

    >= <= context changes without headers save as $var_name
  44. TEXTUS statement selectorStatement selector element selector block_element condition rangeSelector FIRST

    EVERY selector element free_text_modier
  45. TEXTUS input selectorStatement element block_element condition rangeSelector element free_text_modier SELECTOR

    SELECTOR SELECTOR 45 rules -> 43 rules
  46. Bad rule example condstmt : WHERE column_selector EQUALS PATTERN |

    WHERE column_selector NOT_EQUALS PATTERN | WHERE column_selector NOT_IN VARIABLE | WHERE column_selector IN VARIABLE | WHERE range_selector EQUALS PATTERN | WHERE range_selector NOT_EQUALS PATTERN | WHERE range_selector NOT_IN VARIABLE | WHERE range_selector IN VARIABLE
  47. Simplify grammar rules condstmt : WHERE column_selector condition | WHERE

    range_selector condition condition : EQUALS PATTERN | NOT_EQUALS PATTERN | EQUALS VARIABLE | NOT_EQUALS VARIABLE
  48. condition rules def p_condition(p): """ condition : EQUALS PATTERN |

    NOT_EQUALS PATTERN | NOT_IN VARIABLE | IN VARIABLE """ p[0] = [p[1], p[2]] p[0] = {‘operator’: p[1], ‘value’: p[2]]
  49. def p_condition_statement(p): """ condition_statement : WHERE column_selector condition """ p[0]

    = Condition( column_selector=p[2], operator=p[3][0], pattern=p[3][1], ) condition rules p[0] = [p[1], p[2]] Condition !
  50. Parser tree ? [<textus.interpreter.ContextChange>, <textus.interpreter.ColumnSelectorStatement>, <textus.interpreter.SelectorStatement>, <textus.interpreter.OperatorStatement>] without headers ;

    column 2 ; every email ; count ; Condition -> process(text, context)
  51. class OperatorStatement(object): def __init__(self, *args, **kwargs): self.operation = args[0] def

    process(self, text, context): if isinstance(text, basestring): raise ParseException( "Operation needs to go after selectors” ) return getattr(self, self.operation.replace(' ', '_'))(text) def count(self, text): if isinstance(text, list): return len(text) elif isinstance(text, dict): results = defaultdict(int) for col_name, col_values in text: results[col_name] += 1 return results
  52. Interpreter tree = yacc.parse(rule) context = Context() for processor in

    tree: text = processor.process(text, context) return text
  53. TEXTUS

  54. TEXTUS Testing def test_csv_column_every_email_count(self): text = """ Sara;eASYDK;sarah@test.com James;oASKDK;james@test.com Jane;oASWDK;wrongemail@;

    """ rules = “““without headers ; column 3 ; every email ; count ;""" result = process_text( textwrap.dedent(text), rules ) assert result == 2
  55. Solving a real CSV problem name;barcode;state; Miguel;123;valid; Raul;124;valid; Maria;125;valid; Miguel;123;void;

    name;barcode;state; Raul;124;valid; Maria;125;valid;
  56. TEXTUS name;barcode;state; Miguel;123;valid; Raul;124;valid; Maria;125;valid; Miguel;123;void; every row where column

    "state" = "void" ; column "barcode" ; save as $invalid_barcodes ; every row where column "barcode" not in $invalid_barcodes ;
  57. TEXTUS name;barcode;state; Miguel;123;valid; Raul;124;valid; Maria;125;valid; Miguel;123;void; column "barcode" ; find

    repeated ; save as $invalid_barcodes ; every row where column "barcode" not in $invalid_barcodes ;
  58. TEXTUS name;barcode;state; Miguel;123;valid; Raul;124;valid; Maria;125;valid; Miguel;123;void; column "barcode" ; find

    unique ; save as $valid_barcodes ; every row where column "barcode" in $valid_barcodes ;
  59. TEXTUS

  60. Thanks! @maraujop