Upgrade to Pro — share decks privately, control downloads, hide ads and more …

DSLs can't parse that! ♫

DSLs can't parse that! ♫

Se introducirá brevemente qué es un DSL, tipos y cuándo resultan interesantes, mostrando algunos ejemplos.

Después se presentará PLY para crear nuestro propio lexer y parser en Python, primero con un sencillo lenguaje Arnold.

Para terminar, una vez familiarizados con PLY, se presenta textus, un DSL de procesamiento de textos, viendo como parsearlo, interpretarlo y resolver un problema de una forma distinta a la que estamos acostumbrados, dándole una nueva perspectiva.

Miguel Araujo

November 21, 2015
Tweet

More Decks by Miguel Araujo

Other Decks in Programming

Transcript

  1. @maraujop
    DSLs
    Can't parse that!♫
    Miguel Araujo

    View Slide

  2. Text processing is hard
    Shell scripting with pipes are not for average user
    It’s a day to day problem everywhere
    I’m not into Excel

    View Slide

  3. What are they?

    View Slide

  4. DSLs Examples SQL
    SELECT
    name
    FROM
    Users
    WHERE
    age > 28;

    View Slide

  5. DSLs Examples CSS
    h1 {
    fontSize: 24;
    fontWeight: ‘bold’;
    }

    View Slide

  6. DSLs Examples Template
    A Django template
    {{ title }}
    {% for name in names %}
    {{ forloop.index }} hello
    {{ name }}
    {% endfor %}

    View Slide

  7. Create a language?
    but how?

    View Slide

  8. Lexical Analysis
    lexer
    (scanner)
    tokens
    (language elements)
    streams
    patterns
    for matching

    View Slide

  9. Python lexer
    python_line = “i for in [1,2]"
    token_gen = tokenize.generate_tokens(
    StringIO(python_line).readline
    )
    def handle_token(
    type, token, (srow, scol), (erow, ecol), line
    ):
    print "%d,%d-%d,%d:\t%s\t%s" % (
    srow, scol, erow, ecol,
    tokenize.tok_name[type], repr(token)
    )

    View Slide

  10. Python lexer
    1,4-1,5: NAME 'i'
    1,0-1,3: NAME 'for'
    1,6-1,8: NAME 'in'
    1,9-1,10: OP '['
    1,10-1,11: NUMBER '1'
    1,11-1,12: OP ','
    1,12-1,13: NUMBER '2'

    2,0-2,0: ENDMARKER ''
    python_line = “i for in [1,2]"
    for token in tokens:
    handle_token(token)

    View Slide

  11. Syntax Analysis
    parser
    Syntax trees
    (structure)
    tokens
    grammar
    rules

    View Slide

  12. CPython parser and compile
    parser
    AST
    tokens Python grammar
    lexer
    compile()
    bytecode

    View Slide

  13. DSLs are a little different
    parser
    Structured Python code
    tokens Your grammar
    lexer
    executes()

    View Slide

  14. Lexers and parsers
    difficult to build
    lexers
    regex
    parsers
    LR, LALR, SLR

    View Slide

  15. Internals
    Compilers
    Principles,
    Techniques and
    Tools

    View Slide

  16. Generator tools
    Lex Yacc
    Lexical
    Analyzer
    generator
    Yet Another
    Compiler
    Compiler

    View Slide

  17. Generator tools
    Flex Bison
    A fast
    scanner
    generator
    The YACC-
    compatible
    Parser
    Generator

    View Slide

  18. PLY
    PLY
    • Performance
    • Extensive error checking
    • Useful diagnostics
    (Python Lex Yacc)
    by David M. Beazley

    View Slide

  19. Creating a lexer
    with PLY

    View Slide

  20. TOKENS
    token.type
    token.value
    OPERATOR
    +

    View Slide

  21. LEXER PATTERNS
    t_TOKEN_TYPE = r’regex’
    def t_TOKEN_TYPE(token):
    r’regex’
    return token Default token type
    token value

    View Slide

  22. CREATING A LEXER WITH PLY
    FIRST functions in order
    THEN Tokens defined by strings
    in decreasing regular expression length

    View Slide

  23. Simple Arnold Example

    View Slide

  24. ARNOLD LEXER
    def t_KILLER_SENTENCE(token):
    "no\ problemo|it's\ showtime”
    token.type = "OPERATOR"
    return token

    View Slide

  25. ARNOLD LEXER lex.py
    from ply import lex
    tokens = ['OPERATOR']
    t_ignore = ‘\t\n'
    def t_error(t):
    print "Illegal character '%s'" % \
    t.value[0]
    t.lexer.skip(1)
    lex.lex()

    View Slide

  26. ARNOLD LEXER
    from lex import lex
    lex.input(text)
    while True:
    token = lex.token()
    if token is None:
    break
    print token

    View Slide

  27. ARNOLD LEXER
    LexToken(OPERATOR,'no problemo',1,0)
    LexToken(OPERATOR,"it's showtime",1,12)
    LexToken(OPERATOR,'no problemo',1,26)
    text = """
    no problemo it's showtime
    no problemo
    """

    View Slide

  28. ARNOLD LEXER
    Illegal character 'c'
    Illegal character 'o'
    Illegal character 'n'
    Illegal character 's'
    Illegal character 'i'

    text = """
    consider that a divorce no problemo it's showtime
    no problemo
    """

    View Slide

  29. Creating a parser
    with PLY

    View Slide

  30. GRAMMAR TREE
    expression
    expression
    expression
    OPERATOR
    NUMERAL NUMERAL
    4 + 4

    View Slide

  31. ARNOLD GRAMMAR BNF
    statement :
    | statement OPERATOR
    Backus-Naur Form

    View Slide

  32. ARNOLD GRAMMAR IN PLY
    def p_statement(p):
    """statement :
    | statement OPERATOR"""
    if len(p) == 1:
    p[0] = []
    else:
    p[0] = p[1] + [Sentence(p[2])]
    YaccProduction
    p[1] p[2]

    View Slide

  33. ARNOLD GRAMMAR IN PLY
    def p_statement(p):
    """statement :
    | statement OPERATOR"""
    if len(p) == 1:
    p[0] = []
    else:
    p[0] = p[1] + [Sentence(p[2])]

    View Slide

  34. ARNOLD GRAMMAR IN PLY
    def p_statement(p):
    """statement :
    | statement OPERATOR"""
    if len(p) == 1:
    p[0] = []
    else:
    p[0] = p[1] + [Sentence(p[2])]
    statement OPERATOR [ ]

    View Slide

  35. PLY GRAMMAR ERROR HANDLING
    def p_error(p):
    if p is None:
    print “""
    Syntax error most likely at
    the end of the text
    """
    else:
    print "Syntax error!! at %s" % p
    print "Stopped at %s" % p.lexer.token()
    sys.exit(1)

    View Slide

  36. PLY PARSER
    from yacc import yacc
    text = """
    consider that a divorce no problemo it's showtime
    """
    tree = yacc.parse(text)
    > [,
    ,
    ]

    View Slide

  37. TIPS
    You won’t get it right the first time
    As always practice makes perfect
    Read other grammars beforehand
    Think and plan ahead

    View Slide

  38. TEXTUS

    View Slide

  39. TEXTUS
    Text processing is hard
    Shell scripting with pipes are not for average user
    It’s a day to day problem everywhere
    I’m not into Excel

    View Slide

  40. TEXTUS
    every word like “precio” ; count ;
    every row from “[“ to “]” ; match $name“:”$email ;
    print “upper(#{name}) -> #{email}”
    every word ; match {matricula} ; find repeated ;

    View Slide

  41. TEXTUS
    elements
    word
    column
    email
    number
    paragraph
    row
    selectors
    first
    every
    last
    odd
    even
    free text modifiers
    like “text”
    length
    like regex
    date / phone / …

    View Slide

  42. TEXTUS
    structured modifiers
    separated by “;”
    indicators
    1
    1-6
    1,6
    1,-2
    >6
    operators
    count
    average
    unique
    find unique
    find repeated
    group by
    group by day
    max
    min
    sum
    count_if
    group by month
    group by year

    View Slide

  43. TEXTUS
    conditions
    where
    !=
    =
    in
    not in
    >
    <
    >=
    <=
    context changes
    without headers
    save as $var_name

    View Slide

  44. TEXTUS statement
    selectorStatement
    selector element
    selector block_element condition
    rangeSelector
    FIRST
    EVERY
    selector element free_text_modier

    View Slide

  45. TEXTUS
    input
    selectorStatement
    element
    block_element condition
    rangeSelector
    element free_text_modier
    SELECTOR
    SELECTOR
    SELECTOR
    45 rules -> 43 rules

    View Slide

  46. Bad rule example
    condstmt : WHERE column_selector EQUALS PATTERN
    | WHERE column_selector NOT_EQUALS PATTERN
    | WHERE column_selector NOT_IN VARIABLE
    | WHERE column_selector IN VARIABLE
    | WHERE range_selector EQUALS PATTERN
    | WHERE range_selector NOT_EQUALS PATTERN
    | WHERE range_selector NOT_IN VARIABLE
    | WHERE range_selector IN VARIABLE

    View Slide

  47. Simplify grammar rules
    condstmt : WHERE column_selector condition
    | WHERE range_selector condition
    condition : EQUALS PATTERN
    | NOT_EQUALS PATTERN
    | EQUALS VARIABLE
    | NOT_EQUALS VARIABLE

    View Slide

  48. condition rules
    def p_condition(p):
    """
    condition : EQUALS PATTERN
    | NOT_EQUALS PATTERN
    | NOT_IN VARIABLE
    | IN VARIABLE
    """
    p[0] = [p[1], p[2]]
    p[0] = {‘operator’: p[1], ‘value’: p[2]]

    View Slide

  49. def p_condition_statement(p):
    """
    condition_statement : WHERE column_selector condition
    """
    p[0] = Condition(
    column_selector=p[2],
    operator=p[3][0],
    pattern=p[3][1],
    )
    condition rules
    p[0] = [p[1], p[2]]
    Condition !

    View Slide

  50. Parser tree ?
    [,
    ,
    ,
    ]
    without headers ; column 2 ; every email ; count ;
    Condition -> process(text, context)

    View Slide

  51. class OperatorStatement(object):
    def __init__(self, *args, **kwargs):
    self.operation = args[0]
    def process(self, text, context):
    if isinstance(text, basestring):
    raise ParseException(
    "Operation needs to go after selectors”
    )
    return getattr(self, self.operation.replace(' ', '_'))(text)
    def count(self, text):
    if isinstance(text, list):
    return len(text)
    elif isinstance(text, dict):
    results = defaultdict(int)
    for col_name, col_values in text:
    results[col_name] += 1
    return results

    View Slide

  52. Interpreter
    tree = yacc.parse(rule)
    context = Context()
    for processor in tree:
    text = processor.process(text, context)
    return text

    View Slide

  53. TEXTUS

    View Slide

  54. TEXTUS Testing
    def test_csv_column_every_email_count(self):
    text = """
    Sara;eASYDK;[email protected]
    James;oASKDK;[email protected]
    Jane;oASWDK;[email protected];
    """
    rules = “““without headers ; column 3 ;
    every email ; count ;"""
    result = process_text(
    textwrap.dedent(text),
    rules
    )
    assert result == 2

    View Slide

  55. Solving a real CSV problem
    name;barcode;state;
    Miguel;123;valid;
    Raul;124;valid;
    Maria;125;valid;
    Miguel;123;void;
    name;barcode;state;
    Raul;124;valid;
    Maria;125;valid;

    View Slide

  56. TEXTUS name;barcode;state;
    Miguel;123;valid;
    Raul;124;valid;
    Maria;125;valid;
    Miguel;123;void;
    every row where column "state" = "void" ;
    column "barcode" ; save as
    $invalid_barcodes ;
    every row where column "barcode" not in
    $invalid_barcodes ;

    View Slide

  57. TEXTUS name;barcode;state;
    Miguel;123;valid;
    Raul;124;valid;
    Maria;125;valid;
    Miguel;123;void;
    column "barcode" ; find repeated ; save
    as $invalid_barcodes ;
    every row where column "barcode" not in
    $invalid_barcodes ;

    View Slide

  58. TEXTUS name;barcode;state;
    Miguel;123;valid;
    Raul;124;valid;
    Maria;125;valid;
    Miguel;123;void;
    column "barcode" ; find unique ; save as
    $valid_barcodes ;
    every row where column "barcode" in
    $valid_barcodes ;

    View Slide

  59. TEXTUS

    View Slide

  60. Thanks!
    @maraujop

    View Slide