Upgrade to Pro — share decks privately, control downloads, hide ads and more …

More than Regexp

Zete
November 18, 2012

More than Regexp

RubyConfChina 2012 超越正则表达式的正则表达式

Zete

November 18, 2012
Tweet

More Decks by Zete

Other Decks in Programming

Transcript

  1. I mastered Basic, C, C++, Objective-C, Java, C#, Mathematica, Ruby,

    Perl, Python, CoffeeScript, Haskell, Scala, Groovy, R, SML, Erlang, F#, MASM / GNU assembly, LLVM assembly Monday, November 19, 12
  2. I mastered Basic, C, C++, Objective-C, Java, C#, Mathematica, Ruby,

    Perl, Python, CoffeeScript, Haskell, Scala, Groovy, R, SML, Erlang, F#, MASM / GNU assembly, LLVM assembly ’s hello world Monday, November 19, 12
  3. I know quite a lot about web development technologies, compiler

    techniques, functional programing, quantum mechanics, abstract algebra, category theory and medieval history Monday, November 19, 12
  4. I know quite a lot about web development technologies, compiler

    techniques, functional programing, quantum mechanics, abstract algebra, category theory and medieval history a little bit Monday, November 19, 12
  5. ★ Search and replace ★ Validate form data ★ Implement

    protocols ★ Virus scan Monday, November 19, 12
  6. ★ Search and replace ★ Validate form data ★ Implement

    protocols ★ Virus scan ★ Matching mRNA motif Monday, November 19, 12
  7. ★ Search and replace ★ Validate form data ★ Implement

    protocols ★ Virus scan ★ Matching mRNA motif ★ ... Monday, November 19, 12
  8. ★ Stephen Cole Kleene, 1950s ★ Ken Thompson, ed, g/re/p

    ★ Larry Wall, Perl Monday, November 19, 12
  9. ★ Stephen Cole Kleene, 1950s ★ Ken Thompson, ed, g/re/p

    ★ Larry Wall, Perl Monday, November 19, 12
  10. ★ Stephen Cole Kleene, 1950s ★ Ken Thompson, ed, g/re/p

    ★ Larry Wall, Perl Monday, November 19, 12
  11. ★ Stephen Cole Kleene, 1950s ★ Ken Thompson, ed, g/re/p

    ★ Larry Wall, Perl Monday, November 19, 12
  12. ★ Stephen Cole Kleene, 1950s ★ Ken Thompson, ed, g/re/p

    ★ Larry Wall, Perl Monday, November 19, 12
  13. ★ Looks around ★ Unicode support ★ Matching history ★

    PEG engine in fact Monday, November 19, 12
  14. ★ Looks around ★ Unicode support ★ Matching history ★

    PEG engine in fact ★ Ruby 1.9 Oniguruma (َ⻋车) Monday, November 19, 12
  15. ★ Looks around ★ Unicode support ★ Matching history ★

    PEG engine in fact ★ Ruby 1.9 Oniguruma (َ⻋车) ★ Ruby 2.0 Onigmo (َӠ) Monday, November 19, 12
  16. Regexp in Ruby RUBY_VERSION =~ /1\.9|2\.0/ s = “肿㜮办?” s[/肿㜮/]

    = “ዎ㜮” #=> “ዎ㜮办?” %r(#{words.join ‘|’}) Monday, November 19, 12
  17. foo “foo” exactly . matches ANY char a|b or a?

    maybe yes, maybe no Monday, November 19, 12
  18. foo “foo” exactly . matches ANY char a|b or a?

    maybe yes, maybe no a* kleeeeeene star Monday, November 19, 12
  19. foo “foo” exactly . matches ANY char a|b or a?

    maybe yes, maybe no a* kleeeeeene star a{0} repeat by 0 times Monday, November 19, 12
  20. (a) group \1 back ref (fixed) (?<name>a) define named group

    \g<name> use named ref Monday, November 19, 12
  21. Difference between back ref and named group backref = /(\w+)

    \1/ backref =~ ‘ha ha’ # 0 backref =~ ‘ha ho’ # false named = /(?:<word>\w+) \g<word>/ named =~ ‘ha ha’ # 0 named =~ ‘ha ho’ # 0 Monday, November 19, 12
  22. ★ Complex regexp contains much information ★ Add space to

    make it human-readable Monday, November 19, 12
  23. ★ Complex regexp contains much information ★ Add space to

    make it human-readable ★ Try not to make too-complex regexps Monday, November 19, 12
  24. Alignment reduces visual complexity: /^ [\ \t]* (?:class) \s* (.*?)

    \s* (<.*?)? \s* (#.*)? $/x Monday, November 19, 12
  25. Add comments r = /^ [\ \t]* (?:class) \s* (.*?)

    \s* # class name (<.*?)? \s* # inheritance (#.*)? # line comment $/x r =~ “class A < B # match!” Monday, November 19, 12
  26. Grammar in BNF Notation: <A> ::= “զ஌ಓ” <B>? <B> ::=

    “㟬஌ಓ” <A> Monday, November 19, 12
  27. Grammar in BNF Notation: <A> ::= “զ஌ಓ” <B>? <B> ::=

    “㟬஌ಓ” <A> Tail Recursion Monday, November 19, 12
  28. / (?<ओ语> ஖ᥨښ|զ ){0} (?<谓语> ੋ|ࡏRubyConfChina্讲ྃ ){0} (?<宾语> ষ⻥鱼ത࢜తၰࢠ|ྫྷস话:\g<宾语ಉҐ语> ){0}

    (?<陈ड़۟> \g<ओ语>\g<谓语>\g<宾语> ){0} (?<ྫྷস话> \g<陈ड़۟>,(\g<ݪҼဓ۟>|\g<转ંဓ۟>) ){0} (?<ݪҼဓ۟> Ҽ为஖ᥨ࿨ষ⻥鱼౎༗ീ৚Ḱ ){0} (?<宾语ಉҐ语> “\g<ྫྷস话>” ){0} (?<转ંဓ۟> ୠେՈ຅༗স ){0} \g<ྫྷস话> /x Monday, November 19, 12
  29. Use dictionaries in the sentence components, you can make a

    natural language parser with “Regexp” (PEG in fact) Monday, November 19, 12
  30. Use dictionaries in the sentence components, you can make a

    natural language parser with “Regexp” (PEG in fact) (?<ओ语>஖ᥨښ|᥸㫴ښ|౮笼ښ|...) Monday, November 19, 12
  31. Real world language is a bit more than PEG, generally

    Context Free Grammar. Monday, November 19, 12
  32. In CFG, the branches are not ordered: Assume A and

    B are two rules, A|B is the same as B|A in CFG. Monday, November 19, 12
  33. Simple markdown parser in 130 lines (many features ignored but...)

    Real world example Monday, November 19, 12
  34. It supports nested parens! (while ruby-china doesn’t) [ruby](http://en.wikipedia.org/wiki/Ruby_(programming_language)) <a href=’

    http://en.wikipedia.org/wiki/Ruby_(programming_language)’> ruby </a> Monday, November 19, 12
  35. The parser for nested parens: /(?<paren> \( ( [^\(\)]+ #

    non-paren chars | # or \g<paren> # a paren )* \) )/x Monday, November 19, 12
  36. ? + * +? *? ?+ ++ *+ Can you

    tell all of them? Monday, November 19, 12