Upgrade to Pro — share decks privately, control downloads, hide ads and more …

More than Regexp

Avatar for Zete Zete
November 18, 2012

More than Regexp

RubyConfChina 2012 超越正则表达式的正则表达式

Avatar for Zete

Zete

November 18, 2012
Tweet

More Decks by Zete

Other Decks in Programming

Transcript

  1. I mastered Basic, C, C++, Objective-C, Java, C#, Mathematica, Ruby,

    Perl, Python, CoffeeScript, Haskell, Scala, Groovy, R, SML, Erlang, F#, MASM / GNU assembly, LLVM assembly Monday, November 19, 12
  2. I mastered Basic, C, C++, Objective-C, Java, C#, Mathematica, Ruby,

    Perl, Python, CoffeeScript, Haskell, Scala, Groovy, R, SML, Erlang, F#, MASM / GNU assembly, LLVM assembly ’s hello world Monday, November 19, 12
  3. I know quite a lot about web development technologies, compiler

    techniques, functional programing, quantum mechanics, abstract algebra, category theory and medieval history Monday, November 19, 12
  4. I know quite a lot about web development technologies, compiler

    techniques, functional programing, quantum mechanics, abstract algebra, category theory and medieval history a little bit Monday, November 19, 12
  5. ★ Search and replace ★ Validate form data ★ Implement

    protocols ★ Virus scan Monday, November 19, 12
  6. ★ Search and replace ★ Validate form data ★ Implement

    protocols ★ Virus scan ★ Matching mRNA motif Monday, November 19, 12
  7. ★ Search and replace ★ Validate form data ★ Implement

    protocols ★ Virus scan ★ Matching mRNA motif ★ ... Monday, November 19, 12
  8. ★ Stephen Cole Kleene, 1950s ★ Ken Thompson, ed, g/re/p

    ★ Larry Wall, Perl Monday, November 19, 12
  9. ★ Stephen Cole Kleene, 1950s ★ Ken Thompson, ed, g/re/p

    ★ Larry Wall, Perl Monday, November 19, 12
  10. ★ Stephen Cole Kleene, 1950s ★ Ken Thompson, ed, g/re/p

    ★ Larry Wall, Perl Monday, November 19, 12
  11. ★ Stephen Cole Kleene, 1950s ★ Ken Thompson, ed, g/re/p

    ★ Larry Wall, Perl Monday, November 19, 12
  12. ★ Stephen Cole Kleene, 1950s ★ Ken Thompson, ed, g/re/p

    ★ Larry Wall, Perl Monday, November 19, 12
  13. ★ Looks around ★ Unicode support ★ Matching history ★

    PEG engine in fact Monday, November 19, 12
  14. ★ Looks around ★ Unicode support ★ Matching history ★

    PEG engine in fact ★ Ruby 1.9 Oniguruma (َ⻋车) Monday, November 19, 12
  15. ★ Looks around ★ Unicode support ★ Matching history ★

    PEG engine in fact ★ Ruby 1.9 Oniguruma (َ⻋车) ★ Ruby 2.0 Onigmo (َӠ) Monday, November 19, 12
  16. Regexp in Ruby RUBY_VERSION =~ /1\.9|2\.0/ s = “肿㜮办?” s[/肿㜮/]

    = “ዎ㜮” #=> “ዎ㜮办?” %r(#{words.join ‘|’}) Monday, November 19, 12
  17. foo “foo” exactly . matches ANY char a|b or a?

    maybe yes, maybe no Monday, November 19, 12
  18. foo “foo” exactly . matches ANY char a|b or a?

    maybe yes, maybe no a* kleeeeeene star Monday, November 19, 12
  19. foo “foo” exactly . matches ANY char a|b or a?

    maybe yes, maybe no a* kleeeeeene star a{0} repeat by 0 times Monday, November 19, 12
  20. (a) group \1 back ref (fixed) (?<name>a) define named group

    \g<name> use named ref Monday, November 19, 12
  21. Difference between back ref and named group backref = /(\w+)

    \1/ backref =~ ‘ha ha’ # 0 backref =~ ‘ha ho’ # false named = /(?:<word>\w+) \g<word>/ named =~ ‘ha ha’ # 0 named =~ ‘ha ho’ # 0 Monday, November 19, 12
  22. ★ Complex regexp contains much information ★ Add space to

    make it human-readable Monday, November 19, 12
  23. ★ Complex regexp contains much information ★ Add space to

    make it human-readable ★ Try not to make too-complex regexps Monday, November 19, 12
  24. Alignment reduces visual complexity: /^ [\ \t]* (?:class) \s* (.*?)

    \s* (<.*?)? \s* (#.*)? $/x Monday, November 19, 12
  25. Add comments r = /^ [\ \t]* (?:class) \s* (.*?)

    \s* # class name (<.*?)? \s* # inheritance (#.*)? # line comment $/x r =~ “class A < B # match!” Monday, November 19, 12
  26. Grammar in BNF Notation: <A> ::= “զ஌ಓ” <B>? <B> ::=

    “㟬஌ಓ” <A> Monday, November 19, 12
  27. Grammar in BNF Notation: <A> ::= “զ஌ಓ” <B>? <B> ::=

    “㟬஌ಓ” <A> Tail Recursion Monday, November 19, 12
  28. / (?<ओ语> ஖ᥨښ|զ ){0} (?<谓语> ੋ|ࡏRubyConfChina্讲ྃ ){0} (?<宾语> ষ⻥鱼ത࢜తၰࢠ|ྫྷস话:\g<宾语ಉҐ语> ){0}

    (?<陈ड़۟> \g<ओ语>\g<谓语>\g<宾语> ){0} (?<ྫྷস话> \g<陈ड़۟>,(\g<ݪҼဓ۟>|\g<转ંဓ۟>) ){0} (?<ݪҼဓ۟> Ҽ为஖ᥨ࿨ষ⻥鱼౎༗ീ৚Ḱ ){0} (?<宾语ಉҐ语> “\g<ྫྷস话>” ){0} (?<转ંဓ۟> ୠେՈ຅༗স ){0} \g<ྫྷস话> /x Monday, November 19, 12
  29. Use dictionaries in the sentence components, you can make a

    natural language parser with “Regexp” (PEG in fact) Monday, November 19, 12
  30. Use dictionaries in the sentence components, you can make a

    natural language parser with “Regexp” (PEG in fact) (?<ओ语>஖ᥨښ|᥸㫴ښ|౮笼ښ|...) Monday, November 19, 12
  31. Real world language is a bit more than PEG, generally

    Context Free Grammar. Monday, November 19, 12
  32. In CFG, the branches are not ordered: Assume A and

    B are two rules, A|B is the same as B|A in CFG. Monday, November 19, 12
  33. Simple markdown parser in 130 lines (many features ignored but...)

    Real world example Monday, November 19, 12
  34. It supports nested parens! (while ruby-china doesn’t) [ruby](http://en.wikipedia.org/wiki/Ruby_(programming_language)) <a href=’

    http://en.wikipedia.org/wiki/Ruby_(programming_language)’> ruby </a> Monday, November 19, 12
  35. The parser for nested parens: /(?<paren> \( ( [^\(\)]+ #

    non-paren chars | # or \g<paren> # a paren )* \) )/x Monday, November 19, 12
  36. ? + * +? *? ?+ ++ *+ Can you

    tell all of them? Monday, November 19, 12