Slide 1

Slide 1 text

More Than Regexp https://github.com/luikore Monday, November 19, 12

Slide 2

Slide 2 text

Who m i Monday, November 19, 12

Slide 3

Slide 3 text

I mastered Basic, C, C++, Objective-C, Java, C#, Mathematica, Ruby, Perl, Python, CoffeeScript, Haskell, Scala, Groovy, R, SML, Erlang, F#, MASM / GNU assembly, LLVM assembly Monday, November 19, 12

Slide 4

Slide 4 text

I mastered Basic, C, C++, Objective-C, Java, C#, Mathematica, Ruby, Perl, Python, CoffeeScript, Haskell, Scala, Groovy, R, SML, Erlang, F#, MASM / GNU assembly, LLVM assembly ’s hello world Monday, November 19, 12

Slide 5

Slide 5 text

I know quite a lot about web development technologies, compiler techniques, functional programing, quantum mechanics, abstract algebra, category theory and medieval history Monday, November 19, 12

Slide 6

Slide 6 text

I know quite a lot about web development technologies, compiler techniques, functional programing, quantum mechanics, abstract algebra, category theory and medieval history a little bit Monday, November 19, 12

Slide 7

Slide 7 text

I am a rubyist just like you Monday, November 19, 12

Slide 8

Slide 8 text

on Languages Monday, November 19, 12

Slide 9

Slide 9 text

Parsing Parsing is reading, the computer way Monday, November 19, 12

Slide 10

Slide 10 text

Regexp The built-in parsing tool of Ruby Monday, November 19, 12

Slide 11

Slide 11 text

★ Search and replace Monday, November 19, 12

Slide 12

Slide 12 text

★ Search and replace ★ Validate form data Monday, November 19, 12

Slide 13

Slide 13 text

★ Search and replace ★ Validate form data ★ Implement protocols Monday, November 19, 12

Slide 14

Slide 14 text

★ Search and replace ★ Validate form data ★ Implement protocols ★ Virus scan Monday, November 19, 12

Slide 15

Slide 15 text

★ Search and replace ★ Validate form data ★ Implement protocols ★ Virus scan ★ Matching mRNA motif Monday, November 19, 12

Slide 16

Slide 16 text

★ Search and replace ★ Validate form data ★ Implement protocols ★ Virus scan ★ Matching mRNA motif ★ ... Monday, November 19, 12

Slide 17

Slide 17 text

A Brief History Monday, November 19, 12

Slide 18

Slide 18 text

★ Stephen Cole Kleene, 1950s ★ Ken Thompson, ed, g/re/p ★ Larry Wall, Perl Monday, November 19, 12

Slide 19

Slide 19 text

★ Stephen Cole Kleene, 1950s ★ Ken Thompson, ed, g/re/p ★ Larry Wall, Perl Monday, November 19, 12

Slide 20

Slide 20 text

★ Stephen Cole Kleene, 1950s ★ Ken Thompson, ed, g/re/p ★ Larry Wall, Perl Monday, November 19, 12

Slide 21

Slide 21 text

★ Stephen Cole Kleene, 1950s ★ Ken Thompson, ed, g/re/p ★ Larry Wall, Perl Monday, November 19, 12

Slide 22

Slide 22 text

★ Stephen Cole Kleene, 1950s ★ Ken Thompson, ed, g/re/p ★ Larry Wall, Perl Monday, November 19, 12

Slide 23

Slide 23 text

Modern Implementations Goes far beyond the original definition Monday, November 19, 12

Slide 24

Slide 24 text

★ Looks around Monday, November 19, 12

Slide 25

Slide 25 text

★ Looks around ★ Unicode support Monday, November 19, 12

Slide 26

Slide 26 text

★ Looks around ★ Unicode support ★ Matching history Monday, November 19, 12

Slide 27

Slide 27 text

★ Looks around ★ Unicode support ★ Matching history ★ PEG engine in fact Monday, November 19, 12

Slide 28

Slide 28 text

★ Looks around ★ Unicode support ★ Matching history ★ PEG engine in fact ★ Ruby 1.9 Oniguruma (َ⻋车) Monday, November 19, 12

Slide 29

Slide 29 text

★ Looks around ★ Unicode support ★ Matching history ★ PEG engine in fact ★ Ruby 1.9 Oniguruma (َ⻋车) ★ Ruby 2.0 Onigmo (َӠ) Monday, November 19, 12

Slide 30

Slide 30 text

Regexp in Ruby RUBY_VERSION =~ /1\.9|2\.0/ s = “肿㜮办?” s[/肿㜮/] = “ዎ㜮” #=> “ዎ㜮办?” %r(#{words.join ‘|’}) Monday, November 19, 12

Slide 31

Slide 31 text

foo “foo” exactly Monday, November 19, 12

Slide 32

Slide 32 text

foo “foo” exactly . matches ANY char Monday, November 19, 12

Slide 33

Slide 33 text

foo “foo” exactly . matches ANY char a|b or Monday, November 19, 12

Slide 34

Slide 34 text

foo “foo” exactly . matches ANY char a|b or a? maybe yes, maybe no Monday, November 19, 12

Slide 35

Slide 35 text

foo “foo” exactly . matches ANY char a|b or a? maybe yes, maybe no a* kleeeeeene star Monday, November 19, 12

Slide 36

Slide 36 text

foo “foo” exactly . matches ANY char a|b or a? maybe yes, maybe no a* kleeeeeene star a{0} repeat by 0 times Monday, November 19, 12

Slide 37

Slide 37 text

(a) group Monday, November 19, 12

Slide 38

Slide 38 text

(a) group \1 back ref (fixed) Monday, November 19, 12

Slide 39

Slide 39 text

(a) group \1 back ref (fixed) (?a) define named group Monday, November 19, 12

Slide 40

Slide 40 text

(a) group \1 back ref (fixed) (?a) define named group \g use named ref Monday, November 19, 12

Slide 41

Slide 41 text

Difference between back ref and named group backref = /(\w+) \1/ backref =~ ‘ha ha’ # 0 backref =~ ‘ha ho’ # false named = /(?:\w+) \g/ named =~ ‘ha ha’ # 0 named =~ ‘ha ho’ # 0 Monday, November 19, 12

Slide 42

Slide 42 text

★ Complex regexp contains much information Monday, November 19, 12

Slide 43

Slide 43 text

★ Complex regexp contains much information ★ Add space to make it human-readable Monday, November 19, 12

Slide 44

Slide 44 text

★ Complex regexp contains much information ★ Add space to make it human-readable ★ Try not to make too-complex regexps Monday, November 19, 12

Slide 45

Slide 45 text

What does it do? /^[ \t]*(?:class)\s*(.*?) \s*(<.*?)?\s*(#.*)?$/ Monday, November 19, 12

Slide 46

Slide 46 text

Add margins and paddings /^ [\ \t]* (?:class)\s* (.*?)\s* (<.*?)?\s* (#.*)? $/x Monday, November 19, 12

Slide 47

Slide 47 text

Alignment reduces visual complexity: /^ [\ \t]* (?:class) \s* (.*?) \s* (<.*?)? \s* (#.*)? $/x Monday, November 19, 12

Slide 48

Slide 48 text

Add comments r = /^ [\ \t]* (?:class) \s* (.*?) \s* # class name (<.*?)? \s* # inheritance (#.*)? # line comment $/x r =~ “class A < B # match!” Monday, November 19, 12

Slide 49

Slide 49 text

Mathematical modeling languages Formal Language Theory Monday, November 19, 12

Slide 50

Slide 50 text

Mathematical modeling languages Formal Language Theory ਺ֶ Monday, November 19, 12

Slide 51

Slide 51 text

Mathematical modeling languages Formal Language Theory ਺ֶ 语จ Monday, November 19, 12

Slide 52

Slide 52 text

Regular Expression expresses Regular Grammar which recognizes Regular Language, which is Non-Recursive Monday, November 19, 12

Slide 53

Slide 53 text

Parsing Expression Grammar recognizes parsing expression language, which can be Recursive Monday, November 19, 12

Slide 54

Slide 54 text

Monday, November 19, 12

Slide 55

Slide 55 text

Monday, November 19, 12

Slide 56

Slide 56 text

Monday, November 19, 12

Slide 57

Slide 57 text

Example -- Match the following strings: Monday, November 19, 12

Slide 58

Slide 58 text

զ஌ಓ Example -- Match the following strings: Monday, November 19, 12

Slide 59

Slide 59 text

զ஌ಓ զ஌ಓ㟬஌ಓզ஌ಓ Example -- Match the following strings: Monday, November 19, 12

Slide 60

Slide 60 text

զ஌ಓ զ஌ಓ㟬஌ಓզ஌ಓ զ஌ಓ㟬஌ಓզ஌ಓ㟬஌ಓզ஌ಓ Example -- Match the following strings: Monday, November 19, 12

Slide 61

Slide 61 text

զ஌ಓ զ஌ಓ㟬஌ಓզ஌ಓ զ஌ಓ㟬஌ಓզ஌ಓ㟬஌ಓզ஌ಓ զ஌ಓ㟬஌ಓզ஌ಓ㟬஌ಓզ஌ಓ㟬஌ಓզ஌ಓ Example -- Match the following strings: Monday, November 19, 12

Slide 62

Slide 62 text

զ஌ಓ զ஌ಓ㟬஌ಓզ஌ಓ զ஌ಓ㟬஌ಓզ஌ಓ㟬஌ಓզ஌ಓ զ஌ಓ㟬஌ಓզ஌ಓ㟬஌ಓզ஌ಓ㟬஌ಓզ஌ಓ ... Example -- Match the following strings: Monday, November 19, 12

Slide 63

Slide 63 text

The Language (A Regular Language): L = { զ஌ಓ, զ஌ಓ㟬஌ಓզ஌ಓ, ... } Monday, November 19, 12

Slide 64

Slide 64 text

The Regexp (A Regular Grammar): /զ஌ಓ(㟬஌ಓզ஌ಓ)*/ Monday, November 19, 12

Slide 65

Slide 65 text

Structural Analysis: զ஌ಓ(㟬஌ಓ( զ஌ಓ(զ஌ಓ( 㟬஌ಓ(զ஌ಓ) )) )) Monday, November 19, 12

Slide 66

Slide 66 text

Slide 68

Slide 68 text

Slide 69

Slide 69 text

஖ᥨښੋষ⻥鱼ത࢜తၰࢠ The really-recursive example (1): Monday, November 19, 12

Slide 70

Slide 70 text

஖ᥨښੋষ⻥鱼ത࢜తၰࢠ The really-recursive example (1): ओ语 Monday, November 19, 12

Slide 71

Slide 71 text

஖ᥨښੋষ⻥鱼ത࢜తၰࢠ The really-recursive example (1): ओ语 谓语 Monday, November 19, 12

Slide 72

Slide 72 text

஖ᥨښੋষ⻥鱼ത࢜తၰࢠ The really-recursive example (1): ओ语 宾语 谓语 Monday, November 19, 12

Slide 73

Slide 73 text

(?<ओ语>஖ᥨښ) (?<谓语>ੋ) (?<宾语>ষ⻥鱼ത࢜తၰࢠ) (?<陈ड़۟>\g<ओ语>\g<谓语>\g<宾语>) Monday, November 19, 12

Slide 74

Slide 74 text

஖ᥨښੋষ⻥鱼ത࢜తၰࢠ, Ҽ为஖ᥨ࿨ষ⻥鱼౎༗ീ৚Ḱ The really-recursive example (2): Monday, November 19, 12

Slide 75

Slide 75 text

஖ᥨښੋষ⻥鱼ത࢜తၰࢠ, Ҽ为஖ᥨ࿨ষ⻥鱼౎༗ീ৚Ḱ The really-recursive example (2): 陈ड़۟ Monday, November 19, 12

Slide 76

Slide 76 text

஖ᥨښੋষ⻥鱼ത࢜తၰࢠ, Ҽ为஖ᥨ࿨ষ⻥鱼౎༗ീ৚Ḱ The really-recursive example (2): 陈ड़۟ ݪҼဓ۟ Monday, November 19, 12

Slide 77

Slide 77 text

(?<ݪҼဓ۟>Ҽ为஖ᥨ࿨ষ⻥鱼౎༗ീ৚Ḱ) (?<ྫྷস话>\g<陈ड़۟>,\g<ݪҼဓ۟>) Monday, November 19, 12

Slide 78

Slide 78 text

զࡏRubyConfChina্讲ྃྫྷস话: “஖ᥨښੋষ⻥鱼ത࢜త ၰࢠ, Ҽ为஖ᥨ࿨ষ⻥鱼౎༗ീ৚Ḱ” The really-recursive example (3): Monday, November 19, 12

Slide 79

Slide 79 text

զࡏRubyConfChina্讲ྃྫྷস话: “஖ᥨښੋষ⻥鱼ത࢜త ၰࢠ, Ҽ为஖ᥨ࿨ষ⻥鱼౎༗ീ৚Ḱ” The really-recursive example (3): 陈ड़۟ Monday, November 19, 12

Slide 80

Slide 80 text

զࡏRubyConfChina্讲ྃྫྷস话: “஖ᥨښੋষ⻥鱼ത࢜త ၰࢠ, Ҽ为஖ᥨ࿨ষ⻥鱼౎༗ീ৚Ḱ” The really-recursive example (3): 陈ड़۟ 宾语ಉҐ语 Monday, November 19, 12

Slide 81

Slide 81 text

(?<ओ语>஖ᥨښ|զ) (?<谓语>ੋ|ࡏRubyConfChina্讲ྃ) (?<宾语>ষ⻥鱼ത࢜తၰࢠ|ྫྷস话:\g<宾语ಉҐ语>) (?<宾语ಉҐ语>“\g<陈ड़۟>”) Monday, November 19, 12

Slide 82

Slide 82 text

զࡏRubyConfChina্讲ྃྫྷস话: “஖ᥨښੋষ⻥鱼ത࢜త ၰࢠ, Ҽ为஖ᥨ࿨ষ⻥鱼౎༗ീ৚Ḱ”,ୠେՈ຅༗স The really-recursive example (4): Monday, November 19, 12

Slide 83

Slide 83 text

զࡏRubyConfChina্讲ྃྫྷস话: “஖ᥨښੋষ⻥鱼ത࢜త ၰࢠ, Ҽ为஖ᥨ࿨ষ⻥鱼౎༗ീ৚Ḱ”,ୠେՈ຅༗স The really-recursive example (4): 转ંဓ۟ Monday, November 19, 12

Slide 84

Slide 84 text

(?<转ંဓ۟>ୠେՈ຅༗স) (?<ྫྷস话>\g<陈ड़۟>, (\g<ݪҼဓ۟>|\g<转ંဓ۟>)) Monday, November 19, 12

Slide 85

Slide 85 text

Combine them all Monday, November 19, 12

Slide 86

Slide 86 text

/ (?<ओ语> ஖ᥨښ|զ ){0} (?<谓语> ੋ|ࡏRubyConfChina্讲ྃ ){0} (?<宾语> ষ⻥鱼ത࢜తၰࢠ|ྫྷস话:\g<宾语ಉҐ语> ){0} (?<陈ड़۟> \g<ओ语>\g<谓语>\g<宾语> ){0} (?<ྫྷস话> \g<陈ड़۟>,(\g<ݪҼဓ۟>|\g<转ંဓ۟>) ){0} (?<ݪҼဓ۟> Ҽ为஖ᥨ࿨ষ⻥鱼౎༗ീ৚Ḱ ){0} (?<宾语ಉҐ语> “\g<ྫྷস话>” ){0} (?<转ંဓ۟> ୠେՈ຅༗স ){0} \g<ྫྷস话> /x Monday, November 19, 12

Slide 87

Slide 87 text

Use dictionaries in the sentence components, you can make a natural language parser with “Regexp” (PEG in fact) Monday, November 19, 12

Slide 88

Slide 88 text

Use dictionaries in the sentence components, you can make a natural language parser with “Regexp” (PEG in fact) (?<ओ语>஖ᥨښ|᥸㫴ښ|౮笼ښ|...) Monday, November 19, 12

Slide 89

Slide 89 text

Real world language is a bit more than PEG, generally Context Free Grammar. Monday, November 19, 12

Slide 90

Slide 90 text

In CFG, the branches are not ordered: Assume A and B are two rules, A|B is the same as B|A in CFG. Monday, November 19, 12

Slide 91

Slide 91 text

Even CFG can’t solve some ambiguity Monday, November 19, 12

Slide 92

Slide 92 text

Even CFG can’t solve some ambiguity ༗Ұେ೾ၣረਖ਼ࡏ઀ۙ Monday, November 19, 12

Slide 93

Slide 93 text

Even CFG can’t solve some ambiguity ༗Ұେ೾ၣረਖ਼ࡏ઀ۙ ( ) ?? Monday, November 19, 12

Slide 94

Slide 94 text

Now you know more than regexp parsec, rsec, treetop, parselet ... Monday, November 19, 12

Slide 95

Slide 95 text

Simple markdown parser in 130 lines (many features ignored but...) Real world example Monday, November 19, 12

Slide 96

Slide 96 text

It supports nested parens! (while ruby-china doesn’t) [ruby](http://en.wikipedia.org/wiki/Ruby_(programming_language)) ruby Monday, November 19, 12

Slide 97

Slide 97 text

The parser for nested parens: /(? \( ( [^\(\)]+ # non-paren chars | # or \g # a paren )* \) )/x Monday, November 19, 12

Slide 98

Slide 98 text

★ ri Regexp ★ https://github.com/k-takata/Onigmo/tree/master/doc/RE ★ http://en.wikipedia.org/wiki/Parsing_Expression_Grammar Helpful Links Monday, November 19, 12

Slide 99

Slide 99 text

More sugars with Onigmo If there’s time... Monday, November 19, 12

Slide 100

Slide 100 text

/\p{Han}/ /\p{Hiragana,Katakana}/ /\g'name'/ /\g<3>/ /\g'-3'/ /\k<1>/ # /\1/ /\k'name'/ /\k<-1>/ /\k'-n-level'/ Monday, November 19, 12

Slide 101

Slide 101 text

/(?<=backward)something(?=forward)/ /(?

Slide 102

Slide 102 text

? + * +? *? ?+ ++ *+ Can you tell all of them? Monday, November 19, 12

Slide 103

Slide 103 text

Thanks Monday, November 19, 12