More Than Regexp
https://github.com/luikore
Monday, November 19, 12
Slide 2
Slide 2 text
Who m i
Monday, November 19, 12
Slide 3
Slide 3 text
I mastered Basic, C, C++, Objective-C, Java,
C#, Mathematica, Ruby, Perl, Python,
CoffeeScript, Haskell, Scala, Groovy, R, SML,
Erlang, F#, MASM / GNU assembly, LLVM
assembly
Monday, November 19, 12
Slide 4
Slide 4 text
I mastered Basic, C, C++, Objective-C, Java,
C#, Mathematica, Ruby, Perl, Python,
CoffeeScript, Haskell, Scala, Groovy, R, SML,
Erlang, F#, MASM / GNU assembly, LLVM
assembly ’s hello world
Monday, November 19, 12
Slide 5
Slide 5 text
I know quite a lot about web development
technologies, compiler techniques, functional
programing, quantum mechanics, abstract
algebra, category theory and medieval history
Monday, November 19, 12
Slide 6
Slide 6 text
I know quite a lot about web development
technologies, compiler techniques, functional
programing, quantum mechanics, abstract
algebra, category theory and medieval history
a little bit
Monday, November 19, 12
Slide 7
Slide 7 text
I am a rubyist just like you
Monday, November 19, 12
Slide 8
Slide 8 text
on Languages
Monday, November 19, 12
Slide 9
Slide 9 text
Parsing
Parsing is reading, the computer way
Monday, November 19, 12
Slide 10
Slide 10 text
Regexp
The built-in parsing tool of Ruby
Monday, November 19, 12
Slide 11
Slide 11 text
★ Search and replace
Monday, November 19, 12
Slide 12
Slide 12 text
★ Search and replace
★ Validate form data
Monday, November 19, 12
Slide 13
Slide 13 text
★ Search and replace
★ Validate form data
★ Implement protocols
Monday, November 19, 12
Slide 14
Slide 14 text
★ Search and replace
★ Validate form data
★ Implement protocols
★ Virus scan
Monday, November 19, 12
Slide 15
Slide 15 text
★ Search and replace
★ Validate form data
★ Implement protocols
★ Virus scan
★ Matching mRNA motif
Monday, November 19, 12
Slide 16
Slide 16 text
★ Search and replace
★ Validate form data
★ Implement protocols
★ Virus scan
★ Matching mRNA motif
★ ...
Monday, November 19, 12
Slide 17
Slide 17 text
A Brief History
Monday, November 19, 12
Slide 18
Slide 18 text
★ Stephen Cole Kleene, 1950s
★ Ken Thompson, ed, g/re/p
★ Larry Wall, Perl
Monday, November 19, 12
Slide 19
Slide 19 text
★ Stephen Cole Kleene, 1950s
★ Ken Thompson, ed, g/re/p
★ Larry Wall, Perl
Monday, November 19, 12
Slide 20
Slide 20 text
★ Stephen Cole Kleene, 1950s
★ Ken Thompson, ed, g/re/p
★ Larry Wall, Perl
Monday, November 19, 12
Slide 21
Slide 21 text
★ Stephen Cole Kleene, 1950s
★ Ken Thompson, ed, g/re/p
★ Larry Wall, Perl
Monday, November 19, 12
Slide 22
Slide 22 text
★ Stephen Cole Kleene, 1950s
★ Ken Thompson, ed, g/re/p
★ Larry Wall, Perl
Monday, November 19, 12
Slide 23
Slide 23 text
Modern
Implementations
Goes far beyond the original definition
Monday, November 19, 12
Slide 24
Slide 24 text
★ Looks around
Monday, November 19, 12
Slide 25
Slide 25 text
★ Looks around
★ Unicode support
Monday, November 19, 12
Slide 26
Slide 26 text
★ Looks around
★ Unicode support
★ Matching history
Monday, November 19, 12
Slide 27
Slide 27 text
★ Looks around
★ Unicode support
★ Matching history
★ PEG engine in fact
Monday, November 19, 12
Slide 28
Slide 28 text
★ Looks around
★ Unicode support
★ Matching history
★ PEG engine in fact
★ Ruby 1.9 Oniguruma (َ⻋车)
Monday, November 19, 12
Slide 29
Slide 29 text
★ Looks around
★ Unicode support
★ Matching history
★ PEG engine in fact
★ Ruby 1.9 Oniguruma (َ⻋车)
★ Ruby 2.0 Onigmo (َӠ)
Monday, November 19, 12
Slide 30
Slide 30 text
Regexp in Ruby
RUBY_VERSION =~ /1\.9|2\.0/
s = “肿㜮办?”
s[/肿㜮/] = “ዎ㜮” #=> “ዎ㜮办?”
%r(#{words.join ‘|’})
Monday, November 19, 12
Slide 31
Slide 31 text
foo “foo” exactly
Monday, November 19, 12
Slide 32
Slide 32 text
foo “foo” exactly
. matches ANY char
Monday, November 19, 12
Slide 33
Slide 33 text
foo “foo” exactly
. matches ANY char
a|b or
Monday, November 19, 12
Slide 34
Slide 34 text
foo “foo” exactly
. matches ANY char
a|b or
a? maybe yes, maybe no
Monday, November 19, 12
Slide 35
Slide 35 text
foo “foo” exactly
. matches ANY char
a|b or
a? maybe yes, maybe no
a* kleeeeeene star
Monday, November 19, 12
Slide 36
Slide 36 text
foo “foo” exactly
. matches ANY char
a|b or
a? maybe yes, maybe no
a* kleeeeeene star
a{0} repeat by 0 times
Monday, November 19, 12
Slide 37
Slide 37 text
(a) group
Monday, November 19, 12
Slide 38
Slide 38 text
(a) group
\1 back ref (fixed)
Monday, November 19, 12
Slide 39
Slide 39 text
(a) group
\1 back ref (fixed)
(?a) define named group
Monday, November 19, 12
Slide 40
Slide 40 text
(a) group
\1 back ref (fixed)
(?a) define named group
\g use named ref
Monday, November 19, 12
Slide 41
Slide 41 text
Difference between back ref and
named group
backref = /(\w+) \1/
backref =~ ‘ha ha’ # 0
backref =~ ‘ha ho’ # false
named = /(?:\w+) \g/
named =~ ‘ha ha’ # 0
named =~ ‘ha ho’ # 0
Monday, November 19, 12
Slide 42
Slide 42 text
★ Complex regexp contains much
information
Monday, November 19, 12
Slide 43
Slide 43 text
★ Complex regexp contains much
information
★ Add space to make it human-readable
Monday, November 19, 12
Slide 44
Slide 44 text
★ Complex regexp contains much
information
★ Add space to make it human-readable
★ Try not to make too-complex regexps
Monday, November 19, 12
Slide 45
Slide 45 text
What does it do?
/^[ \t]*(?:class)\s*(.*?)
\s*(<.*?)?\s*(#.*)?$/
Monday, November 19, 12
Slide 46
Slide 46 text
Add margins and paddings
/^
[\ \t]*
(?:class)\s*
(.*?)\s*
(<.*?)?\s*
(#.*)?
$/x
Monday, November 19, 12
Add comments
r =
/^
[\ \t]*
(?:class) \s*
(.*?) \s* # class name
(<.*?)? \s* # inheritance
(#.*)? # line comment
$/x
r =~ “class A < B # match!”
Monday, November 19, 12
Slide 49
Slide 49 text
Mathematical modeling languages
Formal Language Theory
Monday, November 19, 12
Slide 50
Slide 50 text
Mathematical modeling languages
Formal Language Theory
ֶ
Monday, November 19, 12
Slide 51
Slide 51 text
Mathematical modeling languages
Formal Language Theory
ֶ 语จ
Monday, November 19, 12
Slide 52
Slide 52 text
Regular Expression expresses Regular
Grammar which recognizes Regular
Language, which is Non-Recursive
Monday, November 19, 12
Slide 53
Slide 53 text
Parsing Expression Grammar recognizes
parsing expression language, which can be
Recursive
Monday, November 19, 12
Slide 54
Slide 54 text
Monday, November 19, 12
Slide 55
Slide 55 text
Monday, November 19, 12
Slide 56
Slide 56 text
Monday, November 19, 12
Slide 57
Slide 57 text
Example -- Match the following strings:
Monday, November 19, 12
Slide 58
Slide 58 text
զಓ
Example -- Match the following strings:
Monday, November 19, 12
Slide 59
Slide 59 text
զಓ
զಓ㟬ಓզಓ
Example -- Match the following strings:
Monday, November 19, 12
Slide 60
Slide 60 text
զಓ
զಓ㟬ಓզಓ
զಓ㟬ಓզಓ㟬ಓզಓ
Example -- Match the following strings:
Monday, November 19, 12
Slide 61
Slide 61 text
զಓ
զಓ㟬ಓզಓ
զಓ㟬ಓզಓ㟬ಓզಓ
զಓ㟬ಓզಓ㟬ಓզಓ㟬ಓզಓ
Example -- Match the following strings:
Monday, November 19, 12
Slide 62
Slide 62 text
զಓ
զಓ㟬ಓզಓ
զಓ㟬ಓզಓ㟬ಓզಓ
զಓ㟬ಓզಓ㟬ಓզಓ㟬ಓզಓ
...
Example -- Match the following strings:
Monday, November 19, 12
Slide 63
Slide 63 text
The Language (A Regular Language):
L = {
զಓ,
զಓ㟬ಓզಓ,
...
}
Monday, November 19, 12
Slide 64
Slide 64 text
The Regexp (A Regular Grammar):
/զಓ(㟬ಓզಓ)*/
Monday, November 19, 12
Use dictionaries in the sentence
components, you can make a natural language
parser with “Regexp” (PEG in fact)
Monday, November 19, 12
Slide 88
Slide 88 text
Use dictionaries in the sentence
components, you can make a natural language
parser with “Regexp” (PEG in fact)
(?<ओ语>ᥨښ|㫴ښ|౮笼ښ|...)
Monday, November 19, 12
Slide 89
Slide 89 text
Real world language is a bit more than PEG,
generally Context Free Grammar.
Monday, November 19, 12
Slide 90
Slide 90 text
In CFG, the branches are not ordered:
Assume A and B are two rules, A|B is the
same as B|A in CFG.
Monday, November 19, 12
Slide 91
Slide 91 text
Even CFG can’t solve some ambiguity
Monday, November 19, 12
Slide 92
Slide 92 text
Even CFG can’t solve some ambiguity
༗Ұେၣረਖ਼ࡏۙ
Monday, November 19, 12
Slide 93
Slide 93 text
Even CFG can’t solve some ambiguity
༗Ұେၣረਖ਼ࡏۙ
( )
??
Monday, November 19, 12
Slide 94
Slide 94 text
Now you know more
than regexp
parsec, rsec, treetop, parselet ...
Monday, November 19, 12
Slide 95
Slide 95 text
Simple markdown parser in 130 lines (many
features ignored but...)
Real world example
Monday, November 19, 12
Slide 96
Slide 96 text
It supports nested parens! (while ruby-china
doesn’t)
[ruby](http://en.wikipedia.org/wiki/Ruby_(programming_language))
ruby
Monday, November 19, 12
Slide 97
Slide 97 text
The parser for nested parens:
/(?
\(
(
[^\(\)]+ # non-paren chars
| # or
\g # a paren
)*
\)
)/x
Monday, November 19, 12
Slide 98
Slide 98 text
★ ri Regexp
★ https://github.com/k-takata/Onigmo/tree/master/doc/RE
★ http://en.wikipedia.org/wiki/Parsing_Expression_Grammar
Helpful Links
Monday, November 19, 12
Slide 99
Slide 99 text
More sugars with
Onigmo
If there’s time...
Monday, November 19, 12