Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Secrets of Regexp

Avatar for Hiro Asari Hiro Asari
February 28, 2013

Secrets of Regexp

Regexp class is in every Rubyist's toolbox. But do you know the theory behind it, and what goes on under the hood?

Avatar for Hiro Asari

Hiro Asari

February 28, 2013
Tweet

More Decks by Hiro Asari

Other Decks in Technology

Transcript

  1. Some people, when confronted with a problem, think, "I know,

    I'll use regular expressions." Now they have two problems. Jaime Zawinski 12 Aug, 1997 http://regex.info/blog/2006-09-15/247 http://www.codinghorror.com/blog/2008/06/regular-expressions-now-you-have-two- problems.html The point is not so much the evils of regular expressions, but the evils of overuse of it.
  2. Formal Language Theory • Alphabet Σ={a, b, c, d, e,

    …, z, λ} (example) • Words over Σ: "a", "b", "ab", "aequafdhfad"
  3. Formal Language Theory • Alphabet Σ={a, b, c, d, e,

    …, z, λ} (example) • Words over Σ: "a", "b", "ab", "aequafdhfad" • Σ*: The set of all words over Σ
  4. Formal Language over Σ • A subset L of Σ*

    (with various properties) • L can be finite, and enumerate well-formed words, but often infinite
  5. Example • Language L over Σ = {a,b} • 'a'

    is a word • a word may be obtained by appending 'ab' to an existing word • only words thus formed are legal
  6. Expression • Textual representation of the formal language against which

    an input is tested whether it is a well-formed word in that language
  7. Regular Languages • ∅ (empty language) is regular • For

    each a ∈ Σ (a belongs to Σ), the singleton language {a} is a regular language.
  8. Regular Languages • ∅ (empty language) is regular • For

    each a ∈ Σ (a belongs to Σ), the singleton language {a} is a regular language. • If A and B are regular languages, then A ∪ B (union), A•B (concatenation), and A* (Kleene star) are regular languages
  9. Regular Languages • ∅ (empty language) is regular • For

    each a ∈ Σ (a belongs to Σ), the singleton language {a} is a regular language. • If A and B are regular languages, then A ∪ B (union), A•B (concatenation), and A* (Kleene star) are regular languages • No other languages over Σ are regular.
  10. Regular? Expressions • It turns out that some expressions are

    more powerful and expresses non-regular languages • Language of 'squares': (.*)\1 • a, aa, aaaa, WikiWiki
  11. How does Regexp work? • Build a finite state automaton

    representing a given regular expression • Feed the String to the regular expression and see if the match succeeds
  12. a a

  13. zyxwvutsrqponmlkjihgfedcba ^ /a$/ Regexp does not think, 'a$' can match

    only at the end of the line, so we should fast forward to the end of the line
  14. zyxwvutsrqponmlkjihgfedcba ^ zyxwvutsrqponmlkjihgfedcba ^ /a$/ Regexp does not think, 'a$'

    can match only at the end of the line, so we should fast forward to the end of the line
  15. zyxwvutsrqponmlkjihgfedcba ^ zyxwvutsrqponmlkjihgfedcba ^ zyxwvutsrqponmlkjihgfedcba ^ /a$/ Regexp does not

    think, 'a$' can match only at the end of the line, so we should fast forward to the end of the line
  16. zyxwvutsrqponmlkjihgfedcba ^ zyxwvutsrqponmlkjihgfedcba ^ zyxwvutsrqponmlkjihgfedcba ^ zyxwvutsrqponmlkjihgfedcba ^ /a$/ Regexp

    does not think, 'a$' can match only at the end of the line, so we should fast forward to the end of the line
  17. zyxwvutsrqponmlkjihgfedcba ^ zyxwvutsrqponmlkjihgfedcba ^ zyxwvutsrqponmlkjihgfedcba ^ zyxwvutsrqponmlkjihgfedcba ^ ⋮ zyxwvutsrqponmlkjihgfedcba

    ^ /a$/ Regexp does not think, 'a$' can match only at the end of the line, so we should fast forward to the end of the line
  18. abc d a dfadg ^ abc d a dfadg ^

    abc d a dfadg ^ abc d a dfadg ^ # matches 'abc d a dfadg ' ^\s*(.*)\s*$
  19. def pathological(n=5) Regexp.new('a?' * n + 'a' * n) end

    1.upto(40) do |n| print n, ": " print Time.now, "\n" if 'a'*n =~ pathological(n) end a?a?a?…a?aaa…a
  20. UP_TO_256 = /\b(?:25[0-5] # 250-255 |2[0-4][0-9] # 200-249 |1[0-9][0-9] #

    100-199 |[1-9][0-9] # 2-digit numbers |[0-9]) # single-digit numbers \b/x IPV4_ADDRESS = /#{UP_TO_256}(?:\.#{UP_TO_256}){3}/ Use /x
  21. \A, \z for strings ^, $ for lines • \A:

    the beginning of the string • \z: the end of the string • ^: after \n • $: before \n
  22. always in Ruby \A, \z for strings ^, $ for

    lines • \A: the beginning of the string • \z: the end of the string • ^: after \n • $: before \n
  23. #! /usr/bin/env perl $a = "abc\ndef"; if ($a =~ /^d/)

    { print "yes\n"; } if ($a =~ /^d/m) { print "yes now\n"; } # prints 'yes now' What's the problem? also note the difference in what /m means
  24. #! /usr/bin/env ruby a = "abc\ndef"; if (a =~ /^d/)

    p "yes" end What's the problem? http://guides.rubyonrails.org/security.html#regular-expressions
  25. class File < ActiveRecord::Base validates :name, :format => /^[\w\.\-\+]+$/ end

    Security Implications http://guides.rubyonrails.org/security.html#regular-expressions
  26. require 'benchmark' # simple benchmark for alternations and character class

    n = 5_000 str = 'cafebabedeadbeef'*5_000 Benchmark.bmbm do |x| x.report('alternation') do str =~ /^(a|b|c|d|e|f)+$/ end x.report('character class') do str =~ /^[a-f]+$/ end end Prefer Character Class to Alterations
  27. Ruby 1.8.7 user system total real alternation 0.030000 0.010000 0.040000

    ( 0.036702) character class 0.000000 0.000000 0.000000 ( 0.004704) Ruby 2.0.0 user system total real alternation 0.020000 0.010000 0.030000 ( 0.023139) character class 0.000000 0.000000 0.000000 ( 0.009641) JRuby 1.7.4.dev user system total real alternation 0.030000 0.000000 0.030000 ( 0.021000) character class 0.010000 0.000000 0.010000 ( 0.007000) Benchmarks
  28. # case-insensitively match any non-word character… # one is unlike

    the others 'r' =~ /(?i:[\W])/ 's' =~ /(?i:[\W])/ 't' =~ /(?i:[\W])/ Beware of Character Classes matches, even if 's' is a word character https://bugs.ruby-lang.org/issues/4044
  29. class Integer def prime? "1" * self !~ /^1?$|^(11+?)\1+$/ end

    end Integer#prime? No performance guarantee Attributed a Perl hacker Abigail