Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Introduction to Regular Expressions

Introduction to Regular Expressions

Jamie Ly

March 04, 2010
Tweet

More Decks by Jamie Ly

Other Decks in Programming

Transcript

  1. Intro to
    Regular
    Expressions
    Presented by: Jamie Ly

    View full-size slide

  2. Agenda
    ●Definition
    ●Infamy
    ●Whetting
    ●Literacy
    ●Potpourri

    View full-size slide

  3. [a-z0-9\.]+@wharton\.upenn\.edu

    View full-size slide

  4. What is a Regex?
    ● Pattern matcher
    ● Text processor
    ● mini-specification

    View full-size slide

  5. How are they processed?
    ● One method of processing used FSAs
    ○ http://osteele.com/tools/reanimator/

    View full-size slide

  6. Why the bad rap?

    View full-size slide

  7. Some people, when
    confronted with a problem,
    think 'I know, I’ll use regular
    expressions.' Now they have
    two problems.
    -Jamie Zawinski

    View full-size slide

  8. Overused
    Unintelligible

    View full-size slide

  9. Obscure Regexes
    E-Mail Validation, Complete Spec
    http://ex-parrot.com/~pdw/Mail-RFC822-Address.html
    Sudoku Solver
    http://perl.abigail.be/Talks/Sudoku/HTML/ (direct link)

    View full-size slide

  10. Feature: Literals
    ● 1 2 3
    ● a b c
    ● A B C
    ● - # %
    ● \. \^ \$
    ● \t \n \r
    /abc/

    View full-size slide

  11. Feature: Character classes/sets
    ● enclosed with brackets
    ● OR
    ● ba[rs]k matches bark and bask
    ● Negations ^
    ● c[^r]ap matches anything except crap
    ● Short-hand
    ○ Perl: \d, \s, \w
    ○ POSIX: [:digit:], [:space:], [:word:]
    /a[bc]/

    View full-size slide

  12. Feature: Quantifiers
    ● ? one or none
    ● * >=0
    ● + > 0
    ● {n} = n
    ● {n, m} between n and m
    /a[bc]+/

    View full-size slide

  13. Feature: Grouping
    ● Uses parens
    ● For capture
    ● Extraction
    ● Backreferences*
    /(a[bc])+/

    View full-size slide

  14. Feature: Grouping - Capture
    search = "apple pear scrapple snapple peach pineapple"
    re = "([a-z]+pple)"
    matches = RegexMatch ( search, re )
    print matches
    ["apple", "scrapple", "snapple", "pineapple"]

    View full-size slide

  15. Feature: Others
    ● Modifiers
    ● Start/end anchors
    ● Zero-length matches
    ● Backreferences
    ● Lookahead, lookbehind, lookaround, ...

    View full-size slide

  16. Example: Quantifiers
    /(not)? to be/
    /((lion)+(tiger)+(bear)+)/

    View full-size slide

  17. Example: Quantifiers - Matches
    /(not)? to be/
    "not to be"
    "to be"

    View full-size slide

  18. Example: Quantifiers - No match
    /(not)? to be/
    "NOT to be"
    "not to be"

    View full-size slide

  19. Example: Quantifiers - Matches
    /((lion)+(tiger)+(bear)+)/
    "liontigerbear"
    "lionliontigertigerbear"

    View full-size slide

  20. Example: Quantifiers - No Match
    /((lion)+(tiger)+(bear)+)/
    "lion tiger bear"
    "lionbear"

    View full-size slide

  21. Example: Grouping - Alternation
    /the (best|worst) of
    times/
    /(liberty|death)/

    View full-size slide

  22. 4 Main Uses
    ●Search
    ●Normalization
    ●Extraction
    ●Validation

    View full-size slide

  23. Use: Search - Finding Text In Files
    ● grep
    ● findstr (Windows)
    ● eclipse, Notepad++, ...
    ● Possible Uses:
    ○ files containing @todo or !todo
    ○ files using cf tags

    View full-size slide

  24. Use: Search - Using Regex In Code
    ● ColdFusion - REReplace note
    ● Javascript - RegExp
    ● Java - java.util.regex.Pattern
    ● ASP - VBScript.Regexp
    ● C# - System.Text.RegularExpressions.Regex
    ● Comparison http://en.wikipedia.
    org/wiki/Comparison_of_regular_expression_engines

    View full-size slide

  25. Use: Search - Another Domain
    ● Programming language detection
    ○ Pastebin
    ○ Syntax Highlighting

    View full-size slide

  26. Use: Normalization
    ● Legacy system data (simple flat-file processing)
    ● One-off user lists
    ● Phone numbers, zip code
    ● Example: Phone Number List
    ● Common use case: Joining/Splitting Strings
    ● Common use case: Code refactoring

    View full-size slide

  27. Use: Extraction
    ● Ties into search and normalization
    ● Example: URLs, e-mail addresses
    ● Plug: CiteThis! sample
    ● Use: Screen scraping Yahoo Finance src code

    View full-size slide

  28. Use: Validation
    ● User input
    ● Casting v String Matching
    ● Example: Input Masking, Stripping HTML

    View full-size slide

  29. When to Avoid RE
    ● HTML Parsing
    ○ DOM/XPath
    ● Performance critical
    ● Stateful processing
    ○ odd number of a's
    ● Some text file processing (some CSV)

    View full-size slide

  30. Regex Literacy

    View full-size slide

  31. How to Read REs
    ● Why be able to read them?
    ● Decompose a regex into its component parts
    ○ How to solve? 1+3*4/5^6+3+5-6*10+3-5
    ○ Decompose/Group!
    ● Be familiar with the problem space/domain
    ● Write down strings that match as you scan
    ● Test against various strings (only use this if stumped!)

    View full-size slide

  32. /^#?([a-f0-9]{6}|[a-f0-9]{3})$/

    View full-size slide

  33. /^#?([a-f0-9]{6}|[a-f0-9]{3})$/

    View full-size slide

  34. /^#?(
    [a-f0-9]{6}
    |
    [a-f0-9]{3}
    )$/

    View full-size slide

  35. #aaa
    af0
    af9f04
    #00cc00

    View full-size slide

  36. /^([a-z0-9_\.-]+)@([\da-z\.-]+)\.
    ([a-z\.]{2,6})$/

    View full-size slide

  37. /^([a-z0-9_\.-]+)@([\da-z\.-]+)\.
    ([a-z\.]{2,6})$/

    View full-size slide

  38. /^
    ([a-z0-9_\.-]+)
    @
    ([\da-z\.-]+)
    \.
    ([a-z\.]{2,6})
    $/

    View full-size slide

  39. Regex
    Construction

    View full-size slide

  40. How to Construct a RE
    ● Have a few examples of :
    ○ matching strings
    ○ not matching strings
    ● start with a simple expression
    ● build up
    ● like writing pseudo-code
    ● Let's write a date-matching RE!

    View full-size slide

  41. Just because we can...
    does it mean we should?
    Casting v RegEx
    ● Simple natural language parsing
    ○ Search queries
    ○ I Want Sandy
    ○ Remember the milk

    View full-size slide

  42. Date Match!
    1. What will we match?
    ● Typical US Dates?
    ● International format? 2010-03-04
    ● Do we match times?
    Let's settle on matching:
    ● Month date, year
    ● Including 3-letter month names

    View full-size slide

  43. Date Match!
    1. What will we match? (continued)
    ● Possible matches
    ○ Oct. 20, 2000
    ● Not matches
    ○ 10/20/2000
    ○ October 20

    View full-size slide

  44. Date Match!
    2. Start with a simple expression
    Write a Regex to match "October 20, 2000"
    /October 20, 2000/
    Easy huh? Now, be able to match each day in October.
    /October [1-3][0-9], 2000/

    View full-size slide

  45. Date Match!
    2. Start with a simple expression (cont)
    Close! But it doesn't match the 1st through 9th
    /October [1-3]?[0-9], 2000/
    Better! Now, match any year from 1000 to 3999.
    /October [1-2]?[0-9], [123][0-9][0-9][0-9]/

    View full-size slide

  46. Date Match!
    2. Start with a simple expression (cont)
    When you see the same classes repeated, you can simplify!
    /October [1-2]?[0-9], [123][0-9]{3}/
    Now, match October's abbreviation.
    /(October|Oct\.) [1-2]?[0-9], [123][0-9]{3}/

    View full-size slide

  47. Date Match!
    2. Start with a simple expression (cont)
    Now we can add the other months. We'll only add May and
    December so we can limit the size.
    /((May|October|December)|(Oct|Dec)\.) [1-2]?[0-9], [123][0-9]{3}
    /
    or
    /(May|October|December|Oct\.|Dec\.) [1-2]?[0-9], [123][0-9]{3}/

    View full-size slide

  48. Date Match!
    2. Start with a simple expression (cont)
    From here, we determine whether to loosen the requirements.
    ● Ignore whitespace?
    ● Ignore case?
    ● Comma, other punctuation optional?
    /(May|October|December|Oct\.?|Dec\.?)\s+[1-2]?[0-9],?\s+[123]
    [0-9]{3}/

    View full-size slide

  49. Regex Builders
    ● Not crutches!
    ● Testers/Interactive builder
    ○ http://rubular.com/
    ○ http://gskinner.com/RegExr/
    ○ *http://ryanswanson.com/regexp/#start
    ○ http://osteele.com/tools/rework/#
    ○ http://txt2re.com/index-php.php3
    ○ http://tools.netshiftmedia.com/regexlibrary/

    View full-size slide

  50. Custom Project Using REs
    Word Jumble!
    http://scorpio-dev.wharton.upenn.edu/users/jamiely/wordjumble/

    View full-size slide

  51. http://www.addedbytes.com/cheat-sheets/regular-expressions-cheat-sheet/

    View full-size slide

  52. References
    ● http://net.tutsplus.com/tutorials/other/8-regular-expressions-you-should-know/
    ● http://regexlib.com/
    ● http://www.regular-expressions.info/

    View full-size slide

  53. Unused
    slides

    View full-size slide

  54. RE Linksheet

    View full-size slide

  55. Feature: Backreference
    ● You may use a captured group in a regex
    ● This is useful for paired data such as html
    ● When you don't know what the first match will be
    Matches an html tag and its closing tag (kinda)
    /<(\w+).+> inner html including tags \1>/
    Matches a line of the name song!
    /\w(\w+), B\1, Bo B\1, Banana Fanana/

    View full-size slide

  56. Feature: Modifiers
    ● (g)lobal
    ● (i)gnore case/case (i)nsensitive
    ● (m)ultiline
    ● example

    View full-size slide

  57. Feature: Start/end anchors
    ● ^ begin
    ● $ end
    ● Maria: Let's start at the very beginning!
    ● ^do re mi
    ● REM: world as we know it$

    View full-size slide

  58. Example: Quantifiers - Alt. Regex
    /((lion)+(tiger)+(bear)+)/
    /((lion){1,2}(tiger){1,2}(bear)?)/
    "liontigerbear"
    "lionliontigertigerbear"

    View full-size slide

  59. Example: Grouping - Matches
    /(circle of (hell|the
    (inferno|abyss))){1,7}/
    "circle of the inferno"
    "circle of hellcircle of hell"

    View full-size slide

  60. Example: Grouping - No match
    /(circle of (hell|the
    (inferno|abyss))){1,7}/
    "circle of abyss"
    "circle of"

    View full-size slide

  61. *Feature: - Zero Length Matches
    ● Zero-length matches (^, \b)

    View full-size slide

  62. Use: Capture
    /(heart|mind)*/

    View full-size slide

  63. Search
    ● dir/ls
    ● Example: Find files beginning with test
    ● find
    ● Example: Find files beginning with test and ending with a
    numeric timestamp.
    ● Common use case:
    ○ finding all log files
    ○ finding image files

    View full-size slide

  64. Use: Normalization (cont)
    ● Phone numbers, zip code
    ● Example: Phone Number List
    ● Common use case: Joining/Splitting Strings

    View full-size slide

  65. General Search
    ● IDE
    ○ Notepad++
    ■ sql search
    ○ Eclipse
    ■ function search
    ○ Dreamweaver?
    ○ Word?
    ■ Not really regex
    ○ Emacs?
    ● FindStr/Grep
    ○ Example: ?
    ○ Common use case: Finding all instances of a global
    variable

    View full-size slide

  66. Search
    ● find

    View full-size slide

  67. grep, findstr

    View full-size slide

  68. RE Flavors and Compensating
    How to compensate for a deficient implementation.
    (ColdFusion)

    View full-size slide

  69. What is a regular expression?
    ● formal language
    ● interpreted by RE processor
    ● mini-programs
    ● mini-specifications
    ● domain: text-processing
    ● iffy: find:RE::dom:XPath ... kinda
    ● show program parse tree?
    ○ compare to regex automata graph
    ○ graph: ca[tp]er against cater caper acapella

    View full-size slide

  70. Disperse Examples and Tests
    Throughout?
    ● Roman numerals: http://stackoverflow.
    com/questions/800813/what-is-the-most-difficult-
    challenging-regular-expression-you-have-ever-
    written/800932

    View full-size slide

  71. RE Cheatsheet
    ● How to encourage RE use?
    ● Distribute cheatsheets
    ● Comb through LL source for examples where REs could be
    used?

    View full-size slide

  72. Use: Search Domains
    ● Find all files containing @todo or !todo
    ● Find all files using cf tags

    View full-size slide