Introduction to Regular Expressions

Introduction to Regular Expressions

02a507e3f7f2b7a00c9be6bd1f902dd1?s=128

Jamie Ly

March 04, 2010
Tweet

Transcript

  1. 5.

    How are they processed? • One method of processing used

    FSAs ◦ http://osteele.com/tools/reanimator/
  2. 7.

    Some people, when confronted with a problem, think 'I know,

    I’ll use regular expressions.' Now they have two problems. -Jamie Zawinski
  3. 10.
  4. 11.
  5. 12.

    Feature: Literals • 1 2 3 • a b c

    • A B C • - # % • \. \^ \$ • \t \n \r /abc/
  6. 13.

    Feature: Character classes/sets • enclosed with brackets • OR •

    ba[rs]k matches bark and bask • Negations ^ • c[^r]ap matches anything except crap • Short-hand ◦ Perl: \d, \s, \w ◦ POSIX: [:digit:], [:space:], [:word:] /a[bc]/
  7. 14.

    Feature: Quantifiers • ? one or none • * >=0

    • + > 0 • {n} = n • {n, m} between n and m /a[bc]+/
  8. 16.

    Feature: Grouping - Capture search = "apple pear scrapple snapple

    peach pineapple" re = "([a-z]+pple)" matches = RegexMatch ( search, re ) print matches ["apple", "scrapple", "snapple", "pineapple"]
  9. 17.

    Feature: Others • Modifiers • Start/end anchors • Zero-length matches

    • Backreferences • Lookahead, lookbehind, lookaround, ...
  10. 18.
  11. 25.
  12. 27.

    Use: Search - Finding Text In Files • grep •

    findstr (Windows) • eclipse, Notepad++, ... • Possible Uses: ◦ files containing @todo or !todo ◦ files using cf tags
  13. 28.

    Use: Search - Using Regex In Code • ColdFusion -

    REReplace note • Javascript - RegExp • Java - java.util.regex.Pattern • ASP - VBScript.Regexp • C# - System.Text.RegularExpressions.Regex • Comparison http://en.wikipedia. org/wiki/Comparison_of_regular_expression_engines
  14. 30.

    Use: Normalization • Legacy system data (simple flat-file processing) •

    One-off user lists • Phone numbers, zip code • Example: Phone Number List • Common use case: Joining/Splitting Strings • Common use case: Code refactoring
  15. 31.

    Use: Extraction • Ties into search and normalization • Example:

    URLs, e-mail addresses • Plug: CiteThis! sample • Use: Screen scraping Yahoo Finance src code
  16. 32.

    Use: Validation • User input • Casting v String Matching

    • Example: Input Masking, Stripping HTML
  17. 33.

    When to Avoid RE • HTML Parsing ◦ DOM/XPath •

    Performance critical • Stateful processing ◦ odd number of a's • Some text file processing (some CSV)
  18. 35.

    How to Read REs • Why be able to read

    them? • Decompose a regex into its component parts ◦ How to solve? 1+3*4/5^6+3+5-6*10+3-5 ◦ Decompose/Group! • Be familiar with the problem space/domain • Write down strings that match as you scan • Test against various strings (only use this if stumped!)
  19. 45.

    How to Construct a RE • Have a few examples

    of : ◦ matching strings ◦ not matching strings • start with a simple expression • build up • like writing pseudo-code • Let's write a date-matching RE!
  20. 46.

    Just because we can... does it mean we should? Casting

    v RegEx • Simple natural language parsing ◦ Search queries ◦ I Want Sandy ◦ Remember the milk
  21. 47.

    Date Match! 1. What will we match? • Typical US

    Dates? • International format? 2010-03-04 • Do we match times? Let's settle on matching: • Month date, year • Including 3-letter month names
  22. 48.

    Date Match! 1. What will we match? (continued) • Possible

    matches ◦ Oct. 20, 2000 • Not matches ◦ 10/20/2000 ◦ October 20
  23. 49.

    Date Match! 2. Start with a simple expression Write a

    Regex to match "October 20, 2000" /October 20, 2000/ Easy huh? Now, be able to match each day in October. /October [1-3][0-9], 2000/
  24. 50.

    Date Match! 2. Start with a simple expression (cont) Close!

    But it doesn't match the 1st through 9th /October [1-3]?[0-9], 2000/ Better! Now, match any year from 1000 to 3999. /October [1-2]?[0-9], [123][0-9][0-9][0-9]/
  25. 51.

    Date Match! 2. Start with a simple expression (cont) When

    you see the same classes repeated, you can simplify! /October [1-2]?[0-9], [123][0-9]{3}/ Now, match October's abbreviation. /(October|Oct\.) [1-2]?[0-9], [123][0-9]{3}/
  26. 52.

    Date Match! 2. Start with a simple expression (cont) Now

    we can add the other months. We'll only add May and December so we can limit the size. /((May|October|December)|(Oct|Dec)\.) [1-2]?[0-9], [123][0-9]{3} / or /(May|October|December|Oct\.|Dec\.) [1-2]?[0-9], [123][0-9]{3}/
  27. 53.

    Date Match! 2. Start with a simple expression (cont) From

    here, we determine whether to loosen the requirements. • Ignore whitespace? • Ignore case? • Comma, other punctuation optional? /(May|October|December|Oct\.?|Dec\.?)\s+[1-2]?[0-9],?\s+[123] [0-9]{3}/
  28. 54.
  29. 55.

    Regex Builders • Not crutches! • Testers/Interactive builder ◦ http://rubular.com/

    ◦ http://gskinner.com/RegExr/ ◦ *http://ryanswanson.com/regexp/#start ◦ http://osteele.com/tools/rework/# ◦ http://txt2re.com/index-php.php3 ◦ http://tools.netshiftmedia.com/regexlibrary/
  30. 62.

    Feature: Backreference • You may use a captured group in

    a regex • This is useful for paired data such as html • When you don't know what the first match will be Matches an html tag and its closing tag (kinda) /<(\w+).+> inner html including tags </\1>/ Matches a line of the name song! /\w(\w+), B\1, Bo B\1, Banana Fanana/
  31. 64.

    Feature: Start/end anchors • ^ begin • $ end •

    Maria: Let's start at the very beginning! • ^do re mi • REM: world as we know it$
  32. 70.

    Search • dir/ls • Example: Find files beginning with test

    • find • Example: Find files beginning with test and ending with a numeric timestamp. • Common use case: ◦ finding all log files ◦ finding image files
  33. 71.

    Use: Normalization (cont) • Phone numbers, zip code • Example:

    Phone Number List • Common use case: Joining/Splitting Strings
  34. 72.

    General Search • IDE ◦ Notepad++ ▪ sql search ◦

    Eclipse ▪ function search ◦ Dreamweaver? ◦ Word? ▪ Not really regex ◦ Emacs? • FindStr/Grep ◦ Example: ? ◦ Common use case: Finding all instances of a global variable
  35. 76.

    What is a regular expression? • formal language • interpreted

    by RE processor • mini-programs • mini-specifications • domain: text-processing • iffy: find:RE::dom:XPath ... kinda • show program parse tree? ◦ compare to regex automata graph ◦ graph: ca[tp]er against cater caper acapella
  36. 78.

    RE Cheatsheet • How to encourage RE use? • Distribute

    cheatsheets • Comb through LL source for examples where REs could be used?
  37. 79.

    Use: Search Domains • Find all files containing @todo or

    !todo • Find all files using cf tags