Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Introduction to Regular Expressions

Introduction to Regular Expressions

Avatar for Jamie Ly

Jamie Ly

March 04, 2010
Tweet

More Decks by Jamie Ly

Other Decks in Programming

Transcript

  1. How are they processed? • One method of processing used

    FSAs ◦ http://osteele.com/tools/reanimator/
  2. Some people, when confronted with a problem, think 'I know,

    I’ll use regular expressions.' Now they have two problems. -Jamie Zawinski
  3. Feature: Literals • 1 2 3 • a b c

    • A B C • - # % • \. \^ \$ • \t \n \r /abc/
  4. Feature: Character classes/sets • enclosed with brackets • OR •

    ba[rs]k matches bark and bask • Negations ^ • c[^r]ap matches anything except crap • Short-hand ◦ Perl: \d, \s, \w ◦ POSIX: [:digit:], [:space:], [:word:] /a[bc]/
  5. Feature: Quantifiers • ? one or none • * >=0

    • + > 0 • {n} = n • {n, m} between n and m /a[bc]+/
  6. Feature: Grouping - Capture search = "apple pear scrapple snapple

    peach pineapple" re = "([a-z]+pple)" matches = RegexMatch ( search, re ) print matches ["apple", "scrapple", "snapple", "pineapple"]
  7. Feature: Others • Modifiers • Start/end anchors • Zero-length matches

    • Backreferences • Lookahead, lookbehind, lookaround, ...
  8. Use: Search - Finding Text In Files • grep •

    findstr (Windows) • eclipse, Notepad++, ... • Possible Uses: ◦ files containing @todo or !todo ◦ files using cf tags
  9. Use: Search - Using Regex In Code • ColdFusion -

    REReplace note • Javascript - RegExp • Java - java.util.regex.Pattern • ASP - VBScript.Regexp • C# - System.Text.RegularExpressions.Regex • Comparison http://en.wikipedia. org/wiki/Comparison_of_regular_expression_engines
  10. Use: Normalization • Legacy system data (simple flat-file processing) •

    One-off user lists • Phone numbers, zip code • Example: Phone Number List • Common use case: Joining/Splitting Strings • Common use case: Code refactoring
  11. Use: Extraction • Ties into search and normalization • Example:

    URLs, e-mail addresses • Plug: CiteThis! sample • Use: Screen scraping Yahoo Finance src code
  12. Use: Validation • User input • Casting v String Matching

    • Example: Input Masking, Stripping HTML
  13. When to Avoid RE • HTML Parsing ◦ DOM/XPath •

    Performance critical • Stateful processing ◦ odd number of a's • Some text file processing (some CSV)
  14. How to Read REs • Why be able to read

    them? • Decompose a regex into its component parts ◦ How to solve? 1+3*4/5^6+3+5-6*10+3-5 ◦ Decompose/Group! • Be familiar with the problem space/domain • Write down strings that match as you scan • Test against various strings (only use this if stumped!)
  15. How to Construct a RE • Have a few examples

    of : ◦ matching strings ◦ not matching strings • start with a simple expression • build up • like writing pseudo-code • Let's write a date-matching RE!
  16. Just because we can... does it mean we should? Casting

    v RegEx • Simple natural language parsing ◦ Search queries ◦ I Want Sandy ◦ Remember the milk
  17. Date Match! 1. What will we match? • Typical US

    Dates? • International format? 2010-03-04 • Do we match times? Let's settle on matching: • Month date, year • Including 3-letter month names
  18. Date Match! 1. What will we match? (continued) • Possible

    matches ◦ Oct. 20, 2000 • Not matches ◦ 10/20/2000 ◦ October 20
  19. Date Match! 2. Start with a simple expression Write a

    Regex to match "October 20, 2000" /October 20, 2000/ Easy huh? Now, be able to match each day in October. /October [1-3][0-9], 2000/
  20. Date Match! 2. Start with a simple expression (cont) Close!

    But it doesn't match the 1st through 9th /October [1-3]?[0-9], 2000/ Better! Now, match any year from 1000 to 3999. /October [1-2]?[0-9], [123][0-9][0-9][0-9]/
  21. Date Match! 2. Start with a simple expression (cont) When

    you see the same classes repeated, you can simplify! /October [1-2]?[0-9], [123][0-9]{3}/ Now, match October's abbreviation. /(October|Oct\.) [1-2]?[0-9], [123][0-9]{3}/
  22. Date Match! 2. Start with a simple expression (cont) Now

    we can add the other months. We'll only add May and December so we can limit the size. /((May|October|December)|(Oct|Dec)\.) [1-2]?[0-9], [123][0-9]{3} / or /(May|October|December|Oct\.|Dec\.) [1-2]?[0-9], [123][0-9]{3}/
  23. Date Match! 2. Start with a simple expression (cont) From

    here, we determine whether to loosen the requirements. • Ignore whitespace? • Ignore case? • Comma, other punctuation optional? /(May|October|December|Oct\.?|Dec\.?)\s+[1-2]?[0-9],?\s+[123] [0-9]{3}/
  24. Regex Builders • Not crutches! • Testers/Interactive builder ◦ http://rubular.com/

    ◦ http://gskinner.com/RegExr/ ◦ *http://ryanswanson.com/regexp/#start ◦ http://osteele.com/tools/rework/# ◦ http://txt2re.com/index-php.php3 ◦ http://tools.netshiftmedia.com/regexlibrary/
  25. Feature: Backreference • You may use a captured group in

    a regex • This is useful for paired data such as html • When you don't know what the first match will be Matches an html tag and its closing tag (kinda) /<(\w+).+> inner html including tags </\1>/ Matches a line of the name song! /\w(\w+), B\1, Bo B\1, Banana Fanana/
  26. Feature: Start/end anchors • ^ begin • $ end •

    Maria: Let's start at the very beginning! • ^do re mi • REM: world as we know it$
  27. Search • dir/ls • Example: Find files beginning with test

    • find • Example: Find files beginning with test and ending with a numeric timestamp. • Common use case: ◦ finding all log files ◦ finding image files
  28. Use: Normalization (cont) • Phone numbers, zip code • Example:

    Phone Number List • Common use case: Joining/Splitting Strings
  29. General Search • IDE ◦ Notepad++ ▪ sql search ◦

    Eclipse ▪ function search ◦ Dreamweaver? ◦ Word? ▪ Not really regex ◦ Emacs? • FindStr/Grep ◦ Example: ? ◦ Common use case: Finding all instances of a global variable
  30. What is a regular expression? • formal language • interpreted

    by RE processor • mini-programs • mini-specifications • domain: text-processing • iffy: find:RE::dom:XPath ... kinda • show program parse tree? ◦ compare to regex automata graph ◦ graph: ca[tp]er against cater caper acapella
  31. RE Cheatsheet • How to encourage RE use? • Distribute

    cheatsheets • Comb through LL source for examples where REs could be used?
  32. Use: Search Domains • Find all files containing @todo or

    !todo • Find all files using cf tags