Introduction to Regular Expressions

Introduction to Regular Expressions

02a507e3f7f2b7a00c9be6bd1f902dd1?s=128

Jamie Ly

March 04, 2010
Tweet

Transcript

  1. Intro to Regular Expressions Presented by: Jamie Ly

  2. Agenda •Definition •Infamy •Whetting •Literacy •Potpourri

  3. [a-z0-9\.]+@wharton\.upenn\.edu

  4. What is a Regex? • Pattern matcher • Text processor

    • mini-specification
  5. How are they processed? • One method of processing used

    FSAs ◦ http://osteele.com/tools/reanimator/
  6. Why the bad rap?

  7. Some people, when confronted with a problem, think 'I know,

    I’ll use regular expressions.' Now they have two problems. -Jamie Zawinski
  8. Overused Unintelligible

  9. Obscure Regexes E-Mail Validation, Complete Spec http://ex-parrot.com/~pdw/Mail-RFC822-Address.html Sudoku Solver http://perl.abigail.be/Talks/Sudoku/HTML/

    (direct link)
  10. Whetting

  11. Features

  12. Feature: Literals • 1 2 3 • a b c

    • A B C • - # % • \. \^ \$ • \t \n \r /abc/
  13. Feature: Character classes/sets • enclosed with brackets • OR •

    ba[rs]k matches bark and bask • Negations ^ • c[^r]ap matches anything except crap • Short-hand ◦ Perl: \d, \s, \w ◦ POSIX: [:digit:], [:space:], [:word:] /a[bc]/
  14. Feature: Quantifiers • ? one or none • * >=0

    • + > 0 • {n} = n • {n, m} between n and m /a[bc]+/
  15. Feature: Grouping • Uses parens • For capture • Extraction

    • Backreferences* /(a[bc])+/
  16. Feature: Grouping - Capture search = "apple pear scrapple snapple

    peach pineapple" re = "([a-z]+pple)" matches = RegexMatch ( search, re ) print matches ["apple", "scrapple", "snapple", "pineapple"]
  17. Feature: Others • Modifiers • Start/end anchors • Zero-length matches

    • Backreferences • Lookahead, lookbehind, lookaround, ...
  18. Examples

  19. Example: Quantifiers /(not)? to be/ /((lion)+(tiger)+(bear)+)/

  20. Example: Quantifiers - Matches /(not)? to be/ "not to be"

    "to be"
  21. Example: Quantifiers - No match /(not)? to be/ "NOT to

    be" "not to be"
  22. Example: Quantifiers - Matches /((lion)+(tiger)+(bear)+)/ "liontigerbear" "lionliontigertigerbear"

  23. Example: Quantifiers - No Match /((lion)+(tiger)+(bear)+)/ "lion tiger bear" "lionbear"

  24. Example: Grouping - Alternation /the (best|worst) of times/ /(liberty|death)/

  25. Uses

  26. 4 Main Uses •Search •Normalization •Extraction •Validation

  27. Use: Search - Finding Text In Files • grep •

    findstr (Windows) • eclipse, Notepad++, ... • Possible Uses: ◦ files containing @todo or !todo ◦ files using cf tags
  28. Use: Search - Using Regex In Code • ColdFusion -

    REReplace note • Javascript - RegExp • Java - java.util.regex.Pattern • ASP - VBScript.Regexp • C# - System.Text.RegularExpressions.Regex • Comparison http://en.wikipedia. org/wiki/Comparison_of_regular_expression_engines
  29. Use: Search - Another Domain • Programming language detection ◦

    Pastebin ◦ Syntax Highlighting
  30. Use: Normalization • Legacy system data (simple flat-file processing) •

    One-off user lists • Phone numbers, zip code • Example: Phone Number List • Common use case: Joining/Splitting Strings • Common use case: Code refactoring
  31. Use: Extraction • Ties into search and normalization • Example:

    URLs, e-mail addresses • Plug: CiteThis! sample • Use: Screen scraping Yahoo Finance src code
  32. Use: Validation • User input • Casting v String Matching

    • Example: Input Masking, Stripping HTML
  33. When to Avoid RE • HTML Parsing ◦ DOM/XPath •

    Performance critical • Stateful processing ◦ odd number of a's • Some text file processing (some CSV)
  34. Regex Literacy

  35. How to Read REs • Why be able to read

    them? • Decompose a regex into its component parts ◦ How to solve? 1+3*4/5^6+3+5-6*10+3-5 ◦ Decompose/Group! • Be familiar with the problem space/domain • Write down strings that match as you scan • Test against various strings (only use this if stumped!)
  36. /^#?([a-f0-9]{6}|[a-f0-9]{3})$/

  37. /^#?([a-f0-9]{6}|[a-f0-9]{3})$/

  38. /^#?( [a-f0-9]{6} | [a-f0-9]{3} )$/

  39. #aaa af0 af9f04 #00cc00

  40. /^([a-z0-9_\.-]+)@([\da-z\.-]+)\. ([a-z\.]{2,6})$/

  41. /^([a-z0-9_\.-]+)@([\da-z\.-]+)\. ([a-z\.]{2,6})$/

  42. /^ ([a-z0-9_\.-]+) @ ([\da-z\.-]+) \. ([a-z\.]{2,6}) $/

  43. a@-.az -@000.apples _______@0------0... jamiely@wharton.upenn.edu

  44. Regex Construction

  45. How to Construct a RE • Have a few examples

    of : ◦ matching strings ◦ not matching strings • start with a simple expression • build up • like writing pseudo-code • Let's write a date-matching RE!
  46. Just because we can... does it mean we should? Casting

    v RegEx • Simple natural language parsing ◦ Search queries ◦ I Want Sandy ◦ Remember the milk
  47. Date Match! 1. What will we match? • Typical US

    Dates? • International format? 2010-03-04 • Do we match times? Let's settle on matching: • Month date, year • Including 3-letter month names
  48. Date Match! 1. What will we match? (continued) • Possible

    matches ◦ Oct. 20, 2000 • Not matches ◦ 10/20/2000 ◦ October 20
  49. Date Match! 2. Start with a simple expression Write a

    Regex to match "October 20, 2000" /October 20, 2000/ Easy huh? Now, be able to match each day in October. /October [1-3][0-9], 2000/
  50. Date Match! 2. Start with a simple expression (cont) Close!

    But it doesn't match the 1st through 9th /October [1-3]?[0-9], 2000/ Better! Now, match any year from 1000 to 3999. /October [1-2]?[0-9], [123][0-9][0-9][0-9]/
  51. Date Match! 2. Start with a simple expression (cont) When

    you see the same classes repeated, you can simplify! /October [1-2]?[0-9], [123][0-9]{3}/ Now, match October's abbreviation. /(October|Oct\.) [1-2]?[0-9], [123][0-9]{3}/
  52. Date Match! 2. Start with a simple expression (cont) Now

    we can add the other months. We'll only add May and December so we can limit the size. /((May|October|December)|(Oct|Dec)\.) [1-2]?[0-9], [123][0-9]{3} / or /(May|October|December|Oct\.|Dec\.) [1-2]?[0-9], [123][0-9]{3}/
  53. Date Match! 2. Start with a simple expression (cont) From

    here, we determine whether to loosen the requirements. • Ignore whitespace? • Ignore case? • Comma, other punctuation optional? /(May|October|December|Oct\.?|Dec\.?)\s+[1-2]?[0-9],?\s+[123] [0-9]{3}/
  54. Potpourri

  55. Regex Builders • Not crutches! • Testers/Interactive builder ◦ http://rubular.com/

    ◦ http://gskinner.com/RegExr/ ◦ *http://ryanswanson.com/regexp/#start ◦ http://osteele.com/tools/rework/# ◦ http://txt2re.com/index-php.php3 ◦ http://tools.netshiftmedia.com/regexlibrary/
  56. Custom Project Using REs Word Jumble! http://scorpio-dev.wharton.upenn.edu/users/jamiely/wordjumble/

  57. http://www.addedbytes.com/cheat-sheets/regular-expressions-cheat-sheet/

  58. Appendices

  59. References • http://net.tutsplus.com/tutorials/other/8-regular-expressions-you-should-know/ • http://regexlib.com/ • http://www.regular-expressions.info/

  60. Unused slides

  61. RE Linksheet

  62. Feature: Backreference • You may use a captured group in

    a regex • This is useful for paired data such as html • When you don't know what the first match will be Matches an html tag and its closing tag (kinda) /<(\w+).+> inner html including tags </\1>/ Matches a line of the name song! /\w(\w+), B\1, Bo B\1, Banana Fanana/
  63. Feature: Modifiers • (g)lobal • (i)gnore case/case (i)nsensitive • (m)ultiline

    • example
  64. Feature: Start/end anchors • ^ begin • $ end •

    Maria: Let's start at the very beginning! • ^do re mi • REM: world as we know it$
  65. Example: Quantifiers - Alt. Regex /((lion)+(tiger)+(bear)+)/ /((lion){1,2}(tiger){1,2}(bear)?)/ "liontigerbear" "lionliontigertigerbear"

  66. Example: Grouping - Matches /(circle of (hell|the (inferno|abyss))){1,7}/ "circle of

    the inferno" "circle of hellcircle of hell"
  67. Example: Grouping - No match /(circle of (hell|the (inferno|abyss))){1,7}/ "circle

    of abyss" "circle of"
  68. *Feature: - Zero Length Matches • Zero-length matches (^, \b)

  69. Use: Capture /(heart|mind)*/

  70. Search • dir/ls • Example: Find files beginning with test

    • find • Example: Find files beginning with test and ending with a numeric timestamp. • Common use case: ◦ finding all log files ◦ finding image files
  71. Use: Normalization (cont) • Phone numbers, zip code • Example:

    Phone Number List • Common use case: Joining/Splitting Strings
  72. General Search • IDE ◦ Notepad++ ▪ sql search ◦

    Eclipse ▪ function search ◦ Dreamweaver? ◦ Word? ▪ Not really regex ◦ Emacs? • FindStr/Grep ◦ Example: ? ◦ Common use case: Finding all instances of a global variable
  73. Search • find

  74. grep, findstr

  75. RE Flavors and Compensating How to compensate for a deficient

    implementation. (ColdFusion)
  76. What is a regular expression? • formal language • interpreted

    by RE processor • mini-programs • mini-specifications • domain: text-processing • iffy: find:RE::dom:XPath ... kinda • show program parse tree? ◦ compare to regex automata graph ◦ graph: ca[tp]er against cater caper acapella
  77. Disperse Examples and Tests Throughout? • Roman numerals: http://stackoverflow. com/questions/800813/what-is-the-most-difficult-

    challenging-regular-expression-you-have-ever- written/800932 •
  78. RE Cheatsheet • How to encourage RE use? • Distribute

    cheatsheets • Comb through LL source for examples where REs could be used?
  79. Use: Search Domains • Find all files containing @todo or

    !todo • Find all files using cf tags