Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Introduction to Regular Expressions

Introduction to Regular Expressions

Jamie Ly

March 04, 2010
Tweet

More Decks by Jamie Ly

Other Decks in Programming

Transcript

  1. Intro to Regular Expressions Presented by: Jamie Ly

  2. Agenda •Definition •Infamy •Whetting •Literacy •Potpourri

  3. [a-z0-9\.][email protected]\.upenn\.edu

  4. What is a Regex? • Pattern matcher • Text processor

    • mini-specification
  5. How are they processed? • One method of processing used

    FSAs ◦ http://osteele.com/tools/reanimator/
  6. Why the bad rap?

  7. Some people, when confronted with a problem, think 'I know,

    I’ll use regular expressions.' Now they have two problems. -Jamie Zawinski
  8. Overused Unintelligible

  9. Obscure Regexes E-Mail Validation, Complete Spec http://ex-parrot.com/~pdw/Mail-RFC822-Address.html Sudoku Solver http://perl.abigail.be/Talks/Sudoku/HTML/

    (direct link)
  10. Whetting

  11. Features

  12. Feature: Literals • 1 2 3 • a b c

    • A B C • - # % • \. \^ \$ • \t \n \r /abc/
  13. Feature: Character classes/sets • enclosed with brackets • OR •

    ba[rs]k matches bark and bask • Negations ^ • c[^r]ap matches anything except crap • Short-hand ◦ Perl: \d, \s, \w ◦ POSIX: [:digit:], [:space:], [:word:] /a[bc]/
  14. Feature: Quantifiers • ? one or none • * >=0

    • + > 0 • {n} = n • {n, m} between n and m /a[bc]+/
  15. Feature: Grouping • Uses parens • For capture • Extraction

    • Backreferences* /(a[bc])+/
  16. Feature: Grouping - Capture search = "apple pear scrapple snapple

    peach pineapple" re = "([a-z]+pple)" matches = RegexMatch ( search, re ) print matches ["apple", "scrapple", "snapple", "pineapple"]
  17. Feature: Others • Modifiers • Start/end anchors • Zero-length matches

    • Backreferences • Lookahead, lookbehind, lookaround, ...
  18. Examples

  19. Example: Quantifiers /(not)? to be/ /((lion)+(tiger)+(bear)+)/

  20. Example: Quantifiers - Matches /(not)? to be/ "not to be"

    "to be"
  21. Example: Quantifiers - No match /(not)? to be/ "NOT to

    be" "not to be"
  22. Example: Quantifiers - Matches /((lion)+(tiger)+(bear)+)/ "liontigerbear" "lionliontigertigerbear"

  23. Example: Quantifiers - No Match /((lion)+(tiger)+(bear)+)/ "lion tiger bear" "lionbear"

  24. Example: Grouping - Alternation /the (best|worst) of times/ /(liberty|death)/

  25. Uses

  26. 4 Main Uses •Search •Normalization •Extraction •Validation

  27. Use: Search - Finding Text In Files • grep •

    findstr (Windows) • eclipse, Notepad++, ... • Possible Uses: ◦ files containing @todo or !todo ◦ files using cf tags
  28. Use: Search - Using Regex In Code • ColdFusion -

    REReplace note • Javascript - RegExp • Java - java.util.regex.Pattern • ASP - VBScript.Regexp • C# - System.Text.RegularExpressions.Regex • Comparison http://en.wikipedia. org/wiki/Comparison_of_regular_expression_engines
  29. Use: Search - Another Domain • Programming language detection ◦

    Pastebin ◦ Syntax Highlighting
  30. Use: Normalization • Legacy system data (simple flat-file processing) •

    One-off user lists • Phone numbers, zip code • Example: Phone Number List • Common use case: Joining/Splitting Strings • Common use case: Code refactoring
  31. Use: Extraction • Ties into search and normalization • Example:

    URLs, e-mail addresses • Plug: CiteThis! sample • Use: Screen scraping Yahoo Finance src code
  32. Use: Validation • User input • Casting v String Matching

    • Example: Input Masking, Stripping HTML
  33. When to Avoid RE • HTML Parsing ◦ DOM/XPath •

    Performance critical • Stateful processing ◦ odd number of a's • Some text file processing (some CSV)
  34. Regex Literacy

  35. How to Read REs • Why be able to read

    them? • Decompose a regex into its component parts ◦ How to solve? 1+3*4/5^6+3+5-6*10+3-5 ◦ Decompose/Group! • Be familiar with the problem space/domain • Write down strings that match as you scan • Test against various strings (only use this if stumped!)
  36. /^#?([a-f0-9]{6}|[a-f0-9]{3})$/

  37. /^#?([a-f0-9]{6}|[a-f0-9]{3})$/

  38. /^#?( [a-f0-9]{6} | [a-f0-9]{3} )$/

  39. #aaa af0 af9f04 #00cc00

  40. /^([a-z0-9_\.-]+)@([\da-z\.-]+)\. ([a-z\.]{2,6})$/

  41. /^([a-z0-9_\.-]+)@([\da-z\.-]+)\. ([a-z\.]{2,6})$/

  42. /^ ([a-z0-9_\.-]+) @ ([\da-z\.-]+) \. ([a-z\.]{2,6}) $/

  43. [email protected] [email protected] [email protected] [email protected]

  44. Regex Construction

  45. How to Construct a RE • Have a few examples

    of : ◦ matching strings ◦ not matching strings • start with a simple expression • build up • like writing pseudo-code • Let's write a date-matching RE!
  46. Just because we can... does it mean we should? Casting

    v RegEx • Simple natural language parsing ◦ Search queries ◦ I Want Sandy ◦ Remember the milk
  47. Date Match! 1. What will we match? • Typical US

    Dates? • International format? 2010-03-04 • Do we match times? Let's settle on matching: • Month date, year • Including 3-letter month names
  48. Date Match! 1. What will we match? (continued) • Possible

    matches ◦ Oct. 20, 2000 • Not matches ◦ 10/20/2000 ◦ October 20
  49. Date Match! 2. Start with a simple expression Write a

    Regex to match "October 20, 2000" /October 20, 2000/ Easy huh? Now, be able to match each day in October. /October [1-3][0-9], 2000/
  50. Date Match! 2. Start with a simple expression (cont) Close!

    But it doesn't match the 1st through 9th /October [1-3]?[0-9], 2000/ Better! Now, match any year from 1000 to 3999. /October [1-2]?[0-9], [123][0-9][0-9][0-9]/
  51. Date Match! 2. Start with a simple expression (cont) When

    you see the same classes repeated, you can simplify! /October [1-2]?[0-9], [123][0-9]{3}/ Now, match October's abbreviation. /(October|Oct\.) [1-2]?[0-9], [123][0-9]{3}/
  52. Date Match! 2. Start with a simple expression (cont) Now

    we can add the other months. We'll only add May and December so we can limit the size. /((May|October|December)|(Oct|Dec)\.) [1-2]?[0-9], [123][0-9]{3} / or /(May|October|December|Oct\.|Dec\.) [1-2]?[0-9], [123][0-9]{3}/
  53. Date Match! 2. Start with a simple expression (cont) From

    here, we determine whether to loosen the requirements. • Ignore whitespace? • Ignore case? • Comma, other punctuation optional? /(May|October|December|Oct\.?|Dec\.?)\s+[1-2]?[0-9],?\s+[123] [0-9]{3}/
  54. Potpourri

  55. Regex Builders • Not crutches! • Testers/Interactive builder ◦ http://rubular.com/

    ◦ http://gskinner.com/RegExr/ ◦ *http://ryanswanson.com/regexp/#start ◦ http://osteele.com/tools/rework/# ◦ http://txt2re.com/index-php.php3 ◦ http://tools.netshiftmedia.com/regexlibrary/
  56. Custom Project Using REs Word Jumble! http://scorpio-dev.wharton.upenn.edu/users/jamiely/wordjumble/

  57. http://www.addedbytes.com/cheat-sheets/regular-expressions-cheat-sheet/

  58. Appendices

  59. References • http://net.tutsplus.com/tutorials/other/8-regular-expressions-you-should-know/ • http://regexlib.com/ • http://www.regular-expressions.info/

  60. Unused slides

  61. RE Linksheet

  62. Feature: Backreference • You may use a captured group in

    a regex • This is useful for paired data such as html • When you don't know what the first match will be Matches an html tag and its closing tag (kinda) /<(\w+).+> inner html including tags </\1>/ Matches a line of the name song! /\w(\w+), B\1, Bo B\1, Banana Fanana/
  63. Feature: Modifiers • (g)lobal • (i)gnore case/case (i)nsensitive • (m)ultiline

    • example
  64. Feature: Start/end anchors • ^ begin • $ end •

    Maria: Let's start at the very beginning! • ^do re mi • REM: world as we know it$
  65. Example: Quantifiers - Alt. Regex /((lion)+(tiger)+(bear)+)/ /((lion){1,2}(tiger){1,2}(bear)?)/ "liontigerbear" "lionliontigertigerbear"

  66. Example: Grouping - Matches /(circle of (hell|the (inferno|abyss))){1,7}/ "circle of

    the inferno" "circle of hellcircle of hell"
  67. Example: Grouping - No match /(circle of (hell|the (inferno|abyss))){1,7}/ "circle

    of abyss" "circle of"
  68. *Feature: - Zero Length Matches • Zero-length matches (^, \b)

  69. Use: Capture /(heart|mind)*/

  70. Search • dir/ls • Example: Find files beginning with test

    • find • Example: Find files beginning with test and ending with a numeric timestamp. • Common use case: ◦ finding all log files ◦ finding image files
  71. Use: Normalization (cont) • Phone numbers, zip code • Example:

    Phone Number List • Common use case: Joining/Splitting Strings
  72. General Search • IDE ◦ Notepad++ ▪ sql search ◦

    Eclipse ▪ function search ◦ Dreamweaver? ◦ Word? ▪ Not really regex ◦ Emacs? • FindStr/Grep ◦ Example: ? ◦ Common use case: Finding all instances of a global variable
  73. Search • find

  74. grep, findstr

  75. RE Flavors and Compensating How to compensate for a deficient

    implementation. (ColdFusion)
  76. What is a regular expression? • formal language • interpreted

    by RE processor • mini-programs • mini-specifications • domain: text-processing • iffy: find:RE::dom:XPath ... kinda • show program parse tree? ◦ compare to regex automata graph ◦ graph: ca[tp]er against cater caper acapella
  77. Disperse Examples and Tests Throughout? • Roman numerals: http://stackoverflow. com/questions/800813/what-is-the-most-difficult-

    challenging-regular-expression-you-have-ever- written/800932 •
  78. RE Cheatsheet • How to encourage RE use? • Distribute

    cheatsheets • Comb through LL source for examples where REs could be used?
  79. Use: Search Domains • Find all files containing @todo or

    !todo • Find all files using cf tags