Slide 1

Slide 1 text

Intro to Regular Expressions Presented by: Jamie Ly

Slide 2

Slide 2 text

Agenda ●Definition ●Infamy ●Whetting ●Literacy ●Potpourri

Slide 3

Slide 3 text

[a-z0-9\.]+@wharton\.upenn\.edu

Slide 4

Slide 4 text

What is a Regex? ● Pattern matcher ● Text processor ● mini-specification

Slide 5

Slide 5 text

How are they processed? ● One method of processing used FSAs ○ http://osteele.com/tools/reanimator/

Slide 6

Slide 6 text

Why the bad rap?

Slide 7

Slide 7 text

Some people, when confronted with a problem, think 'I know, I’ll use regular expressions.' Now they have two problems. -Jamie Zawinski

Slide 8

Slide 8 text

Overused Unintelligible

Slide 9

Slide 9 text

Obscure Regexes E-Mail Validation, Complete Spec http://ex-parrot.com/~pdw/Mail-RFC822-Address.html Sudoku Solver http://perl.abigail.be/Talks/Sudoku/HTML/ (direct link)

Slide 10

Slide 10 text

Whetting

Slide 11

Slide 11 text

Features

Slide 12

Slide 12 text

Feature: Literals ● 1 2 3 ● a b c ● A B C ● - # % ● \. \^ \$ ● \t \n \r /abc/

Slide 13

Slide 13 text

Feature: Character classes/sets ● enclosed with brackets ● OR ● ba[rs]k matches bark and bask ● Negations ^ ● c[^r]ap matches anything except crap ● Short-hand ○ Perl: \d, \s, \w ○ POSIX: [:digit:], [:space:], [:word:] /a[bc]/

Slide 14

Slide 14 text

Feature: Quantifiers ● ? one or none ● * >=0 ● + > 0 ● {n} = n ● {n, m} between n and m /a[bc]+/

Slide 15

Slide 15 text

Feature: Grouping ● Uses parens ● For capture ● Extraction ● Backreferences* /(a[bc])+/

Slide 16

Slide 16 text

Feature: Grouping - Capture search = "apple pear scrapple snapple peach pineapple" re = "([a-z]+pple)" matches = RegexMatch ( search, re ) print matches ["apple", "scrapple", "snapple", "pineapple"]

Slide 17

Slide 17 text

Feature: Others ● Modifiers ● Start/end anchors ● Zero-length matches ● Backreferences ● Lookahead, lookbehind, lookaround, ...

Slide 18

Slide 18 text

Examples

Slide 19

Slide 19 text

Example: Quantifiers /(not)? to be/ /((lion)+(tiger)+(bear)+)/

Slide 20

Slide 20 text

Example: Quantifiers - Matches /(not)? to be/ "not to be" "to be"

Slide 21

Slide 21 text

Example: Quantifiers - No match /(not)? to be/ "NOT to be" "not to be"

Slide 22

Slide 22 text

Example: Quantifiers - Matches /((lion)+(tiger)+(bear)+)/ "liontigerbear" "lionliontigertigerbear"

Slide 23

Slide 23 text

Example: Quantifiers - No Match /((lion)+(tiger)+(bear)+)/ "lion tiger bear" "lionbear"

Slide 24

Slide 24 text

Example: Grouping - Alternation /the (best|worst) of times/ /(liberty|death)/

Slide 25

Slide 25 text

Uses

Slide 26

Slide 26 text

4 Main Uses ●Search ●Normalization ●Extraction ●Validation

Slide 27

Slide 27 text

Use: Search - Finding Text In Files ● grep ● findstr (Windows) ● eclipse, Notepad++, ... ● Possible Uses: ○ files containing @todo or !todo ○ files using cf tags

Slide 28

Slide 28 text

Use: Search - Using Regex In Code ● ColdFusion - REReplace note ● Javascript - RegExp ● Java - java.util.regex.Pattern ● ASP - VBScript.Regexp ● C# - System.Text.RegularExpressions.Regex ● Comparison http://en.wikipedia. org/wiki/Comparison_of_regular_expression_engines

Slide 29

Slide 29 text

Use: Search - Another Domain ● Programming language detection ○ Pastebin ○ Syntax Highlighting

Slide 30

Slide 30 text

Use: Normalization ● Legacy system data (simple flat-file processing) ● One-off user lists ● Phone numbers, zip code ● Example: Phone Number List ● Common use case: Joining/Splitting Strings ● Common use case: Code refactoring

Slide 31

Slide 31 text

Use: Extraction ● Ties into search and normalization ● Example: URLs, e-mail addresses ● Plug: CiteThis! sample ● Use: Screen scraping Yahoo Finance src code

Slide 32

Slide 32 text

Use: Validation ● User input ● Casting v String Matching ● Example: Input Masking, Stripping HTML

Slide 33

Slide 33 text

When to Avoid RE ● HTML Parsing ○ DOM/XPath ● Performance critical ● Stateful processing ○ odd number of a's ● Some text file processing (some CSV)

Slide 34

Slide 34 text

Regex Literacy

Slide 35

Slide 35 text

How to Read REs ● Why be able to read them? ● Decompose a regex into its component parts ○ How to solve? 1+3*4/5^6+3+5-6*10+3-5 ○ Decompose/Group! ● Be familiar with the problem space/domain ● Write down strings that match as you scan ● Test against various strings (only use this if stumped!)

Slide 36

Slide 36 text

/^#?([a-f0-9]{6}|[a-f0-9]{3})$/

Slide 37

Slide 37 text

/^#?([a-f0-9]{6}|[a-f0-9]{3})$/

Slide 38

Slide 38 text

/^#?( [a-f0-9]{6} | [a-f0-9]{3} )$/

Slide 39

Slide 39 text

#aaa af0 af9f04 #00cc00

Slide 40

Slide 40 text

/^([a-z0-9_\.-]+)@([\da-z\.-]+)\. ([a-z\.]{2,6})$/

Slide 41

Slide 41 text

/^([a-z0-9_\.-]+)@([\da-z\.-]+)\. ([a-z\.]{2,6})$/

Slide 42

Slide 42 text

/^ ([a-z0-9_\.-]+) @ ([\da-z\.-]+) \. ([a-z\.]{2,6}) $/

Slide 43

Slide 43 text

a@-.az -@000.apples _______@0------0... jamiely@wharton.upenn.edu

Slide 44

Slide 44 text

Regex Construction

Slide 45

Slide 45 text

How to Construct a RE ● Have a few examples of : ○ matching strings ○ not matching strings ● start with a simple expression ● build up ● like writing pseudo-code ● Let's write a date-matching RE!

Slide 46

Slide 46 text

Just because we can... does it mean we should? Casting v RegEx ● Simple natural language parsing ○ Search queries ○ I Want Sandy ○ Remember the milk

Slide 47

Slide 47 text

Date Match! 1. What will we match? ● Typical US Dates? ● International format? 2010-03-04 ● Do we match times? Let's settle on matching: ● Month date, year ● Including 3-letter month names

Slide 48

Slide 48 text

Date Match! 1. What will we match? (continued) ● Possible matches ○ Oct. 20, 2000 ● Not matches ○ 10/20/2000 ○ October 20

Slide 49

Slide 49 text

Date Match! 2. Start with a simple expression Write a Regex to match "October 20, 2000" /October 20, 2000/ Easy huh? Now, be able to match each day in October. /October [1-3][0-9], 2000/

Slide 50

Slide 50 text

Date Match! 2. Start with a simple expression (cont) Close! But it doesn't match the 1st through 9th /October [1-3]?[0-9], 2000/ Better! Now, match any year from 1000 to 3999. /October [1-2]?[0-9], [123][0-9][0-9][0-9]/

Slide 51

Slide 51 text

Date Match! 2. Start with a simple expression (cont) When you see the same classes repeated, you can simplify! /October [1-2]?[0-9], [123][0-9]{3}/ Now, match October's abbreviation. /(October|Oct\.) [1-2]?[0-9], [123][0-9]{3}/

Slide 52

Slide 52 text

Date Match! 2. Start with a simple expression (cont) Now we can add the other months. We'll only add May and December so we can limit the size. /((May|October|December)|(Oct|Dec)\.) [1-2]?[0-9], [123][0-9]{3} / or /(May|October|December|Oct\.|Dec\.) [1-2]?[0-9], [123][0-9]{3}/

Slide 53

Slide 53 text

Date Match! 2. Start with a simple expression (cont) From here, we determine whether to loosen the requirements. ● Ignore whitespace? ● Ignore case? ● Comma, other punctuation optional? /(May|October|December|Oct\.?|Dec\.?)\s+[1-2]?[0-9],?\s+[123] [0-9]{3}/

Slide 54

Slide 54 text

Potpourri

Slide 55

Slide 55 text

Regex Builders ● Not crutches! ● Testers/Interactive builder ○ http://rubular.com/ ○ http://gskinner.com/RegExr/ ○ *http://ryanswanson.com/regexp/#start ○ http://osteele.com/tools/rework/# ○ http://txt2re.com/index-php.php3 ○ http://tools.netshiftmedia.com/regexlibrary/

Slide 56

Slide 56 text

Custom Project Using REs Word Jumble! http://scorpio-dev.wharton.upenn.edu/users/jamiely/wordjumble/

Slide 57

Slide 57 text

http://www.addedbytes.com/cheat-sheets/regular-expressions-cheat-sheet/

Slide 58

Slide 58 text

Appendices

Slide 59

Slide 59 text

References ● http://net.tutsplus.com/tutorials/other/8-regular-expressions-you-should-know/ ● http://regexlib.com/ ● http://www.regular-expressions.info/

Slide 60

Slide 60 text

Unused slides

Slide 61

Slide 61 text

RE Linksheet

Slide 62

Slide 62 text

Feature: Backreference ● You may use a captured group in a regex ● This is useful for paired data such as html ● When you don't know what the first match will be Matches an html tag and its closing tag (kinda) /<(\w+).+> inner html including tags \1>/ Matches a line of the name song! /\w(\w+), B\1, Bo B\1, Banana Fanana/

Slide 63

Slide 63 text

Feature: Modifiers ● (g)lobal ● (i)gnore case/case (i)nsensitive ● (m)ultiline ● example

Slide 64

Slide 64 text

Feature: Start/end anchors ● ^ begin ● $ end ● Maria: Let's start at the very beginning! ● ^do re mi ● REM: world as we know it$

Slide 65

Slide 65 text

Example: Quantifiers - Alt. Regex /((lion)+(tiger)+(bear)+)/ /((lion){1,2}(tiger){1,2}(bear)?)/ "liontigerbear" "lionliontigertigerbear"

Slide 66

Slide 66 text

Example: Grouping - Matches /(circle of (hell|the (inferno|abyss))){1,7}/ "circle of the inferno" "circle of hellcircle of hell"

Slide 67

Slide 67 text

Example: Grouping - No match /(circle of (hell|the (inferno|abyss))){1,7}/ "circle of abyss" "circle of"

Slide 68

Slide 68 text

*Feature: - Zero Length Matches ● Zero-length matches (^, \b)

Slide 69

Slide 69 text

Use: Capture /(heart|mind)*/

Slide 70

Slide 70 text

Search ● dir/ls ● Example: Find files beginning with test ● find ● Example: Find files beginning with test and ending with a numeric timestamp. ● Common use case: ○ finding all log files ○ finding image files

Slide 71

Slide 71 text

Use: Normalization (cont) ● Phone numbers, zip code ● Example: Phone Number List ● Common use case: Joining/Splitting Strings

Slide 72

Slide 72 text

General Search ● IDE ○ Notepad++ ■ sql search ○ Eclipse ■ function search ○ Dreamweaver? ○ Word? ■ Not really regex ○ Emacs? ● FindStr/Grep ○ Example: ? ○ Common use case: Finding all instances of a global variable

Slide 73

Slide 73 text

Search ● find

Slide 74

Slide 74 text

grep, findstr

Slide 75

Slide 75 text

RE Flavors and Compensating How to compensate for a deficient implementation. (ColdFusion)

Slide 76

Slide 76 text

What is a regular expression? ● formal language ● interpreted by RE processor ● mini-programs ● mini-specifications ● domain: text-processing ● iffy: find:RE::dom:XPath ... kinda ● show program parse tree? ○ compare to regex automata graph ○ graph: ca[tp]er against cater caper acapella

Slide 77

Slide 77 text

Disperse Examples and Tests Throughout? ● Roman numerals: http://stackoverflow. com/questions/800813/what-is-the-most-difficult- challenging-regular-expression-you-have-ever- written/800932 ●

Slide 78

Slide 78 text

RE Cheatsheet ● How to encourage RE use? ● Distribute cheatsheets ● Comb through LL source for examples where REs could be used?

Slide 79

Slide 79 text

Use: Search Domains ● Find all files containing @todo or !todo ● Find all files using cf tags