Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Intro to Regex

Intro to Regex

A general and thorough exploration of Regex. Starts with the theory and basic implementation and dives into numerous examples. Covers everything from character sets through lookarounds and conditionals.

Jacob Emerick

June 12, 2014
Tweet

Other Decks in Programming

Transcript

  1. ABOUT JACOB Software Engineer for Shutterstock 10 years of PHP

    Also a pretty cool cat @jpemeric Personal Site
  2. SO WHAT ARE WE TALKING ABOUT? 1. Explore the theory

    2. Implementation 3. Introduce via examples 4. Steadily increase complexity 5. How to use regex 6. Share some resources
  3. FORMAL LANGUAGE THEORY If an alphabet is a finite set

    of symbols, you can create an infinite set of strings (Kleene Closure) using those symbols. Any arbitrary subset of these strings (a language) cannot be determined by computing. However, they may be defined.
  4. REGULAR EXPRESSION LANGUAGE A language is a regular language if

    it can be described using only these parameters For a deeper exploration, check out Eric Lippert's series
  5. IMPURE IMPLEMENTATIONS Formal language theory is pretty The real world

    is not pretty HTML, XML, etc are not regular languages Oh, and backtracking
  6. ANATOMY OF A REGEX delimiter (/, #, ~, @, etc)

    literal characters special characters modifiers
  7. LANGUAGE DIFFERENCES Recursion Conditionals Atomic > 9 Capture GREP x

    Javascript x Ruby x x x VIM x PHP (PCRE) x x x x Comparison on Wikipedia
  8. LITERALS, SPECIAL, AND CLASSES Literal characters are acceptable , You

    can also use alternation with literals Special characters mean something special , , Escape special character to use Its easy to define ranges, or character classes , abc 158 cat|dog + - [] \+ [a-z] [0-9]
  9. WHITESPACE AND SHORTHAND is new line is return carriage (thanks

    windows) is tab is whitespace (and maybe ) is word character is word boundary (zero-length) is digit Capitalize to negate ( matches !digit) \n \r \t \s [ \t\f] [\r\n] \w [A-Za-z0-9_] \b \d \D
  10. QUANTIFIERS matches 0 or more matches 1 or more matches

    four characters matches between four and six characters makes things optional * + {4} {4,6} ?
  11. DOTS AND BOUNDARIES matches everything except line breaks Dot is

    too powerful - avoid if possible Negated character classes are better , force boundaries vs on 123abc456 . ^ $ /^\d+$/ /\d+/
  12. EXAMPLE - VALID PASSWORD Only contain letters (upper and lower)

    and numbers Between six and eight characters /^[a-zA-Z0-9]{6,8}$/
  13. EXAMPLE - FULL NAME First, last, optional middle name Space

    between name pieces First letter capitalized, no special chars allowed /^[A-Z][a-z]+ ([A-Z][a-z]+ )?[A-Z][a-z]+$/
  14. GREEDINESS Regex is by nature greedy Example: for anchor tags

    Add to make quantifiers lazy , , Alternatively, use negatated character classes /<a\b.*>/ ? +? *? ?? /<a\b[^>]*>
  15. CAPTURING GROUPS for capturing blocks for later parsing Also useful

    for limiting quantifiers or alternations If you need a group but don't want it captured, Alternation can spawn multiple captures, use You can also name groups with () (?: (?| (?P<name>pattern)
  16. EXAMPLES - BASIC LINK Create a regex to only capture

    url and anchor from a link If title is present, capture that too /<a href="(?P<link>[^"]+)"(?: title="(?P<title>[^"]+)")?>(?P<anch or>[^<]+)<\/a>/
  17. BACKREFERENCES Allows you to reference previous capturing groups Indexed left

    to right You can also backreference named captured Backtracking is tricky \1 (?P=name) /<([a-z]+)[^>]*>[^<]+<\/\1>/
  18. PATTERN MODIFIERS sets case-insensitive sets multiline (for and ) sets

    to match everything, including newlines ignores whitespace in pattern (so it can breathe more) evaluates the code (deprecated, don't do this!) Many more on i m ^ $ s . x e PHP's PCRE page
  19. ATOMIC GROUPINGS Throws away all backtracking information after matching Useful

    for optimization vs Order is important Avoid catastrophic backtracking! (?> /\b(brogrammer|bro)\b/ /\b(?>brogrammer|bro)\b/
  20. POSSESSIVE QUANTIFIER Greedy, stubborn quantifier , , Refuses to give

    up matches during backtracking Useful for optimization Avoid catastrophic backtracking! *+ ++ ?+ /\b<span class="[^"]+">[^<]<\/span>/
  21. EXAMPLE - OPTIMIZE THIS Example from the author of RegexBuddy

    For a 12-column CSV, find row with the last element of 'P' results in 3,265 steps /^(.*?,){11}P/ /^(?>([^,\r\n]*+,){11})P/
  22. POSITIVE AND NEGATIVE LOOKAROUNDS Zero-length assertions at the beginning and

    end of patterns Lookaheads are and Lookbehinds are and More complex Lookbehinds only support fixed width, no quantifiers Workaround (in some languages) is (?= (?! (<?<= (?<! /\b(?=[a-z]+\b)[a-z]*(nike|viagra|nfl)[a-z]*/ \K
  23. EXAMPLE - LOOKAROUNDS Match all states that end with the

    letter 'a' Given a list of states comma separated /[A-Za-z ]+a(?=,)/
  24. EXAMPLE - PARSING HTTP HEADERS Given a standard response headers,

    pull code and date out H T T P / 1 . x 2 0 0 O K D a t e : S a t , 2 8 N o v 2 0 0 9 0 4 : 3 6 : 2 5 G M T E x p i r e s : S a t , 2 8 N o v 2 0 0 9 0 5 : 3 6 : 2 5 G M T /(?:(HTTP)\/\d.[a-z]|(Date|Expires):) (?(1)(\d+)|([A-Za-z ,\d:]+))/
  25. RECURSION AND SUBROUTINES for recursion Example: matches bobo, bboobboobboo, etc

    Subroutines allow recursion based on capturing groups Syntax is similar but references , (?R)? /(?:b(?R)?o)+/ (?1) (?)
  26. MATCHING preg_match() for single, preg_match_all() for global args: pattern, subject,

    matches (by reference), flags, offset p r e g _ m a t c h ( ' / < a h r e f = " ( ? P < l i n k > [ ^ " ] + ) " ( ? : t i t l e = " ( ? P < t i t l e > [ ^ " ] + ) " ) ? > ( ? P < a n c h o r > [ ^ < ] + ) < \ / a > / ' , ' < a h r e f = " u r l " t i t l e = " s e o t i t l e " > s e o a n c h o r < / a > ' , $ m a t c h e s ) ; { [ 0 ] = > s t r i n g ( 4 6 ) " < a h r e f = " u r l " t i t l e = " s e o t i t l e " > s e o a n c h o r < / a > " , [ " l i n k " ] = > s t r i n g ( 3 ) " u r l " , [ 1 ] = > s t r i n g ( 3 ) " u r l " , [ " t i t l e " ] = > s t r i n g ( 9 ) " s e o t i t l e " , [ 2 ] = > s t r i n g ( 9 ) " s e o t i t l e " , [ " a n c h o r " ] = > s t r i n g ( 1 0 ) " s e o a n c h o r " , [ 3 ] = > s t r i n g ( 1 0 ) " s e o a n c h o r " , }
  27. REPLACING preg_replace() and preg_filter() args: pattern, replacement, subject, limit, count

    returns modified subject preg_filter() will return null or empty array on failure can accept strings or arrays for replacements and strings preg_replace_callback() accepts a callback (awesome)
  28. MISC preg_split() splits a string into an array by a

    regex preg_quote() escapes chacters preg_last_error() returns an error code (for debugging) PHP's PCRE functions
  29. RESOURCES Excellent PCRE debugger Awesome general Regex General tutorials and

    hints Awesome Regex Game Theory of Regular Expressions Regex101 Regexr Regular Expressions Info Regex Crossword Eric Lippert