twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky History 8 1943 Patterns in neuroscience 1951 Stephen Keene describing neural networks 1960’s Pattern matching in text editors, lexical parsing in compilers 1980’s PERL 2002 Java 1.4 - regex in core Java
twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky Greedy Quantifiers 9 Symbol # j’s? j 1 j? 0-1 j* 0 or more j+ 1 or more j{5} 5 j{5,6} 5-6 j{5,} 5 or more
twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky Common Character Classes 12 Regex Matches [123] Any of 1, 2 or 3 [1-3] Any of 1, 2 or 3 [^5] Any character but “5” [a-zA-Z] Letter \d Digit \s Whitespace \w Word character (letter or digit)
twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky Less Common Character Classes 13 Regex Matches Longer form \D Not digit [^0-9] \S Not whitespace [^\s] \W Not word char [^a-zA-Z0-9] [1-3[x-z]] Union [1-3x-z] [[m-p]&&[l-n]] Intersection [mn] [m-p&&[^o]] Subtraction [mnp] Clarity
twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky var phoneNumbers = """ 111-111-1111 222-222-2222 """; var pattern = Pattern.compile( “[0-9]{3}-[0-9]{3}-[0-9]{4}"); var matcher = pattern.matcher(phoneNumbers); while (matcher.find()) System.out.println(matcher.group()); Match a Pattern 19 You promised readable regex!
twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky var phoneNumbers = """ 111-111-1111 222-222-2222 """; var areaCodeGroup = "(\\d{3})"; var threeDigits = "\\d{3}"; var fourDigits = "\\d{4}"; var dash = "-"; var regex = areaCodeGroup + dash + threeDigits + dash + fourDigits; var pattern = Pattern.compile(regex); var matcher = pattern.matcher(phoneNumbers); while (matcher.find()) System.out.println(matcher.group(1)); Groups 22
twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky var phoneNumbers = """ 111-111-1111 """; var areaCodeGroup = "(?\\d{3})"; var threeDigits = "\\d{3}"; var fourDigits = "\\d{4}"; var dash = "-"; var regex = areaCodeGroup + dash + threeDigits + dash + fourDigits; var pattern = Pattern.compile(regex); var matcher = pattern.matcher(phoneNumbers); while (matcher.find()) System.out.println( matcher.group("areaCode")); Named Capturing Groups 25
twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky var string = "Elevation high"; var regex = "[a-zA-Z ]+"; System.out.println( string.matches(regex)); Exact match 26 That’s a lot of ceremony!
twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky var before = "123 Sesame Street"; var after = before.replaceAll("\\d", ""); System.out.println(after); Replace 27 Now THAT is easy to read
twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky var string = "Mile High City"; string = string.replaceAll("^\\w+", ""); string = string.replaceAll("\\w+$", ""); string = string.strip(); System.out.println(string); What does this print? 28 High
twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky var string = "Mile High City"; var boundaryAndWord = "\\b\\w+"; string = string.replaceAll( boundaryAndWord, ""); string = string.strip(); System.out.println(string); What about now? 30 Blank. Both start of string and spaces are boundaries
twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky What does this print? 31 var text = "\\___/"; var regex = "\\_.*/"; System.out.println(text.matches(regex)); false
twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky Flags 33 Flag Name Purpose (?i) CASE_INSENTIVE Case insensitive ASCII (?m) MULTILINE ^ and $ match line breaks (?s) DOTALL . matches line break (?d) UNIX_LINES Only matches \n (?x) COMMENTS Ignores whitespace and # to end of line + Unicode ones
twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky var fiveDigits = "\\d{5}"; var optionalFourDigitSuffix = "(-\\d{4})?"; var regex = fiveDigits + optionalFourDigitSuffix; var pattern = Pattern.compile(regex); var regex = """ \\d{5} # five digits (-\\d{4})? # optional four digits """; var pattern = Pattern.compile(regex, Pattern.COMMENTS); Comments 34 Which is more readable?
twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky 36 So I have to say what I don’t want? var regex = "(?s).*(.*).*"; var body = html .replaceFirst(regex, "$1") .strip(); System.out.println(body); Ready!
twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky Where did the close * go? 38 var text = "* -aa- -b- *"; var pattern = Pattern.compile("-([a-z]+)-"); var matcher = pattern.matcher(text); var builder = new StringBuilder(); while(matcher.find()) matcher.appendReplacement( builder, "x"); System.out.println(builder); * x x
twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky Where did the close * go? 39 var text = "* -aa- -b- *"; var pattern = Pattern.compile("-([a-z]+)-"); var matcher = pattern.matcher(text); var builder = new StringBuilder(); while(matcher.find()) matcher.appendReplacement( builder, "x"); matcher.appendTail(builder); System.out.println(builder); * x x *
twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky What does this do? 40 var text = "* -aa- -b- *"; var pattern = Pattern.compile("-([a-z]+)-"); var matcher = pattern.matcher(text); var builder = new StringBuilder(); while(matcher.find()) matcher.appendReplacement( builder, "$"); System.out.println(builder); IllegalArgumentException: Illegal group reference: group index is missing
twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky Fix 41 var text = "* -aa- -b- *"; var pattern = Pattern.compile("-([a-z]+)-"); var matcher = pattern.matcher(text); var builder = new StringBuilder(); while(matcher.find()) var replace = Matcher.quoteReplacement("$"); matcher.appendReplacement(builder, replace); System.out.println(builder); * $ $
twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky Quantifier Types 42 Sample Type Description z? Greedy Read whole string and backtrack z?? Reluctant Look at one character at a time z?+ Possessive Read whole string/never backtrack
twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky Looking 45 var text = “1 fish 2 fish red fish blue fish"; var regex = "\\w+ fish(?! blue)"; var pattern = Pattern.compile(regex); var matcher = pattern.matcher(text); while (matcher.find()) System.out.println(matcher.group()); 1 fish
twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky Indexes 46 var text = "i am sam. I am SAM. Sam i am"; var pattern = Pattern.compile("(?i)sam"); var matcher = pattern.matcher(text); while (matcher.find()) System.out.println(matcher.start() + "-" + matcher.end()); 5-8
twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky What’s wrong? 50 changed = changed. replaceAll("\\.\\.\\.", ";") Performance. No need for a regex
twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky What’s wrong? 53 if (dateString.matches("^(?:(?:31(\\/|-|\\.)(?:0?[13578]| 1[02]))\\1|(?:(?:29|30)(\\/|-|\\.)(?:0?[13-9]|1[0-2])\\2)) (?:(?:1[6-9]|[2-9]\\d)?\\d{2})$|^(?:29(\\/|-|\\.)0?2\\3(?: (?:(?:1[6-9]|[2-9]\\d)?(?:0[48]|[2468][048]|[13579][26])| (?:(?:16|[2468][048]|[3579][26])00))))$|^(?:0?[1-9]|1\\d| 2[0-8])(\\/|-|\\.)(?:(?:0?[1-9])|(?:1[0-2]))\\4(?: (?:1[6-9]|[2-9]\\d)?\\d{2})$")) { handleDate(dateString); } Too complicated I draw the line way before this :)
twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky What’s wrong? 55 Pattern.compile("(?=a)b"); If lookahead matches next character = a, it isn’t b
twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky What’s wrong? 58 Pattern.compile("(a|b)*"); Backtracking can overflow stack on large strings. Vs
twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky What’s wrong? 59 Pattern.compile("(.|\n)*"); Have dot itself match the line breaks. Better to use:
twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky What’s wrong? 62 Pattern.compile("a++abc"); Can’t match because ++ is greedy so no “a” left after
twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky Beyond English 63 Problem Reason/Fix "cc̈d̈d".replaceAll("[c̈d̈] ", "X"); Incorrectly assumes Unicode Graphene Cluster is one code point. Fix:
"cc̈d̈d".replaceAll("c̈|d̈", "X"); Pattern.compile("söme pättern", Pattern.CASE_INSENSITIV E); By default, case insensitive is ASCII only. Fix:
Pattern.compile(“söme pättern", Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE); Pattern p = Pattern.compile("é|ë| è"); Could be code point or cluster. Fix:
Pattern p = Pattern.compile("é| ë|è", Pattern.CANON_EQ);
twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky val text = "Mary had a little lamb" val regex = Regex("\\b\\w{3,4} ") print(regex.find(text)?.value) Kotlin 66 Mary
twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky val text = "Mary had a little lamb" val regex = "\\b\\w{3,4} ".toRegex() regex.findAll(text) .map { it.groupValues[0] } .forEach { print(it) } Kotlin 67 Mary had
twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky val text = "Mary had a little lamb." val wordBoundary = "\\b" val threeOrFourChars = "\\w{3,4}" val space = " " val regex = Regex(wordBoundary + threeOrFourChars + space) println(regex.replaceFirst(text, "_")) Kotlin 68 _had a little lamb.
twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky val text = "Mary had a little lamb" val regex = """\b\w{3,4} """.r val optional = regex findFirstIn text println(optional.getOrElse("No Match")) Scala 70 Mary
twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky val text = "Mary had a little lamb." val regex = """\b\w{3,4} """.r val it = regex findAllIn text it foreach print Scala 71 Mary had
twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky import scala.util.matching.Regex val text = "Mary had a little lamb." val wordBoundary = """\b""" val threeOrFourChars = """\w{3,4}""" val space = " " val regex = new Regex(wordBoundary + threeOrFourChars + space) println(regex replaceFirstIn(text, "_")) Scala 72 _had a little lamb.
twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky def text = 'Mary had a little lamb' def regex = /\b\w{3,4} / def matcher = text =~ regex print matcher[0] Groovy 73 Mary
twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky def text = 'Mary had a little lamb' def regex = /\b\w{3,4} / def matcher = text =~ regex print matcher.findAll().join(' ') Groovy 74 Mary had
twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky (ns clojure.examples.example (:gen-class)) (defn Replacer [] (def text "Mary had a little lamb.") (def wordBoundary "\\b") (def threeOrFourChars "\\w{3,4}") (def space " ") (def regex (str wordBoundary threeOrFourChars space)) (def pat (re-pattern regex)) (println(clojure.string/replace-first text pat "_"))) (Replacer) Clojure 78 _had a little lamb.