Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Readable Regular Expressions in Java

Readable Regular Expressions in Java

Jeanne Boyarsky

July 20, 2023
Tweet

More Decks by Jeanne Boyarsky

Other Decks in Programming

Transcript

  1. twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky History 8 1943 Patterns in neuroscience 1951 Stephen

    Keene describing neural networks 1960’s Pattern matching in text editors, lexical parsing in compilers 1980’s PERL 2002 Java 1.4 - regex in core Java
  2. twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky Greedy Quantifiers 9 Symbol # j’s? j 1

    j? 0-1 j* 0 or more j+ 1 or more j{5} 5 j{5,6} 5-6 j{5,} 5 or more
  3. twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky Puzzle time - teams 10 Can you come

    up with 10 ways of matching your assigned regex? (try at regex101.com if you aren’t sure what will match) • One or more x’s • Zero or more x’s • Two x’s? Note: for this game, you can only have two x’s in each regex
  4. twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky Sample solutions 11 One or more • x+

    • x{1,} • xx* • x{1}x* • x{1,1}x* • x{1}x{0,} • x{1,1}x{0,} • xx{0,} • x{0,0},x{1,} • (x|x)+ • etc Zero or more • x* • x*x* • x{0}x* • x{0,}x* • x{0,0}x* • x*x{0} • x*x{0,} • x*x{0,0} • x{0}x{0,} • (x|x)+ • etc Two • xx • x{2} • x{2,2} • x{1},x{1} • x{1,1},x{1,1} • x{0}x{2} • x{0,0}x{2} • x{0,0}x{2} • x{2}x{0} • (x|x){2} • etc
  5. twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky Common Character Classes 12 Regex Matches [123] Any

    of 1, 2 or 3 [1-3] Any of 1, 2 or 3 [^5] Any character but “5” [a-zA-Z] Letter \d Digit \s Whitespace \w Word character (letter or digit)
  6. twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky Less Common Character Classes 13 Regex Matches Longer

    form \D Not digit [^0-9] \S Not whitespace [^\s] \W Not word char [^a-zA-Z0-9] [1-3[x-z]] Union [1-3x-z] [[m-p]&&[l-n]] Intersection [mn] [m-p&&[^o]] Subtraction [mnp] Clarity Understanding
  7. twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky var phoneNumbers = """ 111-111-1111 222-222-2222 """; var

    pattern = Pattern.compile( “[0-9]{3}-[0-9]{3}-[0-9]{4}"); var matcher = pattern.matcher(phoneNumbers); while (matcher.find()) System.out.println(matcher.group()); Match a Pattern 19 You promised readable regex!
  8. twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky var phoneNumbers = """ 111-111-1111 222-222-2222 """; var

    threeDigits = "[0-9]{3}"; var fourDigits = "[0-9]{4}"; var dash = "-"; var regex = threeDigits + dash + threeDigits + dash + fourDigits; var pattern = Pattern.compile(regex); var matcher = pattern.matcher(phoneNumbers); while (matcher.find()) System.out.println(matcher.group()); Refactored 20
  9. twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky var phoneNumbers = """ 111-111-1111 222-222-2222 """; var

    threeDigits = “\\d{3}”; var fourDigits = “\\d{4}”; var dash = "-"; var regex = threeDigits + dash + threeDigits + dash + fourDigits; var pattern = Pattern.compile(regex); var matcher = pattern.matcher(phoneNumbers); while (matcher.find()) System.out.println(matcher.group()); Escaping 21
  10. twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky var phoneNumbers = """ 111-111-1111 222-222-2222 """; var

    areaCodeGroup = "(\\d{3})"; var threeDigits = "\\d{3}"; var fourDigits = "\\d{4}"; var dash = "-"; var regex = areaCodeGroup + dash + threeDigits + dash + fourDigits; var pattern = Pattern.compile(regex); var matcher = pattern.matcher(phoneNumbers); while (matcher.find()) System.out.println(matcher.group(1)); Groups 22
  11. twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky var numbers = "012"; var regex = "((\\d)(\\d))\\d";

    var pattern = Pattern.compile(regex); var matcher = pattern.matcher(numbers); while (matcher.find()) { System.out.format("%s %s ", matcher.group(), matcher.group(0)); System.out.format("%s %s ", matcher.group(1), matcher.group(2)); System.out.format("%s %s", matcher.group(3), matcher.group(4)); } What is the output? 23 012 012 01 0 Index out of bounds: no group 4
  12. twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky var phoneNumbers = """ 111-111-1111 """; var areaCodeGroup

    = "(?<areaCode>\\d{3})"; var threeDigits = "\\d{3}"; var fourDigits = "\\d{4}"; var dash = "-"; var regex = areaCodeGroup + dash + threeDigits + dash + fourDigits; var pattern = Pattern.compile(regex); var matcher = pattern.matcher(phoneNumbers); while (matcher.find()) System.out.println( matcher.group("areaCode")); Named Capturing Groups 25
  13. twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky var string = "Elevation high"; var regex =

    "[a-zA-Z ]+"; System.out.println( string.matches(regex)); Exact match 26 That’s a lot of ceremony!
  14. twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky var before = "123 Sesame Street"; var after

    = before.replaceAll("\\d", ""); System.out.println(after); Replace 27 Now THAT is easy to read
  15. twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky var string = "Mile High City"; string =

    string.replaceAll("^\\w+", ""); string = string.replaceAll("\\w+$", ""); string = string.strip(); System.out.println(string); What does this print? 28 High
  16. twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky 29 With readability this time? var string =

    "Mile High City"; var firstWord = "^\\w+"; var lastWord = "\\w+$"; string = string.replaceAll(firstWord, ""); string = string.replaceAll(lastWord, ""); string = string.strip(); System.out.println(string);
  17. twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky var string = "Mile High City"; var boundaryAndWord

    = "\\b\\w+"; string = string.replaceAll( boundaryAndWord, ""); string = string.strip(); System.out.println(string); What about now? 30 Blank. Both start of string and spaces are boundaries
  18. twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky What does this print? 31 var text =

    "\\___/"; var regex = "\\_.*/"; System.out.println(text.matches(regex)); false Need four backslashes in the regex to print true.
  19. twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky Flags 33 Flag Name Purpose (?i) CASE_INSENSITIVE Case

    insensitive ASCII (?m) MULTILINE ^ and $ match line breaks (?s) DOTALL . matches line break (?d) UNIX_LINES Only matches \n (?x) COMMENTS Ignores whitespace and # to end of line + Unicode ones
  20. twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky var fiveDigits = "\\d{5}"; var optionalFourDigitSuffix = "(-\\d{4})?";

    var regex = fiveDigits + optionalFourDigitSuffix; var pattern = Pattern.compile(regex); var regex = """ \\d{5} # five digits (-\\d{4})? # optional four digits """; var pattern = Pattern.compile(regex, Pattern.COMMENTS); Comments 34 Which is more readable? When would the other be?
  21. twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky var html = """ <html> … <body> <p>Ready!</p>

    </body> </html> """; var body = html.replaceFirst("(?s)^.*<body>", "") .replaceFirst("(?s)</body>.*$", “") .strip(); System.out.println(body); Embedding Flag 35 <p>Ready!</p>
  22. twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky 36 So I have to say what I

    don’t want? var regex = "(?s).*<body>(.*)</body>.*"; var body = html .replaceFirst(regex, "$1") .strip(); System.out.println(body); <p>Ready!</p>
  23. twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky 37 Huh? var dotAllMode = "(?s)"; var anyChars

    = ".*"; var captureAnyChars = "(.*)"; var startBody = "<body>"; var endBody = "</body>"; var bodyPart = startBody + captureAnyChars + endBody; var regex = dotAllMode + anyChars + bodyPart + anyChars; var body = html.replaceFirst(regex, “$1") .strip(); System.out.println(body); <p>Ready!</p>
  24. twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky Where did the close * go? 38 var

    text = "* -aa- -b- *"; var pattern = Pattern.compile("-([a-z]+)-"); var matcher = pattern.matcher(text); var builder = new StringBuilder(); while(matcher.find()) matcher.appendReplacement( builder, "x"); System.out.println(builder); * x x
  25. twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky Where did the close * go? 39 var

    text = "* -aa- -b- *"; var pattern = Pattern.compile("-([a-z]+)-"); var matcher = pattern.matcher(text); var builder = new StringBuilder(); while(matcher.find()) matcher.appendReplacement( builder, "x"); matcher.appendTail(builder); System.out.println(builder); * x x *
  26. twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky What does this do? 40 var text =

    "* -aa- -b- *"; var pattern = Pattern.compile("-([a-z]+)-"); var matcher = pattern.matcher(text); var builder = new StringBuilder(); while(matcher.find()) matcher.appendReplacement( builder, "$"); System.out.println(builder); IllegalArgumentException: Illegal group reference: group index is missing
  27. twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky Fix 41 var text = "* -aa- -b-

    *"; var pattern = Pattern.compile("-([a-z]+)-"); var matcher = pattern.matcher(text); var builder = new StringBuilder(); while(matcher.find()) var replace = Matcher.quoteReplacement("$"); matcher.appendReplacement(builder, replace); System.out.println(builder); * $ $
  28. twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky Quantifier Types 42 Sample Type Description z? Greedy

    Read whole string and backtrack z?? Reluctant Look at one character at a time z?+ Possessive Read whole string/never backtrack
  29. twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky Comparing 43 var text = "Poem: row row

    row your boat"; System.out.println( text.matches(".*(row )+your boat")); System.out.println( text.matches(".*?(row )+your boat")); System.out.println( text.matches(".*+(row )+your boat")); true (extra backtracking) true (faster) false
  30. twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky Looking 45 var text = “1 fish 2

    fish red fish blue fish"; var regex = "\\w+ fish(?! blue)"; var pattern = Pattern.compile(regex); var matcher = pattern.matcher(text); while (matcher.find()) System.out.println(matcher.group()); 1 fish 2 fish blue fish
  31. twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky Indexes 46 var text = "i am sam.

    I am SAM. Sam i am"; var pattern = Pattern.compile("(?i)sam"); var matcher = pattern.matcher(text); while (matcher.find()) System.out.println(matcher.start() + "-" + matcher.end()); 5-8 15-18 20-23
  32. twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky What’s wrong? 51 Pattern regex = Pattern.compile("myRegex"); Matcher

    matcher = regex.matcher("s"); Performance since not static pattern Readability tradeoff
  33. twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky What’s wrong? 53 if (dateString.matches("^(?:(?:31(\\/|-|\\.)(?:0?[13578]| 1[02]))\\1|(?:(?:29|30)(\\/|-|\\.)(?:0?[13-9]|1[0-2])\\2)) (?:(?:1[6-9]|[2-9]\\d)?\\d{2})$|^(?:29(\\/|-|\\.)0?2\\3(?: (?:(?:1[6-9]|[2-9]\\d)?(?:0[48]|[2468][048]|[13579][26])|

    (?:(?:16|[2468][048]|[3579][26])00))))$|^(?:0?[1-9]|1\\d| 2[0-8])(\\/|-|\\.)(?:(?:0?[1-9])|(?:1[0-2]))\\4(?: (?:1[6-9]|[2-9]\\d)?\\d{2})$")) { handleDate(dateString); } Too complicated I draw the line way before this :)
  34. twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky What’s wrong? 57 String regex = request .getParameter("regex");

    String input = request .getParameter("input"); return input.matches(regex); Denial of service opportunity. Need to validate
  35. twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky Beyond English 63 Problem Reason/Fix "cc̈d̈d".replaceAll("[c̈d̈] ", "X");

    Incorrectly assumes Unicode Graphene Cluster is one code point. Fix: "cc̈d̈d".replaceAll("c̈|d̈", "X"); Pattern.compile("söme pättern", Pattern.CASE_INSENSITIV E); By default, case insensitive is ASCII only. Fix: Pattern.compile(“söme pättern", Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE); Pattern p = Pattern.compile("é|ë| è"); Could be code point or cluster. Fix: Pattern p = Pattern.compile("é| ë|è", Pattern.CANON_EQ);
  36. twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky val text = "Mary had a little lamb"

    val regex = Regex("\\b\\w{3,4} ") print(regex.find(text)?.value) Kotlin 66 Mary
  37. twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky val text = "Mary had a little lamb"

    val regex = "\\b\\w{3,4} ".toRegex() regex.findAll(text) .map { it.groupValues[0] } .forEach { print(it) } Kotlin 67 Mary had
  38. twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky val text = "Mary had a little lamb."

    val wordBoundary = "\\b" val threeOrFourChars = "\\w{3,4}" val space = " " val regex = Regex(wordBoundary + threeOrFourChars + space) println(regex.replaceFirst(text, "_")) Kotlin 68 _had a little lamb.
  39. twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky anyOf { string("hello") .digit() .word() .char('.') .char('#') }

    Kotlin - SuperExpressive 69 Justin Lee https://github.com/ evanchooly/super- expressive
  40. twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky val text = "Mary had a little lamb"

    val regex = """\b\w{3,4} """.r val optional = regex findFirstIn text println(optional.getOrElse("No Match")) Scala 70 Mary
  41. twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky val text = "Mary had a little lamb."

    val regex = """\b\w{3,4} """.r val it = regex findAllIn text it foreach print Scala 71 Mary had
  42. twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky import scala.util.matching.Regex val text = "Mary had a

    little lamb." val wordBoundary = """\b""" val threeOrFourChars = """\w{3,4}""" val space = " " val regex = new Regex(wordBoundary + threeOrFourChars + space) println(regex replaceFirstIn(text, "_")) Scala 72 _had a little lamb.
  43. twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky def text = 'Mary had a little lamb'

    def regex = /\b\w{3,4} / def matcher = text =~ regex print matcher[0] Groovy 73 Mary
  44. twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky def text = 'Mary had a little lamb'

    def regex = /\b\w{3,4} / def matcher = text =~ regex print matcher.findAll().join(' ') Groovy 74 Mary had
  45. twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky def text = 'Mary had a little lamb.'

    def wordBoundary = "\\b" def threeOrFourChars = "\\w{3,4}" def space = " " def regex = /$wordBoundary$threeOrFourChars$space/ println text.replaceFirst(regex) { it -> '_' } println text.replaceFirst(regex, '_') Groovy 75 _had a little lamb. _had a little lamb.
  46. twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky (ns clojure.examples.example (:gen-class)) (defn Replacer [] (def text

    "Mary had a little lamb.") (def wordBoundary "\\b") (def threeOrFourChars "\\w{3,4}") (def space " ") (def regex (str wordBoundary threeOrFourChars space)) (def pat (re-pattern regex)) (println(clojure.string/replace-first text pat "_"))) (Replacer) Clojure 78 _had a little lamb.
  47. twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky Puzzle Time 79 Challenge before book draw regexcrossword.com

    Answer key: https://github.com/deepaksood619/ RegexCrossword Experienced - questionable tough I needed answer key for two