Jeanne Boyarsky
July 20, 2023
48

July 20, 2023

Transcript

Jeanne Boyarsky

Thursday, July 20th 2023

UberConf

speakerdeck.com/boyarsky

Pause for a Commercial
2
Java certs: 8/11/17

Book giveaway at end!

note!

Intro

Why learning regex matters
5
Anyone see what is wrong?
Match all version 1 TPR reports

Why learning regex matters
6
Uh oh :(

Why learning regex matters
7
Whew

History
8
1943 Patterns in neuroscience
1951 Stephen Keene describing
neural networks
1960’s Pattern matching in text
editors, lexical parsing in
compilers
1980’s PERL
2002 Java 1.4 - regex in core Java

Greedy Quantifiers
9
Symbol # j’s?
j 1
j? 0-1
j* 0 or more
j+ 1 or more
j{5} 5
j{5,6} 5-6
j{5,} 5 or more

Puzzle time - teams
10
Can you come up with 10 ways of

(try at regex101.com if you aren’t sure

what will match)

• One or more x’s

• Zero or more x’s

• Two x’s?

Note: for this game, you can only have

two x’s in each regex

Sample solutions
11
One or more

• x+

• x{1,}

• xx*

• x{1}x*

• x{1,1}x*

• x{1}x{0,}

• x{1,1}x{0,}

• xx{0,}

• x{0,0},x{1,}

• (x|x)+

• etc

Zero or more

• x*

• x*x*

• x{0}x*

• x{0,}x*

• x{0,0}x*

• x*x{0}

• x*x{0,}

• x*x{0,0}

• x{0}x{0,}

• (x|x)+

• etc

Two

• xx

• x{2}

• x{2,2}

• x{1},x{1}

• x{1,1},x{1,1}

• x{0}x{2}

• x{0,0}x{2}

• x{0,0}x{2}

• x{2}x{0}

• (x|x){2}

• etc

Common Character Classes
12
Regex Matches
[123] Any of 1, 2 or 3
[1-3] Any of 1, 2 or 3
[^5] Any character but “5”
[a-zA-Z] Letter
\d Digit
\s Whitespace
\w Word character (letter
or digit)

Less Common Character Classes
13
Regex Matches Longer form
\D Not digit [^0-9]
\S Not whitespace [^\s]
\W Not word char [^a-zA-Z0-9]
[1-3[x-z]] Union [1-3x-z]
[[m-p]&&[l-n]] Intersection [mn]
[m-p&&[^o]] Subtraction [mnp]
Clarity

Understanding

14
Runtime
generated
classes?
[[m-p]&&[l-n]] Rarely
clearer

15
Team
knows what \D
means?
\d vs \\D?

Conclusion:
Don’t be clever!

Use case
17
Doing
by hand likely
faster….
Do 10 replaces one time

Time for Java

var phoneNumbers = """
111-111-1111
222-222-2222
""";
var pattern = Pattern.compile(
“[0-9]{3}-[0-9]{3}-[0-9]{4}");
var matcher = pattern.matcher(phoneNumbers);
while (matcher.find())
System.out.println(matcher.group());
Match a Pattern
19
You
promised

var phoneNumbers = """
111-111-1111
222-222-2222
""";
var threeDigits = "[0-9]{3}";
var fourDigits = "[0-9]{4}";
var dash = "-";
var regex = threeDigits + dash
+ threeDigits + dash + fourDigits;
var pattern = Pattern.compile(regex);
var matcher = pattern.matcher(phoneNumbers);
while (matcher.find())
System.out.println(matcher.group());
Refactored
20

var phoneNumbers = """
111-111-1111
222-222-2222
""";
var threeDigits = “\\d{3}”;
var fourDigits = “\\d{4}”;
var dash = "-";
var regex = threeDigits + dash
+ threeDigits + dash + fourDigits;
var pattern = Pattern.compile(regex);
var matcher = pattern.matcher(phoneNumbers);
while (matcher.find())
System.out.println(matcher.group());
Escaping
21

var phoneNumbers = """
111-111-1111
222-222-2222
""";
var areaCodeGroup = "(\\d{3})";
var threeDigits = "\\d{3}";
var fourDigits = "\\d{4}";
var dash = "-";
var regex = areaCodeGroup + dash
+ threeDigits + dash + fourDigits;
var pattern = Pattern.compile(regex);
var matcher = pattern.matcher(phoneNumbers);
while (matcher.find())
System.out.println(matcher.group(1));
Groups
22

var numbers = "012";
var regex = "((\\d)(\\d))\\d";
var pattern = Pattern.compile(regex);
var matcher = pattern.matcher(numbers);
while (matcher.find()) {
System.out.format("%s %s ",
matcher.group(), matcher.group(0));
System.out.format("%s %s ",
matcher.group(1), matcher.group(2));
System.out.format("%s %s",
matcher.group(3), matcher.group(4));
}
What is the output?
23
012 012 01 0

Index out of bounds: no group 4

Named capturing groups
24
Group
2 was what
now?
(?)

var phoneNumbers = """
111-111-1111
""";
var areaCodeGroup = "(?\\d{3})";
var threeDigits = "\\d{3}";
var fourDigits = "\\d{4}";
var dash = "-";
var regex = areaCodeGroup + dash
+ threeDigits + dash + fourDigits;
var pattern = Pattern.compile(regex);
var matcher = pattern.matcher(phoneNumbers);
while (matcher.find())
System.out.println(
matcher.group("areaCode"));
Named Capturing Groups
25

var string = "Elevation high";
var regex = "[a-zA-Z ]+";
System.out.println(
string.matches(regex));
Exact match
26
That’s
a lot of
ceremony!

var before = "123 Sesame Street";
var after = before.replaceAll("\\d", "");
System.out.println(after);
Replace
27
Now
THAT is easy

var string = "Mile High City";
string = string.replaceAll("^\\w+", "");
string = string.replaceAll("\\w+\$", "");
string = string.strip();
System.out.println(string);
What does this print?
28
High

29
With
time?
var string = "Mile High City";
var firstWord = "^\\w+";
var lastWord = "\\w+\$";
string = string.replaceAll(firstWord, "");
string = string.replaceAll(lastWord, "");
string = string.strip();
System.out.println(string);

var string = "Mile High City";
var boundaryAndWord = "\\b\\w+";
string = string.replaceAll(
boundaryAndWord, "");
string = string.strip();
System.out.println(string);
30
Blank. Both start of string and spaces are boundaries

What does this print?
31
var text = "\\___/";
var regex = "\\_.*/";
System.out.println(text.matches(regex));
false

Need four backslashes in the regex to print true.

Beyond the Basics

Flags
33
Flag Name Purpose
(?i) CASE_INSENSITIVE Case insensitive
ASCII
(?m) MULTILINE ^ and \$ match line
breaks
(?s) DOTALL . matches line
break
(?d) UNIX_LINES Only matches \n
whitespace and
# to end of line
+ Unicode ones

var fiveDigits = "\\d{5}";
var optionalFourDigitSuffix = "(-\\d{4})?";
var regex = fiveDigits + optionalFourDigitSuffix;
var pattern = Pattern.compile(regex);
var regex = """
\\d{5} # five digits
(-\\d{4})? # optional four digits
""";
34
Which

When would the other
be?

var html = """

""";
var body = html.replaceFirst("(?s)^.*", "")
.replaceFirst("(?s).*\$", “")
.strip();
System.out.println(body);
Embedding Flag
35

36
So I
have to say what
I don’t want?
var regex = "(?s).*(.*).*";
var body = html
.replaceFirst(regex, "\$1")
.strip();
System.out.println(body);

37
Huh?
var dotAllMode = "(?s)";
var anyChars = ".*";
var captureAnyChars = "(.*)";
var startBody = "";
var endBody = "";
var bodyPart = startBody
+ captureAnyChars + endBody;
var regex = dotAllMode + anyChars
+ bodyPart + anyChars;
var body = html.replaceFirst(regex, “\$1")
.strip();
System.out.println(body);

Where did the close * go?
38
var text = "* -aa- -b- *";
var pattern =
Pattern.compile("-([a-z]+)-");
var matcher = pattern.matcher(text);
var builder = new StringBuilder();
while(matcher.find())
matcher.appendReplacement(
builder, "x");
System.out.println(builder);
* x x

Where did the close * go?
39
var text = "* -aa- -b- *";
var pattern =
Pattern.compile("-([a-z]+)-");
var matcher = pattern.matcher(text);
var builder = new StringBuilder();
while(matcher.find())
matcher.appendReplacement(
builder, "x");
matcher.appendTail(builder);
System.out.println(builder);
* x x *

What does this do?
40
var text = "* -aa- -b- *";
var pattern =
Pattern.compile("-([a-z]+)-");
var matcher = pattern.matcher(text);
var builder = new StringBuilder();
while(matcher.find())
matcher.appendReplacement(
builder, "\$");
System.out.println(builder);
IllegalArgumentException: Illegal group reference:
group index is missing

Fix
41
var text = "* -aa- -b- *";
var pattern =
Pattern.compile("-([a-z]+)-");
var matcher = pattern.matcher(text);
var builder = new StringBuilder();
while(matcher.find())
var replace = Matcher.quoteReplacement("\$");
matcher.appendReplacement(builder, replace);
System.out.println(builder);
* \$ \$

Quantifier Types
42
Sample Type Description
z? Greedy Read whole string and backtrack
z?? Reluctant Look at one character at a time
z?+ Possessive Read whole string/never backtrack

Comparing
43
var text = "Poem: row row row your boat";
System.out.println(
System.out.println(
System.out.println(
true (extra backtracking)

true (faster)

false

44
Sample Type
(?<=r) Positive lookbehind
(?

Looking
45
var text =
“1 fish 2 fish red fish blue fish";
var regex = "\\w+ fish(?! blue)";
var pattern = Pattern.compile(regex);
var matcher = pattern.matcher(text);
while (matcher.find())
System.out.println(matcher.group());
1 fish

2 fish

blue fish

Indexes
46
var text = "i am sam. I am SAM. Sam i am";
var pattern = Pattern.compile("(?i)sam");
var matcher = pattern.matcher(text);
while (matcher.find())
System.out.println(matcher.start()
+ "-" + matcher.end());
5-8

15-18

20-23

Debugging Tip
47
Small
regex and build
up
Online
regex checker

Sonar 9 Highlights

What’s wrong?
49
"[ab]|a"
Redundant. a is a subset of [ab]

What’s wrong?
50
changed = changed.
replaceAll("\\.\\.\\.", ";")
Performance. No need for a regex

changed = changed.replace("...", “;");

What’s wrong?
51
Pattern regex =
Pattern.compile("myRegex");
Matcher matcher =
regex.matcher("s");
Performance since not static pattern

What’s wrong?
52
<.+?>

<[^>]+>

What’s wrong?
53
if (dateString.matches("^(?:(?:31(\\/|-|\\.)(?:0?[13578]|
1[02]))\\1|(?:(?:29|30)(\\/|-|\\.)(?:0?[13-9]|1[0-2])\\2))
(?:(?:1[6-9]|[2-9]\\d)?\\d{2})\$|^(?:29(\\/|-|\\.)0?2\\3(?:
(?:(?:1[6-9]|[2-9]\\d)?(?:0[48]|[2468][048]|[13579][26])|
(?:(?:16|[2468][048]|[3579][26])00))))\$|^(?:0?[1-9]|1\\d|
2[0-8])(\\/|-|\\.)(?:(?:0?[1-9])|(?:1[0-2]))\\4(?:
(?:1[6-9]|[2-9]\\d)?\\d{2})\$")) {
handleDate(dateString);
}
Too complicated
I
draw the line way
before this :)

What’s wrong?
54
Pattern.compile("\$[a-z]+^");
Functionally wrong. Should be:

Pattern.compile("^[a-z]+\$");

What’s wrong?
55
Pattern.compile("(?=a)b");
If lookahead matches next character = a, it isn’t b

What’s wrong?
56
Pattern.compile("[a-zA-Z]");
Assumes only English characters

Pattern.compile("\\p{IsAlphabetic}");
a-z
is clearer if that’s all
you want

What’s wrong?
57
String regex = request
.getParameter("regex");
String input = request
.getParameter("input");
return input.matches(regex);
Denial of service opportunity. Need to validate

What’s wrong?
58
Pattern.compile("(a|b)*");
Backtracking can overflow stack on large strings. Vs

Pattern.compile("[ab]*");

What’s wrong?
59
Pattern.compile("(.|\n)*");
Have dot itself match the line breaks. Better to use:

Pattern.compile("(?s).*");

What’s wrong?
60
Pattern.compile("(ab?)*");
Possessive quantifiers disable backtracking

Pattern.compile("(ab?)*+");

What’s wrong?
61
str.matches("\\d*?")
? is redundant here and causes backtracking

str.matches("\\d*")

What’s wrong?
62
Pattern.compile("a++abc");
Can’t match because ++ is greedy so no “a” left after

Pattern.compile("aa++bc");

Beyond English
63
Problem Reason/Fix
"cc̈d̈d".replaceAll("[c̈d̈]
", "X");
Incorrectly assumes Unicode
Graphene Cluster is one code
point. Fix:

"cc̈d̈d".replaceAll("c̈|d̈", "X");
Pattern.compile("söme
pättern",
Pattern.CASE_INSENSITIV
E);
By default, case insensitive is
ASCII only. Fix:

Pattern.compile(“söme pättern",
Pattern.CASE_INSENSITIVE |
Pattern.UNICODE_CASE);
Pattern p =
Pattern.compile("é|ë|
è");
Could be code point or cluster.
Fix:

Pattern p = Pattern.compile("é|
ë|è", Pattern.CANON_EQ);

Other JVM Lang Sampling

Use Cases
65
1.Find first match

2.Find all matches

3.Replace first matches

val text = "Mary had a little lamb"
val regex = Regex("\\b\\w{3,4} ")
print(regex.find(text)?.value)
Kotlin
66
Mary

val text = "Mary had a little lamb"
val regex = "\\b\\w{3,4} ".toRegex()
regex.findAll(text)
.map { it.groupValues[0] }
.forEach { print(it) }
Kotlin
67

val text = "Mary had a little lamb."
val wordBoundary = "\\b"
val threeOrFourChars = "\\w{3,4}"
val space = " "
val regex = Regex(wordBoundary +
threeOrFourChars + space)
println(regex.replaceFirst(text, "_"))
Kotlin
68

anyOf {
string("hello")
.digit()
.word()
.char('.')
.char('#')
}
Kotlin - SuperExpressive
69
Justin Lee

https://github.com/
evanchooly/super-
expressive

val text = "Mary had a little lamb"
val regex = """\b\w{3,4} """.r
val optional = regex findFirstIn text
println(optional.getOrElse("No Match"))
Scala
70
Mary

val text = "Mary had a little lamb."
val regex = """\b\w{3,4} """.r
val it = regex findAllIn text
it foreach print
Scala
71

import scala.util.matching.Regex
val text = "Mary had a little lamb."
val wordBoundary = """\b"""
val threeOrFourChars = """\w{3,4}"""
val space = " "
val regex = new Regex(wordBoundary +
threeOrFourChars + space)
println(regex replaceFirstIn(text, "_"))
Scala
72

def text = 'Mary had a little lamb'
def regex = /\b\w{3,4} /
def matcher = text =~ regex
print matcher[0]
Groovy
73
Mary

def text = 'Mary had a little lamb'
def regex = /\b\w{3,4} /
def matcher = text =~ regex
print matcher.findAll().join(' ')
Groovy
74

def text = 'Mary had a little lamb.'
def wordBoundary = "\\b"
def threeOrFourChars = "\\w{3,4}"
def space = " "
def regex =
/\$wordBoundary\$threeOrFourChars\$space/
println text.replaceFirst(regex)
{ it -> '_' }
println text.replaceFirst(regex, '_')
Groovy
75

(println(
re-find #”\b\w{3,4} ",
Clojure
76
Mary

(println(
re-seq #”\b\w{3,4} ",
Clojure
77

(ns clojure.examples.example
(:gen-class))
(defn Replacer []
(def text "Mary had a little lamb.")
(def wordBoundary "\\b")
(def threeOrFourChars "\\w{3,4}")
(def space " ")
(def regex (str wordBoundary
threeOrFourChars space))
(def pat (re-pattern regex))
(println(clojure.string/replace-first
text pat "_")))
(Replacer)
Clojure
78

Puzzle Time
79
Challenge before book draw

regexcrossword.com

https://github.com/deepaksood619/
RegexCrossword

Experienced - questionable tough

I needed answer key for two