twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky
History
8
1943 Patterns in neuroscience
1951 Stephen Keene describing
neural networks
1960’s Pattern matching in text
editors, lexical parsing in
compilers
1980’s PERL
2002 Java 1.4 - regex in core Java
Slide 9
Slide 9 text
twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky
Greedy Quantifiers
9
Symbol # j’s?
j 1
j? 0-1
j* 0 or more
j+ 1 or more
j{5} 5
j{5,6} 5-6
j{5,} 5 or more
Slide 10
Slide 10 text
twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky
Puzzle time
10
Can you come up with 10 ways of
matching your assigned regex?
• One or more x’s
Note: for this game, you can only have
two x’s in each regex
Slide 11
Slide 11 text
twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky
Sample solutions
11
One or more
• x+
• x{1,}
• xx*
• x{1}x*
• x{1,1}x*
• x{1}x{0,}
• x{1,1}x{0,}
• xx{0,}
• x{0,0},x{1,}
• (x|x)+
• etc
Slide 12
Slide 12 text
twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky
Common Character Classes
12
Regex Matches
[123] Any of 1, 2 or 3
[1-3] Any of 1, 2 or 3
[^5] Any character but “5”
[a-zA-Z] Letter
\d Digit
\s Whitespace
\w Word character (letter
or digit)
Slide 13
Slide 13 text
twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky
Less Common Character Classes
13
Regex Matches Longer form
\D Not digit [^0-9]
\S Not whitespace [^\s]
\W Not word char [^a-zA-Z0-9]
[1-3[x-z]] Union [1-3x-z]
[[m-p]&&[l-n]] Intersection [mn]
[m-p&&[^o]] Subtraction [mnp]
Clarity
Understanding
Slide 14
Slide 14 text
twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky
Know your use case
14
Runtime
generated
classes?
[[m-p]&&[l-n]] Rarely
clearer
Slide 15
Slide 15 text
twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky
Know your team
15
Team
knows what \D
means?
\D vs [^0-9] Misread
\d vs \\D?
Slide 16
Slide 16 text
twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky 16
Conclusion:
Don’t be clever!
Slide 17
Slide 17 text
twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky
Use case
17
Doing
by hand likely
faster….
Do 10 replaces one time
Slide 18
Slide 18 text
twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky 18
Time for Java
Slide 19
Slide 19 text
twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky
var phoneNumbers = """
111-111-1111
222-222-2222
""";
var pattern = Pattern.compile(
“[0-9]{3}-[0-9]{3}-[0-9]{4}");
var matcher = pattern.matcher(phoneNumbers);
while (matcher.find())
System.out.println(matcher.group());
Match a Pattern
19
You
promised
readable regex!
Slide 20
Slide 20 text
twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky
var phoneNumbers = """
111-111-1111
222-222-2222
""";
var threeDigits = "[0-9]{3}";
var fourDigits = "[0-9]{4}";
var dash = "-";
var regex = threeDigits + dash
+ threeDigits + dash + fourDigits;
var pattern = Pattern.compile(regex);
var matcher = pattern.matcher(phoneNumbers);
while (matcher.find())
System.out.println(matcher.group());
Refactored
20
Slide 21
Slide 21 text
twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky
var phoneNumbers = """
111-111-1111
222-222-2222
""";
var threeDigits = “\\d{3}”;
var fourDigits = “\\d{4}”;
var dash = "-";
var regex = threeDigits + dash
+ threeDigits + dash + fourDigits;
var pattern = Pattern.compile(regex);
var matcher = pattern.matcher(phoneNumbers);
while (matcher.find())
System.out.println(matcher.group());
Escaping
21
Slide 22
Slide 22 text
twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky
var phoneNumbers = """
111-111-1111
222-222-2222
""";
var areaCodeGroup = "(\\d{3})";
var threeDigits = "\\d{3}";
var fourDigits = "\\d{4}";
var dash = "-";
var regex = areaCodeGroup + dash
+ threeDigits + dash + fourDigits;
var pattern = Pattern.compile(regex);
var matcher = pattern.matcher(phoneNumbers);
while (matcher.find())
System.out.println(matcher.group(1));
Groups
22
Slide 23
Slide 23 text
twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky
var numbers = "012";
var regex = "((\\d)(\\d))\\d";
var pattern = Pattern.compile(regex);
var matcher = pattern.matcher(numbers);
while (matcher.find()) {
System.out.format("%s %s ",
matcher.group(), matcher.group(0));
System.out.format("%s %s ",
matcher.group(1), matcher.group(2));
System.out.format("%s %s",
matcher.group(3), matcher.group(4));
}
What is the output?
23
012 012 01 0
Index out of bounds: no group 4
Slide 24
Slide 24 text
twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky
Named capturing groups
24
Group
2 was what
now?
(?)
Slide 25
Slide 25 text
twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky
var phoneNumbers = """
111-111-1111
""";
var areaCodeGroup = "(?\\d{3})";
var threeDigits = "\\d{3}";
var fourDigits = "\\d{4}";
var dash = "-";
var regex = areaCodeGroup + dash
+ threeDigits + dash + fourDigits;
var pattern = Pattern.compile(regex);
var matcher = pattern.matcher(phoneNumbers);
while (matcher.find())
System.out.println(
matcher.group("areaCode"));
Named Capturing Groups
25
Slide 26
Slide 26 text
twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky
var string = "Elevation high";
var regex = "[a-zA-Z ]+";
System.out.println(
string.matches(regex));
Exact match
26
That’s
a lot of
ceremony!
Slide 27
Slide 27 text
twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky
var before = "123 Sesame Street";
var after = before.replaceAll("\\d", "");
System.out.println(after);
Replace
27
Now
THAT is easy
to read
Slide 28
Slide 28 text
twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky
var string = "Mile High City";
string = string.replaceAll("^\\w+", "");
string = string.replaceAll("\\w+$", "");
string = string.strip();
System.out.println(string);
What does this print?
28
High
Slide 29
Slide 29 text
twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky
29
With
readability this
time?
var string = "Mile High City";
var firstWord = "^\\w+";
var lastWord = "\\w+$";
string = string.replaceAll(firstWord, "");
string = string.replaceAll(lastWord, "");
string = string.strip();
System.out.println(string);
Slide 30
Slide 30 text
twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky
var string = "Mile High City";
var boundaryAndWord = "\\b\\w+";
string = string.replaceAll(
boundaryAndWord, "");
string = string.strip();
System.out.println(string);
What about now?
30
Blank. Both start of string and spaces are boundaries
Slide 31
Slide 31 text
twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky
What does this print?
31
var text = "\\___/";
var regex = "\\_.*/";
System.out.println(text.matches(regex));
false
Need four backslashes in the regex to print true.
Slide 32
Slide 32 text
twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky 32
Beyond the Basics
Slide 33
Slide 33 text
twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky
Flags
33
Flag Name Purpose
(?i) CASE_INSENTIVE Case insensitive
ASCII
(?m) MULTILINE ^ and $ match line
breaks
(?s) DOTALL . matches line
break
(?d) UNIX_LINES Only matches \n
(?x) COMMENTS Ignores
whitespace and
# to end of line
+ Unicode ones
Slide 34
Slide 34 text
twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky
var fiveDigits = "\\d{5}";
var optionalFourDigitSuffix = "(-\\d{4})?";
var regex = fiveDigits + optionalFourDigitSuffix;
var pattern = Pattern.compile(regex);
var regex = """
\\d{5} # five digits
(-\\d{4})? # optional four digits
""";
var pattern = Pattern.compile(regex, Pattern.COMMENTS);
Comments
34
Which
is more readable?
When would the other
be?
Slide 35
Slide 35 text
twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky
var html = """
…
Ready!
""";
var body = html.replaceFirst("(?s)^.*", "")
.replaceFirst("(?s).*$", “")
.strip();
System.out.println(body);
Embedding Flag
35
Ready!
Slide 36
Slide 36 text
twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky
36
So I
have to say what
I don’t want?
var regex = "(?s).*(.*).*";
var body = html
.replaceFirst(regex, "$1")
.strip();
System.out.println(body);
Ready!
Slide 37
Slide 37 text
twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky
37
Huh?
var dotAllMode = "(?s)";
var anyChars = ".*";
var captureAnyChars = "(.*)";
var startBody = "";
var endBody = "";
var bodyPart = startBody
+ captureAnyChars + endBody;
var regex = dotAllMode + anyChars
+ bodyPart + anyChars;
var body = html.replaceFirst(regex, “$1")
.strip();
System.out.println(body);
Ready!
Slide 38
Slide 38 text
twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky
Where did the close * go?
38
var text = "* -aa- -b- *";
var pattern =
Pattern.compile("-([a-z]+)-");
var matcher = pattern.matcher(text);
var builder = new StringBuilder();
while(matcher.find())
matcher.appendReplacement(
builder, "x");
System.out.println(builder);
* x x
Slide 39
Slide 39 text
twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky
Where did the close * go?
39
var text = "* -aa- -b- *";
var pattern =
Pattern.compile("-([a-z]+)-");
var matcher = pattern.matcher(text);
var builder = new StringBuilder();
while(matcher.find())
matcher.appendReplacement(
builder, "x");
matcher.appendTail(builder);
System.out.println(builder);
* x x *
Slide 40
Slide 40 text
twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky
What does this do?
40
var text = "* -aa- -b- *";
var pattern =
Pattern.compile("-([a-z]+)-");
var matcher = pattern.matcher(text);
var builder = new StringBuilder();
while(matcher.find())
matcher.appendReplacement(
builder, "$");
System.out.println(builder);
IllegalArgumentException: Illegal group reference:
group index is missing
Slide 41
Slide 41 text
twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky
Fix
41
var text = "* -aa- -b- *";
var pattern =
Pattern.compile("-([a-z]+)-");
var matcher = pattern.matcher(text);
var builder = new StringBuilder();
while(matcher.find())
var replace = Matcher.quoteReplacement("$");
matcher.appendReplacement(builder, replace);
System.out.println(builder);
* $ $
Slide 42
Slide 42 text
twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky
Quantifier Types
42
Sample Type Description
z? Greedy Read whole string and backtrack
z?? Reluctant Look at one character at a time
z?+ Possessive Read whole string/never backtrack
twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky
Looking
45
var text =
“1 fish 2 fish red fish blue fish";
var regex = "\\w+ fish(?! blue)";
var pattern = Pattern.compile(regex);
var matcher = pattern.matcher(text);
while (matcher.find())
System.out.println(matcher.group());
1 fish
2 fish
blue fish
Slide 46
Slide 46 text
twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky
Indexes
46
var text = "i am sam. I am SAM. Sam i am";
var pattern = Pattern.compile("(?i)sam");
var matcher = pattern.matcher(text);
while (matcher.find())
System.out.println(matcher.start()
+ "-" + matcher.end());
5-8
15-18
20-23
Slide 47
Slide 47 text
twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky
Debugging Tip
47
Small
regex and build
up
Online
regex checker
twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky
What’s wrong?
49
"[ab]|a"
Redundant. a is a subset of [ab]
Slide 50
Slide 50 text
twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky
What’s wrong?
50
changed = changed.
replaceAll("\\.\\.\\.", ";")
Performance. No need for a regex
changed = changed.replace("...", “;");
Slide 51
Slide 51 text
twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky
What’s wrong?
51
Pattern regex =
Pattern.compile("myRegex");
Matcher matcher =
regex.matcher("s");
Performance since not static pattern
Readability
tradeoff
twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky
What’s wrong?
53
if (dateString.matches("^(?:(?:31(\\/|-|\\.)(?:0?[13578]|
1[02]))\\1|(?:(?:29|30)(\\/|-|\\.)(?:0?[13-9]|1[0-2])\\2))
(?:(?:1[6-9]|[2-9]\\d)?\\d{2})$|^(?:29(\\/|-|\\.)0?2\\3(?:
(?:(?:1[6-9]|[2-9]\\d)?(?:0[48]|[2468][048]|[13579][26])|
(?:(?:16|[2468][048]|[3579][26])00))))$|^(?:0?[1-9]|1\\d|
2[0-8])(\\/|-|\\.)(?:(?:0?[1-9])|(?:1[0-2]))\\4(?:
(?:1[6-9]|[2-9]\\d)?\\d{2})$")) {
handleDate(dateString);
}
Too complicated
I
draw the line way
before this :)
twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky
What’s wrong?
55
Pattern.compile("(?=a)b");
If lookahead matches next character = a, it isn’t b
Slide 56
Slide 56 text
twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky
What’s wrong?
56
Pattern.compile("[a-zA-Z]");
Assumes only English characters
Pattern.compile("\\p{IsAlphabetic}");
a-z
is clearer if that’s all
you want
Slide 57
Slide 57 text
twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky
What’s wrong?
57
String regex = request
.getParameter("regex");
String input = request
.getParameter("input");
return input.matches(regex);
Denial of service opportunity. Need to validate
Slide 58
Slide 58 text
twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky
What’s wrong?
58
Pattern.compile("(a|b)*");
Backtracking can overflow stack on large strings. Vs
Pattern.compile("[ab]*");
Slide 59
Slide 59 text
twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky
What’s wrong?
59
Pattern.compile("(.|\n)*");
Have dot itself match the line breaks. Better to use:
Pattern.compile("(?s).*");
twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky
What’s wrong?
61
str.matches("\\d*?")
? is redundant here and causes backtracking
str.matches("\\d*")
Slide 62
Slide 62 text
twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky
What’s wrong?
62
Pattern.compile("a++abc");
Can’t match because ++ is greedy so no “a” left after
Pattern.compile("aa++bc");
Slide 63
Slide 63 text
twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky
Beyond English
63
Problem Reason/Fix
"cc̈d̈d".replaceAll("[c̈d̈]
", "X");
Incorrectly assumes Unicode
Graphene Cluster is one code
point. Fix:
"cc̈d̈d".replaceAll("c̈|d̈", "X");
Pattern.compile("söme
pättern",
Pattern.CASE_INSENSITIV
E);
By default, case insensitive is
ASCII only. Fix:
Pattern.compile(“söme pättern",
Pattern.CASE_INSENSITIVE |
Pattern.UNICODE_CASE);
Pattern p =
Pattern.compile("é|ë|
è");
Could be code point or cluster.
Fix:
Pattern p = Pattern.compile("é|
ë|è", Pattern.CANON_EQ);
Slide 64
Slide 64 text
twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky 64
Other JVM Lang Sampling
Slide 65
Slide 65 text
twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky
Use Cases
65
1.Find first match
2.Find all matches
3.Replace first matches
Slide 66
Slide 66 text
twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky
val text = "Mary had a little lamb"
val regex = Regex("\\b\\w{3,4} ")
print(regex.find(text)?.value)
Kotlin
66
Mary
Slide 67
Slide 67 text
twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky
val text = "Mary had a little lamb"
val regex = "\\b\\w{3,4} ".toRegex()
regex.findAll(text)
.map { it.groupValues[0] }
.forEach { print(it) }
Kotlin
67
Mary had
Slide 68
Slide 68 text
twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky
val text = "Mary had a little lamb."
val wordBoundary = "\\b"
val threeOrFourChars = "\\w{3,4}"
val space = " "
val regex = Regex(wordBoundary +
threeOrFourChars + space)
println(regex.replaceFirst(text, "_"))
Kotlin
68
_had a little lamb.
twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky
val text = "Mary had a little lamb"
val regex = """\b\w{3,4} """.r
val optional = regex findFirstIn text
println(optional.getOrElse("No Match"))
Scala
70
Mary
Slide 71
Slide 71 text
twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky
val text = "Mary had a little lamb."
val regex = """\b\w{3,4} """.r
val it = regex findAllIn text
it foreach print
Scala
71
Mary had
Slide 72
Slide 72 text
twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky
import scala.util.matching.Regex
val text = "Mary had a little lamb."
val wordBoundary = """\b"""
val threeOrFourChars = """\w{3,4}"""
val space = " "
val regex = new Regex(wordBoundary +
threeOrFourChars + space)
println(regex replaceFirstIn(text, "_"))
Scala
72
_had a little lamb.
Slide 73
Slide 73 text
twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky
def text = 'Mary had a little lamb'
def regex = /\b\w{3,4} /
def matcher = text =~ regex
print matcher[0]
Groovy
73
Mary
Slide 74
Slide 74 text
twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky
def text = 'Mary had a little lamb'
def regex = /\b\w{3,4} /
def matcher = text =~ regex
print matcher.findAll().join(' ')
Groovy
74
Mary had
Slide 75
Slide 75 text
twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky
def text = 'Mary had a little lamb.'
def wordBoundary = "\\b"
def threeOrFourChars = "\\w{3,4}"
def space = " "
def regex =
/$wordBoundary$threeOrFourChars$space/
println text.replaceFirst(regex)
{ it -> '_' }
println text.replaceFirst(regex, '_')
Groovy
75
_had a little lamb.
_had a little lamb.
Slide 76
Slide 76 text
twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky
(println(
re-find #”\b\w{3,4} ",
"Mary had a little lamb"))
Clojure
76
Mary
Slide 77
Slide 77 text
twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky
(println(
re-seq #”\b\w{3,4} ",
"Mary had a little lamb"))
Clojure
77
(Mary had )
Slide 78
Slide 78 text
twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky
(ns clojure.examples.example
(:gen-class))
(defn Replacer []
(def text "Mary had a little lamb.")
(def wordBoundary "\\b")
(def threeOrFourChars "\\w{3,4}")
(def space " ")
(def regex (str wordBoundary
threeOrFourChars space))
(def pat (re-pattern regex))
(println(clojure.string/replace-first
text pat "_")))
(Replacer)
Clojure
78
_had a little lamb.
Slide 79
Slide 79 text
twitter.com/jeanneboyarsky mastodon.social/@jeanneboyarsky
For more reading
79