Unicode
Regular Expressions
s/ / /g
Nick Patch
23 January 2013
Slide 2
Slide 2 text
Unicode Refresher
Unicode attempts to support the
characters of the world — a massive task!
Slide 3
Slide 3 text
Unicode Refresher
It's hard to attach a single meaning to the
word “character” but most folks think of
characters as the smallest stand-alone
components of a writing system.
Slide 4
Slide 4 text
Unicode Refresher
In Unicode, this sense of characters is
represented by one or more code points,
which are each stored in one or more bytes.
Slide 5
Slide 5 text
Unicode Refresher
However, programmers and
programming languages tend to think of
characters as individual code points,
or worse, individual bytes.
We need to modernize our habits!
Slide 6
Slide 6 text
Unicode Refresher
Unicode is not just a big set of characters.
It also defines standard properties for
each character and standard algorithms
for operations such as collation,
normalization, and segmentation.
Grapheme Clusters
regex: /^.$/
string 1: ᾂ
⇧
string 2: α◌̓◌̀◌ͅ
⇧
1. anchor beginning of string
2. match code point (excl. \n)
Slide 16
Slide 16 text
Grapheme Clusters
regex: /^.$/
string 1: ᾂ
⇧⇧
string 2: α◌̓◌̀◌ͅ
1. anchor beginning of string
2. match code point (excl. \n)
3. anchor at end of string
Slide 17
Slide 17 text
Grapheme Clusters
regex: /^.$/
string 1: ᾂ
⇧⇧
string 2: α◌̓◌̀◌ͅ
1. anchor beginning of string
2. match code point (excl. \n)
3. anchor at end of string
4. 1 success but 1 failure — mixed results
Match Any Character
two bytes (if byte mode): е..и
code point (exc. \n): е.и
code point (incl. \n): е\p{Any}и
grapheme cluster (incl. \n): е\Xи
Slide 27
Slide 27 text
Match Any Letter
letter code point:е\p{General_Category=Letter}и
letter code point: е\pLи
Cyrillic code point: е\p{Script=Cyrillic}и
Cyrillic code point: е\p{Cyrillic}и
letter grapheme cluster: е(?=\pL)\Xи
Slide 28
Slide 28 text
regex: / \p{Cyrillic}
о т /x
string 1: който
string 2: кои то
◌̆
Slide 29
Slide 29 text
regex: / о \p{Cyrillic} т /x
string 1: който
string 2: кои то
◌̆
1. match letter о
Slide 30
Slide 30 text
regex: / о \p{Cyrillic} т /x
string 1: който
string 2: кои то
◌̆
1. match letter о
2. match Cyrillic letter (1 code point)
Slide 31
Slide 31 text
regex: / \p{Cyrillic}
о т /x
string 1: който
string 2: кои то
◌̆
1. match letter о
2. match Cyrillic letter (1 code point)
3. match letter т
Slide 32
Slide 32 text
regex: / \p{Cyrillic}
о т /x
string 1: който
string 2: кои то
◌̆
1. match letter о
2. match Cyrillic letter (1 code point)
3. match letter т
4. 1 success but 1 failure — mixed results
Slide 33
Slide 33 text
regex: / (?= \p{Cyrillic} ) \X
о т /x
string 1: който
string 2: кои то
◌̆
Slide 34
Slide 34 text
regex: / о (?= \p{Cyrillic} ) \X т /x
string 1: който
string 2: кои то
◌̆
1. match letter о
Slide 35
Slide 35 text
regex: / о (?= \p{Cyrillic} ) \X т /x
string 1: който
⇧
string 2: кои то
◌̆
⇧
1. match letter о
2. positive lookahead Cyrillic letter (1 code point)
Slide 36
Slide 36 text
regex: / (?= \p{Cyrillic} )
о \X т /x
string 1: който
⇧
string 2: кои◌̆то
⇧
1. match letter о
2. positive lookahead Cyrillic letter (1 code point)
3. match grapheme cluster (1+ code points)
Slide 37
Slide 37 text
regex: / (?= \p{Cyrillic} ) \X
о т /x
string 1: който
⇧
string 2: кои◌̆то
⇧
1. match letter о
2. positive lookahead Cyrillic letter (1 code point)
3. match grapheme cluster (1+ code points)
4. match letter т
Slide 38
Slide 38 text
regex: / (?= \p{Cyrillic} ) \X
о т /x
string 1: който
⇧
string 2: кои т
◌̆ о
⇧
1. match letter о
2. positive lookahead Cyrillic letter (1 code point)
3. match grapheme cluster (1+ code points)
4. match letter т
5. success!
Character Literals
[ ي
ی ]
(?: ي
|
ی )
[\x{064A}\x{06CC}]
[\N{ARABIC LETTER YEH}
\N{ARABIC LETTER FARSI YEH}]
Slide 43
Slide 43 text
Properties
\p{Script=Latin}
Name: Script
Value: Latin
Match any code point with the
value “Latin” for the Script property.
Slide 44
Slide 44 text
Properties
\P{Script=Latin}
Name: Script
Value: not Latin
Negated form:
Match any code point without the
value “Latin” for the Script property.
Slide 45
Slide 45 text
Properties
\p{Latin}
Name: Script (implicit)
Value: Latin
The Script and General Category
properties don't require the name
because they're so common and
their values don't conflict.
Slide 46
Slide 46 text
Properties
\p{General_Category=Letter}
Name: General Category
Value: Letter
Match any code point with the value
“Letter” for the General Category property.
Slide 47
Slide 47 text
Properties
\p{gc=Letter}
Name: General Category (gc)
Value: Letter
Property names may be abbreviated.
Slide 48
Slide 48 text
Properties
\p{gc=L}
Name: General Category (gc)
Value: Letter (L)
The General Category property is
so commonly used that its values
all have standard abbreviations.
Slide 49
Slide 49 text
Properties
\p{L}
Name: General Category (implicit)
Value: Letter (L)
And the General Category values may even
be used on their own, like the Script values.
These two properties have distinct values.
Slide 50
Slide 50 text
Properties
\pL
Name: General Category (implicit)
Value: Letter (L)
Single-character General Category
values don't require curly braces.
Slide 51
Slide 51 text
Properties
\PL
Name: General Category (implicit)
Value: not Letter (L)
Don't forget negation!