Advanced Unicode Regular Expressions

Unicode Regular Expressions s/ / /g Nick Patch 23 January
2013

Unicode Refresher Unicode attempts to support the characters of the
world — a massive task!

Unicode Refresher It's hard to attach a single meaning to
the word “character” but most folks think of characters as the smallest stand-alone components of a writing system.

Unicode Refresher In Unicode, this sense of characters is represented
by one or more code points, which are each stored in one or more bytes.

Unicode Refresher However, programmers and programming languages tend to think
of characters as individual code points, or worse, individual bytes. We need to modernize our habits!

Unicode Refresher Unicode is not just a big set of
characters. It also defines standard properties for each character and standard algorithms for operations such as collation, normalization, and segmentation.

Normalization NFD(ᾀ◌̀) = α◌̓◌̀◌ͅ NFC(ᾀ◌̀) = ᾂ̀

Normalization NFD(Чю рлёнис ◌́ ) = Чю рле нис ◌́
◌̈ NFC(Чю рлёнис ◌́ ) = Чю рлёнис ◌́

Normalization ᾂ ≡ ἂ◌ͅ ≡ ᾀ◌̀ ≡ ᾳ◌̓◌̀ ≡ α
◌̓◌̀◌ͅ ≡ α◌̓◌ͅ◌̀ ≡ α◌ͅ◌̓◌̀ ≠ ᾲ◌̓ ≡ ὰ◌̓◌ͅ ≡ ὰ◌ͅ◌̓ ≡ ᾳ◌̀◌̓ ≡ α◌̀◌̓◌ͅ ≡ α◌̀◌ͅ◌̓ ≡ α◌ͅ◌̀◌̓

Perl Normalization use Unicode::Normalize; say $str; # ᾀ◌̀ say NFD($str);
# α◌̓◌̀◌ͅ say NFC($str); # ᾂ̀

JavaScript Normalization var unorm = require('unorm'); console.log($str); # ᾀ◌̀ console.log(unorm.nfd($str));
# α◌̓◌̀◌ͅ console.log(unorm.nfc($str)); # ᾂ̀

PHP Normalization echo $str; # ᾀ◌̀ echo Normalizer::normalize($str, Normalizer::FORM_D); #
α◌̓◌̀◌ͅ echo Normalizer::normalize($str, Normalizer::FORM_C); # ᾂ̀

Grapheme Clusters regex: /^.$/ string 1: ᾂ string 2: α◌̓◌̀◌ͅ

Grapheme Clusters regex: /^.$/ string 1: ᾂ ⇧ string 2:
α◌̓◌̀◌ͅ ⇧ 1. anchor beginning of string

Grapheme Clusters regex: /^.$/ string 1: ᾂ ⇧ string 2:
α◌̓◌̀◌ͅ ⇧ 1. anchor beginning of string 2. match code point (excl. \n)

Grapheme Clusters regex: /^.$/ string 1: ᾂ ⇧⇧ string 2:
α◌̓◌̀◌ͅ 1. anchor beginning of string 2. match code point (excl. \n) 3. anchor at end of string

Grapheme Clusters regex: /^.$/ string 1: ᾂ ⇧⇧ string 2:
α◌̓◌̀◌ͅ 1. anchor beginning of string 2. match code point (excl. \n) 3. anchor at end of string 4. 1 success but 1 failure — mixed results

Grapheme Clusters regex: /^\X$/ string 1: ᾂ string 2: α◌̓◌̀◌ͅ

Grapheme Clusters regex: /^\X$/ string 1: ᾂ ⇧ string 2:
α◌̓◌̀◌ͅ ⇧ 1. anchor beginning of string

Grapheme Clusters regex: /^\X$/ string 1: ᾂ ⇧ string 2:
α◌̓◌̀◌ͅ ⇧ 1. anchor beginning of string 2. match grapheme cluster

Grapheme Clusters regex: /^\X$/ string 1: ᾂ ⇧⇧ string 2:
α◌̓◌̀◌ͅ ⇧ ⇧ 1. anchor beginning of string 2. match grapheme cluster 3. anchor at end of string

Grapheme Clusters regex: /^\X$/ string 1: ᾂ ⇧⇧ string 2:
α◌̓◌̀◌ͅ ⇧ ⇧ 1. anchor beginning of string 2. match grapheme cluster 3. anchor at end of string 4. success!

Perl use v5.12; # better yet: v5.14 use utf8; use
charnames qw( :full ); # unless v5.16 use open qw( :encoding(UTF-8) :std ); $str =~ /^\X$/; $str =~ s/^(\X)$/->$1<-/;

PHP preg_match('/^\X$/u', $str); preg_replace('/^(\X)$/u', '->$1<-', $str);

JavaScript [This slide intentionally left blank.]

Match Any Character two bytes (if byte mode): е..и code
point (exc. \n): е.и code point (incl. \n): е\p{Any}и grapheme cluster (incl. \n): е\Xи

Match Any Letter letter code point:е\p{General_Category=Letter}и letter code point: е\pLи
Cyrillic code point: е\p{Script=Cyrillic}и Cyrillic code point: е\p{Cyrillic}и letter grapheme cluster: е(?=\pL)\Xи

regex: / \p{Cyrillic} о т /x string 1: който string
2: кои то ◌̆

regex: / о \p{Cyrillic} т /x string 1: който string
2: кои то ◌̆ 1. match letter о

regex: / о \p{Cyrillic} т /x string 1: който string
2: кои то ◌̆ 1. match letter о 2. match Cyrillic letter (1 code point)

2: кои то ◌̆ 1. match letter о 2. match Cyrillic letter (1 code point) 3. match letter т

2: кои то ◌̆ 1. match letter о 2. match Cyrillic letter (1 code point) 3. match letter т 4. 1 success but 1 failure — mixed results

regex: / (?= \p{Cyrillic} ) \X о т /x string
1: който string 2: кои то ◌̆

regex: / о (?= \p{Cyrillic} ) \X т /x string
1: който string 2: кои то ◌̆ 1. match letter о

regex: / о (?= \p{Cyrillic} ) \X т /x string
1: който ⇧ string 2: кои то ◌̆ ⇧ 1. match letter о 2. positive lookahead Cyrillic letter (1 code point)

regex: / (?= \p{Cyrillic} ) о \X т /x string
1: който ⇧ string 2: кои◌̆то ⇧ 1. match letter о 2. positive lookahead Cyrillic letter (1 code point) 3. match grapheme cluster (1+ code points)

1: който ⇧ string 2: кои◌̆то ⇧ 1. match letter о 2. positive lookahead Cyrillic letter (1 code point) 3. match grapheme cluster (1+ code points) 4. match letter т

1: който ⇧ string 2: кои т ◌̆ о ⇧ 1. match letter о 2. positive lookahead Cyrillic letter (1 code point) 3. match grapheme cluster (1+ code points) 4. match letter т 5. success!

Character Literals [ ي ی ] (?: | ی ي)

Character Literals [ ي ی ] (?: ي | ی
)

) [\x{064A}\x{06CC}]

) [\x{064A}\x{06CC}] [\N{ARABIC LETTER YEH} \N{ARABIC LETTER FARSI YEH}]

Properties \p{Script=Latin} Name: Script Value: Latin Match any code point
with the value “Latin” for the Script property.

Properties \P{Script=Latin} Name: Script Value: not Latin Negated form: Match
any code point without the value “Latin” for the Script property.

Properties \p{Latin} Name: Script (implicit) Value: Latin The Script and
General Category properties don't require the name because they're so common and their values don't conflict.

Properties \p{General_Category=Letter} Name: General Category Value: Letter Match any code
point with the value “Letter” for the General Category property.

Properties \p{gc=Letter} Name: General Category (gc) Value: Letter Property names
may be abbreviated.

Properties \p{gc=L} Name: General Category (gc) Value: Letter (L) The
General Category property is so commonly used that its values all have standard abbreviations.

Properties \p{L} Name: General Category (implicit) Value: Letter (L) And
the General Category values may even be used on their own, like the Script values. These two properties have distinct values.

Properties \pL Name: General Category (implicit) Value: Letter (L) Single-character
General Category values don't require curly braces.

Properties \PL Name: General Category (implicit) Value: not Letter (L)
Don't forget negation!

s/ / /g

Advanced Unicode Regular Expressions

Advanced Unicode Regular Expressions

More Decks by Nova Patch

Other Decks in Programming

Featured

Transcript