Advanced Unicode Regular Expressions

Slide 1

Slide 1 text

Unicode Regular Expressions s/ / /g Nick Patch 23 January 2013

Slide 2

Slide 2 text

Unicode Refresher Unicode attempts to support the characters of the world — a massive task!

Slide 3

Slide 3 text

Unicode Refresher It's hard to attach a single meaning to the word “character” but most folks think of characters as the smallest stand-alone components of a writing system.

Slide 4

Slide 4 text

Unicode Refresher In Unicode, this sense of characters is represented by one or more code points, which are each stored in one or more bytes.

Slide 5

Slide 5 text

Unicode Refresher However, programmers and programming languages tend to think of characters as individual code points, or worse, individual bytes. We need to modernize our habits!

Slide 6

Slide 6 text

Unicode Refresher Unicode is not just a big set of characters. It also defines standard properties for each character and standard algorithms for operations such as collation, normalization, and segmentation.

Slide 7

Slide 7 text

Normalization NFD(ᾀ◌̀) = α◌̓◌̀◌ͅ NFC(ᾀ◌̀) = ᾂ̀

Slide 8

Slide 8 text

Normalization NFD(Чю рлёнис ◌́ ) = Чю рле нис ◌́ ◌̈ NFC(Чю рлёнис ◌́ ) = Чю рлёнис ◌́

Slide 9

Slide 9 text

Normalization ᾂ ≡ ἂ◌ͅ ≡ ᾀ◌̀ ≡ ᾳ◌̓◌̀ ≡ α ◌̓◌̀◌ͅ ≡ α◌̓◌ͅ◌̀ ≡ α◌ͅ◌̓◌̀ ≠ ᾲ◌̓ ≡ ὰ◌̓◌ͅ ≡ ὰ◌ͅ◌̓ ≡ ᾳ◌̀◌̓ ≡ α◌̀◌̓◌ͅ ≡ α◌̀◌ͅ◌̓ ≡ α◌ͅ◌̀◌̓

Slide 10

Slide 10 text

Perl Normalization use Unicode::Normalize; say $str; # ᾀ◌̀ say NFD($str); # α◌̓◌̀◌ͅ say NFC($str); # ᾂ̀

Slide 11

Slide 11 text

JavaScript Normalization var unorm = require('unorm'); console.log($str); # ᾀ◌̀ console.log(unorm.nfd($str)); # α◌̓◌̀◌ͅ console.log(unorm.nfc($str)); # ᾂ̀

Slide 12

Slide 12 text

PHP Normalization echo $str; # ᾀ◌̀ echo Normalizer::normalize($str, Normalizer::FORM_D); # α◌̓◌̀◌ͅ echo Normalizer::normalize($str, Normalizer::FORM_C); # ᾂ̀

Slide 13

Slide 13 text

Grapheme Clusters regex: /^.$/ string 1: ᾂ string 2: α◌̓◌̀◌ͅ

Slide 14

Slide 14 text

Grapheme Clusters regex: /^.$/ string 1: ᾂ ⇧ string 2: α◌̓◌̀◌ͅ ⇧ 1. anchor beginning of string

Slide 15

Slide 15 text

Grapheme Clusters regex: /^.$/ string 1: ᾂ ⇧ string 2: α◌̓◌̀◌ͅ ⇧ 1. anchor beginning of string 2. match code point (excl. \n)

Slide 16

Slide 16 text

Grapheme Clusters regex: /^.$/ string 1: ᾂ ⇧⇧ string 2: α◌̓◌̀◌ͅ 1. anchor beginning of string 2. match code point (excl. \n) 3. anchor at end of string

Slide 17

Slide 17 text

Grapheme Clusters regex: /^.$/ string 1: ᾂ ⇧⇧ string 2: α◌̓◌̀◌ͅ 1. anchor beginning of string 2. match code point (excl. \n) 3. anchor at end of string 4. 1 success but 1 failure — mixed results

Slide 18

Slide 18 text

Grapheme Clusters regex: /^\X$/ string 1: ᾂ string 2: α◌̓◌̀◌ͅ

Slide 19

Slide 19 text

Grapheme Clusters regex: /^\X$/ string 1: ᾂ ⇧ string 2: α◌̓◌̀◌ͅ ⇧ 1. anchor beginning of string

Slide 20

Slide 20 text

Grapheme Clusters regex: /^\X$/ string 1: ᾂ ⇧ string 2: α◌̓◌̀◌ͅ ⇧ 1. anchor beginning of string 2. match grapheme cluster

Slide 21

Slide 21 text

Grapheme Clusters regex: /^\X$/ string 1: ᾂ ⇧⇧ string 2: α◌̓◌̀◌ͅ ⇧ ⇧ 1. anchor beginning of string 2. match grapheme cluster 3. anchor at end of string

Slide 22

Slide 22 text

Grapheme Clusters regex: /^\X$/ string 1: ᾂ ⇧⇧ string 2: α◌̓◌̀◌ͅ ⇧ ⇧ 1. anchor beginning of string 2. match grapheme cluster 3. anchor at end of string 4. success!

Slide 23

Slide 23 text

Perl use v5.12; # better yet: v5.14 use utf8; use charnames qw( :full ); # unless v5.16 use open qw( :encoding(UTF-8) :std ); $str =~ /^\X$/; $str =~ s/^(\X)$/->$1<-/;

Slide 24

Slide 24 text

PHP preg_match('/^\X$/u', $str); preg_replace('/^(\X)$/u', '->$1<-', $str);

Slide 25

Slide 25 text

JavaScript [This slide intentionally left blank.]

Slide 26

Slide 26 text

Match Any Character two bytes (if byte mode): е..и code point (exc. \n): е.и code point (incl. \n): е\p{Any}и grapheme cluster (incl. \n): е\Xи

Slide 27

Slide 27 text

Match Any Letter letter code point:е\p{General_Category=Letter}и letter code point: е\pLи Cyrillic code point: е\p{Script=Cyrillic}и Cyrillic code point: е\p{Cyrillic}и letter grapheme cluster: е(?=\pL)\Xи

Slide 28

Slide 28 text

regex: / \p{Cyrillic} о т /x string 1: който string 2: кои то ◌̆

Slide 29

Slide 29 text

regex: / о \p{Cyrillic} т /x string 1: който string 2: кои то ◌̆ 1. match letter о

Slide 30

Slide 30 text

regex: / о \p{Cyrillic} т /x string 1: който string 2: кои то ◌̆ 1. match letter о 2. match Cyrillic letter (1 code point)

Slide 31

Slide 31 text

regex: / \p{Cyrillic} о т /x string 1: който string 2: кои то ◌̆ 1. match letter о 2. match Cyrillic letter (1 code point) 3. match letter т

Slide 32

Slide 32 text

regex: / \p{Cyrillic} о т /x string 1: който string 2: кои то ◌̆ 1. match letter о 2. match Cyrillic letter (1 code point) 3. match letter т 4. 1 success but 1 failure — mixed results

Slide 33

Slide 33 text

regex: / (?= \p{Cyrillic} ) \X о т /x string 1: който string 2: кои то ◌̆

Slide 34

Slide 34 text

regex: / о (?= \p{Cyrillic} ) \X т /x string 1: който string 2: кои то ◌̆ 1. match letter о

Slide 35

Slide 35 text

regex: / о (?= \p{Cyrillic} ) \X т /x string 1: който ⇧ string 2: кои то ◌̆ ⇧ 1. match letter о 2. positive lookahead Cyrillic letter (1 code point)

Slide 36

Slide 36 text

regex: / (?= \p{Cyrillic} ) о \X т /x string 1: който ⇧ string 2: кои◌̆то ⇧ 1. match letter о 2. positive lookahead Cyrillic letter (1 code point) 3. match grapheme cluster (1+ code points)

Slide 37

Slide 37 text

regex: / (?= \p{Cyrillic} ) \X о т /x string 1: който ⇧ string 2: кои◌̆то ⇧ 1. match letter о 2. positive lookahead Cyrillic letter (1 code point) 3. match grapheme cluster (1+ code points) 4. match letter т

Slide 38

Slide 38 text

regex: / (?= \p{Cyrillic} ) \X о т /x string 1: който ⇧ string 2: кои т ◌̆ о ⇧ 1. match letter о 2. positive lookahead Cyrillic letter (1 code point) 3. match grapheme cluster (1+ code points) 4. match letter т 5. success!

Slide 39

Slide 39 text

Character Literals [ ي ی ] (?: | ی ي)

Slide 40

Slide 40 text

Character Literals [ ي ی ] (?: ي | ی )

Slide 41

Slide 41 text

Character Literals [ ي ی ] (?: ي | ی ) [\x{064A}\x{06CC}]

Slide 42

Slide 42 text

Character Literals [ ي ی ] (?: ي | ی ) [\x{064A}\x{06CC}] [\N{ARABIC LETTER YEH} \N{ARABIC LETTER FARSI YEH}]

Slide 43

Slide 43 text

Properties \p{Script=Latin} Name: Script Value: Latin Match any code point with the value “Latin” for the Script property.

Slide 44

Slide 44 text

Properties \P{Script=Latin} Name: Script Value: not Latin Negated form: Match any code point without the value “Latin” for the Script property.

Slide 45

Slide 45 text

Properties \p{Latin} Name: Script (implicit) Value: Latin The Script and General Category properties don't require the name because they're so common and their values don't conflict.

Slide 46

Slide 46 text

Properties \p{General_Category=Letter} Name: General Category Value: Letter Match any code point with the value “Letter” for the General Category property.

Slide 47

Slide 47 text

Properties \p{gc=Letter} Name: General Category (gc) Value: Letter Property names may be abbreviated.

Slide 48

Slide 48 text

Properties \p{gc=L} Name: General Category (gc) Value: Letter (L) The General Category property is so commonly used that its values all have standard abbreviations.

Slide 49

Slide 49 text

Properties \p{L} Name: General Category (implicit) Value: Letter (L) And the General Category values may even be used on their own, like the Script values. These two properties have distinct values.

Slide 50

Slide 50 text

Properties \pL Name: General Category (implicit) Value: Letter (L) Single-character General Category values don't require curly braces.

Slide 51

Slide 51 text

Properties \PL Name: General Category (implicit) Value: not Letter (L) Don't forget negation!

Slide 52

Slide 52 text

s/ / /g