Advanced Unicode Regular Expressions

05bab33cfd102c84f045838aa4e05bec?s=47 Nova Patch
January 23, 2013

Advanced Unicode Regular Expressions

Unicode regular expression tutorial with examples in Perl, PHP, and JavaScript.

Presented at:
◦  2013-01-23: Shutterstock “Brown Bag Lunch” Tech Talk, New York, NY

05bab33cfd102c84f045838aa4e05bec?s=128

Nova Patch

January 23, 2013
Tweet

Transcript

  1. Unicode Regular Expressions s/ / /g Nick Patch 23 January

    2013
  2. Unicode Refresher Unicode attempts to support the characters of the

    world — a massive task!
  3. Unicode Refresher It's hard to attach a single meaning to

    the word “character” but most folks think of characters as the smallest stand-alone components of a writing system.
  4. Unicode Refresher In Unicode, this sense of characters is represented

    by one or more code points, which are each stored in one or more bytes.
  5. Unicode Refresher However, programmers and programming languages tend to think

    of characters as individual code points, or worse, individual bytes. We need to modernize our habits!
  6. Unicode Refresher Unicode is not just a big set of

    characters. It also defines standard properties for each character and standard algorithms for operations such as collation, normalization, and segmentation.
  7. Normalization NFD(ᾀ◌̀) = α◌̓◌̀◌ͅ NFC(ᾀ◌̀) = ᾂ̀

  8. Normalization NFD(Чю рлёнис ◌́ ) = Чю рле нис ◌́

    ◌̈ NFC(Чю рлёнис ◌́ ) = Чю рлёнис ◌́
  9. Normalization ᾂ ≡ ἂ◌ͅ ≡ ᾀ◌̀ ≡ ᾳ◌̓◌̀ ≡ α

    ◌̓◌̀◌ͅ ≡ α◌̓◌ͅ◌̀ ≡ α◌ͅ◌̓◌̀ ≠ ᾲ◌̓ ≡ ὰ◌̓◌ͅ ≡ ὰ◌ͅ◌̓ ≡ ᾳ◌̀◌̓ ≡ α◌̀◌̓◌ͅ ≡ α◌̀◌ͅ◌̓ ≡ α◌ͅ◌̀◌̓
  10. Perl Normalization use Unicode::Normalize; say $str; # ᾀ◌̀ say NFD($str);

    # α◌̓◌̀◌ͅ say NFC($str); # ᾂ̀
  11. JavaScript Normalization var unorm = require('unorm'); console.log($str); # ᾀ◌̀ console.log(unorm.nfd($str));

    # α◌̓◌̀◌ͅ console.log(unorm.nfc($str)); # ᾂ̀
  12. PHP Normalization echo $str; # ᾀ◌̀ echo Normalizer::normalize($str, Normalizer::FORM_D); #

    α◌̓◌̀◌ͅ echo Normalizer::normalize($str, Normalizer::FORM_C); # ᾂ̀
  13. Grapheme Clusters regex: /^.$/ string 1: ᾂ string 2: α◌̓◌̀◌ͅ

  14. Grapheme Clusters regex: /^.$/ string 1: ᾂ ⇧ string 2:

    α◌̓◌̀◌ͅ ⇧ 1. anchor beginning of string
  15. Grapheme Clusters regex: /^.$/ string 1: ᾂ ⇧ string 2:

    α◌̓◌̀◌ͅ ⇧ 1. anchor beginning of string 2. match code point (excl. \n)
  16. Grapheme Clusters regex: /^.$/ string 1: ᾂ ⇧⇧ string 2:

    α◌̓◌̀◌ͅ 1. anchor beginning of string 2. match code point (excl. \n) 3. anchor at end of string
  17. Grapheme Clusters regex: /^.$/ string 1: ᾂ ⇧⇧ string 2:

    α◌̓◌̀◌ͅ 1. anchor beginning of string 2. match code point (excl. \n) 3. anchor at end of string 4. 1 success but 1 failure — mixed results
  18. Grapheme Clusters regex: /^\X$/ string 1: ᾂ string 2: α◌̓◌̀◌ͅ

  19. Grapheme Clusters regex: /^\X$/ string 1: ᾂ ⇧ string 2:

    α◌̓◌̀◌ͅ ⇧ 1. anchor beginning of string
  20. Grapheme Clusters regex: /^\X$/ string 1: ᾂ ⇧ string 2:

    α◌̓◌̀◌ͅ ⇧ 1. anchor beginning of string 2. match grapheme cluster
  21. Grapheme Clusters regex: /^\X$/ string 1: ᾂ ⇧⇧ string 2:

    α◌̓◌̀◌ͅ ⇧ ⇧ 1. anchor beginning of string 2. match grapheme cluster 3. anchor at end of string
  22. Grapheme Clusters regex: /^\X$/ string 1: ᾂ ⇧⇧ string 2:

    α◌̓◌̀◌ͅ ⇧ ⇧ 1. anchor beginning of string 2. match grapheme cluster 3. anchor at end of string 4. success!
  23. Perl use v5.12; # better yet: v5.14 use utf8; use

    charnames qw( :full ); # unless v5.16 use open qw( :encoding(UTF-8) :std ); $str =~ /^\X$/; $str =~ s/^(\X)$/->$1<-/;
  24. PHP preg_match('/^\X$/u', $str); preg_replace('/^(\X)$/u', '->$1<-', $str);

  25. JavaScript [This slide intentionally left blank.]

  26. Match Any Character two bytes (if byte mode): е..и code

    point (exc. \n): е.и code point (incl. \n): е\p{Any}и grapheme cluster (incl. \n): е\Xи
  27. Match Any Letter letter code point:е\p{General_Category=Letter}и letter code point: е\pLи

    Cyrillic code point: е\p{Script=Cyrillic}и Cyrillic code point: е\p{Cyrillic}и letter grapheme cluster: е(?=\pL)\Xи
  28. regex: / \p{Cyrillic} о т /x string 1: който string

    2: кои то ◌̆
  29. regex: / о \p{Cyrillic} т /x string 1: който string

    2: кои то ◌̆ 1. match letter о
  30. regex: / о \p{Cyrillic} т /x string 1: който string

    2: кои то ◌̆ 1. match letter о 2. match Cyrillic letter (1 code point)
  31. regex: / \p{Cyrillic} о т /x string 1: който string

    2: кои то ◌̆ 1. match letter о 2. match Cyrillic letter (1 code point) 3. match letter т
  32. regex: / \p{Cyrillic} о т /x string 1: който string

    2: кои то ◌̆ 1. match letter о 2. match Cyrillic letter (1 code point) 3. match letter т 4. 1 success but 1 failure — mixed results
  33. regex: / (?= \p{Cyrillic} ) \X о т /x string

    1: който string 2: кои то ◌̆
  34. regex: / о (?= \p{Cyrillic} ) \X т /x string

    1: който string 2: кои то ◌̆ 1. match letter о
  35. regex: / о (?= \p{Cyrillic} ) \X т /x string

    1: който ⇧ string 2: кои то ◌̆ ⇧ 1. match letter о 2. positive lookahead Cyrillic letter (1 code point)
  36. regex: / (?= \p{Cyrillic} ) о \X т /x string

    1: който ⇧ string 2: кои◌̆то ⇧ 1. match letter о 2. positive lookahead Cyrillic letter (1 code point) 3. match grapheme cluster (1+ code points)
  37. regex: / (?= \p{Cyrillic} ) \X о т /x string

    1: който ⇧ string 2: кои◌̆то ⇧ 1. match letter о 2. positive lookahead Cyrillic letter (1 code point) 3. match grapheme cluster (1+ code points) 4. match letter т
  38. regex: / (?= \p{Cyrillic} ) \X о т /x string

    1: който ⇧ string 2: кои т ◌̆ о ⇧ 1. match letter о 2. positive lookahead Cyrillic letter (1 code point) 3. match grapheme cluster (1+ code points) 4. match letter т 5. success!
  39. Character Literals [ ي ی ] (?: | ی ي)

  40. Character Literals [ ي ی ] (?: ي | ی

    )
  41. Character Literals [ ي ی ] (?: ي | ی

    ) [\x{064A}\x{06CC}]
  42. Character Literals [ ي ی ] (?: ي | ی

    ) [\x{064A}\x{06CC}] [\N{ARABIC LETTER YEH} \N{ARABIC LETTER FARSI YEH}]
  43. Properties \p{Script=Latin} Name: Script Value: Latin Match any code point

    with the value “Latin” for the Script property.
  44. Properties \P{Script=Latin} Name: Script Value: not Latin Negated form: Match

    any code point without the value “Latin” for the Script property.
  45. Properties \p{Latin} Name: Script (implicit) Value: Latin The Script and

    General Category properties don't require the name because they're so common and their values don't conflict.
  46. Properties \p{General_Category=Letter} Name: General Category Value: Letter Match any code

    point with the value “Letter” for the General Category property.
  47. Properties \p{gc=Letter} Name: General Category (gc) Value: Letter Property names

    may be abbreviated.
  48. Properties \p{gc=L} Name: General Category (gc) Value: Letter (L) The

    General Category property is so commonly used that its values all have standard abbreviations.
  49. Properties \p{L} Name: General Category (implicit) Value: Letter (L) And

    the General Category values may even be used on their own, like the Script values. These two properties have distinct values.
  50. Properties \pL Name: General Category (implicit) Value: Letter (L) Single-character

    General Category values don't require curly braces.
  51. Properties \PL Name: General Category (implicit) Value: not Letter (L)

    Don't forget negation!
  52. s/ / /g