Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Advanced Unicode Regular Expressions

Nova Patch
January 23, 2013

Advanced Unicode Regular Expressions

Unicode regular expression tutorial with examples in Perl, PHP, and JavaScript.

Presented at:
◦  2013-01-23: Shutterstock “Brown Bag Lunch” Tech Talk, New York, NY

Nova Patch

January 23, 2013
Tweet

More Decks by Nova Patch

Other Decks in Programming

Transcript

  1. Unicode Refresher It's hard to attach a single meaning to

    the word “character” but most folks think of characters as the smallest stand-alone components of a writing system.
  2. Unicode Refresher In Unicode, this sense of characters is represented

    by one or more code points, which are each stored in one or more bytes.
  3. Unicode Refresher However, programmers and programming languages tend to think

    of characters as individual code points, or worse, individual bytes. We need to modernize our habits!
  4. Unicode Refresher Unicode is not just a big set of

    characters. It also defines standard properties for each character and standard algorithms for operations such as collation, normalization, and segmentation.
  5. Normalization NFD(Чю рлёнис ◌́ ) = Чю рле нис ◌́

    ◌̈ NFC(Чю рлёнис ◌́ ) = Чю рлёнис ◌́
  6. Normalization ᾂ ≡ ἂ◌ͅ ≡ ᾀ◌̀ ≡ ᾳ◌̓◌̀ ≡ α

    ◌̓◌̀◌ͅ ≡ α◌̓◌ͅ◌̀ ≡ α◌ͅ◌̓◌̀ ≠ ᾲ◌̓ ≡ ὰ◌̓◌ͅ ≡ ὰ◌ͅ◌̓ ≡ ᾳ◌̀◌̓ ≡ α◌̀◌̓◌ͅ ≡ α◌̀◌ͅ◌̓ ≡ α◌ͅ◌̀◌̓
  7. PHP Normalization echo $str; # ᾀ◌̀ echo Normalizer::normalize($str, Normalizer::FORM_D); #

    α◌̓◌̀◌ͅ echo Normalizer::normalize($str, Normalizer::FORM_C); # ᾂ̀
  8. Grapheme Clusters regex: /^.$/ string 1: ᾂ ⇧ string 2:

    α◌̓◌̀◌ͅ ⇧ 1. anchor beginning of string
  9. Grapheme Clusters regex: /^.$/ string 1: ᾂ ⇧ string 2:

    α◌̓◌̀◌ͅ ⇧ 1. anchor beginning of string 2. match code point (excl. \n)
  10. Grapheme Clusters regex: /^.$/ string 1: ᾂ ⇧⇧ string 2:

    α◌̓◌̀◌ͅ 1. anchor beginning of string 2. match code point (excl. \n) 3. anchor at end of string
  11. Grapheme Clusters regex: /^.$/ string 1: ᾂ ⇧⇧ string 2:

    α◌̓◌̀◌ͅ 1. anchor beginning of string 2. match code point (excl. \n) 3. anchor at end of string 4. 1 success but 1 failure — mixed results
  12. Grapheme Clusters regex: /^\X$/ string 1: ᾂ ⇧ string 2:

    α◌̓◌̀◌ͅ ⇧ 1. anchor beginning of string
  13. Grapheme Clusters regex: /^\X$/ string 1: ᾂ ⇧ string 2:

    α◌̓◌̀◌ͅ ⇧ 1. anchor beginning of string 2. match grapheme cluster
  14. Grapheme Clusters regex: /^\X$/ string 1: ᾂ ⇧⇧ string 2:

    α◌̓◌̀◌ͅ ⇧ ⇧ 1. anchor beginning of string 2. match grapheme cluster 3. anchor at end of string
  15. Grapheme Clusters regex: /^\X$/ string 1: ᾂ ⇧⇧ string 2:

    α◌̓◌̀◌ͅ ⇧ ⇧ 1. anchor beginning of string 2. match grapheme cluster 3. anchor at end of string 4. success!
  16. Perl use v5.12; # better yet: v5.14 use utf8; use

    charnames qw( :full ); # unless v5.16 use open qw( :encoding(UTF-8) :std ); $str =~ /^\X$/; $str =~ s/^(\X)$/->$1<-/;
  17. Match Any Character two bytes (if byte mode): е..и code

    point (exc. \n): е.и code point (incl. \n): е\p{Any}и grapheme cluster (incl. \n): е\Xи
  18. Match Any Letter letter code point:е\p{General_Category=Letter}и letter code point: е\pLи

    Cyrillic code point: е\p{Script=Cyrillic}и Cyrillic code point: е\p{Cyrillic}и letter grapheme cluster: е(?=\pL)\Xи
  19. regex: / о \p{Cyrillic} т /x string 1: който string

    2: кои то ◌̆ 1. match letter о
  20. regex: / о \p{Cyrillic} т /x string 1: който string

    2: кои то ◌̆ 1. match letter о 2. match Cyrillic letter (1 code point)
  21. regex: / \p{Cyrillic} о т /x string 1: който string

    2: кои то ◌̆ 1. match letter о 2. match Cyrillic letter (1 code point) 3. match letter т
  22. regex: / \p{Cyrillic} о т /x string 1: който string

    2: кои то ◌̆ 1. match letter о 2. match Cyrillic letter (1 code point) 3. match letter т 4. 1 success but 1 failure — mixed results
  23. regex: / (?= \p{Cyrillic} ) \X о т /x string

    1: който string 2: кои то ◌̆
  24. regex: / о (?= \p{Cyrillic} ) \X т /x string

    1: който string 2: кои то ◌̆ 1. match letter о
  25. regex: / о (?= \p{Cyrillic} ) \X т /x string

    1: който ⇧ string 2: кои то ◌̆ ⇧ 1. match letter о 2. positive lookahead Cyrillic letter (1 code point)
  26. regex: / (?= \p{Cyrillic} ) о \X т /x string

    1: който ⇧ string 2: кои◌̆то ⇧ 1. match letter о 2. positive lookahead Cyrillic letter (1 code point) 3. match grapheme cluster (1+ code points)
  27. regex: / (?= \p{Cyrillic} ) \X о т /x string

    1: който ⇧ string 2: кои◌̆то ⇧ 1. match letter о 2. positive lookahead Cyrillic letter (1 code point) 3. match grapheme cluster (1+ code points) 4. match letter т
  28. regex: / (?= \p{Cyrillic} ) \X о т /x string

    1: който ⇧ string 2: кои т ◌̆ о ⇧ 1. match letter о 2. positive lookahead Cyrillic letter (1 code point) 3. match grapheme cluster (1+ code points) 4. match letter т 5. success!
  29. Character Literals [ ي ی ] (?: ي | ی

    ) [\x{064A}\x{06CC}] [\N{ARABIC LETTER YEH} \N{ARABIC LETTER FARSI YEH}]
  30. Properties \p{Script=Latin} Name: Script Value: Latin Match any code point

    with the value “Latin” for the Script property.
  31. Properties \P{Script=Latin} Name: Script Value: not Latin Negated form: Match

    any code point without the value “Latin” for the Script property.
  32. Properties \p{Latin} Name: Script (implicit) Value: Latin The Script and

    General Category properties don't require the name because they're so common and their values don't conflict.
  33. Properties \p{General_Category=Letter} Name: General Category Value: Letter Match any code

    point with the value “Letter” for the General Category property.
  34. Properties \p{gc=L} Name: General Category (gc) Value: Letter (L) The

    General Category property is so commonly used that its values all have standard abbreviations.
  35. Properties \p{L} Name: General Category (implicit) Value: Letter (L) And

    the General Category values may even be used on their own, like the Script values. These two properties have distinct values.