Unicode Regular Expressions in Perl

Unicode Regular Expressions in Perl

Perl’s regular expression engine provides rich features for matching and parsing Unicode strings. Recent releases of Perl have added powerful new modifiers, character classes, and other special escape sequences that can be added to your toolkit. The functionality of regex metacharacters has also been evolving to conform with Unicode standards and it’s important to understand the differences.

This talk will be useful to programmers of all levels who want to learn about Unicode character properties and new regex features. A basic knowledge of regular expressions is required.

Presented at:
◦  2013-04-20: DC–Baltimore Perl Workshop (DCBPW) 2013, Baltimore, MD

05bab33cfd102c84f045838aa4e05bec?s=128

Nova Patch

April 20, 2013
Tweet

Transcript

  1. Unicode Regular Expressions s/ / /g Nick Patch Shutterstock

  2. Perl has some of the best Unicode support today, especially

    with respect to regular expressions. Benjamin Peterson The Guts of Unicode in Python PyCon 2013
  3. UTF-8 encoded input

  4. UTF-8 encoded input ⇩ decode

  5. UTF-8 encoded input ⇩ decode ⇩ character string

  6. UTF-8 encoded input ⇩ decode ⇩ character string ⇩ hack…

    hack… hack…
  7. UTF-8 encoded input ⇩ decode ⇩ character string ⇩ hack…

    hack… hack… ⇩ encode
  8. UTF-8 encoded input ⇩ decode ⇩ character string ⇩ hack…

    hack… hack… ⇩ encode ⇩ UTF-8 encoded output
  9. use utf8;

  10. use utf8; $word =~ s{ (?: دابآ | هراب |

    یدنب | يدنب | نیرت | یزیر | يزیر | یزاس | يزاس | ییاه ) $}{}x;
  11. use utf8; $word =~ s{ (?: ия # definite articles

    for nouns: | ът # ∙ masculine | та # ∙ feminine | то # ∙ neutral | те # ∙ plural ) $}{}x;
  12. use open qw( :encoding(UTF-8) :std );

  13. use utf8; use open qw( :encoding(UTF-8) :std ); use Test::More

    tests => 66; use Lingua::Stem::UniNE::CS qw( stem ); is stem('zvířatech'), 'zvíř', 'rm -atech'; is stem('zvířatům'), 'zvíř', 'rm -atům'; is stem('zvířata'), 'zvíř', 'rm -ata'; is stem('zvířaty'), 'zvíř', 'rm -aty';
  14. decode_json($json)

  15. decode('UTF-8', $arg)

  16. use v5.12;

  17. use v5.14;

  18. use charnames ':full';

  19. use charnames ':full'; sub remove_kasra { my ($word) = @_;

    $word =~ s{ \N{ARABIC KASRA} $}{}x; return $word; }
  20. use v5.16; sub remove_kasra { my ($word) = @_; $word

    =~ s{ \N{ARABIC KASRA} $}{}x; return $word; }
  21. \d

  22. \d 123…

  23. \d 123… … ১২৩

  24. \d 123… … ১২৩ … ໑໒໓

  25. [0-9] 123…

  26. \w

  27. \w abc… 123… _

  28. \w abc… 123… _ αβγ… … かきく

  29. \w abc… 123… _ αβγ… … かきく …د ج ب

  30. [A-Za-z0-9_] abc… 123… _

  31. \b abc… 123… _ αβγ… … かきく …د ج ب

  32. \s

  33. \R

  34. \R LF (\n) CR (\r) FF (\f)

  35. \R LF (\n) CR (\r) FF (\f) CRLF (\r\n)

  36. \R LF (\n) CR (\r) FF (\f) CRLF (\r\n) NEL

    VT LS PS
  37. .

  38. \X

  39. \X n ̈

  40. \X Spınal Tap ̈

  41. \X Spınal Tap ̈ n\N{COMBINING DIAERESIS}

  42. \X Spınal Tap ̈ n\N{COMBINING DIAERESIS} \r\n

  43. \p

  44. \p{ASCII}

  45. \P{ASCII}

  46. \p{General_Category=Letter}

  47. \p{Letter}

  48. \p{L}

  49. \pL

  50. L Letter M Mark N Number P Punctuation S Symbol

    Z Separator C Other
  51. S Symbol Sm Math_Symbol Sc Currency_Symbol Sk Modifier_Symbol So Other_Symbol

  52. \p{Script=Latin}

  53. \p{Latin}

  54. [\p{Hiragana} \p{Katakana} \p{Han} \p{Latin} \p{Common}]

  55. [\p{Hira} \p{Kana} \p{Hani} \p{Latn} \p{Common}]

  56. Arab Arabic Beng Bengali Deva Devanagari Egyp Egyptian hieroglyphs Ethi

    Ethiopic Grek Greek Hang Hangul …
  57. return $word if $word =~ s{ зи $}{г}x || $word

    =~ s{ е ( \p{Cyrl} ) и $}{я$1}x || $word =~ s{ ци $}{к}x || $word =~ s{ (?: та | ища ) $}{}x;
  58. use v5.18;

  59. (?[ ])

  60. (?[ \d - \p{ASCII} ])

  61. (?[ \d & \p{Thai} ])

  62. perlre — regex syntax perlrebackslash — regex escape sequences perlrecharclass

    — regex character classes perlunicode — Unicode features Lingua::Stem::UniNE — code examples
  63. @nickpatch