Unicode Programming in Modern Perl

Unicode Programming in Modern Perl

In 2010 the Perl 5 language switched to a yearly major release cycle with monthly developer releases. Perl has a history of great Unicode support and the new development process has enabled the language to rapidly enhance Unicode functionality and support new Unicode standards. This talk will demonstrate the current state of Unicode in Perl, review the exciting changes in the last few years, and touch on future development.

Presented at:
◦ 2013-10-22: Internationalization & Unicode Conference 37 (IUC37), Santa Clara, CA
◦  2014-03-27: New York Perl Mongers (NY.pm), New York, NY

05bab33cfd102c84f045838aa4e05bec?s=128

Nova Patch

March 27, 2014
Tweet

Transcript

  1. Unicode Programming in Modern Perl Nick Patch @nickpatch Shutterstock

  2. None
  3. Perl has some of the best Unicode support today, especially

    with respect to regular expressions. Benjamin Peterson The Guts of Unicode in Python PyCon 2013
  4. UTF-8 encoded input

  5. UTF-8 encoded input ⇩ decode

  6. UTF-8 encoded input ⇩ decode ⇩ character string

  7. UTF-8 encoded input ⇩ decode ⇩ character string ⇩ hack…

    hack… hack…
  8. UTF-8 encoded input ⇩ decode ⇩ character string ⇩ hack…

    hack… hack… ⇩ encode
  9. UTF-8 encoded input ⇩ decode ⇩ character string ⇩ hack…

    hack… hack… ⇩ encode ⇩ UTF-8 encoded output
  10. use utf8;

  11. use utf8; $word =~ s{ (?: نیرت | يدنب |

    یدنب | هراب | دابأ | ییاه | يزاس | یزاس | يزیر | یزیر ) $}{}x;
  12. use utf8; $word =~ s{ (?: نیرت | يدنب |

    یدنب | هراب | دابأ | ییاه | يزاس | یزاس | يزیر | یزیر ) $}{}x;
  13. use utf8; $word =~ s{ (?: دابأ | هراب |

    یدنب | يدنب | نیرت | یزیر | يزیر | یزاس | يزاس | ییاه ) $}{}x;
  14. use utf8; $word =~ s{ (?: ия # definite articles

    for nouns: | ът # masculine ∙ | та # feminine ∙ | то # neutral ∙ | те # plural ∙ ) $}{}x;
  15. =encoding UTF-8

  16. =encoding UTF-8 =head1 NAME Lingua::Stem::UniNE - University of Neuchâtel stemmers

  17. =encoding UTF-8 =head1 NAME Lingua::Stem::UniNE - University of Neuchâtel stemmers

  18. =encoding UTF-8 =head1 NAME Lingua::Stem::UniNE - University of Neuchâtel stemmers

  19. use open qw( :encoding(UTF-8) :std );

  20. use open qw( :encoding(UTF-8) :std );

  21. use open qw( :encoding(UTF-8) :std );

  22. use utf8; use open qw( :encoding(UTF-8) :std ); use Test::More

    tests => 66; use Lingua::Stem::UniNE::CS qw( stem ); is stem('zvířatech'), 'zvíř', 'rm -atech'; is stem('zvířatům'), 'zvíř', 'rm -atům'; is stem('zvířata'), 'zvíř', 'rm -ata'; is stem('zvířaty'), 'zvíř', 'rm -aty';
  23. use utf8; use open qw( :encoding(UTF-8) :std ); use Test::More

    tests => 66; use Lingua::Stem::UniNE::CS qw( stem ); is stem('zvířatech'), 'zvíř', 'rm -atech'; is stem('zvířatům'), 'zvíř', 'rm -atům'; is stem('zvířata'), 'zvíř', 'rm -ata'; is stem('zvířaty'), 'zvíř', 'rm -aty';
  24. use utf8; use open qw( :encoding(UTF-8) :std ); use Test::More

    tests => 66; use Lingua::Stem::UniNE::CS qw( stem ); is stem('zvířatech'), 'zvíř', 'rm -atech'; is stem('zvířatům'), 'zvíř', 'rm -atům'; is stem('zvířata'), 'zvíř', 'rm -ata'; is stem('zvířaty'), 'zvíř', 'rm -aty';
  25. use utf8; use open qw( :encoding(UTF-8) :std ); use Test::More

    tests => 66; use Lingua::Stem::UniNE::CS qw( stem ); is stem('zvířatech'), 'zvíř', 'rm -atech'; is stem('zvířatům'), 'zvíř', 'rm -atům'; is stem('zvířata'), 'zvíř', 'rm -ata'; is stem('zvířaty'), 'zvíř', 'rm -aty';
  26. use charnames ':full';

  27. sub remove_kasra { my ($word) = @_; $word =~ s{

    \x{0650} $}{}x; return $word; }
  28. sub remove_kasra { my ($word) = @_; $word =~ s{

    \x{0650} $}{}x; return $word; }
  29. use charnames ':full'; sub remove_kasra { my ($word) = @_;

    $word =~ s{ \x{0650} $}{}x; return $word; }
  30. use charnames ':full'; sub remove_kasra { my ($word) = @_;

    $word =~ s{ \N{ARABIC KASRA} $}{}x; return $word; }
  31. use v5.16; sub remove_kasra { my ($word) = @_; $word

    =~ s{ \N{ARABIC KASRA} $}{}x; return $word; }
  32. lc('Größe') eq 'größe'

  33. lc('Größe') eq 'größe' uc('Größe') eq 'GRÖSSE'

  34. lc('Größe') eq 'größe' uc('Größe') eq 'GRÖSSE' lc('Größe') ne lc(uc('Größe'))

  35. lc('Größe') eq 'größe' uc('Größe') eq 'GRÖSSE' lc('Größe') ne lc(uc('Größe')) fc('Größe')

    eq fc('GRÖSSE')
  36. use Unicode::CaseFold; fc('Größe') eq fc('GRÖSSE')

  37. use v5.16; fc('Größe') eq fc('GRÖSSE')

  38. use Unicode::Normalize; NFC('Größe') eq NFC('Gro\x{0308}ße')

  39. use v5.16; use Unicode::Normalize; NFC(fc('Größe')) eq NFC(fc('GRO\x{0308}SSE'))

  40. None
  41. None
  42. use Unicode::Collate; my $c = Unicode::Collate->new; @countries = $c->sort(@countries);

  43. use Unicode::Collate; my $c = Unicode::Collate->new( level => 2 #

    ignore case ); $c->eq('Größe', 'GRO\x{0308}SSE')
  44. use Unicode::Collate; my $c = Unicode::Collate->new( level => 2 #

    ignore case ); $c->eq('Größe', 'GRO\x{0308}SSE')
  45. use Unicode::Collate::Locale; my $c = Unicode::Collate::Locale->new( locale => 'de' );

    @words_de = $c->sort(@words_de);
  46. use Unicode::Collate::Locale; my $c = Unicode::Collate::Locale->new( locale => 'de' );

    @words_de = $c->sort(@words_de);
  47. \d 123… … ১২৩ … ໑໒໓

  48. \d 123… … ১২৩ … ໑໒໓

  49. \d 123… … ১২৩ … ໑໒໓

  50. \d 123… … ১২৩ … ໑໒໓

  51. [0-9] 123…

  52. \w abc… 123… _ αβγ… ㄅㄆㄇ… …ج ب أ 

  53. \w abc… 123… _ αβγ… ㄅㄆㄇ… …ج ب أ 

  54. \w abc… 123… _ αβγ… ㄅㄆㄇ… …ج ب أ 

  55. \w abc… 123… _ αβγ… ㄅㄆㄇ… …ج ب أ 

  56. \b abc… 123… _ αβγ… ㄅㄆㄇ… …ج ب أ 

  57. /\w/a abc… 123… _

  58. \s

  59. \R LF (\n) CR (\r) FF (\f) CRLF (\r\n) NEL

    VT LS PS
  60. \R LF (\n) CR (\r) FF (\f) CRLF (\r\n) NEL

    VT LS PS
  61. \R LF (\n) CR (\r) FF (\f) CRLF (\r\n) NEL

    VT LS PS
  62. \R LF (\n) CR (\r) FF (\f) CRLF (\r\n) NEL

    VT LS PS
  63. .

  64. \X Spınal Tap ̈ n\N{COMBINING DIAERESIS} 각 กำำ நி िष

    CRLF (\r\n)
  65. \X Spın̈al Tap n\N{COMBINING DIAERESIS} 각 กำำ நி िष CRLF

    (\r\n)
  66. \X Spınal Tap ̈ n\N{COMBINING DIAERESIS} 각 กำำ நி िष

    CRLF (\r\n)
  67. \X Spınal Tap ̈ n\N{COMBINING DIAERESIS} 각 กำำ நி िष

    CRLF (\r\n)
  68. \X Spınal Tap ̈ n\N{COMBINING DIAERESIS} 각 กำำ நி िष

    CRLF (\r\n)
  69. \X Spınal Tap ̈ n\N{COMBINING DIAERESIS} 각 กำำ நி िष

    CRLF (\r\n)
  70. \p{…}

  71. \p{General_Category=Letter}

  72. \p{Letter}

  73. \p{L}

  74. \pL

  75. L Letter M Mark N Number P Punctuation S Symbol

    Z Separator C Other
  76. S Symbol Sm Math_Symbol Sc Currency_Symbol Sk Modifier_Symbol So Other_Symbol

  77. \p{Script=Latin}

  78. \p{Latin}

  79. [\p{Hiragana} \p{Katakana} \p{Han} \p{Latin} \p{Common}]

  80. [\p{Hira} \p{Kana} \p{Hani} \p{Latn} \p{Common}]

  81. Arab Arabic Beng Bengali Deva Devanagari Egyp Egyptian hieroglyphs Ethi

    Ethiopic Grek Greek Hang Hangul …
  82. return $word if $word =~ s{ $}{ }x зи г

    || $word =~ s{ ( е \p{Cyrl} ) $}{ $1}x и я || $word =~ s{ $}{ }x ци к || $word =~ s{ (?: | ) $}{}x; та ища
  83. \p{ASCII}

  84. \P{ASCII}

  85. use v5.18;

  86. (?[…])

  87. (?[ \d - \p{ASCII} ])

  88. (?[ \d & \p{Thai} ])

  89. no warnings 'experimental::regex_sets'; (?[ \d & \p{Thai} ])

  90. perlunicode — Unicode features perluniprops — Unicode properties perlre —

    regex syntax perlreref — regex reference perlrebackslash — regex escape sequences perlrecharclass — regex character classes Unicode::UCD — Unicode Character DB Lingua::Stem::UniNE — code examples
  91. Questions? Nick Patch @nickpatch Shutterstock