Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Unicode Programming in Modern Perl

Unicode Programming in Modern Perl

In 2010 the Perl 5 language switched to a yearly major release cycle with monthly developer releases. Perl has a history of great Unicode support and the new development process has enabled the language to rapidly enhance Unicode functionality and support new Unicode standards. This talk will demonstrate the current state of Unicode in Perl, review the exciting changes in the last few years, and touch on future development.

Presented at:
◦ 2013-10-22: Internationalization & Unicode Conference 37 (IUC37), Santa Clara, CA
◦  2014-03-27: New York Perl Mongers (NY.pm), New York, NY

Nova Patch

March 27, 2014
Tweet

More Decks by Nova Patch

Other Decks in Programming

Transcript

  1. Perl has some of the best Unicode support today, especially

    with respect to regular expressions. Benjamin Peterson The Guts of Unicode in Python PyCon 2013
  2. UTF-8 encoded input ⇩ decode ⇩ character string ⇩ hack…

    hack… hack… ⇩ encode ⇩ UTF-8 encoded output
  3. use utf8; $word =~ s{ (?: نیرت | يدنب |

    یدنب | هراب | دابأ | ییاه | يزاس | یزاس | يزیر | یزیر ) $}{}x;
  4. use utf8; $word =~ s{ (?: نیرت | يدنب |

    یدنب | هراب | دابأ | ییاه | يزاس | یزاس | يزیر | یزیر ) $}{}x;
  5. use utf8; $word =~ s{ (?: دابأ | هراب |

    یدنب | يدنب | نیرت | یزیر | يزیر | یزاس | يزاس | ییاه ) $}{}x;
  6. use utf8; $word =~ s{ (?: ия # definite articles

    for nouns: | ът # masculine ∙ | та # feminine ∙ | то # neutral ∙ | те # plural ∙ ) $}{}x;
  7. use utf8; use open qw( :encoding(UTF-8) :std ); use Test::More

    tests => 66; use Lingua::Stem::UniNE::CS qw( stem ); is stem('zvířatech'), 'zvíř', 'rm -atech'; is stem('zvířatům'), 'zvíř', 'rm -atům'; is stem('zvířata'), 'zvíř', 'rm -ata'; is stem('zvířaty'), 'zvíř', 'rm -aty';
  8. use utf8; use open qw( :encoding(UTF-8) :std ); use Test::More

    tests => 66; use Lingua::Stem::UniNE::CS qw( stem ); is stem('zvířatech'), 'zvíř', 'rm -atech'; is stem('zvířatům'), 'zvíř', 'rm -atům'; is stem('zvířata'), 'zvíř', 'rm -ata'; is stem('zvířaty'), 'zvíř', 'rm -aty';
  9. use utf8; use open qw( :encoding(UTF-8) :std ); use Test::More

    tests => 66; use Lingua::Stem::UniNE::CS qw( stem ); is stem('zvířatech'), 'zvíř', 'rm -atech'; is stem('zvířatům'), 'zvíř', 'rm -atům'; is stem('zvířata'), 'zvíř', 'rm -ata'; is stem('zvířaty'), 'zvíř', 'rm -aty';
  10. use utf8; use open qw( :encoding(UTF-8) :std ); use Test::More

    tests => 66; use Lingua::Stem::UniNE::CS qw( stem ); is stem('zvířatech'), 'zvíř', 'rm -atech'; is stem('zvířatům'), 'zvíř', 'rm -atům'; is stem('zvířata'), 'zvíř', 'rm -ata'; is stem('zvířaty'), 'zvíř', 'rm -aty';
  11. sub remove_kasra { my ($word) = @_; $word =~ s{

    \x{0650} $}{}x; return $word; }
  12. sub remove_kasra { my ($word) = @_; $word =~ s{

    \x{0650} $}{}x; return $word; }
  13. use charnames ':full'; sub remove_kasra { my ($word) = @_;

    $word =~ s{ \x{0650} $}{}x; return $word; }
  14. use charnames ':full'; sub remove_kasra { my ($word) = @_;

    $word =~ s{ \N{ARABIC KASRA} $}{}x; return $word; }
  15. use v5.16; sub remove_kasra { my ($word) = @_; $word

    =~ s{ \N{ARABIC KASRA} $}{}x; return $word; }
  16. use Unicode::Collate; my $c = Unicode::Collate->new( level => 2 #

    ignore case ); $c->eq('Größe', 'GRO\x{0308}SSE')
  17. use Unicode::Collate; my $c = Unicode::Collate->new( level => 2 #

    ignore case ); $c->eq('Größe', 'GRO\x{0308}SSE')
  18. \s

  19. .

  20. \pL

  21. return $word if $word =~ s{ $}{ }x зи г

    || $word =~ s{ ( е \p{Cyrl} ) $}{ $1}x и я || $word =~ s{ $}{ }x ци к || $word =~ s{ (?: | ) $}{}x; та ища
  22. perlunicode — Unicode features perluniprops — Unicode properties perlre —

    regex syntax perlreref — regex reference perlrebackslash — regex escape sequences perlrecharclass — regex character classes Unicode::UCD — Unicode Character DB Lingua::Stem::UniNE — code examples