Unicode Best Practices in Perl

Unicode Best Practices Nick Patch @nickpatch Shutterstock

Perl has some of the best Unicode support today, especially
with respect to regular expressions. Benjamin Peterson The Guts of Unicode in Python PyCon 2013

use v5.8;

use v5.12;

use v5.14;

UTF-8 encoded input

UTF-8 encoded input ⇩ decode

UTF-8 encoded input ⇩ decode ⇩ character string

UTF-8 encoded input ⇩ decode ⇩ character string ⇩ hack…
hack… hack…

hack… hack… ⇩ encode

hack… hack… ⇩ encode ⇩ UTF-8 encoded output

use utf8;

s/ / /g

use utf8; $word =~ s{ (?: دابآ | هراب |
یدنب | يدنب | نیرت | یزیر | يزیر | یزاس | يزاس | ییاه ) $}{}x;

use utf8; $word =~ s{ (?: ия # definite articles
for nouns: | ът # ∙ masculine | та # ∙ feminine | то # ∙ neutral | те # ∙ plural ) $}{}x;

=encoding UTF-8

=encoding UTF-8 =head1 NAME Lingua::Stem::UniNE - University of Neuchâtel stemmers

use open qw( :encoding(UTF-8) :std );

use utf8; use open qw( :encoding(UTF-8) :std ); use Test::More
tests => 66; use Lingua::Stem::UniNE::CS qw( stem ); is stem('zvířatech'), 'zvíř', 'rm -atech'; is stem('zvířatům'), 'zvíř', 'rm -atům'; is stem('zvířata'), 'zvíř', 'rm -ata'; is stem('zvířaty'), 'zvíř', 'rm -aty';

decode_json($json)

$res->decoded_content

decode('UTF-8', $arg)

use charnames ':full';

use charnames ':full'; sub remove_kasra { my ($word) = @_;
$word =~ s{ \N{ARABIC KASRA} $}{}x; return $word; }

use v5.16; sub remove_kasra { my ($word) = @_; $word
=~ s{ \N{ARABIC KASRA} $}{}x; return $word; }

\d 123…

\d 123… … ১২৩

\d 123… … ১২৩ … ໑໒໓

[0-9] 123…

\w abc… 123… _

\w abc… 123… _ αβγ… … ㄅㄆㄇ

\w abc… 123… _ αβγ… … ㄅㄆㄇ … ج ب أ

\b abc… 123… _ αβγ… … ㄅㄆㄇ … ج ب أ

[A-Za-z0-9_] abc… 123… _

\p{PerlWord} abc… 123… _

\R LF (\n) CR (\r) FF (\f)

\R LF (\n) CR (\r) FF (\f) CRLF (\r\n)

\R LF (\n) CR (\r) FF (\f) CRLF (\r\n) NEL
VT LS PS

\X n ̈

\X Spınal Tap ̈

\X Spınal Tap ̈ n\N{COMBINING DIAERESIS}

\X Spınal Tap ̈ n\N{COMBINING DIAERESIS} \r\n

\p{ASCII}

\P{ASCII}

\p{General_Category=Letter}

\p{Letter}

L Letter M Mark N Number P Punctuation S Symbol
Z Separator C Other

S Symbol Sm Math_Symbol Sc Currency_Symbol Sk Modifier_Symbol So Other_Symbol

\p{Script=Latin}

\p{Latin}

[\p{Hiragana} \p{Katakana} \p{Han} \p{Latin} \p{Common}]

[\p{Hira} \p{Kana} \p{Hani} \p{Latn} \p{Common}]

Arab Arabic Beng Bengali Deva Devanagari Egyp Egyptian hieroglyphs Ethi
Ethiopic Grek Greek Hang Hangul …

return $word if $word =~ s{ зи $}{г}x || $word
=~ s{ е ( \p{Cyrl} ) и $}{я$1}x || $word =~ s{ ци $}{к}x || $word =~ s{ (?: та | ища ) $}{}x;

lc('Größe') eq 'größe'

lc('Größe') eq 'größe' uc('Größe') eq 'GRÖSSE'

lc('Größe') eq 'größe' uc('Größe') eq 'GRÖSSE' lc('Größe') ne lc(uc('Größe'))

use Unicode::CaseFold; fc('Größe') eq fc(GRÖSSE)

use v5.16; fc('Größe') eq fc(GRÖSSE)

use Unicode::Normalize; NFC('Größe') eq NFC('Gro◌̈ße')

use v5.16; use Unicode::Normalize; NFC(fc('Größe')) eq NFC(fc('GRO◌̈SSE'))

UTF-8 encoded input ⇩ decode ⇩ character string ⇩ NFD
hack ⇨ … hack… hack… NFC ⇨ ⇩ encode ⇩ UTF-8 encoded output

use Unicode::Collate; my $c = Unicode::Collate->new; @countries = $c->sort(@countries);

use Unicode::Collate; my $c = Unicode::Collate->new( level => 2 #
ignore case ); $c->eq('Größe', 'GRO◌̈SSE')

use Unicode::Collate::Locale; my $c = Unicode::Collate::Locale->new( locale => 'de' );
@words_de = $c->sort(@words_de);

use Test::More; new_ok 'Text::CSV::Hashify' => [ file => $file, max_rows
=> '٤٠', ];

@nickpatch

Unicode Best Practices in Perl

Unicode Best Practices in Perl

More Decks by Nova Patch

Other Decks in Programming

Featured

Transcript