Unicode Regular Expressions in Perl

Unicode Regular Expressions s/ / /g Nick Patch Shutterstock

Perl has some of the best Unicode support today, especially
with respect to regular expressions. Benjamin Peterson The Guts of Unicode in Python PyCon 2013

UTF-8 encoded input

UTF-8 encoded input ⇩ decode

UTF-8 encoded input ⇩ decode ⇩ character string

UTF-8 encoded input ⇩ decode ⇩ character string ⇩ hack…
hack… hack…

hack… hack… ⇩ encode

hack… hack… ⇩ encode ⇩ UTF-8 encoded output

use utf8;

use utf8; $word =~ s{ (?: دابآ | هراب |
یدنب | يدنب | نیرت | یزیر | يزیر | یزاس | يزاس | ییاه ) $}{}x;

use utf8; $word =~ s{ (?: ия # definite articles
for nouns: | ът # ∙ masculine | та # ∙ feminine | то # ∙ neutral | те # ∙ plural ) $}{}x;

use open qw( :encoding(UTF-8) :std );

use utf8; use open qw( :encoding(UTF-8) :std ); use Test::More
tests => 66; use Lingua::Stem::UniNE::CS qw( stem ); is stem('zvířatech'), 'zvíř', 'rm -atech'; is stem('zvířatům'), 'zvíř', 'rm -atům'; is stem('zvířata'), 'zvíř', 'rm -ata'; is stem('zvířaty'), 'zvíř', 'rm -aty';

decode_json($json)

decode('UTF-8', $arg)

use v5.12;

use v5.14;

use charnames ':full';

use charnames ':full'; sub remove_kasra { my ($word) = @_;
$word =~ s{ \N{ARABIC KASRA} $}{}x; return $word; }

use v5.16; sub remove_kasra { my ($word) = @_; $word
=~ s{ \N{ARABIC KASRA} $}{}x; return $word; }

\d 123…

\d 123… … ১২৩

\d 123… … ১২৩ … ໑໒໓

[0-9] 123…

\w abc… 123… _

\w abc… 123… _ αβγ… … かきく

\w abc… 123… _ αβγ… … かきく …د ج ب

[A-Za-z0-9_] abc… 123… _

\b abc… 123… _ αβγ… … かきく …د ج ب

\R LF (\n) CR (\r) FF (\f)

\R LF (\n) CR (\r) FF (\f) CRLF (\r\n)

\R LF (\n) CR (\r) FF (\f) CRLF (\r\n) NEL
VT LS PS

\X n ̈

\X Spınal Tap ̈

\X Spınal Tap ̈ n\N{COMBINING DIAERESIS}

\X Spınal Tap ̈ n\N{COMBINING DIAERESIS} \r\n

\p{ASCII}

\P{ASCII}

\p{General_Category=Letter}

\p{Letter}

L Letter M Mark N Number P Punctuation S Symbol
Z Separator C Other

S Symbol Sm Math_Symbol Sc Currency_Symbol Sk Modifier_Symbol So Other_Symbol

\p{Script=Latin}

\p{Latin}

[\p{Hiragana} \p{Katakana} \p{Han} \p{Latin} \p{Common}]

[\p{Hira} \p{Kana} \p{Hani} \p{Latn} \p{Common}]

Arab Arabic Beng Bengali Deva Devanagari Egyp Egyptian hieroglyphs Ethi
Ethiopic Grek Greek Hang Hangul …

return $word if $word =~ s{ зи $}{г}x || $word
=~ s{ е ( \p{Cyrl} ) и $}{я$1}x || $word =~ s{ ци $}{к}x || $word =~ s{ (?: та | ища ) $}{}x;

use v5.18;

(?[ ])

(?[ \d - \p{ASCII} ])

(?[ \d & \p{Thai} ])

perlre — regex syntax perlrebackslash — regex escape sequences perlrecharclass
— regex character classes perlunicode — Unicode features Lingua::Stem::UniNE — code examples

@nickpatch

Unicode Regular Expressions in Perl

Unicode Regular Expressions in Perl

More Decks by Nova Patch

Other Decks in Programming

Featured

Transcript