Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Unicode Regular Expressions in Perl

Unicode Regular Expressions in Perl

Perl’s regular expression engine provides rich features for matching and parsing Unicode strings. Recent releases of Perl have added powerful new modifiers, character classes, and other special escape sequences that can be added to your toolkit. The functionality of regex metacharacters has also been evolving to conform with Unicode standards and it’s important to understand the differences.

This talk will be useful to programmers of all levels who want to learn about Unicode character properties and new regex features. A basic knowledge of regular expressions is required.

Presented at:
◦  2013-04-20: DC–Baltimore Perl Workshop (DCBPW) 2013, Baltimore, MD

Nova Patch

April 20, 2013
Tweet

More Decks by Nova Patch

Other Decks in Programming

Transcript

  1. Perl has some of the best Unicode support today, especially

    with respect to regular expressions. Benjamin Peterson The Guts of Unicode in Python PyCon 2013
  2. UTF-8 encoded input ⇩ decode ⇩ character string ⇩ hack…

    hack… hack… ⇩ encode ⇩ UTF-8 encoded output
  3. use utf8; $word =~ s{ (?: دابآ | هراب |

    یدنب | يدنب | نیرت | یزیر | يزیر | یزاس | يزاس | ییاه ) $}{}x;
  4. use utf8; $word =~ s{ (?: ия # definite articles

    for nouns: | ът # ∙ masculine | та # ∙ feminine | то # ∙ neutral | те # ∙ plural ) $}{}x;
  5. use utf8; use open qw( :encoding(UTF-8) :std ); use Test::More

    tests => 66; use Lingua::Stem::UniNE::CS qw( stem ); is stem('zvířatech'), 'zvíř', 'rm -atech'; is stem('zvířatům'), 'zvíř', 'rm -atům'; is stem('zvířata'), 'zvíř', 'rm -ata'; is stem('zvířaty'), 'zvíř', 'rm -aty';
  6. use charnames ':full'; sub remove_kasra { my ($word) = @_;

    $word =~ s{ \N{ARABIC KASRA} $}{}x; return $word; }
  7. use v5.16; sub remove_kasra { my ($word) = @_; $word

    =~ s{ \N{ARABIC KASRA} $}{}x; return $word; }
  8. \d

  9. \w

  10. \s

  11. \R

  12. .

  13. \X

  14. \p

  15. \pL

  16. return $word if $word =~ s{ зи $}{г}x || $word

    =~ s{ е ( \p{Cyrl} ) и $}{я$1}x || $word =~ s{ ци $}{к}x || $word =~ s{ (?: та | ища ) $}{}x;
  17. perlre — regex syntax perlrebackslash — regex escape sequences perlrecharclass

    — regex character classes perlunicode — Unicode features Lingua::Stem::UniNE — code examples