Fundamental Unicode in Perl

Fundamental Unicode Nick Patch

“The smallest component of written language that has semantic value;
refers to the abstract meaning and/or shape, rather than a specific shape.” —The Unicode Consortium What Is a Character?

Glyphs are visual representations of characters. Fonts are collections of
glyphs. There may be many different glyphs for the same character. This talk is not about fonts or glyphs. What Is a Glyph?

a b c π ث й Letters

1 2 3 ໓ ๓ ३ Numbers

. / ? 「 « » 」 Punctuation

™ © ≠ ☺ ☠ Symbols

CARRIAGE RETURN NO-BREAK SPACE COMBINING GRAPHEME JOINER RIGHT-TO-LEFT MARK Control
Characters

Many people use “character set” to mean one or more
of these: Character Code Character Encoding Character Repertoire Which makes for a confusing situation. Character Set

A defined mapping of characters to numbers. A ⇒ 41
B ⇒ 42 C ⇒ 43 Each value in a character code is called a code point. Character Code

An algorithm to convert code points to a digital form
for ease of transmitting or storing data. 41 (A) ⇒ 1000001 42 (B) ⇒ 1000010 43 (C) ⇒ 1000011 Character Encoding

A character repertoire is a collection of distinct characters. Character
codes, keyboards, and written languages all have well-defined character repertoires. Character Repertoire

ASCII character code: 128 code points character encoding: 7 bits
each Character Codes & Encodings

ASCII character code: 128 code points character encoding: 7 bits
each Latin 1 (ISO-8859-1) character code: 256 code points character encoding: 8 bits (1 byte) each Character Codes & Encodings

Unicode (character code) 1,112,064 code points (110,000+ defined) Character Codes
& Encodings

Unicode (character code) 1,112,064 code points (110,000+ defined) character encodings:
UTF-8 — 1 to 4 bytes each UTF-16 — 2 or 4 bytes each UTF-32 — 4 bytes each Character Codes & Encodings

A U+0041 LATIN CAPITAL LETTER A ໓ U+0ED3 LAO DIGIT
THREE U+1F4A9 PILE OF POO Code Points

Some code points have precomposed diacritics. ȫ U+022B LATIN SMALL
LETTER O WITH DIAERESIS AND MACRON Code Points

Other characters must be composed from multiple code points using
“combing characters.” n̈ U+006E LATIN SMALL LETTER N U+0308 COMBINING DIAERESIS Code Points

Any series of code points that are composed into a
single user-perceived character. Informally known as “graphemes.” A (U+0041) n̥̈ (U+006E U+0308 U+0325) CRLF (U+000D U+000A) Grapheme Clusters

U+1F42A DROMEDARY CAMEL Time for some…

# ¡jalapeño! say "\x{A1}jalape\x{D1}o!"; String constants ... TIMTOWTDI

# ¡jalapeño! say "\x{A1}jalape\x{D1}o!"; use v5.12; say "\N{U+00A1}jalape\N{U+00D1}o!"; String constants
... TIMTOWTDI

use charnames qw( :full ); say "\N{INVERTED EXCLAMATION MARK}jalape\N{LATIN SMALL
LETTER N WITH TILDE}o!"; String constants ... TIMTOWTDI

use charnames qw( :full ); say "\N{INVERTED EXCLAMATION MARK}jalape\N{LATIN SMALL
LETTER N WITH TILDE}o!"; use utf8; say '¡jalapeño!'; String constants ... TIMTOWTDI

=encoding UTF-8 =head1 ¡jalapeño! String constants ... POD

UTF-8 encoded input ⇩ decode ⇩ Perl Unicode string ⇩
encode ⇩ UTF-8 encoded output I/O

open my $fh, '<:encoding(UTF-8)', $filename; open my $fh, '>:encoding(UTF-8)', $filename;
I/O

open my $fh, '<:encoding(UTF-8)', $filename; open my $fh, '>:encoding(UTF-8)', $filename;
binmode $fh, ':encoding(UTF-8)'; binmode STDIN, ':encoding(UTF-8)'; I/O

use open qw( :encoding(UTF-8) ); open my $fh, '<', $filename;
I/O

# :std for STDIN, STDOUT, STDERR use open qw( :encoding(UTF-8) :std ); I/O

# :std for STDIN, STDOUT, STDERR use open qw( :encoding(UTF-8) :std ); # CPAN module to enable everything UTF-8 use utf8::all; I/O

use Encode; my $internal = decode('UTF-8', $input); my $output =
encode('UTF-8', $internal); Explicit Encoding & Decoding

Let’s use this grapheme cluster as the string in our
next example: ю ́ U+044E CYRILLIC SMALL LETTER YU U+0301 COMBINING ACUTE ACCENT String Length

# UTF-8 encoded: D1 8E CC 81 say length $encoded_grapheme;
# 4 String Length

# 4 use Encode; # Unicode string: 044E 0301 my $grapheme = decode('UTF-8', $encoded); say length $grapheme; # 2 String Length

# 4 use Encode; # Unicode string: 044E 0301 my $grapheme = decode('UTF-8', $encoded); say length $grapheme; # 2 my $length = () = $grapheme =~ /\X/g; say $length; # 1 String Length

# sort of complex for a simple length, eh? my
$length = () = $str =~ /\X/g; say $length; String Length

$length = () = $str =~ /\X/g; say $length; # and tricky depending on the context say scalar( () = $str =~ /\X/g ); String Length

$length = () = $str =~ /\X/g; say $length; # and tricky depending on the context say scalar( () = $str =~ /\X/g ); # a little better $length++ while $str =~ /\X/g; say $length; String Length

# an alternative approach use Unicode::GCString; say Unicode::GCString->new($str)->length; String Length

# an alternative approach use Unicode::GCString; say Unicode::GCString->new($str)->length; # and
yet another (Warning: I wrote it!) use Unicode::Util qw( grapheme_length ); say grapheme_length($str); String Length

Standard ordering of strings for comparison and sorting. sort @names
$a cmp $b $x gt $y $foo eq $bar Collation

Perl provides a collation algorithm based on code points. Collation

Perl provides a collation algorithm based on code points. @words
= qw( Äpfel durian Xerxes ) sort @words # Xerxes durian Äpfel Collation

Perl provides a collation algorithm based on code points. @words
= qw( Äpfel durian Xerxes ) sort @words # Xerxes durian Äpfel sort { lc $a cmp lc $b } @words # durian Xerxes Äpfel Collation

Unicode Collation Algorithm (UCA) provides collation based on natural language
usage. Collation

usage. use Unicode::Collate; my $collator = Unicode::Collate->new; $collator->sort(@words); # Äpfel durian Xerxes Collation

usage. $collator->sort(@names) $collator->cmp($a, $b) $collator->gt($x, $y) $collator->eq($foo, $bar) Collation

UCA also provides locale-specific collations for different languages. Collation

UCA also provides locale-specific collations for different languages. use Unicode::Collate::Locale;
my $kolator = Unicode::Collate::Locale->new( locale => 'pl' # Polish ); Collation

Unicode has 4 normalization forms. The most important are: NFD:
Normalization Form Canonical Decomposition NFC: Normalization Form Canonical Composition Normalization

use Unicode::Normalize; # NFD can be helpful on input $str
= NFD($input); # NFC is recommended on output $output = NFC($str); Normalization

UTF-8 encoded input ⇩ decode ⇩ NFD ⇩ Perl Unicode
string ⇩ NFC ⇩ encode ⇩ UTF-8 encoded output Normalization

By default, unfortunately, strings and regexes are not guaranteed to
use Unicode semantics. This is known as “The Unicode Bug.” There are a few ways to fix this: Unicode Semantics

use Unicode semantics. This is known as “The Unicode Bug.” There are a few ways to fix this: utf8::upgrade($str); Unicode Semantics

use Unicode semantics. This is known as “The Unicode Bug.” There are a few ways to fix this: utf8::upgrade($str); use v5.12; Unicode Semantics

use Unicode semantics. This is known as “The Unicode Bug.” There are a few ways to fix this: utf8::upgrade($str); use v5.12; use feature 'unicode_strings'; Unicode Semantics

You’ll see the “utf8” encoding used frequently in Perl. “utf8”
follows the UTF-8 standard very loosely and allows many errors in your data without warnings. By default, use “UTF-8” instead. UTF-8 vs. utf8 vs. :utf8

# utf8 is Perl's internal encoding form my $internal =
decode('utf8', $input); # UTF-8 is the official UTF-8 encoding my $internal = decode('UTF-8', $input); UTF-8 vs. utf8 vs. :utf8

# utf8 is Perl's internal encoding form my $internal =
decode('utf8', $input); # UTF-8 is the official UTF-8 encoding my $internal = decode('UTF-8', $input); # insecure! no encoding validation at all open my $fh, '<:utf8', $filename; # proper UTF-8 validation open my $fh, '<:encoding(UTF-8)', $filename; UTF-8 vs. utf8 vs. :utf8

Slides will be posted to: @nickpatch Questions?

Fundamental Unicode in Perl

Fundamental Unicode in Perl

More Decks by Nova Patch

Other Decks in Programming

Featured

Transcript