Fundamental Unicode in Perl

Slide 1

Slide 1 text

Fundamental Unicode Nick Patch

Slide 2

Slide 2 text

“The smallest component of written language that has semantic value; refers to the abstract meaning and/or shape, rather than a specific shape.” —The Unicode Consortium What Is a Character?

Slide 3

Slide 3 text

Glyphs are visual representations of characters. Fonts are collections of glyphs. There may be many different glyphs for the same character. This talk is not about fonts or glyphs. What Is a Glyph?

Slide 4

Slide 4 text

a b c π ث й Letters

Slide 5

Slide 5 text

1 2 3 ໓ ๓ ३ Numbers

Slide 6

Slide 6 text

. / ? 「 « » 」 Punctuation

Slide 7

Slide 7 text

™ © ≠ ☺ ☠ Symbols

Slide 8

Slide 8 text

CARRIAGE RETURN NO-BREAK SPACE COMBINING GRAPHEME JOINER RIGHT-TO-LEFT MARK Control Characters

Slide 9

Slide 9 text

Many people use “character set” to mean one or more of these: Character Code Character Encoding Character Repertoire Which makes for a confusing situation. Character Set

Slide 10

Slide 10 text

A defined mapping of characters to numbers. A ⇒ 41 B ⇒ 42 C ⇒ 43 Each value in a character code is called a code point. Character Code

Slide 11

Slide 11 text

An algorithm to convert code points to a digital form for ease of transmitting or storing data. 41 (A) ⇒ 1000001 42 (B) ⇒ 1000010 43 (C) ⇒ 1000011 Character Encoding

Slide 12

Slide 12 text

A character repertoire is a collection of distinct characters. Character codes, keyboards, and written languages all have well-defined character repertoires. Character Repertoire

Slide 13

Slide 13 text

ASCII character code: 128 code points character encoding: 7 bits each Character Codes & Encodings

Slide 14

Slide 14 text

ASCII character code: 128 code points character encoding: 7 bits each Latin 1 (ISO-8859-1) character code: 256 code points character encoding: 8 bits (1 byte) each Character Codes & Encodings

Slide 15

Slide 15 text

Unicode (character code) 1,112,064 code points (110,000+ defined) Character Codes & Encodings

Slide 16

Slide 16 text

Unicode (character code) 1,112,064 code points (110,000+ defined) character encodings: UTF-8 — 1 to 4 bytes each UTF-16 — 2 or 4 bytes each UTF-32 — 4 bytes each Character Codes & Encodings

Slide 17

Slide 17 text

A U+0041 LATIN CAPITAL LETTER A ໓ U+0ED3 LAO DIGIT THREE U+1F4A9 PILE OF POO Code Points

Slide 18

Slide 18 text

Some code points have precomposed diacritics. ȫ U+022B LATIN SMALL LETTER O WITH DIAERESIS AND MACRON Code Points

Slide 19

Slide 19 text

Other characters must be composed from multiple code points using “combing characters.” n̈ U+006E LATIN SMALL LETTER N U+0308 COMBINING DIAERESIS Code Points

Slide 20

Slide 20 text

Any series of code points that are composed into a single user-perceived character. Informally known as “graphemes.” A (U+0041) n̥̈ (U+006E U+0308 U+0325) CRLF (U+000D U+000A) Grapheme Clusters

Slide 21

Slide 21 text

U+1F42A DROMEDARY CAMEL Time for some…

Slide 22

Slide 22 text

# ¡jalapeño! say "\x{A1}jalape\x{D1}o!"; String constants ... TIMTOWTDI

Slide 23

Slide 23 text

# ¡jalapeño! say "\x{A1}jalape\x{D1}o!"; use v5.12; say "\N{U+00A1}jalape\N{U+00D1}o!"; String constants ... TIMTOWTDI

Slide 24

Slide 24 text

use charnames qw( :full ); say "\N{INVERTED EXCLAMATION MARK}jalape\N{LATIN SMALL LETTER N WITH TILDE}o!"; String constants ... TIMTOWTDI

Slide 25

Slide 25 text

use charnames qw( :full ); say "\N{INVERTED EXCLAMATION MARK}jalape\N{LATIN SMALL LETTER N WITH TILDE}o!"; use utf8; say '¡jalapeño!'; String constants ... TIMTOWTDI

Slide 26

Slide 26 text

=encoding UTF-8 =head1 ¡jalapeño! String constants ... POD

Slide 27

Slide 27 text

UTF-8 encoded input ⇩ decode ⇩ Perl Unicode string ⇩ encode ⇩ UTF-8 encoded output I/O

Slide 28

Slide 28 text

open my $fh, '<:encoding(UTF-8)', $filename; open my $fh, '>:encoding(UTF-8)', $filename; I/O

Slide 29

Slide 29 text

open my $fh, '<:encoding(UTF-8)', $filename; open my $fh, '>:encoding(UTF-8)', $filename; binmode $fh, ':encoding(UTF-8)'; binmode STDIN, ':encoding(UTF-8)'; I/O

Slide 30

Slide 30 text

use open qw( :encoding(UTF-8) ); open my $fh, '<', $filename; I/O

Slide 31

Slide 31 text

use open qw( :encoding(UTF-8) ); open my $fh, '<', $filename; # :std for STDIN, STDOUT, STDERR use open qw( :encoding(UTF-8) :std ); I/O

Slide 32

Slide 32 text

use open qw( :encoding(UTF-8) ); open my $fh, '<', $filename; # :std for STDIN, STDOUT, STDERR use open qw( :encoding(UTF-8) :std ); # CPAN module to enable everything UTF-8 use utf8::all; I/O

Slide 33

Slide 33 text

use Encode; my $internal = decode('UTF-8', $input); my $output = encode('UTF-8', $internal); Explicit Encoding & Decoding

Slide 34

Slide 34 text

Let’s use this grapheme cluster as the string in our next example: ю ́ U+044E CYRILLIC SMALL LETTER YU U+0301 COMBINING ACUTE ACCENT String Length

Slide 35

Slide 35 text

# UTF-8 encoded: D1 8E CC 81 say length $encoded_grapheme; # 4 String Length

Slide 36

Slide 36 text

# UTF-8 encoded: D1 8E CC 81 say length $encoded_grapheme; # 4 use Encode; # Unicode string: 044E 0301 my $grapheme = decode('UTF-8', $encoded); say length $grapheme; # 2 String Length

Slide 37

Slide 37 text

# UTF-8 encoded: D1 8E CC 81 say length $encoded_grapheme; # 4 use Encode; # Unicode string: 044E 0301 my $grapheme = decode('UTF-8', $encoded); say length $grapheme; # 2 my $length = () = $grapheme =~ /\X/g; say $length; # 1 String Length

Slide 38

Slide 38 text

# sort of complex for a simple length, eh? my $length = () = $str =~ /\X/g; say $length; String Length

Slide 39

Slide 39 text

# sort of complex for a simple length, eh? my $length = () = $str =~ /\X/g; say $length; # and tricky depending on the context say scalar( () = $str =~ /\X/g ); String Length

Slide 40

Slide 40 text

# sort of complex for a simple length, eh? my $length = () = $str =~ /\X/g; say $length; # and tricky depending on the context say scalar( () = $str =~ /\X/g ); # a little better $length++ while $str =~ /\X/g; say $length; String Length

Slide 41

Slide 41 text

# an alternative approach use Unicode::GCString; say Unicode::GCString->new($str)->length; String Length

Slide 42

Slide 42 text

# an alternative approach use Unicode::GCString; say Unicode::GCString->new($str)->length; # and yet another (Warning: I wrote it!) use Unicode::Util qw( grapheme_length ); say grapheme_length($str); String Length

Slide 43

Slide 43 text

Standard ordering of strings for comparison and sorting. sort @names $a cmp $b $x gt $y $foo eq $bar Collation

Slide 44

Slide 44 text

Perl provides a collation algorithm based on code points. Collation

Slide 45

Slide 45 text

Perl provides a collation algorithm based on code points. @words = qw( Äpfel durian Xerxes ) sort @words # Xerxes durian Äpfel Collation

Slide 46

Slide 46 text

Perl provides a collation algorithm based on code points. @words = qw( Äpfel durian Xerxes ) sort @words # Xerxes durian Äpfel sort { lc $a cmp lc $b } @words # durian Xerxes Äpfel Collation

Slide 47

Slide 47 text

Unicode Collation Algorithm (UCA) provides collation based on natural language usage. Collation

Slide 48

Slide 48 text

Unicode Collation Algorithm (UCA) provides collation based on natural language usage. use Unicode::Collate; my $collator = Unicode::Collate->new; $collator->sort(@words); # Äpfel durian Xerxes Collation

Slide 49

Slide 49 text

Unicode Collation Algorithm (UCA) provides collation based on natural language usage. $collator->sort(@names) $collator->cmp($a, $b) $collator->gt($x, $y) $collator->eq($foo, $bar) Collation

Slide 50

Slide 50 text

UCA also provides locale-specific collations for different languages. Collation

Slide 51

Slide 51 text

UCA also provides locale-specific collations for different languages. use Unicode::Collate::Locale; my $kolator = Unicode::Collate::Locale->new( locale => 'pl' # Polish ); Collation

Slide 52

Slide 52 text

Unicode has 4 normalization forms. The most important are: NFD: Normalization Form Canonical Decomposition NFC: Normalization Form Canonical Composition Normalization

Slide 53

Slide 53 text

use Unicode::Normalize; # NFD can be helpful on input $str = NFD($input); # NFC is recommended on output $output = NFC($str); Normalization

Slide 54

Slide 54 text

UTF-8 encoded input ⇩ decode ⇩ NFD ⇩ Perl Unicode string ⇩ NFC ⇩ encode ⇩ UTF-8 encoded output Normalization

Slide 55

Slide 55 text

By default, unfortunately, strings and regexes are not guaranteed to use Unicode semantics. This is known as “The Unicode Bug.” There are a few ways to fix this: Unicode Semantics

Slide 56

Slide 56 text

By default, unfortunately, strings and regexes are not guaranteed to use Unicode semantics. This is known as “The Unicode Bug.” There are a few ways to fix this: utf8::upgrade($str); Unicode Semantics

Slide 57

Slide 57 text

Slide 58

Slide 58 text

Slide 59

Slide 59 text

You’ll see the “utf8” encoding used frequently in Perl. “utf8” follows the UTF-8 standard very loosely and allows many errors in your data without warnings. By default, use “UTF-8” instead. UTF-8 vs. utf8 vs. :utf8

Slide 60

Slide 60 text

# utf8 is Perl's internal encoding form my $internal = decode('utf8', $input); # UTF-8 is the official UTF-8 encoding my $internal = decode('UTF-8', $input); UTF-8 vs. utf8 vs. :utf8

Slide 61

Slide 61 text

# utf8 is Perl's internal encoding form my $internal = decode('utf8', $input); # UTF-8 is the official UTF-8 encoding my $internal = decode('UTF-8', $input); # insecure! no encoding validation at all open my $fh, '<:utf8', $filename; # proper UTF-8 validation open my $fh, '<:encoding(UTF-8)', $filename; UTF-8 vs. utf8 vs. :utf8

Slide 62

Slide 62 text

Slides will be posted to: @nickpatch Questions?