Unicode Best Practices

Unicode Best Practices s///g nick patch @nickpatch 『 shutterstock 』

UTF-8 encoded input

UTF-8 encoded input ⇩ decode

UTF-8 encoded input ⇩ decode ⇩ character string

UTF-8 encoded input ⇩ decode ⇩ character string ⇩ hack…
hack… hack…

hack… hack… ⇩ encode

hack… hack… ⇩ encode ⇩ UTF-8 encoded output

UTF-8 source code

UTF-8 source code UTF-8 I/O

Perl 5 use utf8;

Perl 5 use utf8; # hack… hack… hack…

Perl 5 use utf8; # hack… hack… hack… =encoding UTF-8

Perl 5 use utf8; # hack… hack… hack… =encoding UTF-8
docs… docs… docs…

Python 2 #-*- coding: UTF-8 -*- # hack… hack… hack…

Python 3 # hack… hack… hack…

Ruby # encoding: UTF-8 # hack… hack… hack…

/ ( ия # definite articles for nouns: | ът
# ∙ masculine | та # ∙ feminine | то # ∙ neutral | те # ∙ plural ) $ /x

func remove_kasra (word) { word.subst(/ ∖N{ARABIC KASRA} $ /x, "")
}

∖d ১২৩… ໑໒໓…

∖d 123… ১২৩… ໑໒໓…

[0-9] 123… ১২৩… ໑໒໓…

∖w abc… 123… _ αβγ… ㄅㄆㄇ… أ ب
ج …

∖b abc… 123… _ αβγ… ㄅㄆㄇ… أ ب
ج …

[A-Za-z0-9_] abc… 123… _

∖R LF (∖n) CR (∖r) FF (∖f)

∖R LF (∖n) CR (∖r) FF (∖f) CRLF (∖r∖n)

∖R LF (∖n) CR (∖r) FF (∖f) CRLF (∖r∖n) NEL
VT LS PS

∖X Spın̈al Tap

∖X Spın̈al Tap n∖N{COMBINING DIAERESIS}

∖X Spın̈al Tap n∖N{COMBINING DIAERESIS} CRLF (∖r∖n)

∖p{…}

∖p{ASCII}

∖P{ASCII}

∖p{General_Category=Letter}

∖p{Letter}

∖p{L}

L Letter M Mark N Number P Punctuation S Symbol
Z Separator C Other

S Symbol Sm Math_Symbol Sc Currency_Symbol Sk Modifier_Symbol So Other_Symbol

∖p{Script=Latin}

∖p{Latin}

[∖p{Hiragana} ∖p{Katakana} ∖p{Han} ∖p{Latin} ∖p{Common}]

[∖p{Hira} ∖p{Kana} ∖p{Hani} ∖p{Latn} ∖p{Common}]

Arab Arabic Beng Bengali Deva Devanagari Egyp Egyptian hieroglyphs Ethi
Ethiopic Grek Greek Hang Hangul …

return $word if $word =~ s{ зи $}{г}x || $word
=~ s{ е ( ∖p{Cyrl} ) и $}{я$1}x || $word =~ s{ ци $}{к}x || $word =~ s{ ( та | ища ) $}{}x;

"Größe".lc == "größe"

"Größe".lc == "größe" "Größe".uc == "GRÖSSE"

"Größe".lc == "größe" "Größe".uc == "GRÖSSE" "Größe".lc != "Größe".uc.lc

"Größe".fc == "GROÖSSE".fc

"Größe".nfc == "Gro◌ ̈ße".nfc

"Größe".nfc.fc == "GRO◌ ̈SSE".nfc.fc

UTF-8 encoded input ⇩ decode ⇩ character string ⇩ NFD
⇨ hack… hack… hack… ⇨ NFC ⇩ encode ⇩ UTF-8 encoded output

⇨ hack… hack… hack… ⇨ NFC

⇨ hack… hack… hack… ⇨ NFC ⇩ encode

⇨ hack… hack… hack… ⇨ NFC ⇩ encode ⇩ UTF-8 encoded output

collator = Unicode::Collator.new countries = collator.sort(countries)◌̈

collator = Unicode::Collator.new( locale: "de" # German ) de_words =
collator.sort(de_words)◌̈

collator = Unicode::Collator.new( level: 2 # ignore case ) collator.eq("Größe",
"GRO◌̈SSE")

Testing characters > 7F: Nóirín characters > FF: 松本行弘 characters
> FFFF: ⩨ grapheme clusters: Spın̈al Tap foreign digits: ٤٠

use utf8; use open qw( :encoding(UTF-8) :std ); use Test::More
tests => 66; use Lingua::Stem::UniNE::CS qw( stem ); is stem("zvířatech"), "zvíř", "rm -atech"; is stem("zvířatům"), "zvíř", "rm -atům"; is stem("zvířata"), "zvíř", "rm -ata"; is stem("zvířaty"), "zvíř", "rm -aty"; …

@nickpatch

Unicode Best Practices

Unicode Best Practices

More Decks by Nova Patch

Other Decks in Programming

Featured

Transcript