Slide 1

Slide 1 text

Unicode Best Practices s///g nick patch @nickpatch 『 shutterstock 』

Slide 2

Slide 2 text

UTF-8 encoded input

Slide 3

Slide 3 text

UTF-8 encoded input ⇩ decode

Slide 4

Slide 4 text

UTF-8 encoded input ⇩ decode ⇩ character string

Slide 5

Slide 5 text

UTF-8 encoded input ⇩ decode ⇩ character string ⇩ hack… hack… hack…

Slide 6

Slide 6 text

UTF-8 encoded input ⇩ decode ⇩ character string ⇩ hack… hack… hack… ⇩ encode

Slide 7

Slide 7 text

UTF-8 encoded input ⇩ decode ⇩ character string ⇩ hack… hack… hack… ⇩ encode ⇩ UTF-8 encoded output

Slide 8

Slide 8 text

UTF-8 source code

Slide 9

Slide 9 text

UTF-8 source code UTF-8 I/O

Slide 10

Slide 10 text

Perl 5 use utf8;

Slide 11

Slide 11 text

Perl 5 use utf8; # hack… hack… hack…

Slide 12

Slide 12 text

Perl 5 use utf8; # hack… hack… hack… =encoding UTF-8

Slide 13

Slide 13 text

Perl 5 use utf8; # hack… hack… hack… =encoding UTF-8 docs… docs… docs…

Slide 14

Slide 14 text

Python 2 #-*- coding: UTF-8 -*- # hack… hack… hack…

Slide 15

Slide 15 text

Python 3 # hack… hack… hack…

Slide 16

Slide 16 text

Ruby # encoding: UTF-8 # hack… hack… hack…

Slide 17

Slide 17 text

/ ( ия # definite articles for nouns: | ът # ∙ masculine | та # ∙ feminine | то # ∙ neutral | те # ∙ plural ) $ /x

Slide 18

Slide 18 text

/ ( دابآ | هراب | یدنب | يدنب | نیرت | یزیر | يزير | یزاس | يزاس | ییاه ) $ /x

Slide 19

Slide 19 text

func remove_kasra (word) { word.subst(/ ∖N{ARABIC KASRA} $ /x, "") }

Slide 20

Slide 20 text

∖d ১২৩… ໑໒໓…

Slide 21

Slide 21 text

∖d 123… ১২৩… ໑໒໓…

Slide 22

Slide 22 text

∖d 123… ১২৩… ໑໒໓…

Slide 23

Slide 23 text

∖d 123… ১২৩… ໑໒໓…

Slide 24

Slide 24 text

[0-9] 123… ১২৩… ໑໒໓…

Slide 25

Slide 25 text

∖w abc… 123… _ αβγ… ㄅㄆㄇ… أ   ب   ج   …

Slide 26

Slide 26 text

∖w abc… 123… _ αβγ… ㄅㄆㄇ… أ   ب   ج   …

Slide 27

Slide 27 text

∖w abc… 123… _ αβγ… ㄅㄆㄇ… أ   ب   ج   …

Slide 28

Slide 28 text

∖w abc… 123… _ αβγ… ㄅㄆㄇ… أ   ب   ج   …

Slide 29

Slide 29 text

∖w abc… 123… _ αβγ… ㄅㄆㄇ… أ   ب   ج   …

Slide 30

Slide 30 text

∖w abc… 123… _ αβγ… ㄅㄆㄇ… أ   ب   ج   …

Slide 31

Slide 31 text

∖w abc… 123… _ αβγ… ㄅㄆㄇ… أ   ب   ج   …

Slide 32

Slide 32 text

∖b abc… 123… _ αβγ… ㄅㄆㄇ… أ   ب   ج   …

Slide 33

Slide 33 text

[A-Za-z0-9_] abc… 123… _

Slide 34

Slide 34 text

∖s

Slide 35

Slide 35 text

∖R

Slide 36

Slide 36 text

∖R LF (∖n) CR (∖r) FF (∖f)

Slide 37

Slide 37 text

∖R LF (∖n) CR (∖r) FF (∖f) CRLF (∖r∖n)

Slide 38

Slide 38 text

∖R LF (∖n) CR (∖r) FF (∖f) CRLF (∖r∖n) NEL VT LS PS

Slide 39

Slide 39 text

.

Slide 40

Slide 40 text

∖X

Slide 41

Slide 41 text

∖X Spın̈al Tap

Slide 42

Slide 42 text

∖X Spın̈al Tap

Slide 43

Slide 43 text

∖X Spın̈al Tap n∖N{COMBINING DIAERESIS}

Slide 44

Slide 44 text

∖X Spın̈al Tap n∖N{COMBINING DIAERESIS} CRLF (∖r∖n)

Slide 45

Slide 45 text

∖X Spın̈al Tap n∖N{COMBINING DIAERESIS} CRLF (∖r∖n)

Slide 46

Slide 46 text

∖p{…}

Slide 47

Slide 47 text

∖p{ASCII}

Slide 48

Slide 48 text

∖P{ASCII}

Slide 49

Slide 49 text

∖p{General_Category=Letter}

Slide 50

Slide 50 text

∖p{Letter}

Slide 51

Slide 51 text

∖p{L}

Slide 52

Slide 52 text

∖pL

Slide 53

Slide 53 text

L Letter M Mark N Number P Punctuation S Symbol Z Separator C Other

Slide 54

Slide 54 text

S Symbol Sm Math_Symbol Sc Currency_Symbol Sk Modifier_Symbol So Other_Symbol

Slide 55

Slide 55 text

∖p{Script=Latin}

Slide 56

Slide 56 text

∖p{Latin}

Slide 57

Slide 57 text

[∖p{Hiragana} ∖p{Katakana} ∖p{Han} ∖p{Latin} ∖p{Common}]

Slide 58

Slide 58 text

[∖p{Hira} ∖p{Kana} ∖p{Hani} ∖p{Latn} ∖p{Common}]

Slide 59

Slide 59 text

Arab Arabic Beng Bengali Deva Devanagari Egyp Egyptian hieroglyphs Ethi Ethiopic Grek Greek Hang Hangul …

Slide 60

Slide 60 text

return $word if $word =~ s{ зи $}{г}x || $word =~ s{ е ( ∖p{Cyrl} ) и $}{я$1}x || $word =~ s{ ци $}{к}x || $word =~ s{ ( та | ища ) $}{}x;

Slide 61

Slide 61 text

"Größe".lc == "größe"

Slide 62

Slide 62 text

"Größe".lc == "größe" "Größe".uc == "GRÖSSE"

Slide 63

Slide 63 text

"Größe".lc == "größe" "Größe".uc == "GRÖSSE" "Größe".lc != "Größe".uc.lc

Slide 64

Slide 64 text

"Größe".fc == "GROÖSSE".fc

Slide 65

Slide 65 text

"Größe".nfc == "Gro◌ ̈ße".nfc

Slide 66

Slide 66 text

"Größe".nfc.fc == "GRO◌ ̈SSE".nfc.fc

Slide 67

Slide 67 text

UTF-8 encoded input ⇩ decode ⇩ character string ⇩ NFD ⇨ hack… hack… hack… ⇨ NFC ⇩ encode ⇩ UTF-8 encoded output

Slide 68

Slide 68 text

UTF-8 encoded input ⇩ decode ⇩ character string ⇩ NFD ⇨ hack… hack… hack… ⇨ NFC ⇩ encode ⇩ UTF-8 encoded output

Slide 69

Slide 69 text

UTF-8 encoded input ⇩ decode ⇩ character string ⇩ NFD ⇨ hack… hack… hack… ⇨ NFC ⇩ encode ⇩ UTF-8 encoded output

Slide 70

Slide 70 text

UTF-8 encoded input ⇩ decode ⇩ character string ⇩ NFD ⇨ hack… hack… hack… ⇨ NFC

Slide 71

Slide 71 text

UTF-8 encoded input ⇩ decode ⇩ character string ⇩ NFD ⇨ hack… hack… hack… ⇨ NFC

Slide 72

Slide 72 text

UTF-8 encoded input ⇩ decode ⇩ character string ⇩ NFD ⇨ hack… hack… hack… ⇨ NFC

Slide 73

Slide 73 text

UTF-8 encoded input ⇩ decode ⇩ character string ⇩ NFD ⇨ hack… hack… hack… ⇨ NFC ⇩ encode

Slide 74

Slide 74 text

UTF-8 encoded input ⇩ decode ⇩ character string ⇩ NFD ⇨ hack… hack… hack… ⇨ NFC ⇩ encode ⇩ UTF-8 encoded output

Slide 75

Slide 75 text

No content

Slide 76

Slide 76 text

No content

Slide 77

Slide 77 text

collator = Unicode::Collator.new countries = collator.sort(countries)◌̈

Slide 78

Slide 78 text

collator = Unicode::Collator.new countries = collator.sort(countries)◌̈

Slide 79

Slide 79 text

collator = Unicode::Collator.new( locale: "de" # German ) de_words = collator.sort(de_words)◌̈

Slide 80

Slide 80 text

collator = Unicode::Collator.new( level: 2 # ignore case ) collator.eq("Größe", "GRO◌̈SSE")

Slide 81

Slide 81 text

Testing characters > 7F: Nóirín characters > FF: 松本行弘 characters > FFFF: ⩨ grapheme clusters: Spın̈al Tap foreign digits: ٤٠

Slide 82

Slide 82 text

Testing characters > 7F: Nóirín characters > FF: 松本行弘 characters > FFFF: ⩨ grapheme clusters: Spın̈al Tap foreign digits: ٤٠

Slide 83

Slide 83 text

No content

Slide 84

Slide 84 text

Testing characters > 7F: Nóirín characters > FF: 松本行弘 characters > FFFF: ⩨ grapheme clusters: Spın̈al Tap foreign digits: ٤٠

Slide 85

Slide 85 text

Testing characters > 7F: Nóirín characters > FF: 松本行弘 characters > FFFF: ⩨ grapheme clusters: Spın̈al Tap foreign digits: ٤٠

Slide 86

Slide 86 text

Testing characters > 7F: Nóirín characters > FF: 松本行弘 characters > FFFF: ⩨ grapheme clusters: Spın̈al Tap foreign digits: ٤٠

Slide 87

Slide 87 text

Testing characters > 7F: Nóirín characters > FF: 松本行弘 characters > FFFF: ⩨ grapheme clusters: Spın̈al Tap foreign digits: ٤٠

Slide 88

Slide 88 text

use utf8; use open qw( :encoding(UTF-8) :std ); use Test::More tests => 66; use Lingua::Stem::UniNE::CS qw( stem ); is stem("zvířatech"), "zvíř", "rm -atech"; is stem("zvířatům"), "zvíř", "rm -atům"; is stem("zvířata"), "zvíř", "rm -ata"; is stem("zvířaty"), "zvíř", "rm -aty"; …

Slide 89

Slide 89 text

@nickpatch

Slide 90

Slide 90 text

No content