Les chaÃ®nes de caractÃ¨res 101

Les chaÃ®nes de caractÃ¨res 101 Afup Day 2024 - Poitiers
- Marion Hurteau @marionherisson / 📧 [email protected] 1

Hello World 👋 Marion Hurteau @MarionHerisson /MarionLeHerisson 📧 [email protected] 2

Hello World 👋 󰠁 JoliCode depuis 2019 ➡ Conseil, production,
audit, expertise et formation 🏰 Poney club, château & bière maison Boire de l’alcool est dangereux pour votre santé. Buvez avec modération (et en bonne compagnie) 3

Qu’est-ce qu’une chaîne de caractères ? 4

“Qu’est-ce qu’une chaîne ?” est une chaîne. 💁 💬 Tout
ce qu’on exprime peut être changé en chaîne 💻 💬 0 et 1 󰳕 ❓ 5

Glyphe ~ est un glyphe, ainsi que n ou ñ
→ une image, une forme abstraite 6

Graphème ~ est un graphème, n en est un aussi
→ une unité d'écriture minimale distincte → lié au contexte d’un système d’écriture spécifique 7

Graphème n est un graphème, n en est un aussi
→ une unité d'écriture minimale distincte → lié au contexte d’un système d’écriture spécifique Diacritique ~ 8

Un peu d’histoire 🔎 9

Les premiers encodages 💻 💬 « 01100001 » signifie «
a » 💁 💬 󰢃 Enfin ça dépend pour qui 10 Image : Unsplash - @fedechanw

Les premiers encodages 1963 : ANSI : ASCII ! 7
bits → 127 caractères 11

ASCII 12

Pas cool pour le reste du monde Allemand : Schildkröte
🐢 Suédois : Skål! 🍻 Français : Éléphant 🐘 13

bits → 127 caractères 1972 : les CPUs 8 bits débarquent ! 8 bits → 255 caractères 14

16 Image : The I.T. Crowd series

Mojibake 17

Mojibake 18

Mojibake 19

bits → 127 caractères 1972 : les CPUs 8 bits débarquent ! 8 bits → 255 caractères 1991 : Unicode V1.0 ! • universel • uniforme • unique 20

bits → 127 caractères 1972 : les CPUs 8 bits débarquent ! 8 bits → 255 caractères 1991 : Unicode V1.0 ! 16 bits → 65 536 caractères • universel • uniforme • unique 21

Le point de code de A est U+0041 Celui de
B est U+0042 … Celui de Ω est U+2126 Points de code (Code points) 22

bits → 127 caractères 1972 : les CPUs 8 bits débarquent ! 8 bits → 255 caractères 1991 : Unicode V1.0 ! 16 bits → 65 536 caractères 1996 : Unicode V2.0 presente UTF ! Encodage ≠ Point de code - UTF-32 → 32 bits, 4 octets - UTF-16 → 16 bits, 2 octets - UTF-8 → 8 bits, 1 to 4 octets 23

✨ UTF-8 ✨ • 8 bits • 0-127 : 1
octet → Rétrocompatibilité avec ASCII • 128+ : 2 à 6 octets 25

bits → 127 caractères 1972 : les CPUs 8 bits débarquent ! 8 bits → 255 caractères 1991 : Unicode V1.0 ! 16 bits → 65 536 caractères 1996 : Unicode V2.0 présente UTF ! Encodage ≠ Point de code 2022 : Unicode V15.0.0 → 149 186 caractères et environ 245 000 points de code assignés sur un espace qui peut contenir jusqu’à 1 114 112 points de code différents 27

Pour résumer ☕ 28 Graphèmes C a f e Glyphes
C a f e ́ Points de code U+0043 U+0061 U+0066 U+0065 U+0301 UTF-32 00 00 00 43 00 00 00 61 00 00 00 66 00 00 00 65 00 00 03 01 UTF-16 00 43 00 61 00 66 00 65 03 01 UTF-8 43 61 66 65 CC 81 ́

Le composant String de Symfony 🧶 32

Le composant String de Symfony Provides an object oriented approach
to strings manipulation. new UnicodeString('Å'); new CodePointString('Å'); new ByteString('Å'); 33

Le composant String de Symfony $text = (new UnicodeString('This is
a déjà-vu situation.')) ->trimEnd('.') ->replace('déjà-vu', 'jamais-vu') ->append('!'); // $text = 'This is a jamais-vu situation!' 34

Le composant String de Symfony u('foo BAR')->upper(); // 'FOO BAR'
u('FOO Bar')->lower(); // 'foo bar' u('Die O\'Brian Straße')->folded(); // "die o'brian strasse" u('Foo: Bar-baz.')->camel(); // 'fooBarBaz' u('Foo: Bar-baz.')->snake(); // 'foo_bar_baz' u('Foo: Bar-baz.')->camel()->title(); // 'FooBarBaz' 35

Le composant String de Symfony u('abc')->indexOf('B'); // null u('abc')->ignoreCase()->indexOf('B'); //
1 u('hello')->append('world'); // 'helloworld' u('hello')->append(' ', 'world'); // 'hello world' u('User')->ensureEnd('Controller'); // 'UserController' u('UserController')->ensureEnd('Controller'); // 'UserController' 36

Le composant String de Symfony u(' Lorem Ipsum ')->padBoth(20, '-');
// '--- Lorem Ipsum ----' u('_.')->repeat(10); // '_._._._._._._._._._.' u(' Lorem Ipsum ')->trim(); // 'Lorem Ipsum' u('http://symfony.com')->replace('http://', 'https://'); u('Symfony is great')->slice(0, 7); // 'Symfony' u('template_name.html.twig')->split('.'); // ['template_name', 'html', 'twig' 37

Méthodes propres à ByteString // returns TRUE if the string
contents // are valid UTF-8 contents b('Lorem Ipsum')->isUtf8(); // true b("\xc3\x28")->isUtf8(); // false 38

Méthodes propres à CodePointString et UnicodeString u('नमस्ते')->codePointsAt(0); // न [2344]
u('नमस्ते')->codePointsAt(1); // म [2350] u('नमस्ते')->codePointsAt(2); // स्ते [2360, 2381, 2340, 2375] u('नमस्ते')->ascii(); // 'namaste' u('さよなら')->ascii(); // 'sayonara' u('спасибо')->ascii(); // 'spasibo' 39

AsciiSlugger $slugger = new AsciiSlugger(); $slug = $slugger->slug('Wôrķšƥáçè ~~sèťtïñğš~~', '/');
// $slug = 'Workspace/settings' 40

AsciiSlugger $slugger = new AsciiSlugger(); $slug = $slugger->slug('Wôrķšƥáçè ~~sèťtïñğš~~', '/');
// $slug = 'Workspace/settings' $slugger = $slugger→withEmoji(); $slug = $slugger→slug('a 😺, and a 🦁 go to 🏞', '-', 'en'); // $slug = 'a-grinning-cat-and-a-lion-go-to-national-park'; $slug = $slugger→slug('un 😺, et un 🦁 vont au 🏞', '-', 'fr'); // $slug = 'un-chat-qui-sourit-et-un-tete-de-lion-vont-au-parc-national'; 41

Inflector $inflector = new EnglishInflector(); $result = $inflector->singularize('teeth'); // ['tooth']
$result = $inflector->singularize('radii'); // ['radius'] $result = $inflector->singularize('leaves'); // ['leaf', 'leave', 'leaff'] $result = $inflector->pluralize('bacterium'); // ['bacteria'] $result = $inflector->pluralize('news'); // ['news'] $result = $inflector->pluralize('person'); // ['persons', 'people'] 42

Quelle est la longueur de “󰣷”? 43

python3 : >>> len("󰣷") == 5 True JavaScript : "󰣷".length
== 7 True Rust : "󰣷".len() == 17 true Elixir : String.length("󰣷") // 1 Swift : var s = "󰣷" print(s.count) // 1 print(s.unicodeScalars.count) // 5 print(s.utf16.count) // 7 print(s.utf8.count) // 17 44 Quelle est la longueur de “󰣷”? Source : https://hsivonen.fi/string-length/

PHP : strlen('󰣷'); // 17 mb_strlen('󰣷', 'UTF-8'); // 5 45
Quelle est la longueur de “󰣷”?

PHP : strlen('󰣷'); // 17 mb_strlen('󰣷', 'UTF-8'); // 5 Laravel
: Str::length("󰣷", "UTF-8"); // 5 Symfony : u('󰣷')→length(); // 1 46 Quelle est la longueur de “󰣷”?

Unicode scalar UTF-32 code units UTF-16 code units UTF-8 code
units UTF-32 octets UTF-16 octets UTF-8 octets U+1F926 FACE PALM 🤦 1 2 4 4 4 4 U+1F3FC EMOJI MODIFIER FITZPATRICK TYPE-3 🏼 1 2 4 4 4 4 U+200D ZERO WIDTH JOINER 1 1 3 4 2 3 U+2642 MALE SIGN ♂ 1 1 3 4 2 3 U+FE0F VARIATION SELECTOR-16 1 1 3 4 2 3 Total 5 7 17 20 14 17 47 Quelle est la longueur de “󰣷”? Source : https://hsivonen.fi/string-length/

Le composant String de Symfony (new ByteString('󰣷'))->length(); (new CodePointString('󰣷'))->length(); (new
UnicodeString('󰣷'))->length(); // 17 // 5 // 1 48

Le composant String de Symfony u('󰔮')->codePointsAt(0); [ 0 => 128105
1 => 8205 2 => 128105 3 => 8205 4 => 128103 5 => 8205 6 => 128102 ] U+1F469 👩 WOMAN U+0200D ZERO WIDTH JOINER U+1F469 👩 WOMAN U+0200D ZERO WIDTH JOINER U+1F467 👧 GIRL U+0200D ZERO WIDTH JOINER U+1F466 👦 BOY 49

Translitération 📖 50

Transliteration? • 󰏏 → 󰏃 • Αθήνα → Athena •
Avant tri, indexation, … 52 [ Antwerpen Brussel Cannes // … Zurich Αθήνα ]

Transliteration? • 󰏏 → 󰏃 • Αθήνα → Athena •
Avant tri, indexation, … 53 [ Antwerpen Αθήνα Brussel Cannes // … Zurich ] [ Antwerpen Brussel Cannes // … Zurich Αθήνα ]

Transliteration? transliterator_transliterate( 'Any-Latin; Latin-ASCII; Lower()', "A æ Übérmensch på høyeste
nivå! И я люблю PHP! ﬁ" ); // "a ae ubermensch pa hoyeste niva! i a lublu php! fi" 54

Mal encoder tue 📱 55

56 Source : https://gizmodo.com/a-cellphones-missing-dot-kills-two-people-puts-three-m-382026

57 Source : https://gizmodo.com/a-cellphones-missing-dot-kills-two-people-puts-three-m-382026

Le cas du i Turc 58 toUpperCase toLowerCase I (U+00049)
İ (U+00130) ı (U+00131) i (U+00069) tr 󰑍 toUpperCase toLowerCase

59 Source : https://decodeunicode.org/en/u+00130

Le cas du i Turc 60 toLowerCase toUpperCase I (U+00049)
İ (U+00130) ı (U+00131) i (U+00069) en_US 󰑔

Normalisation 🦄 61

Pourquoi "Å" !== "Å" !== "Å"? • Combinaison de A
(U+0041) et ̊ (U+030A) • Le point de code U+00C5 qui donne Å, ou “Lettre Latine Capitale A avec un Anneau Dessus” • Le point de code U+212B pour “Signe Angstrom Å” 62 Image : https://tonsky.me/blog/unicode/

Normalisation Normalisation canonique : NFD : Décomposition canonique Å =>
A + ̊ NFC : Composition canonique A + ̊ => Å Normalization de compatibilité : NFKD : Décomposition de compatibilité Å => A + ̊ NFKC : Composition de compatibilité A + ̊ => Å 63

Normalisation Normalisation 64 Composition Décomposition Å (U+00C5) A + ◌̊
(U+0041 + U+030A) Å (U+00C5) A + ◌̊ (U+0041 + U+030A) Å (U+212B)

Fonctions natives de PHP 🐘 65

ASCII: chr() et ord() /** * Generate a single-byte string
from a number * @param int $codepoint : The ascii code. * @return string the specified character. */ #[Pure] function chr(int $codepoint): string {} /** * Convert the first byte of a string to a value between 0 and 255 * @param string $character : A character. * @return int<0, 255> the ASCII value as an integer. */ #[Pure] function ord(string $character): int {} 66

Fonctions mb_ • > 1 octet • str_replace fonctionne très
bien, à condition que botte de foin et aiguille aient le même encodage • Il faut activer manuellement l’extension mbstring dans PHP 67

Fonctions utf8_ utf8_encode() et utf8_decode() 68

Fonctions utf8_ utf8_encode() et utf8_decode() Seulement depuis et vers ISO-8859-1
! Dépréciées depuis PHP 8.2.0 69

Emojis comme noms de classes… interface 🍚 {} interface 🐟
{} class 🍣 implements 🍚, 🐟 { // some code } 70

…ou n’importe quel autre caractère Unicode interface ┻━┻ {} class
（╯°□°）╯︵┻━┻ extends Exception implements ┻━┻ { // some code } class （ノ゜Д゜）ノ︵┻━┻ extends Exception implements ┻━┻ { // some code } 71 Source : https://github.com/sgolemon/table-flip

Des équations plus faciles à lire ! $√2π = sqrt(2
* $π); $⟮z ＋ g ＋½⟯ᶻ⁺½ = pow($z + $g + 0.5, $z + 0.5); $ℯ＾−⟮z ＋ g ＋½⟯ = exp(-($z + $g + 0.5)); /** * Put it all together: * __ / 1 \ z+½ * √2π | z + g + - | e^-(z+g+½) A(z) * \ 2 / */ return $√2π * $⟮z ＋ g ＋½⟯ᶻ⁺½ * $ℯ＾−⟮z ＋ g ＋½⟯ * $A⟮z⟯; 72 Source : https://github.com/markrogoyski/math-php

Des espaces dans le nom des méthodes 73

Homoglyphes ( U+0028 LEFT PARENTHESIS （ U+FF08 FULLWIDTH LEFT PARENTHESIS
﹙ U+FE59 SMALL LEFT PARENTHESIS ⁽ U+207D SUPERSCRIPT LEFT PARENTHESIS ₍ U+208D SUBSCRIPT LEFT PARENTHESIS ❨ U+2768 MEDIUM LEFT PARENTHESIS ORNAMENT 74

Homoglyphes 75 Source : https://decodeunicode.org/en/u+00069

Homoglyphes 76

Homoglyphes 77

Dans votre base de données 💾 78

“SET NAMES”, “CHARACTER SET”, “COLLATE” ? SET NAMES utf8mb4 COLLATE
utf8mb4_general_ci; CREATE DATABASE awesome_database CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci; 79

Charsets & Collations glyph encoding Character set 80 my_alphabet =
[ 0 = A, 1 = B, 2 = a, 3 = b ]

Charsets & Collations 81 my_alphabet = [ 0 = A,
1 = B, 2 = a, 3 = b ] A ? B

1 = B, 2 = a, 3 = b ] A < B

1 = B, 2 = a, 3 = b ] A < B Collation

Charsets & Collations my_alphabet = [ 0 = A, 1
= B, 2 = a, 3 = b ] a < b Collation 84

• utf8mb4_unicode_ci : ◦ "ß" = "ss" ◦ "Œ" =
"OE" ◦ Certains caractères sont ignorés • utf8mb4_general_ci : ◦ "ß" = "s" ◦ "Œ" = "e" Exemples de collations 85

Et “SET NAMES” ? SET character_set_client = 'utf8mb4' SET character_set_connection
= 'utf8mb4' SET character_set_results = 'utf8mb4' 86

Nommage d’UTF-8 • PostgreSQL : UTF8 ◦ 8 bits ◦
1 à 4 octets 87

1 à 4 octets • Oracle : AL32UTF8 (Véritable UTF-8, Unicode 9.0) ◦ Pas UTF8 ! (Ressemblant à du CESU-8, Unicode 3.0) 88

1 à 4 octets • Oracle : AL32UTF8 (Véritable UTF-8, Unicode 9.0) ◦ Pas UTF8 ! (Ressemblant à du CESU-8, Unicode 3.0) • MySQL : utf8mb4 ◦ Pas utf8 ! (Sorte d’UTF-8 sur 3 octets) 89

L’“utf8” de MySQL 90 2002 : UTF-8 -> 6 octets
par caractère Boost de performances si toutes les lignes on le même nombre d’octets dans la table => type CHAR ! CHAR(1) = 6 octets, CHAR(2) = 12 octets, etc… 2003 : L’ancien standard UTF-8 obsolète

91 Source : https://github.com/mysql/mysql-server/commit/43a506c0ced0e6ea101d3ab8b4b423ce3fa327d0

L’“utf8” de MySQL 93 2002 : UTF-8 -> 6 octets
par caractère Boost de performances si toutes les lignes on le même nombre d’octets dans la table => type CHAR ! CHAR(1) = 6 octets, CHAR(2) = 12 octets, etc… 2003 : L’ancien standard UTF-8 obsolète 2010 : utf8mb4

Sécurité 🔓 94

Phabricator Dans MySQL, lorsqu’on veut insérer un caractère dont le
point de code est supérieur à 0xFFFF dans une colonne qui a le charset utf8, MySQL va tronquer la chaîne de caractère au niveau du caractère en question. 95

point de code est supérieur à 0xFFFF dans une colonne qui a le charset utf8, MySQL va tronquer la chaîne de caractère au niveau du caractère en question. “Hello 👋 world 🌍 !” 96 👋 = U+1F44B > 0xFFFF

point de code est supérieur à 0xFFFF dans une colonne qui a le charset utf8, MySQL va tronquer la chaîne de caractère au niveau du caractère en question. “Hello 👋 world 🌍 !” 97 “Hello ” 👋 = U+1F44B > 0xFFFF

Phabricator Inscription restreinte à un domaine particulier L’attaquant s’inscrit avec
“[email protected]🍕@allowed-domain.com” L’éventuel check sur le domaine passe Mais seulement “[email protected]” est stocké dans la BDD ! 98

Attaque par homoglyphe Page Mot de passe oublié ? Entrer
une adresse mail visuellement similaire à l’adresse attaquée [email protected] ≠ ｍｉᎬᎬ@example.org L’adresse est normalisée avant d’être recherchée dans la BDD Un token pour [email protected] est généré et envoyé à ｍｉᎬᎬ@example.org qui peut maintenant se connecter en tant que [email protected] 99

Pour conclure 🧁 100

UTF-8 101 • Partout. Tout le temps. • Sauf que
ça s’appelle pas toujours “utf-8” • Attention à la locale ! • Heureusement, il existe des outils

Bon encodage ! 🔧 102

Merci ! Des questions ? 103 Afup Day 2024 -
Poitiers - Marion Hurteau @marionherisson / 📧 [email protected]

SOURCES - Webpages https://adamhooper.medium.com/in-mysql-never-use-utf8-use-utf8mb4-11761243e434 https://decodeunicode.org/ https://deliciousbrains.com/how-unicode-works/ https://dev.mysql.com/doc/refman/8.0/en/charset-general.html https://github.com/brefphp/bref/blob/f4df37277181dc76b6f644663de236eae7a793d2/src/functions.php#L11 https://github.com/captioning/captioning/issues/86 https://github.com/jolicode/emoji-search
https://github.com/markrogoyski/math-php https://github.com/mysql/mysql-server/commit/43a506c0ced0e6ea101d3ab8b4b423ce3fa327d0 https://github.com/PHP-CS-Fixer/PHP-CS-Fixer/blob/master/src/Fixer/Basic/EncodingFixer.php https://github.com/sgolemon/table-flip/blob/master/src/TableFlip.php https://github.com/symfony/symfony/blob/85b97226def5e4a50c1e3805a6c31bb6642efb70/src/Symfony/Component/Intl/Test s/Transliterator/EmojiTransliteratorTest.php https://github.com/symfony/symfony/pull/33896/files https://gizmodo.com/a-cellphones-missing-dot-kills-two-people-puts-three-m-382026 https://hackerone.com/reports/2233 https://hsivonen.fi/string-length/ https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-kno w-about-unicode-and-character-sets-no-excuses/ https://jolicode.github.io/unicode-conf/index.html#/ https://kunststube.net/encoding/ https://news.ycombinator.com/item?id=8892157 104

https://www.php.net/manual/en/function.utf8-decode.php https://www.php.net/manual/en/mbstring.supported-encodings.php https://www.php.net/manual/fr/refs.international.php https://www.postgresql.org/docs/current/multibyte.html https://pyrech.github.io/php-wtf/#/15?_k=dyazd4 https://stackoverflow.com/questions/766809/whats-the-difference-between-utf8-general-ci-and-utf8-unicod e-ci/ https://symfony.com/blog/new-in-symfony-6-2-better-emoji-support https://symfony.com/doc/current/components/intl.html https://symfony.com/doc/current/components/string.html
https://tonsky.me/blog/unicode/ http://www.unicode.org/charts/ http://unicode.org/emoji/charts/emoji-variants.html https://unicode-org.github.io/icu/userguide/transforms/general/#script-transliteration https://unicode.org/glossary/ https://fr.wikipedia.org/wiki/Wikip%C3%A9dia:Unicode/Test https://en.wikipedia.org/wiki/Character_encoding https://en.wikipedia.org/wiki/UTF-8 https://en.wikipedia.org/wiki/Mojibake https://en.wikipedia.org/wiki/Byte_order_mark https://www.youtube.com/watch?v=kaucJce8hhE&t=19s&ab_channel=TheUnicodeConsortium 105

SOURCES - Images https://unsplash.com/photos/command-computer-keyboard-key-46T6nVjRc2w (Hannah Joshua) https://unsplash.com/photos/crt-monitor-turned-off-aiqKc07b5PA (Federica Galli) 106

SOURCES - Books Unicode à gogo ! by Design Brouhaha
107

Les chaÃ®nes de caractÃ¨res 101

Les chaÃ®nes de caractÃ¨res 101

More Decks by Marion Hurteau

Other Decks in Technology

Featured

Transcript