Slide 1

Slide 1 text

The fabulous World of Emojis and other Unicode symbols #Symfony_Live @nicolasgrekas

Slide 2

Slide 2 text

@nicolasgrekas ASCII x0 x1 x2 x3 x4 x5 x6 x7 x8 x9 xA xB xC xD xE xF 0x NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI 1x DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US 2x SP ! " # $ % & ' ( ) * + , - . / 3x 0 1 2 3 4 5 6 7 8 9 : ; < = > ? 4x @ A B C D E F G H I J K L M N O 5x P Q R S T U V W X Y Z [ \ ] ^ _ 6x ` a b c d e f g h i j k l m n o 7x p q r s t u v w x y z { | } ~ DEL 8x € ‚ ƒ „ … † ‡ ˆ ‰ Š ‹ Œ Ž 9x ‘ ’ “ ” • – — ˜ ™ š › œ ž Ÿ Ax NBSP ¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬ ® ¯ Bx ° ± ² ³ ´ µ ¶ · ¸ ¹ º » ¼ ½ ¾ ¿ Cx À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Dx Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß Ex à á â ã ä å æ ç è é ê ë ì í î ï Fx ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ

Slide 3

Slide 3 text

@nicolasgrekas Windows-1252 x0 x1 x2 x3 x4 x5 x6 x7 x8 x9 xA xB xC xD xE xF 0x NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI 1x DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US 2x SP ! " # $ % & ' ( ) * + , - . / 3x 0 1 2 3 4 5 6 7 8 9 : ; < = > ? 4x @ A B C D E F G H I J K L M N O 5x P Q R S T U V W X Y Z [ \ ] ^ _ 6x ` a b c d e f g h i j k l m n o 7x p q r s t u v w x y z { | } ~ DEL 8x € ‚ ƒ „ … † ‡ ˆ ‰ Š ‹ Œ Ž 9x ‘ ’ “ ” • – — ˜ ™ š › œ ž Ÿ Ax NBSP ¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬ ® ¯ Bx ° ± ² ³ ´ µ ¶ · ¸ ¹ º » ¼ ½ ¾ ¿ Cx À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Dx Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß Ex à á â ã ä å æ ç è é ê ë ì í î ï Fx ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ

Slide 4

Slide 4 text

@nicolasgrekas W3 Tech survey – Feb 2017

Slide 5

Slide 5 text

@nicolasgrekas Unicode: 130k characters, 135 scripts Peace مالس 和平 ☮

Slide 6

Slide 6 text

@nicolasgrekas Unicode: 130k characters, 135 scripts P U+0050 LATIN CAPITAL LETTER P س U+0633 ARABIC LETTER SEEN 和U+548C CJK UNIFIED IDEOGRAPH-548C ☮ U+262E PEACE SYMBOL

Slide 7

Slide 7 text

@nicolasgrekas

Slide 8

Slide 8 text

@nicolasgrekas 17 plans http://reedbeta.com/blog/programmers-intro-to-unicode/

Slide 9

Slide 9 text

@nicolasgrekas

Slide 10

Slide 10 text

@nicolasgrekas

Slide 11

Slide 11 text

@nicolasgrekas From Code Points to Bytes á U+00E1 LATIN SMALL LETTER A WITH ACUTE UTF-16BE 00 E1 UTF-8 C3 A1 あ U+3042 HIRAGANA LETTER A UTF-16BE 30 42 UTF-8 E3 81 82 UTF-8 : 1, 2, 3 or 4 bytes UTF-16 : 2 or 4 bytes UTF-32 : 4 bytes

Slide 12

Slide 12 text

@nicolasgrekas UTF-8 rules the World Byte 1 Byte 2 Byte 3 Byte 4 0xxxxxxx 110xxxxx 10xxxxxx 1110xxxx 10xxxxxx 10xxxxxx 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx • ASCII compatible • Autosynchronized • Endianness insensitive

Slide 13

Slide 13 text

@nicolasgrekas Case sensitivity • 1k characters are concerned • One upper case letter, two lower case variants: Σ ⇔ σ/ς • The turkish exception: I ⇔ i vs İ ⇔ i and I ⇔ ı • Full case folding: ß ⇔ ss

Slide 14

Slide 14 text

@nicolasgrekas Collations • Lituanian puts y between i and k • ch in Spanish is a single character • œ between oe and of • in Danish, Å is a separate letter that follows Z • in Swedish, v and w are the same letter • traditional German considers ä to be the same as ae • In French, côté > coté > côte > cote

Slide 15

Slide 15 text

@nicolasgrekas Composed/decomposed forms  NFC D é j à U+0044 U+00E9 U+006A U+00E0  NFD D e ◌ ́ j a ◌ ̀ U+0044 U+0065 U+0301 U+006A U+0061 U+0300

Slide 16

Slide 16 text

@nicolasgrekas Grapheme Clusters  NFC D é j à U+0044 U+00E9 U+006A U+00E0  NFD D e ◌ ́ j a ◌ ̀ U+0044 U+0065 U+0301 U+006A U+0061 U+0300

Slide 17

Slide 17 text

@nicolasgrekas Emoji - http://unicode.org/reports/tr51/

Slide 18

Slide 18 text

@nicolasgrekas Emoji glyphs

Slide 19

Slide 19 text

@nicolasgrekas Unicode fundamentals summary • Uppercase, lowercase, folding • Compositions, ligatures • Comparison: normalizations & collations • Segmentation: characters, words, sentences & hyphens • Locales: cultural conventions, translitterations • Identifiers & security, confusables • Display: direction, width

Slide 20

Slide 20 text

Unicode in Practice? #Symfony_Live @nicolasgrekas

Slide 21

Slide 21 text

@nicolasgrekas •utf8_binary : A != a •utf8_general_ci : œ ≠ oe •utf8_unicode_ci : œ = oe •utf8_swedish_ci : z > å > æ = ä > ö = ø •SET NAMES utf8mb4 : for security and storing emojis MySQL

Slide 22

Slide 22 text

@nicolasgrekas Outside of PHP world •ICU : Java and C/C++ • X-like licence, supported by IBM, The reference implementation of Unicode •JavaScript : Unicode (NFC) •Python : binary and unicode strings

Slide 23

Slide 23 text

@nicolasgrekas In PHP(RIP PHP 6) • Mbstring: mb_*() • Iconv: iconv_*() • PCRE: preg_*() • Intl: grapheme_*() + normalizer_*() Collator, NumberFormatter, Locale, MessageFormatter, IntlDateFormatter, Spoofchecker, Transliterator, IDN • symfony/polyfill-*

Slide 24

Slide 24 text

symfony/string #Symfony_Live @nicolasgrekas

Slide 25

Slide 25 text

@nicolasgrekas

Slide 26

Slide 26 text

@nicolasgrekas symfony/string • Immutable value objects • AbstractString • BinaryString • AbstractUnicodeString • GraphemeString • Utf8String

Slide 27

Slide 27 text

@nicolasgrekas AbstractString function __construct(string $string = '') function jsonSerialize(): string function length(): int function toBinary(): BinaryString function toGrapheme(): GraphemeString function toUtf8(): Utf8String function __clone() function __toString(): string

Slide 28

Slide 28 text

@nicolasgrekas function after($needle, bool $includeNeedle = false): self; function afterLast($needle, bool $includeNeedle = false): self; function append(string ...$suffix): self; function before($needle, bool $includeNeedle = false): self; function beforeLast($needle, bool $includeNeedle = false): self; function chunk(int $length = 1): array; function collapseWhitespace(): self function contains(): bool function endsWith(string $suffix): bool; function ensureEnd(string $suffix): self function ensureStart(string $prefix): self function equalsTo($string): bool; function folded(): self; function ignoreCase(): self function indexOf($needle, int $offset = 0): ?int; function indexOfLast($needle, int $offset = 0): ?int; function isEmpty(): bool function join(array $strings): self function lower(): self

Slide 29

Slide 29 text

@nicolasgrekas function match(string $pattern, int $flags = 0, int $offset = 0): ?array function padBoth(int $length, string $padStr = ' '): self function padEnd(int $length, string $padStr = ' '): self function padStart(int $length, string $padStr = ' '): self function prepend(string ...$prefix): self; function repeat(int $multiplier): self function replace(string $from, string $to): self; function replaceMatches(string $fromPattern, $to): self function slice(int $start = 0, int $length = null): self; function splice(string $replacement, int $start = 0, int $length = null): self; function split(string $delimiter, int $limit = null): array; function startsWith(string $prefix): bool; function title(bool $allWords = false): self function trim(string $chars = " \t\n\r\0\x0B\x0C\u{A0}\u{FEFF}"): self function trimEnd(string $chars = " \t\n\r\0\x0B\x0C\u{A0}\u{FEFF}"): self function trimStart(string $chars = " \t\n\r\0\x0B\x0C\u{A0}\u{FEFF}"): self function truncate(int $length, string $ellipsis = ''): self function upper(): self function wordwrap(int $width = 75, string $break = "\n", bool $cut = false): self

Slide 30

Slide 30 text

Merci Thank you Gracias ﺍﺮﻜﺷ 多謝 Ευχαριστώ ありがとう 감사합 니다 Спасибо!