Symfony String - flexible handling of Unicode

The fabulous World of Emojis and other Unicode symbols #Symfony_Live
@nicolasgrekas

@nicolasgrekas ASCII x0 x1 x2 x3 x4 x5 x6 x7
x8 x9 xA xB xC xD xE xF 0x NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI 1x DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US 2x SP ! " # $ % & ' ( ) * + , - . / 3x 0 1 2 3 4 5 6 7 8 9 : ; < = > ? 4x @ A B C D E F G H I J K L M N O 5x P Q R S T U V W X Y Z [ \ ] ^ _ 6x ` a b c d e f g h i j k l m n o 7x p q r s t u v w x y z { | } ~ DEL 8x € ‚ ƒ „ … † ‡ ˆ ‰ Š ‹ Œ Ž 9x ‘ ’ “ ” • – — ˜ ™ š › œ ž Ÿ Ax NBSP ¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬ ® ¯ Bx ° ± ² ³ ´ µ ¶ · ¸ ¹ º » ¼ ½ ¾ ¿ Cx À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Dx Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß Ex à á â ã ä å æ ç è é ê ë ì í î ï Fx ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ

@nicolasgrekas Windows-1252 x0 x1 x2 x3 x4 x5 x6 x7
x8 x9 xA xB xC xD xE xF 0x NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI 1x DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US 2x SP ! " # $ % & ' ( ) * + , - . / 3x 0 1 2 3 4 5 6 7 8 9 : ; < = > ? 4x @ A B C D E F G H I J K L M N O 5x P Q R S T U V W X Y Z [ \ ] ^ _ 6x ` a b c d e f g h i j k l m n o 7x p q r s t u v w x y z { | } ~ DEL 8x € ‚ ƒ „ … † ‡ ˆ ‰ Š ‹ Œ Ž 9x ‘ ’ “ ” • – — ˜ ™ š › œ ž Ÿ Ax NBSP ¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬ ® ¯ Bx ° ± ² ³ ´ µ ¶ · ¸ ¹ º » ¼ ½ ¾ ¿ Cx À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Dx Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß Ex à á â ã ä å æ ç è é ê ë ì í î ï Fx ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ

@nicolasgrekas W3 Tech survey – Feb 2017

@nicolasgrekas Unicode: 130k characters, 135 scripts Peace مالس 和平 ☮

@nicolasgrekas Unicode: 130k characters, 135 scripts P U+0050 LATIN CAPITAL
LETTER P س U+0633 ARABIC LETTER SEEN 和U+548C CJK UNIFIED IDEOGRAPH-548C ☮ U+262E PEACE SYMBOL

@nicolasgrekas

@nicolasgrekas 17 plans http://reedbeta.com/blog/programmers-intro-to-unicode/

@nicolasgrekas

@nicolasgrekas From Code Points to Bytes á U+00E1 LATIN SMALL
LETTER A WITH ACUTE UTF-16BE 00 E1 UTF-8 C3 A1 あ U+3042 HIRAGANA LETTER A UTF-16BE 30 42 UTF-8 E3 81 82 UTF-8 : 1, 2, 3 or 4 bytes UTF-16 : 2 or 4 bytes UTF-32 : 4 bytes

@nicolasgrekas UTF-8 rules the World Byte 1 Byte 2 Byte
3 Byte 4 0xxxxxxx 110xxxxx 10xxxxxx 1110xxxx 10xxxxxx 10xxxxxx 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx • ASCII compatible • Autosynchronized • Endianness insensitive

@nicolasgrekas Case sensitivity • 1k characters are concerned • One
upper case letter, two lower case variants: Σ ⇔ σ/ς • The turkish exception: I ⇔ i vs İ ⇔ i and I ⇔ ı • Full case folding: ß ⇔ ss

@nicolasgrekas Collations • Lituanian puts y between i and k
• ch in Spanish is a single character • œ between oe and of • in Danish, Å is a separate letter that follows Z • in Swedish, v and w are the same letter • traditional German considers ä to be the same as ae • In French, côté > coté > côte > cote

@nicolasgrekas Composed/decomposed forms  NFC D é j à U+0044
U+00E9 U+006A U+00E0  NFD D e ◌ ́ j a ◌ ̀ U+0044 U+0065 U+0301 U+006A U+0061 U+0300

@nicolasgrekas Grapheme Clusters  NFC D é j à U+0044
U+00E9 U+006A U+00E0  NFD D e ◌ ́ j a ◌ ̀ U+0044 U+0065 U+0301 U+006A U+0061 U+0300

@nicolasgrekas Emoji - http://unicode.org/reports/tr51/

@nicolasgrekas Emoji glyphs

@nicolasgrekas Unicode fundamentals summary • Uppercase, lowercase, folding • Compositions,
ligatures • Comparison: normalizations & collations • Segmentation: characters, words, sentences & hyphens • Locales: cultural conventions, translitterations • Identifiers & security, confusables • Display: direction, width

Unicode in Practice? #Symfony_Live @nicolasgrekas

@nicolasgrekas •utf8_binary : A != a •utf8_general_ci : œ ≠
oe •utf8_unicode_ci : œ = oe •utf8_swedish_ci : z > å > æ = ä > ö = ø •SET NAMES utf8mb4 : for security and storing emojis MySQL

@nicolasgrekas Outside of PHP world •ICU : Java and C/C++
• X-like licence, supported by IBM, The reference implementation of Unicode •JavaScript : Unicode (NFC) •Python : binary and unicode strings

@nicolasgrekas In PHP(RIP PHP 6) • Mbstring: mb_*() • Iconv:
iconv_*() • PCRE: preg_*() • Intl: grapheme_*() + normalizer_*() Collator, NumberFormatter, Locale, MessageFormatter, IntlDateFormatter, Spoofchecker, Transliterator, IDN • symfony/polyfill-*

symfony/string #Symfony_Live @nicolasgrekas

@nicolasgrekas

@nicolasgrekas symfony/string • Immutable value objects • AbstractString • BinaryString
• AbstractUnicodeString • GraphemeString • Utf8String

@nicolasgrekas AbstractString function __construct(string $string = '') function jsonSerialize(): string
function length(): int function toBinary(): BinaryString function toGrapheme(): GraphemeString function toUtf8(): Utf8String function __clone() function __toString(): string

@nicolasgrekas function after($needle, bool $includeNeedle = false): self; function afterLast($needle,
bool $includeNeedle = false): self; function append(string ...$suffix): self; function before($needle, bool $includeNeedle = false): self; function beforeLast($needle, bool $includeNeedle = false): self; function chunk(int $length = 1): array; function collapseWhitespace(): self function contains(): bool function endsWith(string $suffix): bool; function ensureEnd(string $suffix): self function ensureStart(string $prefix): self function equalsTo($string): bool; function folded(): self; function ignoreCase(): self function indexOf($needle, int $offset = 0): ?int; function indexOfLast($needle, int $offset = 0): ?int; function isEmpty(): bool function join(array $strings): self function lower(): self

@nicolasgrekas function match(string $pattern, int $flags = 0, int $offset
= 0): ?array function padBoth(int $length, string $padStr = ' '): self function padEnd(int $length, string $padStr = ' '): self function padStart(int $length, string $padStr = ' '): self function prepend(string ...$prefix): self; function repeat(int $multiplier): self function replace(string $from, string $to): self; function replaceMatches(string $fromPattern, $to): self function slice(int $start = 0, int $length = null): self; function splice(string $replacement, int $start = 0, int $length = null): self; function split(string $delimiter, int $limit = null): array; function startsWith(string $prefix): bool; function title(bool $allWords = false): self function trim(string $chars = " \t\n\r\0\x0B\x0C\u{A0}\u{FEFF}"): self function trimEnd(string $chars = " \t\n\r\0\x0B\x0C\u{A0}\u{FEFF}"): self function trimStart(string $chars = " \t\n\r\0\x0B\x0C\u{A0}\u{FEFF}"): self function truncate(int $length, string $ellipsis = ''): self function upper(): self function wordwrap(int $width = 75, string $break = "\n", bool $cut = false): self

Merci Thank you Gracias ﺍﺮﻜﺷ 多謝 Ευχαριστώ ありがとう 감사합 니다
Спасибо!

Symfony String - flexible handling of Unicode

Symfony String - flexible handling of Unicode

Nicolas Grekas

More Decks by Nicolas Grekas

Other Decks in Technology

Featured

Transcript

The fabulous World of Emojis and other Unicode symbols #Symfony_Live

@nicolasgrekas ASCII x0 x1 x2 x3 x4 x5 x6 x7

@nicolasgrekas Windows-1252 x0 x1 x2 x3 x4 x5 x6 x7

@nicolasgrekas W3 Tech survey – Feb 2017

@nicolasgrekas Unicode: 130k characters, 135 scripts Peace مالس 和平 ☮

@nicolasgrekas Unicode: 130k characters, 135 scripts P U+0050 LATIN CAPITAL

@nicolasgrekas

@nicolasgrekas 17 plans http://reedbeta.com/blog/programmers-intro-to-unicode/

@nicolasgrekas

@nicolasgrekas

@nicolasgrekas From Code Points to Bytes á U+00E1 LATIN SMALL

@nicolasgrekas UTF-8 rules the World Byte 1 Byte 2 Byte

@nicolasgrekas Case sensitivity • 1k characters are concerned • One

@nicolasgrekas Collations • Lituanian puts y between i and k

@nicolasgrekas Composed/decomposed forms  NFC D é j à U+0044

@nicolasgrekas Grapheme Clusters  NFC D é j à U+0044

@nicolasgrekas Emoji - http://unicode.org/reports/tr51/

@nicolasgrekas Emoji glyphs

@nicolasgrekas Unicode fundamentals summary • Uppercase, lowercase, folding • Compositions,

Unicode in Practice? #Symfony_Live @nicolasgrekas

@nicolasgrekas •utf8_binary : A != a •utf8_general_ci : œ ≠

@nicolasgrekas Outside of PHP world •ICU : Java and C/C++

@nicolasgrekas In PHP(RIP PHP 6) • Mbstring: mb_*() • Iconv:

symfony/string #Symfony_Live @nicolasgrekas

@nicolasgrekas

@nicolasgrekas symfony/string • Immutable value objects • AbstractString • BinaryString

@nicolasgrekas AbstractString function __construct(string $string = '') function jsonSerialize(): string

@nicolasgrekas function after($needle, bool $includeNeedle = false): self; function afterLast($needle,

@nicolasgrekas function match(string $pattern, int $flags = 0, int $offset

Merci Thank you Gracias ﺍﺮﻜﺷ 多謝 Ευχαριστώ ありがとう 감사합 니다