Pro Yearly is on sale from $80 to $50! »

Symfony String - flexible handling of Unicode

6baa34bc1e5c347b1003f6abe8691de1?s=47 Nicolas Grekas
September 13, 2019

Symfony String - flexible handling of Unicode

Handling strings properly usually goes in line with understanding the core concepts of Unicode. At the beginning characters were just bytes, then they became code points, and finally, they can be combined into grapheme clusters. Whether you like it or not, that's three-unit systems we have to deal with as programmers if we want cultures to be able to communicate using computers.

Gathered from my work on the patchwork/utf8 library, my experience on the topic has now been ported to Symfony String, a component that provides a unified API for all 3 systems.

6baa34bc1e5c347b1003f6abe8691de1?s=128

Nicolas Grekas

September 13, 2019
Tweet

Transcript

  1. The fabulous World of Emojis and other Unicode symbols #Symfony_Live

    @nicolasgrekas
  2. @nicolasgrekas ASCII x0 x1 x2 x3 x4 x5 x6 x7

    x8 x9 xA xB xC xD xE xF 0x NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI 1x DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US 2x SP ! " # $ % & ' ( ) * + , - . / 3x 0 1 2 3 4 5 6 7 8 9 : ; < = > ? 4x @ A B C D E F G H I J K L M N O 5x P Q R S T U V W X Y Z [ \ ] ^ _ 6x ` a b c d e f g h i j k l m n o 7x p q r s t u v w x y z { | } ~ DEL 8x € ‚ ƒ „ … † ‡ ˆ ‰ Š ‹ Œ Ž 9x ‘ ’ “ ” • – — ˜ ™ š › œ ž Ÿ Ax NBSP ¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬ ® ¯ Bx ° ± ² ³ ´ µ ¶ · ¸ ¹ º » ¼ ½ ¾ ¿ Cx À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Dx Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß Ex à á â ã ä å æ ç è é ê ë ì í î ï Fx ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ
  3. @nicolasgrekas Windows-1252 x0 x1 x2 x3 x4 x5 x6 x7

    x8 x9 xA xB xC xD xE xF 0x NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI 1x DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US 2x SP ! " # $ % & ' ( ) * + , - . / 3x 0 1 2 3 4 5 6 7 8 9 : ; < = > ? 4x @ A B C D E F G H I J K L M N O 5x P Q R S T U V W X Y Z [ \ ] ^ _ 6x ` a b c d e f g h i j k l m n o 7x p q r s t u v w x y z { | } ~ DEL 8x € ‚ ƒ „ … † ‡ ˆ ‰ Š ‹ Œ Ž 9x ‘ ’ “ ” • – — ˜ ™ š › œ ž Ÿ Ax NBSP ¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬ ® ¯ Bx ° ± ² ³ ´ µ ¶ · ¸ ¹ º » ¼ ½ ¾ ¿ Cx À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Dx Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß Ex à á â ã ä å æ ç è é ê ë ì í î ï Fx ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ
  4. @nicolasgrekas W3 Tech survey – Feb 2017

  5. @nicolasgrekas Unicode: 130k characters, 135 scripts Peace مالس 和平 ☮

  6. @nicolasgrekas Unicode: 130k characters, 135 scripts P U+0050 LATIN CAPITAL

    LETTER P س U+0633 ARABIC LETTER SEEN 和U+548C CJK UNIFIED IDEOGRAPH-548C ☮ U+262E PEACE SYMBOL
  7. @nicolasgrekas

  8. @nicolasgrekas 17 plans http://reedbeta.com/blog/programmers-intro-to-unicode/

  9. @nicolasgrekas

  10. @nicolasgrekas

  11. @nicolasgrekas From Code Points to Bytes á U+00E1 LATIN SMALL

    LETTER A WITH ACUTE UTF-16BE 00 E1 UTF-8 C3 A1 あ U+3042 HIRAGANA LETTER A UTF-16BE 30 42 UTF-8 E3 81 82 UTF-8 : 1, 2, 3 or 4 bytes UTF-16 : 2 or 4 bytes UTF-32 : 4 bytes
  12. @nicolasgrekas UTF-8 rules the World Byte 1 Byte 2 Byte

    3 Byte 4 0xxxxxxx 110xxxxx 10xxxxxx 1110xxxx 10xxxxxx 10xxxxxx 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx • ASCII compatible • Autosynchronized • Endianness insensitive
  13. @nicolasgrekas Case sensitivity • 1k characters are concerned • One

    upper case letter, two lower case variants: Σ ⇔ σ/ς • The turkish exception: I ⇔ i vs İ ⇔ i and I ⇔ ı • Full case folding: ß ⇔ ss
  14. @nicolasgrekas Collations • Lituanian puts y between i and k

    • ch in Spanish is a single character • œ between oe and of • in Danish, Å is a separate letter that follows Z • in Swedish, v and w are the same letter • traditional German considers ä to be the same as ae • In French, côté > coté > côte > cote
  15. @nicolasgrekas Composed/decomposed forms  NFC D é j à U+0044

    U+00E9 U+006A U+00E0  NFD D e ◌ ́ j a ◌ ̀ U+0044 U+0065 U+0301 U+006A U+0061 U+0300
  16. @nicolasgrekas Grapheme Clusters  NFC D é j à U+0044

    U+00E9 U+006A U+00E0  NFD D e ◌ ́ j a ◌ ̀ U+0044 U+0065 U+0301 U+006A U+0061 U+0300
  17. @nicolasgrekas Emoji - http://unicode.org/reports/tr51/

  18. @nicolasgrekas Emoji glyphs

  19. @nicolasgrekas Unicode fundamentals summary • Uppercase, lowercase, folding • Compositions,

    ligatures • Comparison: normalizations & collations • Segmentation: characters, words, sentences & hyphens • Locales: cultural conventions, translitterations • Identifiers & security, confusables • Display: direction, width
  20. Unicode in Practice? #Symfony_Live @nicolasgrekas

  21. @nicolasgrekas •utf8_binary : A != a •utf8_general_ci : œ ≠

    oe •utf8_unicode_ci : œ = oe •utf8_swedish_ci : z > å > æ = ä > ö = ø •SET NAMES utf8mb4 : for security and storing emojis MySQL
  22. @nicolasgrekas Outside of PHP world •ICU : Java and C/C++

    • X-like licence, supported by IBM, The reference implementation of Unicode •JavaScript : Unicode (NFC) •Python : binary and unicode strings
  23. @nicolasgrekas In PHP(RIP PHP 6) • Mbstring: mb_*() • Iconv:

    iconv_*() • PCRE: preg_*() • Intl: grapheme_*() + normalizer_*() Collator, NumberFormatter, Locale, MessageFormatter, IntlDateFormatter, Spoofchecker, Transliterator, IDN • symfony/polyfill-*
  24. symfony/string #Symfony_Live @nicolasgrekas

  25. @nicolasgrekas

  26. @nicolasgrekas symfony/string • Immutable value objects • AbstractString • BinaryString

    • AbstractUnicodeString • GraphemeString • Utf8String
  27. @nicolasgrekas AbstractString function __construct(string $string = '') function jsonSerialize(): string

    function length(): int function toBinary(): BinaryString function toGrapheme(): GraphemeString function toUtf8(): Utf8String function __clone() function __toString(): string
  28. @nicolasgrekas function after($needle, bool $includeNeedle = false): self; function afterLast($needle,

    bool $includeNeedle = false): self; function append(string ...$suffix): self; function before($needle, bool $includeNeedle = false): self; function beforeLast($needle, bool $includeNeedle = false): self; function chunk(int $length = 1): array; function collapseWhitespace(): self function contains(): bool function endsWith(string $suffix): bool; function ensureEnd(string $suffix): self function ensureStart(string $prefix): self function equalsTo($string): bool; function folded(): self; function ignoreCase(): self function indexOf($needle, int $offset = 0): ?int; function indexOfLast($needle, int $offset = 0): ?int; function isEmpty(): bool function join(array $strings): self function lower(): self
  29. @nicolasgrekas function match(string $pattern, int $flags = 0, int $offset

    = 0): ?array function padBoth(int $length, string $padStr = ' '): self function padEnd(int $length, string $padStr = ' '): self function padStart(int $length, string $padStr = ' '): self function prepend(string ...$prefix): self; function repeat(int $multiplier): self function replace(string $from, string $to): self; function replaceMatches(string $fromPattern, $to): self function slice(int $start = 0, int $length = null): self; function splice(string $replacement, int $start = 0, int $length = null): self; function split(string $delimiter, int $limit = null): array; function startsWith(string $prefix): bool; function title(bool $allWords = false): self function trim(string $chars = " \t\n\r\0\x0B\x0C\u{A0}\u{FEFF}"): self function trimEnd(string $chars = " \t\n\r\0\x0B\x0C\u{A0}\u{FEFF}"): self function trimStart(string $chars = " \t\n\r\0\x0B\x0C\u{A0}\u{FEFF}"): self function truncate(int $length, string $ellipsis = ''): self function upper(): self function wordwrap(int $width = 75, string $break = "\n", bool $cut = false): self
  30. Merci Thank you Gracias ﺍﺮﻜﺷ 多謝 Ευχαριστώ ありがとう 감사합 니다

    Спасибо!