Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Symfony String - flexible handling of Unicode

Nicolas Grekas
September 13, 2019

Symfony String - flexible handling of Unicode

Handling strings properly usually goes in line with understanding the core concepts of Unicode. At the beginning characters were just bytes, then they became code points, and finally, they can be combined into grapheme clusters. Whether you like it or not, that's three-unit systems we have to deal with as programmers if we want cultures to be able to communicate using computers.

Gathered from my work on the patchwork/utf8 library, my experience on the topic has now been ported to Symfony String, a component that provides a unified API for all 3 systems.

Nicolas Grekas

September 13, 2019
Tweet

More Decks by Nicolas Grekas

Other Decks in Technology

Transcript

  1. @nicolasgrekas ASCII x0 x1 x2 x3 x4 x5 x6 x7

    x8 x9 xA xB xC xD xE xF 0x NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI 1x DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US 2x SP ! " # $ % & ' ( ) * + , - . / 3x 0 1 2 3 4 5 6 7 8 9 : ; < = > ? 4x @ A B C D E F G H I J K L M N O 5x P Q R S T U V W X Y Z [ \ ] ^ _ 6x ` a b c d e f g h i j k l m n o 7x p q r s t u v w x y z { | } ~ DEL 8x € ‚ ƒ „ … † ‡ ˆ ‰ Š ‹ Œ Ž 9x ‘ ’ “ ” • – — ˜ ™ š › œ ž Ÿ Ax NBSP ¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬ ® ¯ Bx ° ± ² ³ ´ µ ¶ · ¸ ¹ º » ¼ ½ ¾ ¿ Cx À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Dx Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß Ex à á â ã ä å æ ç è é ê ë ì í î ï Fx ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ
  2. @nicolasgrekas Windows-1252 x0 x1 x2 x3 x4 x5 x6 x7

    x8 x9 xA xB xC xD xE xF 0x NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI 1x DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US 2x SP ! " # $ % & ' ( ) * + , - . / 3x 0 1 2 3 4 5 6 7 8 9 : ; < = > ? 4x @ A B C D E F G H I J K L M N O 5x P Q R S T U V W X Y Z [ \ ] ^ _ 6x ` a b c d e f g h i j k l m n o 7x p q r s t u v w x y z { | } ~ DEL 8x € ‚ ƒ „ … † ‡ ˆ ‰ Š ‹ Œ Ž 9x ‘ ’ “ ” • – — ˜ ™ š › œ ž Ÿ Ax NBSP ¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬ ® ¯ Bx ° ± ² ³ ´ µ ¶ · ¸ ¹ º » ¼ ½ ¾ ¿ Cx À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Dx Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß Ex à á â ã ä å æ ç è é ê ë ì í î ï Fx ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ
  3. @nicolasgrekas Unicode: 130k characters, 135 scripts P U+0050 LATIN CAPITAL

    LETTER P س U+0633 ARABIC LETTER SEEN 和U+548C CJK UNIFIED IDEOGRAPH-548C ☮ U+262E PEACE SYMBOL
  4. @nicolasgrekas From Code Points to Bytes á U+00E1 LATIN SMALL

    LETTER A WITH ACUTE UTF-16BE 00 E1 UTF-8 C3 A1 あ U+3042 HIRAGANA LETTER A UTF-16BE 30 42 UTF-8 E3 81 82 UTF-8 : 1, 2, 3 or 4 bytes UTF-16 : 2 or 4 bytes UTF-32 : 4 bytes
  5. @nicolasgrekas UTF-8 rules the World Byte 1 Byte 2 Byte

    3 Byte 4 0xxxxxxx 110xxxxx 10xxxxxx 1110xxxx 10xxxxxx 10xxxxxx 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx • ASCII compatible • Autosynchronized • Endianness insensitive
  6. @nicolasgrekas Case sensitivity • 1k characters are concerned • One

    upper case letter, two lower case variants: Σ ⇔ σ/ς • The turkish exception: I ⇔ i vs İ ⇔ i and I ⇔ ı • Full case folding: ß ⇔ ss
  7. @nicolasgrekas Collations • Lituanian puts y between i and k

    • ch in Spanish is a single character • œ between oe and of • in Danish, Å is a separate letter that follows Z • in Swedish, v and w are the same letter • traditional German considers ä to be the same as ae • In French, côté > coté > côte > cote
  8. @nicolasgrekas Composed/decomposed forms  NFC D é j à U+0044

    U+00E9 U+006A U+00E0  NFD D e ◌ ́ j a ◌ ̀ U+0044 U+0065 U+0301 U+006A U+0061 U+0300
  9. @nicolasgrekas Grapheme Clusters  NFC D é j à U+0044

    U+00E9 U+006A U+00E0  NFD D e ◌ ́ j a ◌ ̀ U+0044 U+0065 U+0301 U+006A U+0061 U+0300
  10. @nicolasgrekas Unicode fundamentals summary • Uppercase, lowercase, folding • Compositions,

    ligatures • Comparison: normalizations & collations • Segmentation: characters, words, sentences & hyphens • Locales: cultural conventions, translitterations • Identifiers & security, confusables • Display: direction, width
  11. @nicolasgrekas •utf8_binary : A != a •utf8_general_ci : œ ≠

    oe •utf8_unicode_ci : œ = oe •utf8_swedish_ci : z > å > æ = ä > ö = ø •SET NAMES utf8mb4 : for security and storing emojis MySQL
  12. @nicolasgrekas Outside of PHP world •ICU : Java and C/C++

    • X-like licence, supported by IBM, The reference implementation of Unicode •JavaScript : Unicode (NFC) •Python : binary and unicode strings
  13. @nicolasgrekas In PHP(RIP PHP 6) • Mbstring: mb_*() • Iconv:

    iconv_*() • PCRE: preg_*() • Intl: grapheme_*() + normalizer_*() Collator, NumberFormatter, Locale, MessageFormatter, IntlDateFormatter, Spoofchecker, Transliterator, IDN • symfony/polyfill-*
  14. @nicolasgrekas symfony/string • Immutable value objects • AbstractString • BinaryString

    • AbstractUnicodeString • GraphemeString • Utf8String
  15. @nicolasgrekas AbstractString function __construct(string $string = '') function jsonSerialize(): string

    function length(): int function toBinary(): BinaryString function toGrapheme(): GraphemeString function toUtf8(): Utf8String function __clone() function __toString(): string
  16. @nicolasgrekas function after($needle, bool $includeNeedle = false): self; function afterLast($needle,

    bool $includeNeedle = false): self; function append(string ...$suffix): self; function before($needle, bool $includeNeedle = false): self; function beforeLast($needle, bool $includeNeedle = false): self; function chunk(int $length = 1): array; function collapseWhitespace(): self function contains(): bool function endsWith(string $suffix): bool; function ensureEnd(string $suffix): self function ensureStart(string $prefix): self function equalsTo($string): bool; function folded(): self; function ignoreCase(): self function indexOf($needle, int $offset = 0): ?int; function indexOfLast($needle, int $offset = 0): ?int; function isEmpty(): bool function join(array $strings): self function lower(): self
  17. @nicolasgrekas function match(string $pattern, int $flags = 0, int $offset

    = 0): ?array function padBoth(int $length, string $padStr = ' '): self function padEnd(int $length, string $padStr = ' '): self function padStart(int $length, string $padStr = ' '): self function prepend(string ...$prefix): self; function repeat(int $multiplier): self function replace(string $from, string $to): self; function replaceMatches(string $fromPattern, $to): self function slice(int $start = 0, int $length = null): self; function splice(string $replacement, int $start = 0, int $length = null): self; function split(string $delimiter, int $limit = null): array; function startsWith(string $prefix): bool; function title(bool $allWords = false): self function trim(string $chars = " \t\n\r\0\x0B\x0C\u{A0}\u{FEFF}"): self function trimEnd(string $chars = " \t\n\r\0\x0B\x0C\u{A0}\u{FEFF}"): self function trimStart(string $chars = " \t\n\r\0\x0B\x0C\u{A0}\u{FEFF}"): self function truncate(int $length, string $ellipsis = ''): self function upper(): self function wordwrap(int $width = 75, string $break = "\n", bool $cut = false): self