Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Symfony String - flexible handling of Unicode

Nicolas Grekas
September 13, 2019

Symfony String - flexible handling of Unicode

Handling strings properly usually goes in line with understanding the core concepts of Unicode. At the beginning characters were just bytes, then they became code points, and finally, they can be combined into grapheme clusters. Whether you like it or not, that's three-unit systems we have to deal with as programmers if we want cultures to be able to communicate using computers.

Gathered from my work on the patchwork/utf8 library, my experience on the topic has now been ported to Symfony String, a component that provides a unified API for all 3 systems.

Nicolas Grekas

September 13, 2019
Tweet

More Decks by Nicolas Grekas

Other Decks in Technology

Transcript

  1. The fabulous World of Emojis
    and other Unicode symbols
    #Symfony_Live
    @nicolasgrekas

    View Slide

  2. @nicolasgrekas
    ASCII
    x0 x1 x2 x3 x4 x5 x6 x7 x8 x9 xA xB xC xD xE xF
    0x NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI
    1x DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US
    2x SP ! " # $ % & ' ( ) * + , - . /
    3x 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
    4x @ A B C D E F G H I J K L M N O
    5x P Q R S T U V W X Y Z [ \ ] ^ _
    6x ` a b c d e f g h i j k l m n o
    7x p q r s t u v w x y z { | } ~ DEL
    8x € ‚ ƒ „ … † ‡ ˆ ‰ Š ‹ Œ Ž
    9x ‘ ’ “ ” • – — ˜ ™ š › œ ž Ÿ
    Ax NBSP ¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬ ® ¯
    Bx ° ± ² ³ ´ µ ¶ · ¸ ¹ º » ¼ ½ ¾ ¿
    Cx À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï
    Dx Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß
    Ex à á â ã ä å æ ç è é ê ë ì í î ï
    Fx ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ

    View Slide

  3. @nicolasgrekas
    Windows-1252
    x0 x1 x2 x3 x4 x5 x6 x7 x8 x9 xA xB xC xD xE xF
    0x NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI
    1x DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US
    2x SP ! " # $ % & ' ( ) * + , - . /
    3x 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
    4x @ A B C D E F G H I J K L M N O
    5x P Q R S T U V W X Y Z [ \ ] ^ _
    6x ` a b c d e f g h i j k l m n o
    7x p q r s t u v w x y z { | } ~ DEL
    8x € ‚ ƒ „ … † ‡ ˆ ‰ Š ‹ Œ Ž
    9x ‘ ’ “ ” • – — ˜ ™ š › œ ž Ÿ
    Ax NBSP ¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬ ® ¯
    Bx ° ± ² ³ ´ µ ¶ · ¸ ¹ º » ¼ ½ ¾ ¿
    Cx À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï
    Dx Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß
    Ex à á â ã ä å æ ç è é ê ë ì í î ï
    Fx ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ

    View Slide

  4. @nicolasgrekas
    W3 Tech survey – Feb 2017

    View Slide

  5. @nicolasgrekas
    Unicode: 130k characters, 135 scripts
    Peace مالس
    和平 ☮

    View Slide

  6. @nicolasgrekas
    Unicode: 130k characters, 135 scripts
    P U+0050
    LATIN CAPITAL LETTER P
    س U+0633
    ARABIC LETTER SEEN
    和U+548C
    CJK UNIFIED IDEOGRAPH-548C
    ☮ U+262E
    PEACE SYMBOL

    View Slide

  7. @nicolasgrekas

    View Slide

  8. @nicolasgrekas
    17 plans
    http://reedbeta.com/blog/programmers-intro-to-unicode/

    View Slide

  9. @nicolasgrekas

    View Slide

  10. @nicolasgrekas

    View Slide

  11. @nicolasgrekas
    From Code Points to Bytes
    á U+00E1
    LATIN SMALL LETTER A WITH ACUTE
    UTF-16BE 00 E1
    UTF-8 C3 A1
    あ U+3042
    HIRAGANA LETTER A
    UTF-16BE 30 42
    UTF-8 E3 81 82
    UTF-8 : 1, 2, 3 or 4 bytes
    UTF-16 : 2 or 4 bytes
    UTF-32 : 4 bytes

    View Slide

  12. @nicolasgrekas
    UTF-8 rules the World
    Byte 1 Byte 2 Byte 3 Byte 4
    0xxxxxxx
    110xxxxx 10xxxxxx
    1110xxxx 10xxxxxx 10xxxxxx
    11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
    • ASCII compatible
    • Autosynchronized
    • Endianness insensitive

    View Slide

  13. @nicolasgrekas
    Case sensitivity
    • 1k characters are concerned
    • One upper case letter, two lower case variants: Σ ⇔ σ/ς
    • The turkish exception: I ⇔ i vs İ ⇔ i and I ⇔ ı
    • Full case folding: ß ⇔ ss

    View Slide

  14. @nicolasgrekas
    Collations
    • Lituanian puts y between i and k
    • ch in Spanish is a single character
    • œ between oe and of
    • in Danish, Å is a separate letter that follows Z
    • in Swedish, v and w are the same letter
    • traditional German considers ä to be the same as ae
    • In French, côté > coté > côte > cote

    View Slide

  15. @nicolasgrekas
    Composed/decomposed forms
     NFC
    D é j à
    U+0044 U+00E9 U+006A U+00E0
     NFD
    D e ◌ ́ j a ◌ ̀
    U+0044 U+0065 U+0301 U+006A U+0061 U+0300

    View Slide

  16. @nicolasgrekas
    Grapheme Clusters
     NFC
    D é j à
    U+0044 U+00E9 U+006A U+00E0
     NFD
    D e ◌ ́ j a ◌ ̀
    U+0044 U+0065 U+0301 U+006A U+0061 U+0300

    View Slide

  17. @nicolasgrekas
    Emoji - http://unicode.org/reports/tr51/

    View Slide

  18. @nicolasgrekas
    Emoji glyphs

    View Slide

  19. @nicolasgrekas
    Unicode fundamentals summary
    • Uppercase, lowercase, folding
    • Compositions, ligatures
    • Comparison: normalizations & collations
    • Segmentation: characters, words, sentences & hyphens
    • Locales: cultural conventions, translitterations
    • Identifiers & security, confusables
    • Display: direction, width

    View Slide

  20. Unicode in Practice?
    #Symfony_Live
    @nicolasgrekas

    View Slide

  21. @nicolasgrekas
    •utf8_binary : A != a
    •utf8_general_ci : œ ≠ oe
    •utf8_unicode_ci : œ = oe
    •utf8_swedish_ci : z > å > æ = ä > ö = ø
    •SET NAMES utf8mb4 : for security and storing emojis
    MySQL

    View Slide

  22. @nicolasgrekas
    Outside of PHP world
    •ICU : Java and C/C++
    • X-like licence, supported by IBM,
    The reference implementation of Unicode
    •JavaScript : Unicode (NFC)
    •Python : binary and unicode strings

    View Slide

  23. @nicolasgrekas
    In PHP(RIP PHP 6)
    • Mbstring: mb_*()
    • Iconv: iconv_*()
    • PCRE: preg_*()
    • Intl: grapheme_*() + normalizer_*()
    Collator, NumberFormatter, Locale, MessageFormatter, IntlDateFormatter, Spoofchecker, Transliterator, IDN
    • symfony/polyfill-*

    View Slide

  24. symfony/string
    #Symfony_Live
    @nicolasgrekas

    View Slide

  25. @nicolasgrekas

    View Slide

  26. @nicolasgrekas
    symfony/string
    • Immutable value objects
    • AbstractString
    • BinaryString
    • AbstractUnicodeString
    • GraphemeString
    • Utf8String

    View Slide

  27. @nicolasgrekas
    AbstractString
    function __construct(string $string = '')
    function jsonSerialize(): string
    function length(): int
    function toBinary(): BinaryString
    function toGrapheme(): GraphemeString
    function toUtf8(): Utf8String
    function __clone()
    function __toString(): string

    View Slide

  28. @nicolasgrekas
    function after($needle, bool $includeNeedle = false): self;
    function afterLast($needle, bool $includeNeedle = false): self;
    function append(string ...$suffix): self;
    function before($needle, bool $includeNeedle = false): self;
    function beforeLast($needle, bool $includeNeedle = false): self;
    function chunk(int $length = 1): array;
    function collapseWhitespace(): self
    function contains(): bool
    function endsWith(string $suffix): bool;
    function ensureEnd(string $suffix): self
    function ensureStart(string $prefix): self
    function equalsTo($string): bool;
    function folded(): self;
    function ignoreCase(): self
    function indexOf($needle, int $offset = 0): ?int;
    function indexOfLast($needle, int $offset = 0): ?int;
    function isEmpty(): bool
    function join(array $strings): self
    function lower(): self

    View Slide

  29. @nicolasgrekas
    function match(string $pattern, int $flags = 0, int $offset = 0): ?array
    function padBoth(int $length, string $padStr = ' '): self
    function padEnd(int $length, string $padStr = ' '): self
    function padStart(int $length, string $padStr = ' '): self
    function prepend(string ...$prefix): self;
    function repeat(int $multiplier): self
    function replace(string $from, string $to): self;
    function replaceMatches(string $fromPattern, $to): self
    function slice(int $start = 0, int $length = null): self;
    function splice(string $replacement, int $start = 0, int $length = null): self;
    function split(string $delimiter, int $limit = null): array;
    function startsWith(string $prefix): bool;
    function title(bool $allWords = false): self
    function trim(string $chars = " \t\n\r\0\x0B\x0C\u{A0}\u{FEFF}"): self
    function trimEnd(string $chars = " \t\n\r\0\x0B\x0C\u{A0}\u{FEFF}"): self
    function trimStart(string $chars = " \t\n\r\0\x0B\x0C\u{A0}\u{FEFF}"): self
    function truncate(int $length, string $ellipsis = ''): self
    function upper(): self
    function wordwrap(int $width = 75, string $break = "\n", bool $cut = false): self

    View Slide

  30. Merci Thank you Gracias ﺍﺮﻜﺷ
    多謝 Ευχαριστώ ありがとう
    감사합 니다 Спасибо!

    View Slide