$30 off During Our Annual Pro Sale. View Details »

UTF8: The last encoding you will ever need

UTF8: The last encoding you will ever need

What are charsets and encodings, which are the advantages and disadvantages of each one available, and finally, why is UTF-8 the last encoding you will ever need.
---
O que são charsets e encodings, quais as vantagens e desvantagens de usar cada um dos diversos disponíveis e, por fim, porque o UTF-8 é o último encoding de que você precisará.

Ricardo Coelho

October 20, 2016
Tweet

More Decks by Ricardo Coelho

Other Decks in Technology

Transcript

  1. UTF-8: THE LAST
    ENCODING YOU WILL
    EVER NEED
    @RAMCOELHO

    View Slide

  2. MOJIBAKE
    £
    =C1RV=CDZT=DBR=D5 T=DCK=D6RF=DAR=D3G=C9P
    =E1rv=EDzt=FBr=F5 t=FCk=F6rf=FAr=F3g=E9p
    ÃRVÃZTŰRÅ TÜKÖRFÚRÃ"GÉP
    árvÃztűrÅ‘ tükörfúrógép

    View Slide

  3. ENCODING
    CONVERSION

    View Slide

  4. ENCODING
    TRANSLATION

    View Slide

  5. COMPUTER
    01000011 01001111 01001101 01010000 01010101 01010100 01000101 01010010

    View Slide

  6. ASCII

    View Slide

  7. ASCII
    4E 49 43 45 21

    View Slide

  8. ASCII
    SERIOUSLY

    View Slide

  9. EBCDIC
    ÅÂÃÄÉÃ

    View Slide

  10. ENCODING

    View Slide

  11. THE EIGHTH BIT
    ˢ˱BOX DRAWINGˮ˺

    View Slide

  12. CODE PAGES
    CP850, CP858, CP437, CP 1252 (WINDOWS 1252), ISO-8859-1 (LATIN 1)

    View Slide

  13. ASCII CODE PAGES
    ISO 8859-1 … ISO-8859-16


    Standard 128 chars
    Charsets

    View Slide

  14. CHARSET

    View Slide

  15. MULTI-ALPHABETS DOCUMENTS
    Δεν το πιστεύω
    ͘΀͵ͯ͢Ώ̶ͣ ͳ΢͢፥䋚Ͷ

    View Slide

  16. UNICODE
    LET THERE BE
    ידוחיי דוק תכרעמ יהי
    fiat unum codice
    捰ํӞ㮆ࠔӞጱդ嘨ᔮ妞
    एक अिद्वतीय कोड िसस्टम हुन त्यहाँ गरौं
    hãy có một hệ thống mã độc đáo
    ας υπάρχει ένα μοναδικό
    σύστημα κωδικού
    ੉ ࣻ ੓ب۾ Ҋਬ ௏٘ दझమ ੌ ࣻ

    View Slide

  17. THE PENTALOGUE
    UNICODE
    • Thou shalt support every alphabet known to men
    • Thou shalt have ASCII compatibility
    • Thou shalt depend upon no code page
    • Thy symbol shalt have its own code point (sometimes more)
    • Thou shalt support bi-directional languages

    View Slide

  18. OR, IS IT?
    UCS-2: 16 BITS TO RULE THEM ALL
    216 = 65.536
    “640K ought to be enough for anybody.”
    – Gates, William
    (64K)
    This encoding later evolved into UTF-16

    View Slide

  19. SURE, THAT WILL WORK!
    32 BITS TO RULE THEM ALL
    This encoding later evolved into UTF-32

    View Slide

  20. OR, IS IT?
    32 BITS TO RULE THEM ALL

    View Slide

  21. NOT PAGES ;)
    PLANES
    Basic Multilingual Plane (BMP)
    U+0000 – U+FFFF
    Control ASCII, ASCII, Extended ASCII (including Box Drawing), Latin, Greek, Cyrillic
    (including Supplement), Hebrew, Arabic, Bengali, Runic, Thai, Cherokee, Phonetic,
    Balinese, Hiragana, Katakana, Vedic, Braile… (ISO-8859-x)

    View Slide

  22. NOT PAGES ;)
    PLANES
    Supplementary Multilingual Plane (SMP)
    U+10000 – U+1FFFF
    Cuneiform, Hieroglyphs, Ancient Persian, Musical Symbols (Modern and Ancient),
    Mahjong, Domino and Cards Symbols, Emoticons, Alchemical Symbols

    View Slide

  23. NOT PAGES ;)
    PLANES
    Supplementary Ideographic Plane (SIP)
    U+20000 – U+2FFFF
    Rarely Used Chinese/Japanese/Korean Ideographs
    Han Unification: Hanzi, Kanji & Hanja

    View Slide

  24. NOT PAGES ;)
    PLANES
    11 Unassigned Planes (3-13)
    U+30000 – U+DFFFF
    Plane 3 will be called (probably) TIP: Tertiary Ideographic Plane
    Reserved for: Oracle Bone script, Bronze Script, Small Seal Script
    No glyphs assigned yet. (represents 64,7% of total planes)

    View Slide

  25. NOT PAGES ;)
    PLANES
    Supplementary Special-purpose Plane
    U+E0000 – U+EFFFF
    Invisible Text Tags (which are not recommended) and CJK Variations
    (99,5% unassigned)

    View Slide

  26. NOT PAGES ;)
    PLANES
    Private Use Area Planes (15 & 16)
    U+F0000 – U+10FFFF
    Ligatures, auxiliary glyphs and glyph build blocks
    Limited interoperability, assignment from outside ISO and Unicode Consortium

    View Slide

  27. FOR UNICODE TRANSFORMATION FORMATS
    SECURITY CONCERNS
    • UTF-16 APIs
    • UTF-32 APIs
    • UTF-EBCDIC
    • Endianess
    • UTF-16LE / UTF-16BE
    • UTF-32LE / UTF-32BE
    • Efficiency

    View Slide

  28. IN THE REAL WORLD
    UNICODE
    Credit: Chima by LEGOTM
    HOW TO FIT 32
    BITS
    INSIDE8-BITCHANNEL
    AN

    View Slide

  29. UTF-8

    View Slide

  30. IT’S MAGICAL!
    UTF-8
    U+0041
    A ⠀
    U+3B04 U+12453

    U+00C7
    Ç

    View Slide

  31. UTF-8
    • No conversion ever needed again!
    • Your OS supports it
    • Your application server supports it
    • Your application supports it
    • Your database server supports it
    • Your API supports it
    • Your friends support it
    Let’s U+!

    View Slide

  32. FOR PHP PROGRAMMERS
    PRACTICAL ADVICES
    • Forget about utf8_encode / utf8_decode
    • Try iconv. You can thank me later
    • Find an mb_ version of your string function
    • mb_detect_encoding does the best it can (not its fault)

    View Slide

  33. THANK YOU!

    View Slide

  34. QUESTIONS

    View Slide

  35. REFERENCES
    • http://www.babelstone.co.uk/Unicode/
    unicode.html
    • http://en.wikipedia.org/wiki/
    %D0%AF#Computing_codes
    • http://en.wikipedia.org/wiki/UTF-8#Examples
    • http://www.fileformat.info/info/unicode/char/
    2F80/index.htm
    • http://dev.mysql.com/doc/refman/5.5/en/
    charset-general.html
    • http://blog.tremend.ro/2006/09/26/mysql-
    php-and-utf8/
    • http://www.php.net/manual/en/function.mb-
    detect-encoding.php
    • http://www.php.net/manual/en/
    function.iconv.php
    • http://php.net/manual/en/function.utf8-
    decode.php
    • https://en.wikipedia.org/wiki/
    Plane_(Unicode)
    • https://en.wikipedia.org/wiki/Unicode
    • https://en.wikipedia.org/wiki/
    Comparison_of_Unicode_encodings
    • http://en.wikipedia.org/wiki/ASCII
    • http://www.asciitable.com/
    • http://en.wikipedia.org/wiki/Teleprinter
    • http://en.wikipedia.org/wiki/
    Microprocessor#8-bit_designs
    • http://en.wikipedia.org/wiki/Code_page_437
    • http://en.wikipedia.org/wiki/
    Code_pages#IBM_PC_.28OEM.
    29_code_pages
    • http://en.wikipedia.org/wiki/Iso-8859

    View Slide