Upgrade to Pro — share decks privately, control downloads, hide ads and more …

UTF8: The last encoding you will ever need

UTF8: The last encoding you will ever need

What are charsets and encodings, which are the advantages and disadvantages of each one available, and finally, why is UTF-8 the last encoding you will ever need.
---
O que são charsets e encodings, quais as vantagens e desvantagens de usar cada um dos diversos disponíveis e, por fim, porque o UTF-8 é o último encoding de que você precisará.

B82b1da7a58dcf37c8f0461c5c08ec0a?s=128

Ricardo Coelho

October 20, 2016
Tweet

Transcript

  1. UTF-8: THE LAST ENCODING YOU WILL EVER NEED @RAMCOELHO

  2. MOJIBAKE £ =C1RV=CDZT=DBR=D5 T=DCK=D6RF=DAR=D3G=C9P =E1rv=EDzt=FBr=F5 t=FCk=F6rf=FAr=F3g=E9p ÃRVÃZTÅ°RÅ TÃœKÖRFÚRÃ"GÉP árvÃztűrÅ‘ tükörfúrógép

  3. ENCODING CONVERSION

  4. ENCODING TRANSLATION

  5. COMPUTER 01000011 01001111 01001101 01010000 01010101 01010100 01000101 01010010

  6. ASCII

  7. ASCII 4E 49 43 45 21

  8. ASCII SERIOUSLY

  9. EBCDIC ÅÂÃÄÉÃ

  10. ENCODING

  11. THE EIGHTH BIT ˢ˱BOX DRAWINGˮ˺

  12. CODE PAGES CP850, CP858, CP437, CP 1252 (WINDOWS 1252), ISO-8859-1

    (LATIN 1)
  13. ASCII CODE PAGES ISO 8859-1 … ISO-8859-16 … … Standard

    128 chars Charsets
  14. CHARSET

  15. MULTI-ALPHABETS DOCUMENTS Δεν το πιστεύω ͘΀͵ͯ͢Ώ̶ͣ ͳ΢͢፥䋚Ͷ

  16. UNICODE LET THERE BE ידוחיי דוק תכרעמ יהי fiat unum

    codice 捰ํӞ㮆ࠔӞጱդ嘨ᔮ妞 एक अिद्वतीय कोड िसस्टम हुन त्यहाँ गरौं hãy có một hệ thống mã độc đáo ας υπάρχει ένα μοναδικό σύστημα κωδικού ੉ ࣻ ੓ب۾ Ҋਬ ௏٘ दझమ ੌ ࣻ
  17. THE PENTALOGUE UNICODE • Thou shalt support every alphabet known

    to men • Thou shalt have ASCII compatibility • Thou shalt depend upon no code page • Thy symbol shalt have its own code point (sometimes more) • Thou shalt support bi-directional languages
  18. OR, IS IT? UCS-2: 16 BITS TO RULE THEM ALL

    216 = 65.536 “640K ought to be enough for anybody.” – Gates, William (64K) This encoding later evolved into UTF-16
  19. SURE, THAT WILL WORK! 32 BITS TO RULE THEM ALL

    This encoding later evolved into UTF-32
  20. OR, IS IT? 32 BITS TO RULE THEM ALL

  21. NOT PAGES ;) PLANES Basic Multilingual Plane (BMP) U+0000 –

    U+FFFF Control ASCII, ASCII, Extended ASCII (including Box Drawing), Latin, Greek, Cyrillic (including Supplement), Hebrew, Arabic, Bengali, Runic, Thai, Cherokee, Phonetic, Balinese, Hiragana, Katakana, Vedic, Braile… (ISO-8859-x)
  22. NOT PAGES ;) PLANES Supplementary Multilingual Plane (SMP) U+10000 –

    U+1FFFF Cuneiform, Hieroglyphs, Ancient Persian, Musical Symbols (Modern and Ancient), Mahjong, Domino and Cards Symbols, Emoticons, Alchemical Symbols
  23. NOT PAGES ;) PLANES Supplementary Ideographic Plane (SIP) U+20000 –

    U+2FFFF Rarely Used Chinese/Japanese/Korean Ideographs Han Unification: Hanzi, Kanji & Hanja
  24. NOT PAGES ;) PLANES 11 Unassigned Planes (3-13) U+30000 –

    U+DFFFF Plane 3 will be called (probably) TIP: Tertiary Ideographic Plane Reserved for: Oracle Bone script, Bronze Script, Small Seal Script No glyphs assigned yet. (represents 64,7% of total planes)
  25. NOT PAGES ;) PLANES Supplementary Special-purpose Plane U+E0000 – U+EFFFF

    Invisible Text Tags (which are not recommended) and CJK Variations (99,5% unassigned)
  26. NOT PAGES ;) PLANES Private Use Area Planes (15 &

    16) U+F0000 – U+10FFFF Ligatures, auxiliary glyphs and glyph build blocks Limited interoperability, assignment from outside ISO and Unicode Consortium
  27. FOR UNICODE TRANSFORMATION FORMATS SECURITY CONCERNS • UTF-16 APIs •

    UTF-32 APIs • UTF-EBCDIC • Endianess • UTF-16LE / UTF-16BE • UTF-32LE / UTF-32BE • Efficiency
  28. IN THE REAL WORLD UNICODE Credit: Chima by LEGOTM HOW

    TO FIT 32 BITS INSIDE8-BITCHANNEL AN
  29. UTF-8

  30. IT’S MAGICAL! UTF-8 U+0041 A ⠀ U+3B04 U+12453 U+00C7 Ç

  31. UTF-8 • No conversion ever needed again! • Your OS

    supports it • Your application server supports it • Your application supports it • Your database server supports it • Your API supports it • Your friends support it Let’s U+!
  32. FOR PHP PROGRAMMERS PRACTICAL ADVICES • Forget about utf8_encode /

    utf8_decode • Try iconv. You can thank me later • Find an mb_ version of your string function • mb_detect_encoding does the best it can (not its fault)
  33. THANK YOU!

  34. QUESTIONS

  35. REFERENCES • http://www.babelstone.co.uk/Unicode/ unicode.html • http://en.wikipedia.org/wiki/ %D0%AF#Computing_codes • http://en.wikipedia.org/wiki/UTF-8#Examples •

    http://www.fileformat.info/info/unicode/char/ 2F80/index.htm • http://dev.mysql.com/doc/refman/5.5/en/ charset-general.html • http://blog.tremend.ro/2006/09/26/mysql- php-and-utf8/ • http://www.php.net/manual/en/function.mb- detect-encoding.php • http://www.php.net/manual/en/ function.iconv.php • http://php.net/manual/en/function.utf8- decode.php • https://en.wikipedia.org/wiki/ Plane_(Unicode) • https://en.wikipedia.org/wiki/Unicode • https://en.wikipedia.org/wiki/ Comparison_of_Unicode_encodings • http://en.wikipedia.org/wiki/ASCII • http://www.asciitable.com/ • http://en.wikipedia.org/wiki/Teleprinter • http://en.wikipedia.org/wiki/ Microprocessor#8-bit_designs • http://en.wikipedia.org/wiki/Code_page_437 • http://en.wikipedia.org/wiki/ Code_pages#IBM_PC_.28OEM. 29_code_pages • http://en.wikipedia.org/wiki/Iso-8859