Upgrade to Pro — share decks privately, control downloads, hide ads and more …

An Introduction to Character Encoding - WCNO - WCNL

An Introduction to Character Encoding - WCNO - WCNL

My talk from WordCamp Norway 2016 and WordCamp Nederlands 2016

As a developer, understanding character encoding adds a lot of clarity to your work, especially when you’re dealing with text that contains characters beyond A-Z. If you’ve ever migrated a database from one site to another and ended up with jumbled characters in your content, this talk is for you. I’ll also explain why emoji in WordPress is the PR face of something much deeper and more important.

Video: https://www.youtube.com/watch?v=tdhaHt9-6Kw

#wordpress #wordcamp #wcno #wcnl #php

John Blackbourn

October 15, 2016
Tweet

More Decks by John Blackbourn

Other Decks in Technology

Transcript

  1. #WCNO John Blackbourn • WordPress core developer • Senior engineer

    at Human Made • Find me on Twitter, GitHub, WordPress.org, etc: @johnbillion
  2. #WCNO ASCII American Standard Code for Information Interchange NUL SOH

    STX ETX EOT ENQ ACK BEL BS TAB LF VT FF CR SO SI 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 ! “ # $ % & ‘ ( ) * + , - . / 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 0 1 2 3 4 5 6 7 8 9 : ; < = > ? 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 @ A B C D E F G H I J K L M N O 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 P Q R S T U V W X Y Z [ \ ] ^ _ 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 ` a b c d e f g h i j k l m n o 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 p q r s t u v w x y z { | } ~ DEL
  3. #WCNO ASCII Doesn’t include characters for… • Ahom • Arabic

    • Imperial Aramaic • Armenian • Avestan • Balinese • Bamum • Batak • Bengali • Bopomofo • Brahmi
  4. #WCNO Unicode 163 £ 197 Å 198 Æ 216 Ø

    8364 € 128169 120.000 characters and counting U+00A3 U+00C5 U+00C6 U+00D8 U+20AC U+1F4A9
  5. #WCNO 010101011010100111111100101100100101 } uses the high bit to signify leading

    / continuation byte of a sequence of multiple bytes. UTF-8 uses the high bit to fit in 128 more characters. Windows-1252
  6. #WCNO Here’s the kicker A two-byte character encoded with UTF-8

    will be seen as two separate characters if it’s read using Windows-1252.
  7. #WCNO A 65 41 41 £ 163 C2 A3 A3

    £ Å 197 C3 85 C5 Ã? Æ 198 C3 86 C6 Ã? Ø 216 C3 98 D8 Ã? € 8364 E2 82 AC 80 â?¬ UTF-8 Windows 1252 Mojibake
  8. #WCNO Here’s the takeaway If you’re storing or transmitting text,

    you need to know what encoding it uses, otherwise you cannot reliably display it.
  9. #WCNO How does mojibake happen? • Migrating data between databases

    Destination database’s encoding doesn’t match source • Reading strings using wrong encoding Reading a Windows-1252 encoded Word file as UTF-8 Reading an XML feed that uses a different encoding • Opening files in editor using wrong encoding Most editors can switch encoding but can’t often fix it
  10. #WCNO How can mojibake be fixed? • Migrating data between

    databases Re-import using the correct encoding (collation) • Reading strings using wrong encoding iconv() in PHP if you know the source encoding • Opening files in editor using wrong encoding Re-open file using correct encoding, then convert
  11. #WCNO Multibyte in PHP • String functions substr(), strlen() -

    Only support single byte characters • Multibyte String Functions mb_strlen() mb_strtolower() mb_substr() and more… Using them will split multibyte characters
  12. #WCNO ☺ ✊ ✌ ✋ ☝ £ ¤ ¦ §

    © ª « ¬ - ® ¯ ° ± ² ³ ´ µ ¶ Multibyte in
  13. #WCNO Multibyte in • utf8 MySQL database character encoding that

    supports up to three bytes per character. • utf8mb4 MySQL database character encoding that supports up to four bytes per character. Enables support for all four-byte characters in UTF-8.
  14. #WCNO 者 為 今 令 免 ⼊入 全 具 刃

    化 外 情 才 抵 次 海 ⾯面 直 真 神 空 草 ⾓角 道 雇 ⾻骨 Multibyte in
  15. #WCNO The takeaway If you’re storing or transmitting text, you

    need to know what encoding it uses, otherwise you cannot reliably display it.