Slide 1

Slide 1 text

No content

Slide 2

Slide 2 text

#WCNO John Blackbourn • WordPress core developer • Senior engineer at Human Made • Find me on Twitter, GitHub, WordPress.org, etc: @johnbillion

Slide 3

Slide 3 text

#WCNO £25.00 That’s nice. } mojibake

Slide 4

Slide 4 text

#WCNO Why do I have strange characters in my content?

Slide 5

Slide 5 text

#WCNO Binary 010101011010100100000100101100100101

Slide 6

Slide 6 text

#WCNO Binary 010101011010100100000100101100100101 } byte 256 values (2^8)

Slide 7

Slide 7 text

#WCNO Binary 010101011010100100000100101100100101 } A Code point 65

Slide 8

Slide 8 text

#WCNO ASCII American Standard Code for Information Interchange NUL SOH STX ETX EOT ENQ ACK BEL BS TAB LF VT FF CR SO SI 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 ! “ # $ % & ‘ ( ) * + , - . / 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 0 1 2 3 4 5 6 7 8 9 : ; < = > ? 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 @ A B C D E F G H I J K L M N O 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 P Q R S T U V W X Y Z [ \ ] ^ _ 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 ` a b c d e f g h i j k l m n o 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 p q r s t u v w x y z { | } ~ DEL

Slide 9

Slide 9 text

#WCNO ASCII Doesn’t include characters for… • Ahom • Arabic • Imperial Aramaic • Armenian • Avestan • Balinese • Bamum • Batak • Bengali • Bopomofo • Brahmi

Slide 10

Slide 10 text

#WCNO

Slide 11

Slide 11 text

#WCNO :(

Slide 12

Slide 12 text

#WCNO Unicode 163 £ 197 Å 198 Æ 216 Ø 8364 € 128169 120.000 characters and counting U+00A3 U+00C5 U+00C6 U+00D8 U+20AC U+1F4A9

Slide 13

Slide 13 text

#WCNO 010101011010100100000100101100100101 } 256 != 120.000

Slide 14

Slide 14 text

#WCNO } UTF-8 11110000 10011111 10010010 10101001

Slide 15

Slide 15 text

#WCNO 010101011010100100000100101100100101 } A UTF-8 Code point 65

Slide 16

Slide 16 text

#WCNO UTF-8 Problem solved.

Slide 17

Slide 17 text

#WCNO UTF-8 ASCII Windows-1252 Latin-1 and many more…

Slide 18

Slide 18 text

#WCNO

Slide 19

Slide 19 text

#WCNO 010101011010100111111100101100100101 } uses the high bit to signify leading / continuation byte of a sequence of multiple bytes. UTF-8 uses the high bit to fit in 128 more characters. Windows-1252

Slide 20

Slide 20 text

#WCNO Here’s the kicker A two-byte character encoded with UTF-8 will be seen as two separate characters if it’s read using Windows-1252.

Slide 21

Slide 21 text

#WCNO A 65 41 41 £ 163 C2 A3 A3 £ Å 197 C3 85 C5 Ã? Æ 198 C3 86 C6 Ã? Ø 216 C3 98 D8 Ã? € 8364 E2 82 AC 80 â?¬ UTF-8 Windows 1252 Mojibake

Slide 22

Slide 22 text

#WCNO Here’s the takeaway If you’re storing or transmitting text, you need to know what encoding it uses, otherwise you cannot reliably display it.

Slide 23

Slide 23 text

#WCNO How does mojibake happen? • Migrating data between databases Destination database’s encoding doesn’t match source • Reading strings using wrong encoding Reading a Windows-1252 encoded Word file as UTF-8 Reading an XML feed that uses a different encoding • Opening files in editor using wrong encoding Most editors can switch encoding but can’t often fix it

Slide 24

Slide 24 text

#WCNO How can mojibake be fixed? • Migrating data between databases Re-import using the correct encoding (collation) • Reading strings using wrong encoding iconv() in PHP if you know the source encoding • Opening files in editor using wrong encoding Re-open file using correct encoding, then convert

Slide 25

Slide 25 text

#WCNO Multibyte in PHP • String functions substr(), strlen() - Only support single byte characters • Multibyte String Functions mb_strlen() mb_strtolower() mb_substr() and more… Using them will split multibyte characters

Slide 26

Slide 26 text

#WCNO ☺ ✊ ✌ ✋ ☝ £ ¤ ¦ § © ª « ¬ - ® ¯ ° ± ² ³ ´ µ ¶ Multibyte in

Slide 27

Slide 27 text

#WCNO Multibyte in • utf8 MySQL database character encoding that supports up to three bytes per character. • utf8mb4 MySQL database character encoding that supports up to four bytes per character. Enables support for all four-byte characters in UTF-8.

Slide 28

Slide 28 text

#WCNO 者 為 今 令 免 ⼊入 全 具 刃 化 外 情 才 抵 次 海 ⾯面 直 真 神 空 草 ⾓角 道 雇 ⾻骨 Multibyte in

Slide 29

Slide 29 text

#WCNO The takeaway If you’re storing or transmitting text, you need to know what encoding it uses, otherwise you cannot reliably display it.

Slide 30

Slide 30 text

#WCNO Resources codepoints.net

Slide 31

Slide 31 text

#WCNO

Slide 32

Slide 32 text

#WCNO Resources codepoints.net Joel Spolsky on character encoding Unicode’s Adopt a Character

Slide 33

Slide 33 text

#WCNO John Blackbourn Find me on Twitter, GitHub, WordPress.org, etc: @johnbillion Questions?