An Introduction to Character Encoding - WCNO - WCNL

An Introduction to Character Encoding - WCNO - WCNL

My talk from WordCamp Norway 2016 and WordCamp Nederlands 2016

As a developer, understanding character encoding adds a lot of clarity to your work, especially when you’re dealing with text that contains characters beyond A-Z. If you’ve ever migrated a database from one site to another and ended up with jumbled characters in your content, this talk is for you. I’ll also explain why emoji in WordPress is the PR face of something much deeper and more important.

Video: https://www.youtube.com/watch?v=tdhaHt9-6Kw

#wordpress #wordcamp #wcno #wcnl #php

23e12888dcd87d07434b7621bc164958?s=128

John Blackbourn

October 15, 2016
Tweet

Transcript

  1. None
  2. #WCNO John Blackbourn • WordPress core developer • Senior engineer

    at Human Made • Find me on Twitter, GitHub, WordPress.org, etc: @johnbillion
  3. #WCNO £25.00 That’s nice. } mojibake

  4. #WCNO Why do I have strange characters in my content?

  5. #WCNO Binary 010101011010100100000100101100100101

  6. #WCNO Binary 010101011010100100000100101100100101 } byte 256 values (2^8)

  7. #WCNO Binary 010101011010100100000100101100100101 } A Code point 65

  8. #WCNO ASCII American Standard Code for Information Interchange NUL SOH

    STX ETX EOT ENQ ACK BEL BS TAB LF VT FF CR SO SI 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 ! “ # $ % & ‘ ( ) * + , - . / 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 0 1 2 3 4 5 6 7 8 9 : ; < = > ? 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 @ A B C D E F G H I J K L M N O 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 P Q R S T U V W X Y Z [ \ ] ^ _ 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 ` a b c d e f g h i j k l m n o 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 p q r s t u v w x y z { | } ~ DEL
  9. #WCNO ASCII Doesn’t include characters for… • Ahom • Arabic

    • Imperial Aramaic • Armenian • Avestan • Balinese • Bamum • Batak • Bengali • Bopomofo • Brahmi
  10. #WCNO

  11. #WCNO :(

  12. #WCNO Unicode 163 £ 197 Å 198 Æ 216 Ø

    8364 € 128169 120.000 characters and counting U+00A3 U+00C5 U+00C6 U+00D8 U+20AC U+1F4A9
  13. #WCNO 010101011010100100000100101100100101 } 256 != 120.000

  14. #WCNO } UTF-8 11110000 10011111 10010010 10101001

  15. #WCNO 010101011010100100000100101100100101 } A UTF-8 Code point 65

  16. #WCNO UTF-8 Problem solved.

  17. #WCNO UTF-8 ASCII Windows-1252 Latin-1 and many more…

  18. #WCNO

  19. #WCNO 010101011010100111111100101100100101 } uses the high bit to signify leading

    / continuation byte of a sequence of multiple bytes. UTF-8 uses the high bit to fit in 128 more characters. Windows-1252
  20. #WCNO Here’s the kicker A two-byte character encoded with UTF-8

    will be seen as two separate characters if it’s read using Windows-1252.
  21. #WCNO A 65 41 41 £ 163 C2 A3 A3

    £ Å 197 C3 85 C5 Ã? Æ 198 C3 86 C6 Ã? Ø 216 C3 98 D8 Ã? € 8364 E2 82 AC 80 â?¬ UTF-8 Windows 1252 Mojibake
  22. #WCNO Here’s the takeaway If you’re storing or transmitting text,

    you need to know what encoding it uses, otherwise you cannot reliably display it.
  23. #WCNO How does mojibake happen? • Migrating data between databases

    Destination database’s encoding doesn’t match source • Reading strings using wrong encoding Reading a Windows-1252 encoded Word file as UTF-8 Reading an XML feed that uses a different encoding • Opening files in editor using wrong encoding Most editors can switch encoding but can’t often fix it
  24. #WCNO How can mojibake be fixed? • Migrating data between

    databases Re-import using the correct encoding (collation) • Reading strings using wrong encoding iconv() in PHP if you know the source encoding • Opening files in editor using wrong encoding Re-open file using correct encoding, then convert
  25. #WCNO Multibyte in PHP • String functions substr(), strlen() -

    Only support single byte characters • Multibyte String Functions mb_strlen() mb_strtolower() mb_substr() and more… Using them will split multibyte characters
  26. #WCNO ☺ ✊ ✌ ✋ ☝ £ ¤ ¦ §

    © ª « ¬ - ® ¯ ° ± ² ³ ´ µ ¶ Multibyte in
  27. #WCNO Multibyte in • utf8 MySQL database character encoding that

    supports up to three bytes per character. • utf8mb4 MySQL database character encoding that supports up to four bytes per character. Enables support for all four-byte characters in UTF-8.
  28. #WCNO 者 為 今 令 免 ⼊入 全 具 刃

    化 外 情 才 抵 次 海 ⾯面 直 真 神 空 草 ⾓角 道 雇 ⾻骨 Multibyte in
  29. #WCNO The takeaway If you’re storing or transmitting text, you

    need to know what encoding it uses, otherwise you cannot reliably display it.
  30. #WCNO Resources codepoints.net

  31. #WCNO

  32. #WCNO Resources codepoints.net Joel Spolsky on character encoding Unicode’s Adopt

    a Character
  33. #WCNO John Blackbourn Find me on Twitter, GitHub, WordPress.org, etc:

    @johnbillion Questions?