Upgrade to Pro — share decks privately, control downloads, hide ads and more …

What the Ã�©Ð”ã° is going on!?

What the Ã�©Ð”ã° is going on!?

Finding Your Way Out of Charset Hell — Say goodbye to cargo cult solutions. Finally get a grasp on character set encoding, learn how it works for PHP apps and MySQL, and become confident in fixing encoding issues once and for all.

Kevin Smith

October 26, 2018
Tweet

More Decks by Kevin Smith

Other Decks in Programming

Transcript

  1. kevinsmith.io 01000101 01000101 01000011 01101111 01101110 01100110 00100000 01101001 01110011

    00100000 01110000 01101111 01110111 01100101 01110010 01100101 01100100 00100000 01100010 01111001 00100000 01110100 01101000 01100101 00100000 01100011 01101111 01101101 01101101 01110101 01101110 01101001 01110100 01111001 00100001 9
  2. kevinsmith.io Character Encoding:
 a set of rules that map a

    sequence of 1s and 0s to characters
 
 Character Set = Charset = Encoding 10
  3. kevinsmith.io • American Standard Code for Information Interchange • Developed

    in the US in the 1960s • Single-byte Character Set • 8 bits = byte • 128 characters defined ASCII 11
  4. kevinsmith.io ASCII Table 12 Character Binary a 1100001 b 1100010

    c 1100011 d 1100100 e 1100101 f 1100110 Character Binary A 1000001 B 1000010 C 1000011 D 1000100 E 1000101 F 1000110
  5. kevinsmith.io Doesn’t include… à, è, ì, ò, ù á, é,

    í, ó, ú, ý â, ê, î, ô, û ñ, õ, ã ä, ë, ï, ö, ü, ÿ 14
  6. kevinsmith.io • Extend encoding using rest of the 256 possible

    characters • Latin1 (ISO-8859-1) • Western European (like Spanish, Irish, Norwegian, Swedish, etc.) • Latin2 (ISO-8859-2) • Eastern European (like Bosnian, Croatian, Czech, Polish, etc.) • Latin5 (ISO-8859-5) • Cyrillic languages (like Russian, Bulgarian, Serbian, etc.) Extended ASCII 15
  7. kevinsmith.io Multi-byte Encodings • 2 bytes (16 bits) = 65,536

    distinct values • 3 bytes (24 bits) = 16,777,216 distinct values • 4 bytes (32 bits) = 4,294,967,296 distinct values • Variable-length encoding • BIG-5: mostly traditional Chinese characters • GB18030: traditional Chinese and simplified Chinese • Shift JIS: Japanese characters 17
  8. kevinsmith.io • Development began in the late 1980’s • Mapping

    of characters to code points • Up to 1,114,112 code points • 137,374 code points currently used (less than 13%) • 137,468 code points reserved for private use Unicode to Rule Them All! 18
  9. kevinsmith.io • Multiple ways of encoding code points into bits

    • UTF-32 uses 32 bits (4 bytes) per code point • Variable width character encodings… • UTF-16 uses 16 or 32 bits • Used internally by Windows, Java, JavaScript • UTF-8 uses 8-32 bits (1-4 bytes) per code point • Efficient storage & backwards compatible with ASCII aka UTF Unicode Transformation Format 20
  10. kevinsmith.io What Every Programmer Absolutely, Positively Needs To Know About

    Encodings And Character Sets To Work With Text http://kunststube.net/encoding/ 21
  11. kevinsmith.io • UTF-8 has been dominant and highly recommended
 since

    mid 2000’s • App or Web Server: Content-Type: text/html; charset=UTF-8 • HTML: <meta charset="UTF-8"> Web Browsers 24
  12. kevinsmith.io • PHP files must be encoded ASCII-compatible • Encoding

    matters when operating on content • Otherwise pretty agnostic about encoding PHP 25
  13. kevinsmith.io • latin1 default charset until v8 • utf8mb4 new

    default charset as of v8 • The connection charset and column charset
 are all that matters at runtime MySQL 26
  14. kevinsmith.io MySQL Charsets at Runtime Connection Character Set “The content

    being sent to me
 is encoded with this character encoding.” Column Character Set “The content stored in this column’s fields
 is encoded with this character encoding.” 27
  15. kevinsmith.io The Situation UTF-8 has been dominant encoding for
 the

    web since mid 2000’s PHP is mostly agnostic about encoding MySQL’s default encoding was latin1 until v8 No MySQL connection charset specified in PHP?
 Default MySQL connection charset will be used. 29
  16. kevinsmith.io The Plot Thickens Text content on the web and

    in PHP
 was UTF-8 encoded. PHP sent that content to MySQL
 and said it was latin1 encoded. MySQL thought it was latin1 encoded
 and stored it as if it were. 30
  17. kevinsmith.io The Hidden Problem You were probably storing UTF-8 encoded

    content
 in database fields that thought it was
 latin1 encoded content. 31
  18. kevinsmith.io MySQL will search and sort based
 on the charset

    it thinks the
 content is encoded with. 33
  19. kevinsmith.io We Fixed* It! So you upgraded your sites to

    CMS versions that started declaring connection charset to be UTF-8… And content since then has been correctly encoded… Though the UTF-8 to latin1 conversion is lossy… …and the previously encoded non-ASCII content
 comes back as Mojibake. 39
  20. kevinsmith.io • Switching DB columns or connection to utf8 or

    utf8mb4 • Slap utf8_encode and utf8_decode on there • A UnicodeFixer() function peppered throughout • Wild guesses using a variety of iconv() conversions Cargo Cult Troubleshooting 41
  21. kevinsmith.io There is No Magic Bullet • Nearly impossible for

    computer to autodetect proper encoding • mb_detect_encoding() in PHP is worthless • Ultimately human readers are needed to confirm • Investigate to figure out what happened 43
  22. kevinsmith.io A Targeted Correction 2 Step Process:
 ALTER TABLE FOO

    MODIFY COLUMN
 V VARCHAR(100) CHARSET binary; ALTER TABLE FOO MODIFY COLUMN
 V VARCHAR(100) CHARSET utf8mb4; 45
  23. kevinsmith.io Export the DB with same connection as PHP
 


    mysqldump -u User -h dbhost.com -p dbName \
 --hex-blob --default-character-set=latin1 \
 > export.sql Correcting the Whole Database 46
  24. kevinsmith.io Edit the export to change charset declaration
 
 sed

    \
 -e 's/SET NAMES latin1/SET NAMES utf8mb4/g' \
 -e 's/CHARSET=utf8/CHARSET=utf8mb4 \
 COLLATE=utf8mb4_unicode_520_ci/g' \
 -e 's/CHARSET=latin1/CHARSET=utf8mb4 \
 COLLATE=utf8mb4_unicode_520_ci/g' \
 -i export.sql Correcting the Whole Database 47
  25. kevinsmith.io Import into tmp DB with utf8mb4 connection
 
 mysql

    -u User -h dbhost.com -p dbName_tmp \
 < export.sql Correcting the Whole Database 48
  26. kevinsmith.io Then test the output
 
 Look for places where

    an encoding error may have cut off the rest of the string! Correcting the Whole Database 49