Finding Your Way Out of Charset Hell — Say goodbye to cargo cult solutions. Finally get a grasp on character set encoding, learn how it works for PHP apps and MySQL, and become confident in fixing encoding issues once and for all.
of characters to code points • Up to 1,114,112 code points • 137,374 code points currently used (less than 13%) • 137,468 code points reserved for private use Unicode to Rule Them All! 18
being sent to me is encoded with this character encoding.” Column Character Set “The content stored in this column’s fields is encoded with this character encoding.” 27
web since mid 2000’s PHP is mostly agnostic about encoding MySQL’s default encoding was latin1 until v8 No MySQL connection charset specified in PHP? Default MySQL connection charset will be used. 29
in PHP was UTF-8 encoded. PHP sent that content to MySQL and said it was latin1 encoded. MySQL thought it was latin1 encoded and stored it as if it were. 30
CMS versions that started declaring connection charset to be UTF-8… And content since then has been correctly encoded… Though the UTF-8 to latin1 conversion is lossy… …and the previously encoded non-ASCII content comes back as Mojibake. 39
utf8mb4 • Slap utf8_encode and utf8_decode on there • A UnicodeFixer() function peppered throughout • Wild guesses using a variety of iconv() conversions Cargo Cult Troubleshooting 41
computer to autodetect proper encoding • mb_detect_encoding() in PHP is worthless • Ultimately human readers are needed to confirm • Investigate to figure out what happened 43