Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Character Building - Fun with charsets and encodings

Character Building - Fun with charsets and encodings

Christoph Lühr

July 02, 2013
Tweet

More Decks by Christoph Lühr

Other Decks in Programming

Transcript

  1. Christoph Lühr
    @chluehr / @bephpug 2013
    "Fun with charsets and encodings"
    Character Building
    ٩(͡๏̯͡๏)۶

    View Slide

  2. basilicom

    View Slide

  3. View Slide


  4. View Slide

  5. Image source: http://www.flickr.com/photos/stinajonsson/3932774410 CC BY-NC 2.0

    View Slide

  6. Charset
    vs.
    Encoding

    View Slide

  7. Set of Characters
    [ A B C ... 1 2 3 ... @#$ ]
    UNICODE / CODE PAGES

    View Slide

  8. U+2278
    NEITHER LESS-THAN
    NOR GREATER-THAN

    View Slide

  9. U+2620
    SKULL AND
    CROSSBONES

    View Slide

  10. U+FDFA
    ARABIC LIGATURE
    SALLALLAHOU ALAYHE
    WASALLAM
    "May Allah bless him and grant him peace"

    View Slide

  11. Encoding = Mapping
    A = 0x65
    UTF-8, ISO-8859-1

    View Slide

  12. Single Byte
    vs.
    Multi Byte

    View Slide

  13. UTF-16
    variable length!

    View Slide

  14. UTF-16/32
    Big- / vs Little-Endian

    View Slide

  15. U+FEFF: BOM
    Byte-Order-Mark
    zero width non-breaking space

    View Slide

  16. U+FFFE: BOM
    Byte-Order-Mark

    View Slide

  17. BOM BOM BOM...
    UTF8 BOM
    0xEF 0xBB 0xBF
    UTF32BE BOM
    0x00 0x00 0xFE 0xFF
    UTF32LE BOM
    0xFF 0xFE 0x00 0x00

    View Slide

  18. How to debug?
    terminal.

    View Slide

  19. hexdump -C foo.txt
    00000000 48 61 6c 6c 6f 20 62 65 70 68 70 75 67 21 0a 48 |Hallo bephpug!.H|
    00000010 69 65 72 20 65 69 6e 20 61 2d 55 6d 6c 61 75 74 |ier ein a-Umlaut|
    00000020 3a c3 a4 21 0a 48 69 65 72 20 65 69 6e 20 61 2d |:..!.Hier ein a-|
    00000030 6d 69 74 2d 4b 72 69 6e 67 65 6c 3a c3 a5 21 0a |mit-Kringel:..!.|
    00000040 0a

    View Slide

  20. How to re-encode?
    iconv
    -f FROM -t TO

    View Slide

  21. PHP?
    strlen, substr, ...

    View Slide


  22. c3 a4 = ä
    c3 =

    View Slide

  23. PHP!
    mbstring
    mb_*
    iconv_*

    View Slide

  24. Transliteration
    ü => ue
    ü => u

    View Slide

  25. Databases

    View Slide

  26. DB - Connection
    SET NAMES 'UTF8'

    View Slide

  27. DB - Storage
    Table vs. DB

    View Slide

  28. DB - Collation
    ü..rstuvw ... rstuüvw

    View Slide

  29. Problems
    Weird Stuff

    View Slide

  30. PHP: Identifiers
    $Schüssel = new Müsli
    (T_FRÜCHTE);
    [a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*

    View Slide

  31. Different Line-Endings

    View Slide

  32. LF
    // lets say hello! LF
    echo "hello" LF

    View Slide

  33. LF
    // lets say hello! LF
    echo "hello" LF
    // lets say hello! LF echo "hello"

    View Slide

  34. Diacritics
    ü => u+"
    U+00FC ü c3 bc LATIN SMALL LETTER U WITH DIAERESIS
    U+0075 u 75 LATIN SMALL LETTER U
    U+0308 _̈ cc 88 COMBINING DIAERESIS

    View Slide

  35. Advice

    View Slide

  36. Use UTF-8

    View Slide

  37. UTF-8

    View Slide

  38. PHP/Server
    header( 'Content-type: text/html;
    charset=utf-8' );
    HTML

    View Slide

  39. Database
    SET NAMES UTF8 (or PDO)
    [mysql]
    default-character-set=utf8

    View Slide

  40. Contact
    Christoph Lühr
    eMail: [email protected], [email protected]
    Twitter: @chluehr
    Slides license
    Attribution-NonCommercial-ShareAlike 3.0
    http://creativecommons.org/licenses/by-nc-sa/3.0/
    Thanks!
    Questions?
    U+3020
    POSTAL MARK FACE

    View Slide

  41. Links
    ● Kore Nordmann (FAQ!)
    http://kore-nordmann.de/blog/0082_charset_versus_encoding.html
    http://kore-nordmann.de/blog/php_charset_encoding_FAQ.html
    ● Misc. Resources
    http://www.iana.org/assignments/character-sets/character-sets.xml
    http://www.joelonsoftware.com/articles/Unicode.html
    http://www.unicode.org/charts/
    http://t-a-w.blogspot.de/2008/12/funny-characters-in-unicode.html
    http://www.utf8-zeichentabelle.de/unicode-utf8-table.pl?number=1024
    http://stackoverflow.com/questions/3417180/exotic-names-for-methods-
    constants-variables-and-fields-bug-or-feature

    View Slide