Upgrade to Pro — share decks privately, control downloads, hide ads and more …

An Introduction to Character Encoding - WCNO - WCNL

An Introduction to Character Encoding - WCNO - WCNL

My talk from WordCamp Norway 2016 and WordCamp Nederlands 2016

As a developer, understanding character encoding adds a lot of clarity to your work, especially when you’re dealing with text that contains characters beyond A-Z. If you’ve ever migrated a database from one site to another and ended up with jumbled characters in your content, this talk is for you. I’ll also explain why emoji in WordPress is the PR face of something much deeper and more important.

Video: https://www.youtube.com/watch?v=tdhaHt9-6Kw

#wordpress #wordcamp #wcno #wcnl #php

John Blackbourn

October 15, 2016
Tweet

More Decks by John Blackbourn

Other Decks in Technology

Transcript

  1. View Slide

  2. #WCNO
    John Blackbourn
    • WordPress core developer
    • Senior engineer at Human Made
    • Find me on Twitter, GitHub, WordPress.org, etc:
    @johnbillion

    View Slide

  3. #WCNO
    £25.00
    That’s nice.
    }
    mojibake

    View Slide

  4. #WCNO
    Why do I have
    strange characters
    in my content?

    View Slide

  5. #WCNO
    Binary
    010101011010100100000100101100100101

    View Slide

  6. #WCNO
    Binary
    010101011010100100000100101100100101
    }
    byte
    256 values (2^8)

    View Slide

  7. #WCNO
    Binary
    010101011010100100000100101100100101
    }
    A
    Code point 65

    View Slide

  8. #WCNO
    ASCII
    American Standard Code for Information Interchange
    NUL
    SOH
    STX
    ETX
    EOT
    ENQ
    ACK
    BEL
    BS
    TAB
    LF
    VT
    FF
    CR
    SO
    SI
    0
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    DLE
    DC1
    DC2
    DC3
    DC4
    NAK
    SYN
    ETB
    CAN
    EM
    SUB
    ESC
    FS
    GS
    RS
    US
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    !

    #
    $
    %
    &

    (
    )
    *
    +
    ,
    -
    .
    /
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    0
    1
    2
    3
    4
    5
    6
    7
    8
    9
    :
    ;
    <
    =
    >
    ?
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    @
    A
    B
    C
    D
    E
    F
    G
    H
    I
    J
    K
    L
    M
    N
    O
    80
    81
    82
    83
    84
    85
    86
    87
    88
    89
    90
    91
    92
    93
    94
    95
    P
    Q
    R
    S
    T
    U
    V
    W
    X
    Y
    Z
    [
    \
    ]
    ^
    _
    96
    97
    98
    99
    100
    101
    102
    103
    104
    105
    106
    107
    108
    109
    110
    111
    `
    a
    b
    c
    d
    e
    f
    g
    h
    i
    j
    k
    l
    m
    n
    o
    112
    113
    114
    115
    116
    117
    118
    119
    120
    121
    122
    123
    124
    125
    126
    127
    p
    q
    r
    s
    t
    u
    v
    w
    x
    y
    z
    {
    |
    }
    ~
    DEL

    View Slide

  9. #WCNO
    ASCII
    Doesn’t include characters for…
    • Ahom
    • Arabic
    • Imperial Aramaic
    • Armenian
    • Avestan
    • Balinese
    • Bamum
    • Batak
    • Bengali
    • Bopomofo
    • Brahmi

    View Slide

  10. #WCNO

    View Slide

  11. #WCNO
    :(

    View Slide

  12. #WCNO
    Unicode
    163 £
    197 Å
    198 Æ
    216 Ø
    8364 €
    128169
    120.000 characters and counting
    U+00A3
    U+00C5
    U+00C6
    U+00D8
    U+20AC
    U+1F4A9

    View Slide

  13. #WCNO
    010101011010100100000100101100100101
    }
    256 != 120.000

    View Slide

  14. #WCNO
    }
    UTF-8
    11110000 10011111 10010010 10101001

    View Slide

  15. #WCNO
    010101011010100100000100101100100101
    }
    A
    UTF-8
    Code point 65

    View Slide

  16. #WCNO
    UTF-8
    Problem solved.

    View Slide

  17. #WCNO
    UTF-8
    ASCII
    Windows-1252
    Latin-1
    and many more…

    View Slide

  18. #WCNO

    View Slide

  19. #WCNO
    010101011010100111111100101100100101
    }
    uses the high bit to signify
    leading / continuation byte of
    a sequence of multiple bytes.
    UTF-8
    uses the high bit to fit in
    128 more characters.
    Windows-1252

    View Slide

  20. #WCNO
    Here’s the kicker
    A two-byte character encoded with UTF-8
    will be seen as two separate characters
    if it’s read using Windows-1252.

    View Slide

  21. #WCNO
    A 65 41 41
    £ 163 C2 A3 A3 £
    Å 197 C3 85 C5 Ã?
    Æ 198 C3 86 C6 Ã?
    Ø 216 C3 98 D8 Ã?
    € 8364 E2 82 AC 80 â?¬
    UTF-8
    Windows
    1252 Mojibake

    View Slide

  22. #WCNO
    Here’s the takeaway
    If you’re storing or transmitting text,
    you need to know what encoding it uses,
    otherwise you cannot reliably display it.

    View Slide

  23. #WCNO
    How does mojibake happen?
    • Migrating data between databases
    Destination database’s encoding doesn’t match source
    • Reading strings using wrong encoding
    Reading a Windows-1252 encoded Word file as UTF-8
    Reading an XML feed that uses a different encoding
    • Opening files in editor using wrong encoding
    Most editors can switch encoding but can’t often fix it

    View Slide

  24. #WCNO
    How can mojibake be fixed?
    • Migrating data between databases
    Re-import using the correct encoding (collation)
    • Reading strings using wrong encoding
    iconv() in PHP if you know the source encoding
    • Opening files in editor using wrong encoding
    Re-open file using correct encoding, then convert

    View Slide

  25. #WCNO
    Multibyte in PHP
    • String functions
    substr(), strlen() - Only support single byte characters
    • Multibyte String Functions
    mb_strlen()
    mb_strtolower()
    mb_substr()
    and more…
    Using them will split multibyte characters

    View Slide

  26. #WCNO




    ✊ ✌
    ✋ ☝


    £ ¤ ¦ § © ª « ¬ - ® ¯ °
    ± ² ³ ´ µ ¶

    Multibyte in

    View Slide

  27. #WCNO
    Multibyte in
    • utf8
    MySQL database character encoding that supports up to
    three bytes per character.
    • utf8mb4
    MySQL database character encoding that supports up to
    four bytes per character.
    Enables support for all four-byte characters in UTF-8.

    View Slide

  28. #WCNO
    者 為 今 令 免 ⼊入 全 具 刃 化 外
    情 才 抵 次 海 ⾯面 直 真 神 空 草
    ⾓角 道 雇 ⾻骨





    Multibyte in

    View Slide

  29. #WCNO
    The takeaway
    If you’re storing or transmitting text,
    you need to know what encoding it uses,
    otherwise you cannot reliably display it.

    View Slide

  30. #WCNO
    Resources
    codepoints.net

    View Slide

  31. #WCNO

    View Slide

  32. #WCNO
    Resources
    codepoints.net
    Joel Spolsky on character encoding
    Unicode’s Adopt a Character

    View Slide

  33. #WCNO
    John Blackbourn
    Find me on Twitter, GitHub, WordPress.org, etc:
    @johnbillion
    Questions?

    View Slide