#WCNO
Binary
010101011010100100000100101100100101
}
A
Code point 65
Slide 8
Slide 8 text
#WCNO
ASCII
American Standard Code for Information Interchange
NUL
SOH
STX
ETX
EOT
ENQ
ACK
BEL
BS
TAB
LF
VT
FF
CR
SO
SI
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
DLE
DC1
DC2
DC3
DC4
NAK
SYN
ETB
CAN
EM
SUB
ESC
FS
GS
RS
US
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
!
“
#
$
%
&
‘
(
)
*
+
,
-
.
/
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
0
1
2
3
4
5
6
7
8
9
:
;
<
=
>
?
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
@
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
P
Q
R
S
T
U
V
W
X
Y
Z
[
\
]
^
_
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
`
a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
p
q
r
s
t
u
v
w
x
y
z
{
|
}
~
DEL
#WCNO
010101011010100100000100101100100101
}
A
UTF-8
Code point 65
Slide 16
Slide 16 text
#WCNO
UTF-8
Problem solved.
Slide 17
Slide 17 text
#WCNO
UTF-8
ASCII
Windows-1252
Latin-1
and many more…
Slide 18
Slide 18 text
#WCNO
Slide 19
Slide 19 text
#WCNO
010101011010100111111100101100100101
}
uses the high bit to signify
leading / continuation byte of
a sequence of multiple bytes.
UTF-8
uses the high bit to fit in
128 more characters.
Windows-1252
Slide 20
Slide 20 text
#WCNO
Here’s the kicker
A two-byte character encoded with UTF-8
will be seen as two separate characters
if it’s read using Windows-1252.
Slide 21
Slide 21 text
#WCNO
A 65 41 41
£ 163 C2 A3 A3 £
Å 197 C3 85 C5 Ã?
Æ 198 C3 86 C6 Ã?
Ø 216 C3 98 D8 Ã?
€ 8364 E2 82 AC 80 â?¬
UTF-8
Windows
1252 Mojibake
Slide 22
Slide 22 text
#WCNO
Here’s the takeaway
If you’re storing or transmitting text,
you need to know what encoding it uses,
otherwise you cannot reliably display it.
Slide 23
Slide 23 text
#WCNO
How does mojibake happen?
• Migrating data between databases
Destination database’s encoding doesn’t match source
• Reading strings using wrong encoding
Reading a Windows-1252 encoded Word file as UTF-8
Reading an XML feed that uses a different encoding
• Opening files in editor using wrong encoding
Most editors can switch encoding but can’t often fix it
Slide 24
Slide 24 text
#WCNO
How can mojibake be fixed?
• Migrating data between databases
Re-import using the correct encoding (collation)
• Reading strings using wrong encoding
iconv() in PHP if you know the source encoding
• Opening files in editor using wrong encoding
Re-open file using correct encoding, then convert
Slide 25
Slide 25 text
#WCNO
Multibyte in PHP
• String functions
substr(), strlen() - Only support single byte characters
• Multibyte String Functions
mb_strlen()
mb_strtolower()
mb_substr()
and more…
Using them will split multibyte characters
#WCNO
Multibyte in
• utf8
MySQL database character encoding that supports up to
three bytes per character.
• utf8mb4
MySQL database character encoding that supports up to
four bytes per character.
Enables support for all four-byte characters in UTF-8.