Slide 1

Slide 1 text

Christoph Lühr @chluehr / @bephpug 2013 "Fun with charsets and encodings" Character Building ٩(͡๏̯͡๏)۶

Slide 2

Slide 2 text

basilicom

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

Slide 5

Slide 5 text

Image source: http://www.flickr.com/photos/stinajonsson/3932774410 CC BY-NC 2.0

Slide 6

Slide 6 text

Charset vs. Encoding

Slide 7

Slide 7 text

Set of Characters [ A B C ... 1 2 3 ... @#$ ] UNICODE / CODE PAGES

Slide 8

Slide 8 text

U+2278 NEITHER LESS-THAN NOR GREATER-THAN

Slide 9

Slide 9 text

U+2620 SKULL AND CROSSBONES

Slide 10

Slide 10 text

U+FDFA ARABIC LIGATURE SALLALLAHOU ALAYHE WASALLAM "May Allah bless him and grant him peace"

Slide 11

Slide 11 text

Encoding = Mapping A = 0x65 UTF-8, ISO-8859-1

Slide 12

Slide 12 text

Single Byte vs. Multi Byte

Slide 13

Slide 13 text

UTF-16 variable length!

Slide 14

Slide 14 text

UTF-16/32 Big- / vs Little-Endian

Slide 15

Slide 15 text

U+FEFF: BOM Byte-Order-Mark zero width non-breaking space

Slide 16

Slide 16 text

U+FFFE: BOM Byte-Order-Mark

Slide 17

Slide 17 text

BOM BOM BOM... UTF8 BOM 0xEF 0xBB 0xBF UTF32BE BOM 0x00 0x00 0xFE 0xFF UTF32LE BOM 0xFF 0xFE 0x00 0x00

Slide 18

Slide 18 text

How to debug? terminal.

Slide 19

Slide 19 text

hexdump -C foo.txt 00000000 48 61 6c 6c 6f 20 62 65 70 68 70 75 67 21 0a 48 |Hallo bephpug!.H| 00000010 69 65 72 20 65 69 6e 20 61 2d 55 6d 6c 61 75 74 |ier ein a-Umlaut| 00000020 3a c3 a4 21 0a 48 69 65 72 20 65 69 6e 20 61 2d |:..!.Hier ein a-| 00000030 6d 69 74 2d 4b 72 69 6e 67 65 6c 3a c3 a5 21 0a |mit-Kringel:..!.| 00000040 0a

Slide 20

Slide 20 text

How to re-encode? iconv -f FROM -t TO

Slide 21

Slide 21 text

PHP? strlen, substr, ...

Slide 22

Slide 22 text

� c3 a4 = ä c3 =

Slide 23

Slide 23 text

PHP! mbstring mb_* iconv_*

Slide 24

Slide 24 text

Transliteration ü => ue ü => u

Slide 25

Slide 25 text

Databases

Slide 26

Slide 26 text

DB - Connection SET NAMES 'UTF8'

Slide 27

Slide 27 text

DB - Storage Table vs. DB

Slide 28

Slide 28 text

DB - Collation ü..rstuvw ... rstuüvw

Slide 29

Slide 29 text

Problems Weird Stuff

Slide 30

Slide 30 text

PHP: Identifiers $Schüssel = new Müsli (T_FRÜCHTE); [a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*

Slide 31

Slide 31 text

Different Line-Endings

Slide 32

Slide 32 text

Slide 33

Slide 33 text

Slide 34

Slide 34 text

Diacritics ü => u+" U+00FC ü c3 bc LATIN SMALL LETTER U WITH DIAERESIS U+0075 u 75 LATIN SMALL LETTER U U+0308 _̈ cc 88 COMBINING DIAERESIS

Slide 35

Slide 35 text

Advice

Slide 36

Slide 36 text

Use UTF-8

Slide 37

Slide 37 text

UTF-8

Slide 38

Slide 38 text

PHP/Server header( 'Content-type: text/html; charset=utf-8' ); HTML

Slide 39

Slide 39 text

Database SET NAMES UTF8 (or PDO) [mysql] default-character-set=utf8

Slide 40

Slide 40 text

Contact Christoph Lühr eMail: luehr@r-pentomino.de, christoph.luehr@basilicom.de Twitter: @chluehr Slides license Attribution-NonCommercial-ShareAlike 3.0 http://creativecommons.org/licenses/by-nc-sa/3.0/ Thanks! Questions? U+3020 POSTAL MARK FACE

Slide 41

Slide 41 text

Links ● Kore Nordmann (FAQ!) http://kore-nordmann.de/blog/0082_charset_versus_encoding.html http://kore-nordmann.de/blog/php_charset_encoding_FAQ.html ● Misc. Resources http://www.iana.org/assignments/character-sets/character-sets.xml http://www.joelonsoftware.com/articles/Unicode.html http://www.unicode.org/charts/ http://t-a-w.blogspot.de/2008/12/funny-characters-in-unicode.html http://www.utf8-zeichentabelle.de/unicode-utf8-table.pl?number=1024 http://stackoverflow.com/questions/3417180/exotic-names-for-methods- constants-variables-and-fields-bug-or-feature