Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Character Building - Fun with charsets and enco...
Search
Christoph Lühr
July 02, 2013
Programming
300
3
Share
Embed
Copy iframe code
Copy JS code
Copy link
Start on current slide
Character Building - Fun with charsets and encodings
Christoph Lühr
July 02, 2013
More Decks by Christoph Lühr
See All by Christoph Lühr
Vektor-Suche & LLMs
chluehr
0
140
Search, Embeddings & Vector-DBs
chluehr
0
170
Reality Check: Automated Content Production at Enterprise Scale with Pimcore
chluehr
0
97
The how and why of getting Freelancers
chluehr
0
85
Content & Master Data Management with Pimcore
chluehr
1
760
Master Data Management with Pimcore
chluehr
0
230
Swoole in 5 Minutes [en]
chluehr
1
12k
PIM & Master Data Management with Pimcore 5 [en]
chluehr
2
330
Digital Transformation & Master Data Management with Pimcore 5
chluehr
0
270
Other Decks in Programming
See All in Programming
Strategic Design in the Frontend: Moduliths & Micro Frontends @DDDEurope
manfredsteyer
PRO
0
100
RTSPクライアントを自作してみた話
simotin13
0
610
LLMによるContent Moderationの本番運用の裏側と品質担保への挑戦
suikabar
3
690
「AIで開発し、AIを届ける」をEvalでつなぐ 〜AIネイティブに始めるプロダクト開発の実践〜 / Connecting "Develop with AI, deliver AI" with Eval
rkaga
4
5.1k
Claspは野良GASの夢をみるか
takter00
0
190
スマートグラスで並列バイブコーディング
hyshu
0
150
生成AI時代にこそ効くGo | Why Go Works in the Age of Generative AI
mom0tomo
8
3.2k
CSC307 Lecture 17
javiergs
PRO
0
320
セキュリティの専門家じゃなくてもできる。「セキュリティ意識」をアップデートして サプライチェーン攻撃への耐性を高めよう。
tk3fftk
5
770
DynamoDBには集計系のクエリがないけどなんとかしたい
musan
1
140
Datadog × OpenTelemetry 入門と実践のあいだ
kn_to_maxpno
1
160
Oxlintのカスタムルールの現況
syumai
6
1.1k
Featured
See All Featured
Taking LLMs out of the black box: A practical guide to human-in-the-loop distillation
inesmontani
PRO
3
2.3k
[Rails World 2023 - Day 1 Closing Keynote] - The Magic of Rails
eileencodes
38
2.9k
The Organizational Zoo: Understanding Human Behavior Agility Through Metaphoric Constructive Conversations (based on the works of Arthur Shelley, Ph.D)
kimpetersen
PRO
0
360
Accessibility Awareness
sabderemane
1
140
Darren the Foodie - Storyboard
khoart
PRO
3
3.4k
Breaking role norms: Why Content Design is so much more than writing copy - Taylor Woolridge
uxyall
0
320
JAMstack: Web Apps at Ludicrous Speed - All Things Open 2022
reverentgeek
1
470
Information Architects: The Missing Link in Design Systems
soysaucechin
0
970
Visualization
eitanlees
152
17k
[SF Ruby Conf 2025] Rails X
palkan
2
1.1k
Side Projects
sachag
455
43k
Producing Creativity
orderedlist
PRO
348
40k
Transcript
Christoph Lühr @chluehr / @bephpug 2013 "Fun with charsets and
encodings" Character Building ٩(͡๏̯͡๏)۶
basilicom
None
�
Image source: http://www.flickr.com/photos/stinajonsson/3932774410 CC BY-NC 2.0
Charset vs. Encoding
Set of Characters [ A B C ... 1 2
3 ... @#$ ] UNICODE / CODE PAGES
U+2278 NEITHER LESS-THAN NOR GREATER-THAN
U+2620 SKULL AND CROSSBONES
U+FDFA ARABIC LIGATURE SALLALLAHOU ALAYHE WASALLAM "May Allah bless him
and grant him peace"
Encoding = Mapping A = 0x65 UTF-8, ISO-8859-1
Single Byte vs. Multi Byte
UTF-16 variable length!
UTF-16/32 Big- / vs Little-Endian
U+FEFF: BOM Byte-Order-Mark zero width non-breaking space
U+FFFE: BOM Byte-Order-Mark
BOM BOM BOM... UTF8 BOM 0xEF 0xBB 0xBF UTF32BE BOM
0x00 0x00 0xFE 0xFF UTF32LE BOM 0xFF 0xFE 0x00 0x00
How to debug? terminal.
hexdump -C foo.txt 00000000 48 61 6c 6c 6f 20
62 65 70 68 70 75 67 21 0a 48 |Hallo bephpug!.H| 00000010 69 65 72 20 65 69 6e 20 61 2d 55 6d 6c 61 75 74 |ier ein a-Umlaut| 00000020 3a c3 a4 21 0a 48 69 65 72 20 65 69 6e 20 61 2d |:..!.Hier ein a-| 00000030 6d 69 74 2d 4b 72 69 6e 67 65 6c 3a c3 a5 21 0a |mit-Kringel:..!.| 00000040 0a
How to re-encode? iconv -f FROM -t TO
PHP? strlen, substr, ...
� c3 a4 = ä c3 =
PHP! mbstring mb_* iconv_*
Transliteration ü => ue ü => u
Databases
DB - Connection SET NAMES 'UTF8'
DB - Storage Table vs. DB
DB - Collation ü..rstuvw ... rstuüvw
Problems Weird Stuff
PHP: Identifiers $Schüssel = new Müsli (T_FRÜCHTE); [a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*
Different Line-Endings
<?php CR LF LF // lets say hello! LF echo
"hello" LF
<?php CR LF LF // lets say hello! LF echo
"hello" LF <?php // lets say hello! LF echo "hello"
Diacritics ü => u+" U+00FC ü c3 bc LATIN SMALL
LETTER U WITH DIAERESIS U+0075 u 75 LATIN SMALL LETTER U U+0308 _̈ cc 88 COMBINING DIAERESIS
Advice
Use UTF-8
UTF-8
PHP/Server header( 'Content-type: text/html; charset=utf-8' ); HTML <meta http-equiv="Content-Type" content="
text/html; charset=UTF-8" >
Database SET NAMES UTF8 (or PDO) [mysql] default-character-set=utf8
Contact Christoph Lühr eMail:
[email protected]
,
[email protected]
Twitter: @chluehr Slides license
Attribution-NonCommercial-ShareAlike 3.0 http://creativecommons.org/licenses/by-nc-sa/3.0/ Thanks! Questions? U+3020 POSTAL MARK FACE
Links • Kore Nordmann (FAQ!) http://kore-nordmann.de/blog/0082_charset_versus_encoding.html http://kore-nordmann.de/blog/php_charset_encoding_FAQ.html • Misc. Resources
http://www.iana.org/assignments/character-sets/character-sets.xml http://www.joelonsoftware.com/articles/Unicode.html http://www.unicode.org/charts/ http://t-a-w.blogspot.de/2008/12/funny-characters-in-unicode.html http://www.utf8-zeichentabelle.de/unicode-utf8-table.pl?number=1024 http://stackoverflow.com/questions/3417180/exotic-names-for-methods- constants-variables-and-fields-bug-or-feature