Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Character Building - Fun with charsets and enco...
Search
Sponsored
·
SiteGround - Reliable hosting with speed, security, and support you can count on.
→
Christoph Lühr
July 02, 2013
Programming
300
3
Share
Embed
Copy iframe code
Copy JS code
Copy link
Start on current slide
Character Building - Fun with charsets and encodings
Christoph Lühr
July 02, 2013
More Decks by Christoph Lühr
See All by Christoph Lühr
Vektor-Suche & LLMs
chluehr
0
140
Search, Embeddings & Vector-DBs
chluehr
0
170
Reality Check: Automated Content Production at Enterprise Scale with Pimcore
chluehr
0
96
The how and why of getting Freelancers
chluehr
0
84
Content & Master Data Management with Pimcore
chluehr
1
750
Master Data Management with Pimcore
chluehr
0
230
Swoole in 5 Minutes [en]
chluehr
1
12k
PIM & Master Data Management with Pimcore 5 [en]
chluehr
2
330
Digital Transformation & Master Data Management with Pimcore 5
chluehr
0
270
Other Decks in Programming
See All in Programming
セキュリティの専門家じゃなくてもできる。「セキュリティ意識」をアップデートして サプライチェーン攻撃への耐性を高めよう。
tk3fftk
5
660
Technical Debt: Understanding it Rightly, Engaging it Rightly #LaravelLiveJP
shogogg
0
200
RTSPクライアントを自作してみた話
simotin13
0
510
DynamoDBには集計系のクエリがないけどなんとかしたい
musan
1
130
Composerを使ったサプライチェーン攻撃の様子を眺めてみる #phpstudy
o0h
PRO
2
230
代数的データ型って何が嬉しいの? #frontend_phpcon_do
kajitack
8
3.3k
tsserverとは何だったのか、これからどうなるのか
nowaki28
1
460
AI 時代のソフトウェア設計の学び方
masuda220
PRO
29
12k
ローカルLLMを使ってB2Bサービスを作っていての学び
yaotti
0
150
CLIであることを活かしたGitHub Copilot CLI活用術 / GitHub Copilot CLI Pro Tips & Tricks
nao_mk2
1
1.2k
oxlintはeslint/typescript-eslintを置き換えられるのか
shomafujita
2
320
Lemonade + Foundry Toolkit でお手軽アプリ開発
seosoft
1
310
Featured
See All Featured
Heart Work Chapter 1 - Part 1
lfama
PRO
7
36k
Optimizing for Happiness
mojombo
378
71k
Efficient Content Optimization with Google Search Console & Apps Script
katarinadahlin
PRO
1
600
Abbi's Birthday
coloredviolet
2
8k
Marketing Yourself as an Engineer | Alaka | Gurzu
gurzu
0
210
Game over? The fight for quality and originality in the time of robots
wayneb77
1
190
How to build a perfect <img>
jonoalderson
1
5.6k
How to Create Impact in a Changing Tech Landscape [PerfNow 2023]
tammyeverts
55
3.4k
Google's AI Overviews - The New Search
badams
0
1k
We Are The Robots
honzajavorek
0
240
Everyday Curiosity
cassininazir
0
220
Testing 201, or: Great Expectations
jmmastey
46
8.2k
Transcript
Christoph Lühr @chluehr / @bephpug 2013 "Fun with charsets and
encodings" Character Building ٩(͡๏̯͡๏)۶
basilicom
None
�
Image source: http://www.flickr.com/photos/stinajonsson/3932774410 CC BY-NC 2.0
Charset vs. Encoding
Set of Characters [ A B C ... 1 2
3 ... @#$ ] UNICODE / CODE PAGES
U+2278 NEITHER LESS-THAN NOR GREATER-THAN
U+2620 SKULL AND CROSSBONES
U+FDFA ARABIC LIGATURE SALLALLAHOU ALAYHE WASALLAM "May Allah bless him
and grant him peace"
Encoding = Mapping A = 0x65 UTF-8, ISO-8859-1
Single Byte vs. Multi Byte
UTF-16 variable length!
UTF-16/32 Big- / vs Little-Endian
U+FEFF: BOM Byte-Order-Mark zero width non-breaking space
U+FFFE: BOM Byte-Order-Mark
BOM BOM BOM... UTF8 BOM 0xEF 0xBB 0xBF UTF32BE BOM
0x00 0x00 0xFE 0xFF UTF32LE BOM 0xFF 0xFE 0x00 0x00
How to debug? terminal.
hexdump -C foo.txt 00000000 48 61 6c 6c 6f 20
62 65 70 68 70 75 67 21 0a 48 |Hallo bephpug!.H| 00000010 69 65 72 20 65 69 6e 20 61 2d 55 6d 6c 61 75 74 |ier ein a-Umlaut| 00000020 3a c3 a4 21 0a 48 69 65 72 20 65 69 6e 20 61 2d |:..!.Hier ein a-| 00000030 6d 69 74 2d 4b 72 69 6e 67 65 6c 3a c3 a5 21 0a |mit-Kringel:..!.| 00000040 0a
How to re-encode? iconv -f FROM -t TO
PHP? strlen, substr, ...
� c3 a4 = ä c3 =
PHP! mbstring mb_* iconv_*
Transliteration ü => ue ü => u
Databases
DB - Connection SET NAMES 'UTF8'
DB - Storage Table vs. DB
DB - Collation ü..rstuvw ... rstuüvw
Problems Weird Stuff
PHP: Identifiers $Schüssel = new Müsli (T_FRÜCHTE); [a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*
Different Line-Endings
<?php CR LF LF // lets say hello! LF echo
"hello" LF
<?php CR LF LF // lets say hello! LF echo
"hello" LF <?php // lets say hello! LF echo "hello"
Diacritics ü => u+" U+00FC ü c3 bc LATIN SMALL
LETTER U WITH DIAERESIS U+0075 u 75 LATIN SMALL LETTER U U+0308 _̈ cc 88 COMBINING DIAERESIS
Advice
Use UTF-8
UTF-8
PHP/Server header( 'Content-type: text/html; charset=utf-8' ); HTML <meta http-equiv="Content-Type" content="
text/html; charset=UTF-8" >
Database SET NAMES UTF8 (or PDO) [mysql] default-character-set=utf8
Contact Christoph Lühr eMail:
[email protected]
,
[email protected]
Twitter: @chluehr Slides license
Attribution-NonCommercial-ShareAlike 3.0 http://creativecommons.org/licenses/by-nc-sa/3.0/ Thanks! Questions? U+3020 POSTAL MARK FACE
Links • Kore Nordmann (FAQ!) http://kore-nordmann.de/blog/0082_charset_versus_encoding.html http://kore-nordmann.de/blog/php_charset_encoding_FAQ.html • Misc. Resources
http://www.iana.org/assignments/character-sets/character-sets.xml http://www.joelonsoftware.com/articles/Unicode.html http://www.unicode.org/charts/ http://t-a-w.blogspot.de/2008/12/funny-characters-in-unicode.html http://www.utf8-zeichentabelle.de/unicode-utf8-table.pl?number=1024 http://stackoverflow.com/questions/3417180/exotic-names-for-methods- constants-variables-and-fields-bug-or-feature