CJK and Unicode From a PHP Committer

by てきめん tekimen

Embed

Start on current slide

Slide 1

Slide 1 text

CJK and Unicode From a PHP Committer

Slide 2

Slide 2 text

Self introdution me てきめん (tekimen) ● https://tekitoh-memdhoi.info ● X(twitter): @youkidearitai ● https://github.com/youkidearitai ● https://mstdn.jp/tekimen ● https://phpc.social/@youkidearitai ● A PHP committer in mbstring/Unicode

Slide 3

Slide 3 text

Agenda ● Unicode is good ● About CJK – Ideographic character and Phonogram character ● About grapheme functions ● What’s new in Unicode side of PHP

Slide 4

Slide 4 text

Unicode is good ● Unicode is good ● Freedom from mojibake(garbled characters) ● We can also use emoj 🎉🎉 i

Slide 5

Slide 5 text

About 漢字(Kanji) ● It originated in China and is pronounced “Hanji” in Chinese. ● 漢字 is ideographic character(表意文字). – It means “Words have meaning.” – For example, English is phonogram character(表音文字). It means “Words have sounds”.

Slide 6

Slide 6 text

How to learn Kanji (or Ideograph character)? Although incorrect, it is easy to understand when learning ideographic characters. ストーリーで覚える漢字300 英語・韓国語・ポルトガル語・スペイン語訳版 https://www.9640.jp/nihongo/ja/detail/?402 https://www.amazon.co.jp/dp/4874244025

Slide 7

Slide 7 text

Nit: Correct(maybe) history of “駅” ● A glimpse into the history of the station reveals that it was formed using pictograms. ● Left side is horse(馬)、 Right side is eye(目) + handcuff symbol(幸) → 驛(old character) → 駅 ● 「駅/驛」という漢字の意味・成り立ち・読み方・画数・部首を学習 https://okjiten.jp/kanji462.html ● 駅とは (エキとは) [単語記事] - ニコニコ大百科 https://dic.nicovideo.jp/a/%E9%A7%85

Slide 8

Slide 8 text

About CJK ● Kanji originated in China and spread throughout East Asia, where various characters are now shared. ● However, the shape of kanji characters varies depending on the country or region. ● By standardizing to Unicode, these kanji characters have also been integrated. – It says Han unification(漢字統合). ● These countries are referred to as “CJK” from the first letters of their alphabets (China, Japan, Korea). – Vietnamese may also be included, in which case it is called CJKV.

Slide 9

Slide 9 text

Example: 化 U+5316 化化化 Chinese Noto Sans SC Japanese Noto Sans JP Korean Noto Sans KR Not protruding

Slide 10

Slide 10 text

Example: 乳 U+4E73 乳乳乳 Chinese Noto Sans SC Japanese Noto Sans JP Korean Noto Sans KR Different point

Slide 11

Slide 11 text

Example: 誤 U+8AA4 誤誤誤 Chinese Noto Sans SC Japanese Noto Sans JP Korean Noto Sans KR Same Code Point, But different all!

Slide 12

Slide 12 text

Further information ● Your Code Displays Japanese Wrong – How to fix: Please specify country font. ●

Slide 13

Slide 13 text

Unicode has unified the world. But... ● On the other hand, there are also regional differences. CJK is one example. ● Unicode has a concept called locale. – The Turkish character “İ (U+0130)” cannot be identified as a lowercase character in root locale. ● The lowercase letter “ss” cannot be recognized as lowercase, even though it is the lowercase form of the German letter “ß (U+1E9E)”.

Slide 14

Slide 14 text

Grapheme cluster ● 🇯🇵 is included “ 🇯” and “ 🇵”. – Emoji sometimes has multi-codepoint. ● Also, Japanese Kanji has multi-codepoint characters. For example, 「邉」 – 邉邉邉󠄁 邉󠄂 邉󠄃 邉󠄄 邉󠄅 邉󠄆 邉󠄈 邉󠄉 邉󠄊 邉󠄋 邉󠄌 邉󠄍 邉󠄎 – Code point is U+9089 U+E0101 ● Variation Selector: U+E0100 between U+E01EF – This said IVS(Ideographic Variation Sequence)

Slide 15

Slide 15 text

Grapheme cluster ● A cluster of graphemes that appears as a single character is called a grapheme cluster. – Kanji(漢字) – Emoji(✌️ ) – Maybe supports characters from around the world

Slide 16

Slide 16 text

PHP Unicode Support ● The grapheme function is supported. This supports grapheme clusters. ● In PHP 8.5, grapheme functions are add locale parameter. – That means support to problems to previous slide. (İ (U+0130) matches lower case and ß (U+1E9E) matches “ss”) ● Use $locale for “de_DE-u-ks-primary”

Slide 17

Slide 17 text

Grapheme functions ● It is located in the Intl extension. – Require –enable-intl option. ● The number of characters in a cluster is counted as the number of characters seen. – Emojis are useful 🙆‍♂️🙆‍♀️

Slide 18

Slide 18 text

grapheme function and mbstring function mb_str_split is splits in Unicode code point unit grapheme_str_split is splits in grapheme cluster unit Splits Unicode code point unit ● U+1F646 ● U+200D ● U+2640 ● U+FE0F Splits grapheme cluster unit

Slide 19

Slide 19 text

Add locale for grapheme functions ● https://wiki.php.net/rfc/graph eme_add_locale_for_case_in sensitive – I add to $locale parameter for grapheme functions. – Based on LDML(Locale Data Markup Language) ● https://www.unicode.o rg/reports/tr35/

Slide 20

Slide 20 text

How to use the $locale parameter ● クリックしてテキストを追加 ● A İ (U+0130) matches i use a “tr_TR” ● A ß(U+1E9E) matches ss use a “u-ks-level1” ● A 邉󠄅 not matches 邊 use a “u-ks-identic”

Slide 21

Slide 21 text

Conclusion ● Enhancements to the grapheme function enable further Unicode support. ● Unicode is difficult, but we can communicate in the world. – I explained in detail using Kanji, but I think the difference in characters is acceptable. ● I think disappear discomfort in several generation that CJK's seems different.

Slide 22

Slide 22 text

Thank you!

Slide 23

Slide 23 text

Appendix: references ● https://maidonanews.jp/article/13142228 ● https://www.unicode.org/reports/tr35/ ● https://heistak.github.io/your-code-displays-japanese-wrong/ ● https://www.9640.jp/nihongo/ja/detail/?402 ● https://speakerdeck.com/youkidearitai/wen-zi-tohananika-phpno wen-zi-kodochu-li-nituite-php-lovers-meetup-number-5 ● https://ken-lunde.medium.com/genuine-han-unification-redux-391 2b561ecae – JP: https://medium.com/@takagi.yuusuke/genuine-han-unifica tion-%E6%97%A5%E6%9C%AC%E8%AA%9E%E8%A8%B3-24a 705d77f9b