Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Unicodeの話 (2007)

4D Japan
November 01, 2024

Unicodeの話 (2007)

4D Japan

November 01, 2024
Tweet

More Decks by 4D Japan

Other Decks in Technology

Transcript

  1. word of note • Unicode compatible – the database –

    text variable – file path • Unicode is NOT fully implemented – method editor – form editor
  2. Unicode in a nutshell • Code Points • Glyphs •

    Combining Characters • Precomposed Characters • Normalization • Surrogate Pairs
  3. Code Points - unique code of character • U+0000 to

    U+007F (127 characters) – identical to ASCII • U+0000 to U+FFFF (65,536 characters) – Basic Multilingual Plane • 0x10000 to 10FFFF (1,048,576 characters) – Supplementary Planes
  4. examples : • U+0041 – LATIN CAPITAL LETTER A •

    U+0430 – CYRILLIC SMALL LETTER A • 0x10409 – DESERET CAPITAL LETTER SHORT AH a A
  5. Non Characters - Length = 1, value="" • Non Characters

    – for Internal Software Use • Combining Characters – Diacritical Marks, etc. • Surrogate Pairs – Represent a Supplementary Code Point
  6. examples : • U+FEFF – BYTE ORDER MARK • U+0061,

    U+0301 – COMBINING ACUTE ACCENT • U+D846, U+DD15 – UNIFIED CJK IDEOGRAPH U+21915 á
  7. Glyphs - appearance on screen or paper • code points

    : font or style independant – characters are organized by name, not shape • code points : script dependant – characters are organized by meaning, not shape • code point : CJK Unified Ideograph – Chinese, Japanese, Korean share the same code – Simplified and Traditional Chinese are distinct
  8. examples : • U+0061 – LATIN SMALL LETTER A •

    U+0430 – CYRILLIC SMALL LETTER A • U++8A9E – UNIFIED CJK IDEOGRAPH #8A9E • U+8BED – UNIFIED CJK IDEOGRAPH #8BED a a a
  9. Combining Characters - part of a letter • Diacritical Marks

    – acute, grave, kasra, shadda, dakuten, handakuten • Hangeul (Korean) – choseong, jungsong, jongseong
  10. • U+0061, U+0323, U+0302 – LATIN SMALL LETTER A –

    COMBINING DOT BELOW – COMBINING CIRCUMFLEX ACCENT • U+0633, U+064E, U+0651 – ARABIC LETTER SEEN – ARABIC FATHA – ARABIC SHADDA • U+304C, U+3099 – HIRAGANA LETTER U – COMBINING H/K VOICED SOUND MARK examples :
  11. • U+1100, U+1161, U+11B9 – HANGUL CHOSEONG KIYEOK – HANGUL

    JUNGSEONG A – HANGUL JONGSEONG PIEUP-SIOS examples :
  12. • U+0041, U+0301 – LATIN CAPITAL LETTER A – COMBINING

    ACCUTE ACCENT • U+00C1 – LATIN CAPITAL LETTER A WITH ACUTE • U+1100, U+1161, U+11B9 – HANGUL CHOSEONG KIYEOK – HANGUL JUNGSEONG A – HANGUL JONGSEONG PIEUP-SIOS • U+AC12 – HANGUL SYLLABLE KAPS examples :
  13. in order to handle Unicode correctly, you need to... •

    Understand text evaluation rules – Comparison operands or Match regex • Know the exact number of letters – Length = the number of Code Points • Recognize unbreakable sequences – Substring can split Surrogates – Character code can be just a fraction
  14. Normalization • NFD – decompose as much as possible •

    NFC – decompose, then re-compose • NFKC, NFKD – factor in compatible characters
  15. Surrogate Pairs (UTF-16 specific) • upper surrogate – 0xD800 -

    0xDBFF • lower surrogate – 0xDC00 - 0xDFFF
  16. Extended Japanese • 4D 2004 – JIS X 0208 –

    6,879 characters • 4D v11 SQL – JIS X 0213 – 11,233 characters the lower stroke is longer than the upper; this is a disfigured character only available in v11 SQL the dot on the top right is unnecessary; this is a disfigured character only available in v11 SQL
  17. Extended Chinese • A+ – GB18030 support with ethnic minority

    scripts • A – GB18030 support without ethnic minority scripts • B – updated product that meet the A standard • C – not in conformity with GB18030 (uncertified) source : http://www.lisa.org/globalizationinsider/2002/05/a_look_at_china.html 4D v11 SQL (UTF-16 with surrogate pairs) = A+ 4D 2004 (GB2312/Big5) = C
  18. Regular Expressions • \x – match code point (c.f. Char)

    • .* – match one or more (c.f. @) • \p – match property (no equiv.)
  19. Enhanced Functions • Position with * – count Non Characters

    as 1 letter – diacritic sensitive search • Lowercase with * – keep diacritic marks • Uppercase with * – keep diacritic marks
  20. Plugin API v11 • Unichar* – null terminated buffer of

    UTF-16 • Unistring – structure of Unichar* and its length • evk_ArrayUnicode – array of Unistring • What if Unicode mode is turned off? – doesn't matter
  21. Plugin API 2004 • Compatible with v11 SQL • What

    if Unicode mode is turned on? – doesn't matter
  22. Pasteboard • GET PASTEBOARD DATA – text (Unicode or legacy

    encoding) – file path as URL • GET FILE FROM PASTEBOARD – multiple file paths in system representation • GET TEXT FROM PASTEBOARD – "Mac" encoding
  23. Conversion to and from BLOB • CONVERT FROM TEXT –

    sepecify target encoding • TEXT TO BLOB – "Mac" encoding
  24. PROCESS HTML TAGS • To create Web Page/XML... – use

    C_BLOB – declare charset="whatever" • To create UTF-16 text... – use C_TEXT • Conversion affected by... – Database parameter #17 (Character set)
  25. Upgrading from 4D 2004 • single language throughout the DB

    – conversion automatic • multiple languages within the DB – create MultiLang.TXT