Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Unicodeの話 (2007)

Avatar for 4D Japan 4D Japan
November 01, 2024

Unicodeの話 (2007)

Avatar for 4D Japan

4D Japan

November 01, 2024
Tweet

More Decks by 4D Japan

Other Decks in Technology

Transcript

  1. word of note • Unicode compatible – the database –

    text variable – file path • Unicode is NOT fully implemented – method editor – form editor
  2. Unicode in a nutshell • Code Points • Glyphs •

    Combining Characters • Precomposed Characters • Normalization • Surrogate Pairs
  3. Code Points - unique code of character • U+0000 to

    U+007F (127 characters) – identical to ASCII • U+0000 to U+FFFF (65,536 characters) – Basic Multilingual Plane • 0x10000 to 10FFFF (1,048,576 characters) – Supplementary Planes
  4. examples : • U+0041 – LATIN CAPITAL LETTER A •

    U+0430 – CYRILLIC SMALL LETTER A • 0x10409 – DESERET CAPITAL LETTER SHORT AH a A
  5. Non Characters - Length = 1, value="" • Non Characters

    – for Internal Software Use • Combining Characters – Diacritical Marks, etc. • Surrogate Pairs – Represent a Supplementary Code Point
  6. examples : • U+FEFF – BYTE ORDER MARK • U+0061,

    U+0301 – COMBINING ACUTE ACCENT • U+D846, U+DD15 – UNIFIED CJK IDEOGRAPH U+21915 á
  7. Glyphs - appearance on screen or paper • code points

    : font or style independant – characters are organized by name, not shape • code points : script dependant – characters are organized by meaning, not shape • code point : CJK Unified Ideograph – Chinese, Japanese, Korean share the same code – Simplified and Traditional Chinese are distinct
  8. examples : • U+0061 – LATIN SMALL LETTER A •

    U+0430 – CYRILLIC SMALL LETTER A • U++8A9E – UNIFIED CJK IDEOGRAPH #8A9E • U+8BED – UNIFIED CJK IDEOGRAPH #8BED a a a
  9. Combining Characters - part of a letter • Diacritical Marks

    – acute, grave, kasra, shadda, dakuten, handakuten • Hangeul (Korean) – choseong, jungsong, jongseong
  10. • U+0061, U+0323, U+0302 – LATIN SMALL LETTER A –

    COMBINING DOT BELOW – COMBINING CIRCUMFLEX ACCENT • U+0633, U+064E, U+0651 – ARABIC LETTER SEEN – ARABIC FATHA – ARABIC SHADDA • U+304C, U+3099 – HIRAGANA LETTER U – COMBINING H/K VOICED SOUND MARK examples :
  11. • U+1100, U+1161, U+11B9 – HANGUL CHOSEONG KIYEOK – HANGUL

    JUNGSEONG A – HANGUL JONGSEONG PIEUP-SIOS examples :
  12. • U+0041, U+0301 – LATIN CAPITAL LETTER A – COMBINING

    ACCUTE ACCENT • U+00C1 – LATIN CAPITAL LETTER A WITH ACUTE • U+1100, U+1161, U+11B9 – HANGUL CHOSEONG KIYEOK – HANGUL JUNGSEONG A – HANGUL JONGSEONG PIEUP-SIOS • U+AC12 – HANGUL SYLLABLE KAPS examples :
  13. in order to handle Unicode correctly, you need to... •

    Understand text evaluation rules – Comparison operands or Match regex • Know the exact number of letters – Length = the number of Code Points • Recognize unbreakable sequences – Substring can split Surrogates – Character code can be just a fraction
  14. Normalization • NFD – decompose as much as possible •

    NFC – decompose, then re-compose • NFKC, NFKD – factor in compatible characters
  15. Surrogate Pairs (UTF-16 specific) • upper surrogate – 0xD800 -

    0xDBFF • lower surrogate – 0xDC00 - 0xDFFF
  16. Extended Japanese • 4D 2004 – JIS X 0208 –

    6,879 characters • 4D v11 SQL – JIS X 0213 – 11,233 characters the lower stroke is longer than the upper; this is a disfigured character only available in v11 SQL the dot on the top right is unnecessary; this is a disfigured character only available in v11 SQL
  17. Extended Chinese • A+ – GB18030 support with ethnic minority

    scripts • A – GB18030 support without ethnic minority scripts • B – updated product that meet the A standard • C – not in conformity with GB18030 (uncertified) source : http://www.lisa.org/globalizationinsider/2002/05/a_look_at_china.html 4D v11 SQL (UTF-16 with surrogate pairs) = A+ 4D 2004 (GB2312/Big5) = C
  18. Regular Expressions • \x – match code point (c.f. Char)

    • .* – match one or more (c.f. @) • \p – match property (no equiv.)
  19. Enhanced Functions • Position with * – count Non Characters

    as 1 letter – diacritic sensitive search • Lowercase with * – keep diacritic marks • Uppercase with * – keep diacritic marks
  20. Plugin API v11 • Unichar* – null terminated buffer of

    UTF-16 • Unistring – structure of Unichar* and its length • evk_ArrayUnicode – array of Unistring • What if Unicode mode is turned off? – doesn't matter
  21. Plugin API 2004 • Compatible with v11 SQL • What

    if Unicode mode is turned on? – doesn't matter
  22. Pasteboard • GET PASTEBOARD DATA – text (Unicode or legacy

    encoding) – file path as URL • GET FILE FROM PASTEBOARD – multiple file paths in system representation • GET TEXT FROM PASTEBOARD – "Mac" encoding
  23. Conversion to and from BLOB • CONVERT FROM TEXT –

    sepecify target encoding • TEXT TO BLOB – "Mac" encoding
  24. PROCESS HTML TAGS • To create Web Page/XML... – use

    C_BLOB – declare charset="whatever" • To create UTF-16 text... – use C_TEXT • Conversion affected by... – Database parameter #17 (Character set)
  25. Upgrading from 4D 2004 • single language throughout the DB

    – conversion automatic • multiple languages within the DB – create MultiLang.TXT