Unicodeの話 (2007)

4D v11 SQL Unicode Compatibility Keisuke Miyako, 4D Japan

In this Session... • Unicode explained • 4D v11 SQL
demo • Conversion from 4D 2004

word of note • Unicode compatible – the database –
text variable – file path • Unicode is NOT fully implemented – method editor – form editor

word of note • non-Unicode mode available – preferences –
SET DATABASE PARAMETER

Unicode in a nutshell • Code Points • Glyphs •
Combining Characters • Precomposed Characters • Normalization • Surrogate Pairs

Code Points - unique code of character • U+0000 to
U+007F (127 characters) – identical to ASCII • U+0000 to U+FFFF (65,536 characters) – Basic Multilingual Plane • 0x10000 to 10FFFF (1,048,576 characters) – Supplementary Planes

examples : • U+0041 – LATIN CAPITAL LETTER A •
U+0430 – CYRILLIC SMALL LETTER A • 0x10409 – DESERET CAPITAL LETTER SHORT AH a A

Non Characters - Length = 1, value="" • Non Characters
– for Internal Software Use • Combining Characters – Diacritical Marks, etc. • Surrogate Pairs – Represent a Supplementary Code Point

examples : • U+FEFF – BYTE ORDER MARK • U+0061,
U+0301 – COMBINING ACUTE ACCENT • U+D846, U+DD15 – UNIFIED CJK IDEOGRAPH U+21915 á

Glyphs - appearance on screen or paper • code points
: font or style independant – characters are organized by name, not shape • code points : script dependant – characters are organized by meaning, not shape • code point : CJK Unified Ideograph – Chinese, Japanese, Korean share the same code – Simplified and Traditional Chinese are distinct

examples : • U+0061 – LATIN SMALL LETTER A •
U+0430 – CYRILLIC SMALL LETTER A • U++8A9E – UNIFIED CJK IDEOGRAPH #8A9E • U+8BED – UNIFIED CJK IDEOGRAPH #8BED a a a

Combining Characters - part of a letter • Diacritical Marks
– acute, grave, kasra, shadda, dakuten, handakuten • Hangeul (Korean) – choseong, jungsong, jongseong

• U+0061, U+0323, U+0302 – LATIN SMALL LETTER A –
COMBINING DOT BELOW – COMBINING CIRCUMFLEX ACCENT • U+0633, U+064E, U+0651 – ARABIC LETTER SEEN – ARABIC FATHA – ARABIC SHADDA • U+304C, U+3099 – HIRAGANA LETTER U – COMBINING H/K VOICED SOUND MARK examples :

• U+1100, U+1161, U+11B9 – HANGUL CHOSEONG KIYEOK – HANGUL
JUNGSEONG A – HANGUL JONGSEONG PIEUP-SIOS examples :

Precomposed Characters - ready made • alphabet with diacritical marks
• precomposed Hangeul (Korean)

• U+0041, U+0301 – LATIN CAPITAL LETTER A – COMBINING
ACCUTE ACCENT • U+00C1 – LATIN CAPITAL LETTER A WITH ACUTE • U+1100, U+1161, U+11B9 – HANGUL CHOSEONG KIYEOK – HANGUL JUNGSEONG A – HANGUL JONGSEONG PIEUP-SIOS • U+AC12 – HANGUL SYLLABLE KAPS examples :

in order to handle Unicode correctly, you need to... •
Understand text evaluation rules – Comparison operands or Match regex • Know the exact number of letters – Length = the number of Code Points • Recognize unbreakable sequences – Substring can split Surrogates – Character code can be just a fraction

Normalization • NFD – decompose as much as possible •
NFC – decompose, then re-compose • NFKC, NFKD – factor in compatible characters

String Comparison number in parentheses indicate the Length of string
(Code Points)

Surrogate Pairs (UTF-16 specific) • upper surrogate – 0xD800 -
0xDBFF • lower surrogate – 0xDC00 - 0xDFFF

Extended Japanese • 4D 2004 – JIS X 0208 –
6,879 characters • 4D v11 SQL – JIS X 0213 – 11,233 characters the lower stroke is longer than the upper; this is a disfigured character only available in v11 SQL the dot on the top right is unnecessary; this is a disfigured character only available in v11 SQL

Extended Chinese • A+ – GB18030 support with ethnic minority
scripts • A – GB18030 support without ethnic minority scripts • B – updated product that meet the A standard • C – not in conformity with GB18030 (uncertified) source : http://www.lisa.org/globalizationinsider/2002/05/a_look_at_china.html 4D v11 SQL (UTF-16 with surrogate pairs) = A+ 4D 2004 (GB2312/Big5) = C

Regular Expressions • \x – match code point (c.f. Char)
• .* – match one or more (c.f. @) • \p – match property (no equiv.)

Enhanced Functions • Position with * – count Non Characters
as 1 letter – diacritic sensitive search • Lowercase with * – keep diacritic marks • Uppercase with * – keep diacritic marks

Plugin API v11 • Unichar* – null terminated buffer of
UTF-16 • Unistring – structure of Unichar* and its length • evk_ArrayUnicode – array of Unistring • What if Unicode mode is turned off? – doesn't matter

Plugin API 2004 • Compatible with v11 SQL • What
if Unicode mode is turned on? – doesn't matter

Pasteboard • GET PASTEBOARD DATA – text (Unicode or legacy
encoding) – file path as URL • GET FILE FROM PASTEBOARD – multiple file paths in system representation • GET TEXT FROM PASTEBOARD – "Mac" encoding

Conversion to and from BLOB • CONVERT FROM TEXT –
sepecify target encoding • TEXT TO BLOB – "Mac" encoding

PROCESS HTML TAGS • To create Web Page/XML... – use
C_BLOB – declare charset="whatever" • To create UTF-16 text... – use C_TEXT • Conversion affected by... – Database parameter #17 (Character set)

Upgrading from 4D 2004 • single language throughout the DB
– conversion automatic • multiple languages within the DB – create MultiLang.TXT

4D v11 SQL Unicode Compatibility end of presentation

Unicodeの話 (2007)

Unicodeの話 (2007)

4D Japan

More Decks by 4D Japan

Other Decks in Technology

Featured

Transcript

4D v11 SQL Unicode Compatibility Keisuke Miyako, 4D Japan

In this Session... • Unicode explained • 4D v11 SQL

word of note • Unicode compatible – the database –

word of note • non-Unicode mode available – preferences –

Unicode in a nutshell • Code Points • Glyphs •

Code Points - unique code of character • U+0000 to

examples : • U+0041 – LATIN CAPITAL LETTER A •

Non Characters - Length = 1, value="" • Non Characters

examples : • U+FEFF – BYTE ORDER MARK • U+0061,

Glyphs - appearance on screen or paper • code points

examples : • U+0061 – LATIN SMALL LETTER A •

Combining Characters - part of a letter • Diacritical Marks

• U+0061, U+0323, U+0302 – LATIN SMALL LETTER A –

• U+1100, U+1161, U+11B9 – HANGUL CHOSEONG KIYEOK – HANGUL

Precomposed Characters - ready made • alphabet with diacritical marks

• U+0041, U+0301 – LATIN CAPITAL LETTER A – COMBINING

in order to handle Unicode correctly, you need to... •

Normalization • NFD – decompose as much as possible •

String Comparison number in parentheses indicate the Length of string

Surrogate Pairs (UTF-16 specific) • upper surrogate – 0xD800 -

Extended Japanese • 4D 2004 – JIS X 0208 –

Extended Chinese • A+ – GB18030 support with ethnic minority

Regular Expressions • \x – match code point (c.f. Char)

Enhanced Functions • Position with * – count Non Characters

Plugin API v11 • Unichar* – null terminated buffer of

Plugin API 2004 • Compatible with v11 SQL • What

Pasteboard • GET PASTEBOARD DATA – text (Unicode or legacy

Conversion to and from BLOB • CONVERT FROM TEXT –

PROCESS HTML TAGS • To create Web Page/XML... – use

Upgrading from 4D 2004 • single language throughout the DB

4D v11 SQL Unicode Compatibility end of presentation