Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Limit of code point for grapheme cluster in pro...

Sponsored · Ship Features Fearlessly Turn features on and off without deploys. Used by thousands of Ruby developers.

Limit of code point for grapheme cluster in programming language side.

Avatar for てきめん tekimen

てきめん tekimen PRO

February 23, 2026
Tweet

More Decks by てきめん tekimen

Other Decks in Programming

Transcript

  1. Trap • How many grapheme cluster is this below word?

    – 👨‍👩‍👧‍👧‍$$ • Maybe 3, But correct is 1. – Using ZWJ all code points. – Cursor is only move one word. • Because there is no corresponding glyph in the font – There is allowed list emoji-zwj-sequences.txt in Unicode. • https://unicode.org/Public/17.0.0/emoji/emoji-zwj-sequences.txt
  2. emoji bomb • Seems 👨‍👦‍👦 • Actually, 200MB but 1

    grapheme cluster in 👨‍👦‍👦 – Let's call it Emoji Bomb , but there is Bomb Emoji . 💣️ 💣️ • In addition, it cannot be displayed because it crashes just by displaying it on the screen.
  3. Emoji bomb: Compare Swift and PHP • Read emoji_bomb.txt in

    Swift and PHP • Same result in 1 grapheme cluster and 59999999 code points.
  4. Zalgo text • ä̈̈̈̈̈̈̈̈̈̈ – Zalgo text. Example: ư̵̧̡̥̙̭̿̈̀̒̐̊͒͑ H̵̛͕̞̦̰̜͍̰̥̟͆̏͂̌͑ͅ

    ä̷͔̟͓̬̯̟͍̭͉͈̮͙̣̯̬͚̞̭̍̀̾͠m̴̡̧̛̝̯̹̗̹̤̲̺̟̥̈̏͊̔̑̍͆̌̀̚͝͝b̴̢̢̫̝̠̗̼̬̻̮̺̭͔̘͑̆̎̚ r̷̡̡̲̼̖͎̫̮̜͇̬͌͘g̷̹͍͎̬͕͓͕̐̃̈́̓̆̚͝ẻ̵̡̼̬̥̹͇̭͔̯̉͛̈́̕r̸̮̖̻̮̣̗͚͖̝̂͌̾̓̀̿̔̀͋̈́͌̈́̋͜ • No limits. – SNS people plays using zalgo text. – Like emojis, it is also possible to send a large number of code points with one grapheme cluster. • In addition, Symfony fixes grapheme_strlen to mb_strlen. – https://github.com/symfony/symfony/pull/13527/files
  5. Practical code points • In UAX #51 and UAX #15,

    I think limit of 32 code points. – https://unicode.org/reports/tr51/#valid-emoji-tag-sequences • Emoji – https://unicode.org/reports/tr15/#Stream_Safe_Text_Format • NKFD • The most of code point in human language is “Hakṣhmalawarayaṁ(ཧྐྵྨླྺྼྻྂ)”. 9 code points(1 Base Character+8Combining Character) – https://stackoverflow.com/questions/11978912/how-to-protect -against-diacritics-such-as-zalgo-text
  6. What is “character”? • Character is code point or byte

    or grapheme cluster? • It seems better to match the requirements of the application. – In CJK, If you have to be careful with kanji, you need grapheme clusters • Identity becomes important, such as people's names and place names... • However, Approves code point unit. – Grapheme cluster is very slow in performance.