Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Special characters and where to find them

Special characters and where to find them

There are 23 official languages within the European Union and many, if not all, of them have special characters. In German, for example, there are Umlauts (“üöä”) and the “ß”; and in other languages, there are more. Many characters exist in a pre-composed version and as a combination of two characters. Using the two-character version can lead to a broken search, broken spell check, broken transliteration for the slug, and broken images if this happens in a filename in combination with some server configuration and browsers.

Presentation from WordCamp Europe 2019.

Torsten Landsiedel

June 21, 2019
Tweet

More Decks by Torsten Landsiedel

Other Decks in Technology

Transcript

  1. SPECIAL CHARACTERS
    AND WHERE TO FIND THEM
    W O R D C A M P E U R O P E 2 0 1 9 // B E R L I N
    Ümlaut, Baby!
    To r s t e n L a n d s i e d e l // @ z o d i a c 1 9 7 8

    View Slide

  2. S I N C E 2 0 1 2
    Fulltime

    WordPress
    Freelancer
    01.
    2 0 0 9 – 2 0 1 7
    General
    Translation Editor
    (for German)
    02.
    2 0 1 3 – 2 0 1 8
    Support forum
    moderator

    (for German)
    03.
    WHO AM I?
    @zodiac1978

    View Slide

  3. There are
    Do we really want to support them
    23
    WHY ARE
    WE HERE?
    official languages
    W I T H I N T H E
    E U R O P E A N U N I O N
    ?

    View Slide

  4. N O N E N G L I S H
    I N S T A L L A T I O N S
    48,5%

    View Slide

  5. CLASSIC EDITOR

    View Slide

  6. CLASSIC EDITOR
    Popup with special characters

    View Slide

  7. OKAY AND
    Cat pics?
    NOW ?

    View Slide

  8. BROKEN FONTS

    View Slide

  9. BROKEN FONTS

    View Slide

  10. DISABLE
    VIA TRANSLATION

    View Slide

  11. Normalization
    WHAT AM I
    TALKING ABOUT?
    S U B S E T S
    F O N T S
    L O C A L E S
    K O L L A T I O N
    E N C O D I N G
    C H A R A C T E R S E T S

    View Slide

  12. WAPUU THE BEAR
    T H I S I S A L I T T L E
    S T O R Y A B O U T …

    View Slide

  13. bear = Bär
    (in German)
    WAPUU THE BEAR
    T H I S I S A L I T T L E
    S T O R Y A B O U T …

    View Slide

  14. WAPUU THE BEAR
    ä
    is an Umlaut
    T H I S I S A L I T T L E
    S T O R Y A B O U T …

    View Slide

  15. WAPUU THE BEAR
    ä
    =
    ASCII 228
    T H I S I S A L I T T L E
    S T O R Y A B O U T …

    View Slide

  16. WAPUU THE BEAR
    T H I S I S A L I T T L E
    S T O R Y A B O U T …
    ä ≠ a +¨

    View Slide

  17. ä ¨
    +
    a

    (ä)
    ≠ +
    (a) (̈)

    View Slide

  18. ä ¨
    +
    a

    Combining Diaresis

    View Slide

  19. ä ¨
    +
    a

    UTF-8: 0xCC 0x88 (cc88)
    Combining Diaresis

    View Slide

  20. ä ¨
    a
    Combining Diaresis
    +

    UTF-8: 0xCC 0x88 (cc88)

    View Slide

  21. Even though UTF-8 provides a single byte sequence
    for each character sequence, the existence of
    multiple character sequences for "the same thing"
    may have security consequences
    whenever string matching, indexing, searching,
    sorting, regular expression
    matching and selection are involved.

    R F C 3 6 2 9

    View Slide

  22. UNICODE
    NORMALIZATION FORMS
    NORMALIZATION
    FORM
    Description
    D (NFD) Canonical Decomposition
    C (NFC)
    Canonical Decomposition,
    followed by
    Canonical Composition
    KD (NFKD) Compatibility Decomposition
    KC (NFKC)
    Compatibility Decomposition,
    followed by
    Canonical Composition

    View Slide

  23. However, it can be difficult for users to assure
    that a given resource or set of resources
    uses a consistent textual representation
    because the differences are
    usually not visible when viewed as text.

    W W W . W 3 . O R G / T R / C H A R M O D - N O R M / # U N I C O D E N O R M A L I Z A T I O N

    View Slide

  24. Tools [like WordPress] and implementations thus
    need to consider the difficulties
    experienced by users when
    visually or logically equivalent strings
    that "ought to" match (in the user's mind)
    are considered to be distinct values.

    W W W . W 3 . O R G / T R / C H A R M O D - N O R M / # U N I C O D E N O R M A L I Z A T I O N

    View Slide

  25. Providing a means for users
    to see these differences and/or
    normalize them as appropriate
    makes it possible for end users
    to avoid failures that spring from
    invisible differences in their source documents.
    For example, the W3C Validator
    warns when an HTML document is
    not fully in Unicode Normalization Form C.

    W W W . W 3 . O R G / T R / C H A R M O D - N O R M / # U N I C O D E N O R M A L I Z A T I O N

    View Slide

  26. But
    WHY?

    View Slide

  27. And then picking NFD normalization –
    and making it visible,
    and actively converting correct unicode
    into that absolutely horrible format,
    that’s just inexcusable.
    Even the people who think
    normalization is a good thing admit that
    NFD is a bad format,
    and certainly not for data exchange.
    It’s not even „paste-eater“ quality thinking.
    It’s actually actively corrupting user data.
    By design. Christ.

    L I N U S T O R V A L D S

    View Slide

  28. HFS Plus converts all file names
    to decomposed Unicode,
    while Macintosh keyboards generally
    produce precomposed Unicode.
    This isn't a problem as long as you
    use system-provided APIs to process text.
    Apple's APIs correctly handle both
    precomposed and decomposed Unicode.
    T E C H N I C A L Q & A Q A 1 2 3 5 ( A P P L E )

    View Slide

  29. How
    CAN
    THIS
    BE
    SOLVED
    ?

    View Slide

  30. PHP
    N O R M A L I Z E R : : N O R M A L I Z E
    Normalizes the input provided
    and returns the normalized string
    P H P 5 ≧ 5 . 3 . 0 , P H P 7 , P E C L i n t l ≧ 1 . 0 . 0

    View Slide

  31. JAVASCRIPT
    S T R I N G . P R O T O T Y P E . N O R M A L I Z E ( )
    The normalize() method returns the specified
    Unicode Normalization Form of the string.
    It does not affect the value of the string itself.
    E C M A S c r i p t 2 0 1 5 ( 6 t h E d i t i o n , E C M A - 2 6 2 )

    View Slide

  32. JAVASCRIPT
    Supported in modern browsers

    View Slide

  33. What is the
    STATUS
    QUO ?

    View Slide

  34. BLOCK EDITOR/
    GUTENBERG (JS)
    H T M L = H T M L . N O R M A L I Z E ( ) ;
    BUT, only for the content ( # 1 4 1 7 8 )
    and no polyfill for older browsers ( # 1 3 1 5 7 )
    G i t h u b I s s u e n u m b e r s

    View Slide

  35. L E T ’ S T R Y T H A T …
    ✴ Dictionary/proofreading fails
    ✴ Transliteration (Bär -> Baer) 

    does not work:


    Slug is showing %cc%88
    BLOCK EDITOR/
    GUTENBERG (JS)

    View Slide

  36. M A Y B E A F T E R S A V E ?
    ✴ Editor still shows 

    „wrong“ permalink
    BLOCK EDITOR/
    GUTENBERG (JS)

    View Slide

  37. I N T E R N A L S E A R C H B R O K E N
    ✴ Search for „Bär“ should have found the post 

    with the title „Bär“ but is showing „No posts found“
    BLOCK EDITOR/
    GUTENBERG (JS)

    View Slide

  38. B R O W S E R
    S E A R C H
    B R O K E N
    ( F I R E F O X )
    BLOCK EDITOR/
    GUTENBERG (JS)

    View Slide

  39. B R O W S E R
    S E A R C H
    W O R K S
    ( C H R O M E )
    BLOCK EDITOR/
    GUTENBERG (JS)

    View Slide

  40. B R O W S E R
    S E A R C H
    B R O K E N ! 

    ( S A F A R I )
    BLOCK EDITOR/
    GUTENBERG (JS)

    View Slide

  41. W E I R D
    B E H A V I O U R
    BLOCK EDITOR/
    GUTENBERG (JS)
    ✴ This happens on first pasting and publishing

    View Slide

  42. W E I R D
    B E H A V I O U R
    BLOCK EDITOR/
    GUTENBERG (JS)
    ✴ This happens after deleting the slug and re-generating it

    View Slide

  43. Is this
    PLUGIN
    TERRITORY
    ?

    View Slide

  44. IS THIS

    PLUGIN
    TERRITORY?

    View Slide

  45. IS THIS

    PLUGIN
    TERRITORY?

    View Slide

  46. Talking about
    FILENAMES

    Capture d’écran 2013-03-30 à 11.36.32.png
    French screenshot name:

    View Slide

  47. How do we solve this
    IN THE
    FUTURE
    ?

    View Slide

  48. TRAC IDEAS
    SO FAR
    A D D I T I O N A L
    F I L T E R S
    Using the PHP
    normalizer
    function 

    (if available)
    01.
    R E G E X 

    P O L Y F I L L S /
    F A L L B A C K S
    If PHP normalizer
    function not
    available
    02.
    G L O B A L
    J S / P A S T E
    S O L U T I O N
    Current solution

    is limited 

    to the editor
    03.

    View Slide

  49. PLEASE
    !
    Questions

    View Slide

  50. !
    Thänk
    you

    View Slide

  51. PICTURE CREDITS
    #2 Photo by Harrison Moore on Unsplash
    #2 Photo by Jonas Jacobsson on Unsplash
    #4 https:/
    /wordpress.org/about/stats/
    #3 Photo by Adi Goldstein on Unsplash
    #5+6 Photo by Nad X on Unsplash
    #8 Thanks to Michael Schäfer!
    #9 Screenshot from https:/
    /walktowc.eu/
    #11 Photo by Darius Bashar on Unsplash
    #12-16 Wapuu the BER

    #22 Photo by Jon Tyson on Unsplash
    #26 Photo by Javier García on Unsplash
    #27 Wikimedia: Krd CC BY-SA 3.0
    #29 Photo by Nick Fewings on Unsplash
    #33 Photo by Conor Samuel on Unsplash
    #43 Photo by Ivana Milakovic on Unsplash
    #46 Photo by Aarón Blanco Tejedor on Unsplash
    #47 Photo by Bruce Warrington on Unsplash
    #48 Photo by Mr TT on Unsplash
    #48 Photo by Chris Barbalis on Unsplash
    #48 Photo by Ashim D’Silva on Unsplash
    #49 Photo by Cris DiNoto on Unsplash
    #50 Photo by Chris Barbalis on Unsplash

    View Slide