Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Time, Numbers, Text

Time, Numbers, Text

How hard can it be to add two numbers together? How hard could it be to repeat the same task at the same time tomorrow? How hard can it be to sort a list of names or even cound how many letters are there in a given word?

We'll talk about the surprising complexity of data that we all deal with every day. This is something every developer should know but very few actually grasp.

More Decks by Андрей Листочкин (Andrey Listochkin)

Other Decks in Programming

Transcript

  1. View Slide

  2. @listochkin

    View Slide

  3. Ember

    View Slide

  4. 2018
    Looks like Vue
    Tons of components
    WASM
    SSR
    Preact speeds

    View Slide

  5. Awesome?

    View Slide

  6. You tell me!
    No UI in 3? years

    View Slide

  7. jQuery
    React
    <br/>

    View Slide

  8. You don’t need a UI

    View Slide

  9. Email
    Calendar
    CRM
    Sms
    Phone
    API

    View Slide

  10. View Slide

  11. Math
    Physics

    View Slide

  12. Programming
    Languages

    View Slide

  13. =>
    async

    View Slide

  14. conditions
    loops
    functions
    data-types

    View Slide

  15. Binary Hardware

    View Slide

  16. Reality <-> Digital

    View Slide

  17. UI

    View Slide

  18. Messy
    Inconsistent
    Illogical

    View Slide

  19. Humans

    View Slide

  20. Logical
    Wise
    Sentient
    Decision-makers

    View Slide

  21. Emotional
    Greasy
    Impulsive
    Reactionary

    View Slide

  22. Math
    Poetry
    Music
    Art

    View Slide

  23. Logical and Illogical
    Consistent and Inconsistent

    View Slide

  24. Programmers’ Minefield

    View Slide

  25. Human data

    View Slide

  26. Text
    Numbers
    Time
    Color & Sound
    Smell & Taste

    View Slide

  27. Logical yet inconsistent

    View Slide

  28. 2018

    View Slide

  29. 3 golden rules

    View Slide

  30. Store time in UTC

    View Slide

  31. Unicode and UTF-8

    View Slide

  32. Fixed-point decimals
    or integers for Money

    View Slide

  33. Store time in UTC
    Unicode and UTF-8
    Fixed-point decimals
    or integers for Money

    View Slide

  34. Bad News

    View Slide

  35. Bad News #1
    Not everybody does it

    View Slide

  36. Bad News #2
    These rules aren’t enough

    View Slide

  37. There was a bank

    View Slide

  38. Loans
    Principal + Interest
    Monthly Installments

    View Slide

  39. SMS Reminder

    View Slide

  40. Every month
    3 days before payment due
    at 10:00 am

    View Slide

  41. 1. Find the right date

    View Slide

  42. 29th -> 26th

    View Slide

  43. Move to 25th for February

    View Slide

  44. Except Leap Year

    View Slide

  45. Except when that date
    ends up being
    a weekend or a bank holiday

    View Slide

  46. 2. Find the right time

    View Slide

  47. 10:00 am

    View Slide

  48. Time Zones

    View Slide

  49. Time Offset
    vs
    Time Zone

    View Slide

  50. View Slide

  51. View Slide

  52. UTC+2
    Finland
    Ukraine
    South Africa

    View Slide

  53. UTC+3
    Finland
    Ukraine
    UTC+2
    South Africa

    View Slide

  54. Time Zones

    View Slide

  55. tzdata

    View Slide

  56. “Timestamp with timezone”
    is a lie

    View Slide

  57. Timestamp with timezone offset

    View Slide

  58. User time =
    Local DateTime + TimeZone Name

    View Slide

  59. Almost nobody does it today

    View Slide

  60. Location -> TimeZone

    View Slide

  61. APIs
    Local datasets

    View Slide

  62. Moving target

    View Slide

  63. tzdata
    2018c
    10+ releases a year
    Often bundled with other software

    View Slide

  64. App-tzdata
    Database-tzdata
    System-tzdata

    View Slide

  65. Time Math can be
    surprising!

    View Slide

  66. 1. Do it in a single place
    2. Update tzdata regularly

    View Slide

  67. Numbers

    View Slide

  68. Number Base

    View Slide

  69. Base 10
    0.5 0.111

    View Slide

  70. Base 10
    Financial Data

    View Slide

  71. 0.1 + 0.2 =
    0.30000000000000004

    View Slide

  72. True for almost all languages

    View Slide

  73. Intel
    1977

    View Slide

  74. Sign Fraction Exponent

    View Slide

  75. (1 + F) * 2^E

    View Slide

  76. IEEE 754 - 1985, 2008

    View Slide

  77. Money

    View Slide

  78. “Use fixed point numbers”

    View Slide

  79. + - *

    View Slide

  80. /

    View Slide

  81. %

    View Slide

  82. Installments

    View Slide

  83. (Interest + Premium) / Duration

    View Slide

  84. Typical durations:
    3, 6, 12, 24, 120, etc

    View Slide

  85. Nonoptimal for Base-2

    View Slide

  86. Rounding

    View Slide

  87. 10.265

    View Slide

  88. Round-half-up
    for all installments
    but the last one

    View Slide

  89. 10.265 => 10.27

    View Slide

  90. Skews the effective rate up
    Need to compensate

    View Slide

  91. Last installment

    View Slide

  92. Round-half-to-even
    Banking rounding

    View Slide

  93. 10.265 => 10.26
    6 is even
    7 is add

    View Slide

  94. Math.round()
    Always half-up

    View Slide

  95. CSS

    View Slide

  96. Text

    View Slide

  97. Writing systems
    Right-to-left
    Top-to-bottom
    Graphemes vs Characters

    View Slide

  98. 6-bit => 7-bit (ASCII)
    8-bit CP-1251, koi8-u
    16-bit - GB, Big5, Shift-JIS

    View Slide

  99. Unicode

    View Slide

  100. Let’s encode every writing

    View Slide

  101. 2 bytes should be enough

    View Slide

  102. Windows
    macOS
    iOS
    JavaScript

    View Slide

  103. 0 - 10FFFFFF

    View Slide

  104. UCS-2 => UTF-16
    Surrogate pairs

    View Slide

  105. 1101 10xx xxxx xxxx
    1101 11xx xxxx xxxx

    View Slide

  106. UTF-8

    View Slide

  107. Linux
    Android
    Networking
    Databases

    View Slide

  108. iOS device
    Network
    Server

    View Slide

  109. UTF-8 <=> UTF-16

    View Slide

  110. UTF-16 => UTF-8

    View Slide

  111. Correct:
    2 surrogate pairs
    Codepoint
    Re-encode as 3-4 byte UTF-8

    View Slide

  112. Practical:
    Take each surrogate separately
    2 codepoints
    Re-encode as 2 * (3 - 4) byte UTF-8

    View Slide

  113. Non-valid UTF-8 bytes
    CESU-8

    View Slide

  114. NULL-byte

    View Slide

  115. “Modified UTF-8”
    Often CESU-8-encoded

    View Slide

  116. CESU-8
    Oracle
    MySQL
    Java
    Desktop software

    View Slide

  117. Forbidden on Web

    View Slide

  118. Codepoint
    vs
    Character
    vs
    Glyph

    View Slide

  119. View Slide

  120. U+1F468
    U+200D
    U+1F469
    U+200D
    U+1F467
    U+200D
    U+1F466

    View Slide

  121. 1 glyph
    7 codepoints

    View Slide

  122. Cyrillic small letter ha
    combining breve

    View Slide

  123. View Slide

  124. देवनागर
    Devanagari

    View Slide

  125. split
    substr

    View Slide

  126. Unicode-aware
    regular expressions

    View Slide

  127. /\p{Alpha}/u

    View Slide

  128. Unicode
    is not always the answer

    View Slide

  129. Han Unification

    View Slide

  130. View Slide


  131. Serbian б
    Ukrainian б

    View Slide

  132. View Slide

  133. Now imagine this
    for almost every letter

    View Slide

  134. Traditional Chinese
    Simplified Chinese
    Japanese (Kanji)
    Korean (Hanja)
    Vietnamese (Chữ Nôm)

    View Slide

  135. View Slide

  136. Plain-text document
    in 2 CJK languages?
    Show historic and
    modern kanji variants?
    Impossible in Unicode

    View Slide

  137. Errors
    Inaccuracies
    Ambiguities

    View Slide

  138. HTML

    View Slide

  139. No Browser API to detect
    User Input Language

    View Slide

  140. window.navigator.languages
    Statistical methods

    View Slide

  141. Sorting

    View Slide

  142. Collation

    View Slide

  143. Harald
    Henrik
    Haakon

    View Slide

  144. aa <=> å
    … x y z æ ø å

    View Slide

  145. Sorting is not always
    character-based

    View Slide

  146. kanji 漢字
    hiragana (ひらがな / 平仮名)
    katakana (カタカナ / 片仮名)
    rōmaji

    View Slide

  147. ラドクリフ、マラソン
    五輪代表に 1万m出場にも含み

    View Slide

  148. Can’t sort text in Kanji

    View Slide

  149. View Slide

  150. Search

    View Slide

  151. ñ
    U+006E U+0303
    U+00F1

    View Slide

  152. Composed
    Decomposed
    Canonical
    Compatibility

    View Slide

  153. O ffi ce
    O f f i ce

    View Slide

  154. NKFD
    Compatible Decomposed

    View Slide

  155. Key is
    to have all strings
    normalized
    the same way!

    View Slide

  156. ICU

    View Slide

  157. International
    Components for Unicode

    View Slide

  158. Iteration
    Collation
    Formatting for text and datetimes
    Normalization

    View Slide

  159. ~30 MiB

    View Slide

  160. JavaScript
    ECMA-402
    Intl

    View Slide

  161. Not included
    in Node by default

    View Slide

  162. Be extra careful!

    View Slide

  163. View Slide

  164. Every developer
    made mistakes
    handling these kinds of data

    View Slide

  165. Non-obvious
    Dormant
    Painful to find, test, and fix

    View Slide

  166. Very expensive bugs

    View Slide

  167. Demand knowledge

    View Slide

  168. while
    if
    Unicode

    View Slide

  169. Check your software
    Teach your team-mates
    Screen at the interview

    View Slide

  170. View Slide

  171. Time Numbers Text
    @listochkin

    View Slide