Time, Numbers, Text

Time, Numbers, Text

How hard can it be to add two numbers together? How hard could it be to repeat the same task at the same time tomorrow? How hard can it be to sort a list of names or even cound how many letters are there in a given word?

We'll talk about the surprising complexity of data that we all deal with every day. This is something every developer should know but very few actually grasp.

Transcript

  1. None
  2. @listochkin

  3. Ember

  4. 2018 Looks like Vue Tons of components WASM SSR Preact

    speeds
  5. Awesome?

  6. You tell me! No UI in 3? years

  7. jQuery React <script type="text/babel">

  8. You don’t need a UI

  9. Email Calendar CRM Sms Phone API

  10. None
  11. Math Physics

  12. Programming Languages

  13. => async

  14. conditions loops functions data-types

  15. Binary Hardware

  16. Reality <-> Digital

  17. UI

  18. Messy Inconsistent Illogical

  19. Humans

  20. Logical Wise Sentient Decision-makers

  21. Emotional Greasy Impulsive Reactionary

  22. Math Poetry Music Art

  23. Logical and Illogical Consistent and Inconsistent

  24. Programmers’ Minefield

  25. Human data

  26. Text Numbers Time Color & Sound Smell & Taste

  27. Logical yet inconsistent

  28. 2018

  29. 3 golden rules

  30. Store time in UTC

  31. Unicode and UTF-8

  32. Fixed-point decimals or integers for Money

  33. Store time in UTC Unicode and UTF-8 Fixed-point decimals or

    integers for Money
  34. Bad News

  35. Bad News #1 Not everybody does it

  36. Bad News #2 These rules aren’t enough

  37. There was a bank

  38. Loans Principal + Interest Monthly Installments

  39. SMS Reminder

  40. Every month 3 days before payment due at 10:00 am

  41. 1. Find the right date

  42. 29th -> 26th

  43. Move to 25th for February

  44. Except Leap Year

  45. Except when that date ends up being a weekend or

    a bank holiday
  46. 2. Find the right time

  47. 10:00 am

  48. Time Zones

  49. Time Offset vs Time Zone

  50. None
  51. None
  52. UTC+2 Finland Ukraine South Africa

  53. UTC+3 Finland Ukraine UTC+2 South Africa

  54. Time Zones

  55. tzdata

  56. “Timestamp with timezone” is a lie

  57. Timestamp with timezone offset

  58. User time = Local DateTime + TimeZone Name

  59. Almost nobody does it today

  60. Location -> TimeZone

  61. APIs Local datasets

  62. Moving target

  63. tzdata <YYYYw> 2018c 10+ releases a year Often bundled with

    other software
  64. App-tzdata Database-tzdata System-tzdata

  65. Time Math can be surprising!

  66. 1. Do it in a single place 2. Update tzdata

    regularly
  67. Numbers

  68. Number Base

  69. Base 10 0.5 0.111 ⅓

  70. Base 10 Financial Data

  71. 0.1 + 0.2 = 0.30000000000000004

  72. True for almost all languages

  73. Intel 1977

  74. Sign Fraction Exponent

  75. <S> (1 + F) * 2^E

  76. IEEE 754 - 1985, 2008

  77. Money

  78. “Use fixed point numbers”

  79. + - *

  80. /

  81. %

  82. Installments

  83. (Interest + Premium) / Duration

  84. Typical durations: 3, 6, 12, 24, 120, etc

  85. Nonoptimal for Base-2

  86. Rounding

  87. 10.265

  88. Round-half-up for all installments but the last one

  89. 10.265 => 10.27

  90. Skews the effective rate up Need to compensate

  91. Last installment

  92. Round-half-to-even Banking rounding

  93. 10.265 => 10.26 6 is even 7 is add

  94. Math.round() Always half-up

  95. CSS

  96. Text

  97. Writing systems Right-to-left Top-to-bottom Graphemes vs Characters

  98. 6-bit => 7-bit (ASCII) 8-bit CP-1251, koi8-u 16-bit - GB,

    Big5, Shift-JIS
  99. Unicode

  100. Let’s encode every writing

  101. 2 bytes should be enough

  102. Windows macOS iOS JavaScript

  103. 0 - 10FFFFFF

  104. UCS-2 => UTF-16 Surrogate pairs

  105. 1101 10xx xxxx xxxx 1101 11xx xxxx xxxx

  106. UTF-8

  107. Linux Android Networking Databases

  108. iOS device Network Server

  109. UTF-8 <=> UTF-16

  110. UTF-16 => UTF-8

  111. Correct: 2 surrogate pairs Codepoint Re-encode as 3-4 byte UTF-8

  112. Practical: Take each surrogate separately 2 codepoints Re-encode as 2

    * (3 - 4) byte UTF-8
  113. Non-valid UTF-8 bytes CESU-8

  114. NULL-byte

  115. “Modified UTF-8” Often CESU-8-encoded

  116. CESU-8 Oracle MySQL Java Desktop software

  117. Forbidden on Web

  118. Codepoint vs Character vs Glyph

  119. None
  120. U+1F468 U+200D U+1F469 U+200D U+1F467 U+200D U+1F466

  121. 1 glyph 7 codepoints

  122. Cyrillic small letter ha combining breve

  123. None
  124. देवनागर Devanagari

  125. split substr

  126. Unicode-aware regular expressions

  127. /\p{Alpha}/u

  128. Unicode is not always the answer

  129. Han Unification

  130. None
  131. <div style="font-family:Ubuntu"> <p lang="sr">Serbian б</p> <p lang="uk">Ukrainian б</p> </div>

  132. None
  133. Now imagine this for almost every letter

  134. Traditional Chinese Simplified Chinese Japanese (Kanji) Korean (Hanja) Vietnamese (Chữ

    Nôm)
  135. None
  136. Plain-text document in 2 CJK languages? Show historic and modern

    kanji variants? Impossible in Unicode
  137. Errors Inaccuracies Ambiguities

  138. HTML <p lang="ja">

  139. No Browser API to detect User Input Language

  140. window.navigator.languages Statistical methods

  141. Sorting

  142. Collation

  143. Harald Henrik Haakon

  144. aa <=> å … x y z æ ø å

  145. Sorting is not always character-based

  146. kanji 漢字 hiragana (ひらがな / 平仮名) katakana (カタカナ / 片仮名)

    rōmaji
  147. ラドクリフ、マラソン 五輪代表に 1万m出場にも含み

  148. Can’t sort text in Kanji

  149. None
  150. Search

  151. ñ U+006E U+0303 U+00F1

  152. Composed Decomposed Canonical Compatibility

  153. O ffi ce O f f i ce

  154. NKFD Compatible Decomposed

  155. Key is to have all strings normalized the same way!

  156. ICU

  157. International Components for Unicode

  158. Iteration Collation Formatting for text and datetimes Normalization

  159. ~30 MiB

  160. JavaScript ECMA-402 Intl

  161. Not included in Node by default

  162. Be extra careful!

  163. None
  164. Every developer made mistakes handling these kinds of data

  165. Non-obvious Dormant Painful to find, test, and fix

  166. Very expensive bugs

  167. Demand knowledge

  168. while if Unicode

  169. Check your software Teach your team-mates Screen at the interview

  170. None
  171. Time Numbers Text @listochkin