Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Why Strings Are Evil

Why Strings Are Evil

Given at Playgrounds 2017

4d6be90af74894fd132fb06dacec04d7?s=128

Samuel E. Giddins

February 23, 2017
Tweet

Transcript

  1. Why Strings are Evil Samuel Giddins

  2. Roadmap 1. Defining a String 2. Motivation 3. Strings in

    Our Programming Languages 4. Character Encodings 5. Unicode 6. Correctness, Performance, & Shipping
  3. What is a String?

  4. → Text → A bunch of letters and weird characters

    → Emoji → Control characters
  5. Strings @segiddins

  6. Strings import Foundation NSLog("Hello, world!")

  7. Strings Ä__stubs__TEXTÄ0ÄÄ__stub_helper__TEXT∞`∞Ä__const__TEXTà__cstring_ TEXTòò__unwind_info__TEXT∞P∞Ë__DATA__nl_symbol_ptr__DATA__la_symbol_ptr__DATA@ H__LINKEDIT 0 †&"Ä0 pê ∏ ∞!x P

    h! /usr/lib/dyldvÛQùØ=ìRA¡SwE}$
  8. Strings segiddins@segiddins.me

  9. Strings → Unformatted data → Often has meaning beyond the

    bytes → The universal data type?
  10. Motivation

  11. Motivation → I think I'm pretty good at this →

    But I find strings really hard
  12. String-Intensive Applications → Displaying rich text → Writing a JSON

    parser → Detecting data in forms → Literally anything that involves code → Communication
  13. Motivation → I've spent more time writing string-related code than

    I want to → I want to work on projects that require understanding strings → I don't think I'm alone
  14. Motivation → Strings are useful → Strings are hard →

    Strings aren't getting any less hard → I want to rationalize all the pain I've had working with Strings
  15. Strings (in our languages)

  16. In C char *string = "string"; char string[] = {'s',

    't', 'r', 'i', 'n', 'g', 0};
  17. In C → String is a sequence of characters →

    A character is a typedefd uint8_t → Nul-terminated → That's its
  18. In Objective-C NSString *string = @"string\b!\0!";

  19. In Objective-C An NSString object encodes a Unicode-compliant text string,

    represented as a sequence of UTF–16 code units. All lengths, character indexes, and ranges are expressed in terms of 16-bit platform- endian values, with index values starting at 0.
  20. In Objective-C → Inspired by Pascal strings → NSString is

    a class cluster → Generally, a pointer to a buffer and a length → Canonically, UTF-16
  21. In Swift let ! = """

  22. In Swift A Unicode string value. A string is a

    series of characters. Strings in Swift are Unicode correct, locale insensitive, and designed to be efficient.
  23. In Swift → String is a value-type struct → Composed

    of characters/extended grapheme clusters → 15 associated types → 75 initializers → 269 methods → 107 properties
  24. Character Encodings

  25. Character Encodings → Maps bytes to 'characters' → Map isn't

    necessarily invertible → Many choices → Lots of legacy cruft
  26. Common Character Encodings → Morse → EBCDIC → ASCII →

    Windows-1252 → UTF-{8,16,32}
  27. ASCII

  28. ASCII → "plain text" → Great for English → Missing

    everything else
  29. What's missing from ASCII? → Accents → Non-latin characters →

    Emoji → Math symbols → ...
  30. Enter Unicode

  31. Unicode provides a unique number for every character, no matter

    what the platform, no matter what the program, no matter what the language.
  32. Why is Unicode important? → Puts "all" characters in a

    single encoding → Has idea of "canonical equivalence" → Several different underlying representations → Can choose between them depending upon the application
  33. Swift Strings are Unicode-Aware "\u{006E}\u{0303}" == "\u{00F1}" // true "ñ"

    == "ñ" // true
  34. Unicode Lessons → Byte equality string equality → String equality

    is (potentially) quadratic → Comparison, sorting, ... → Good luck! → Locales, casing, etc. are complicated
  35. Unicode Lessons → Unicode is complicated → Because the languages

    we use are complicated
  36. Correctness, Performance, & Shipping

  37. Correctness, Performance, & Shipping Choose 2

  38. Correctness → Being "unicode-aware" → Handling control characters → Detecting

    invalid byte sequences in a given encoding → Recognizing specific grammars in certain strings
  39. Recognizing specific grammars in certain strings → Telephone numbers →

    Emails → JSON → Programming languages
  40. Recognizing specific grammars in certain strings These all use strings

    in a certain format (defined by a grammar) to represent more than just their bytes
  41. Performance → Indexing is → Length is → Looping can

    be either or → Checking equality → What does that even mean‽
  42. Performance → Simple-looking code can be slow → Unicode adds

    performance wrinkles → String slicing can lead to memory leaks → Or string slicing can be slow
  43. Performance → Changing encodings → Delineating lines/paragraphs/pages → Text layout

    → S ̵͘ ͝ ͖͎̯̱͕̣̝ u ̵̊̈́̓̈́̾̎̔̄̃ ̢͚͓̞̺ ͜ p ̶͓͔͕̞͆͆̇̂̄ 㸅 ͍̘̰̮̝̪̖ e ̴ ̚ ͑̿ ̚ ̙̬̜͕͓͈̝ r ̷̛͠ ͐̈́͌̎̒́͑ ̕ ͖ ̴̔̎̇ ́͆͐̔̚ ̢̳̲͔ ͎ l ̸̀̎̿̚ ͉̥ a ̴̾̅̿́ ̢̪̤͇͎̣ ͔̠ r ̷̓̈́̓͑ ̡̥̯͈ ̨̜̬̘ g ̵͛̅ ͚̤̰ e ̸̡͍͚̞̩͇̱̘̭͇̝̦͕̃́́̀̈́̌̄̆́ ̸̽̈́̃͘ ̚ ̐ ͒͝ ̕ ̓̽ ̨͍͍̬͖̭̠͓ s ̷̓̔͌̌ ̡ ̞͍͕̻̰̥ t ̵̆͂̍͐͝ ͋ ͕̭̯ r ̷̖͎̟̻̯̗̥̓̅ i̴ ̠̝̑̚ ͇̼͎̖̝̮ n ̸̛͉ g ̵̀͛̓̒͗̓͌͆̑͘̚ ͓ s ̷̏́͛͒ ̡̙̻̪
  44. Shipping

  45. Shipping All of these considerations mean designing fast & correct

    string handling is a massive undertaking.
  46. Shipping Sometimes, sacrificing on one of them is OK. Sometimes,

    being incorrect or slow and shipping the rest of your app is a worthy tradeoff.
  47. sɓuıɹʇS 'oS

  48. So, Strings

  49. So, Strings → Not evil, just complicated → An overloaded

    concept – pick the overload you need → Move from a string to structured data as soon as possible
  50. So, Strings Not so simple in the end. Just as

    crucial.
  51. @segiddins