$30 off During Our Annual Pro Sale. View Details »

Why Strings Are Evil

Why Strings Are Evil

Given at Playgrounds 2017

Samuel E. Giddins

February 23, 2017
Tweet

More Decks by Samuel E. Giddins

Other Decks in Technology

Transcript

  1. Why Strings are Evil Samuel Giddins

  2. Roadmap 1. Defining a String 2. Motivation 3. Strings in

    Our Programming Languages 4. Character Encodings 5. Unicode 6. Correctness, Performance, & Shipping
  3. What is a String?

  4. → Text → A bunch of letters and weird characters

    → Emoji → Control characters
  5. Strings @segiddins

  6. Strings import Foundation NSLog("Hello, world!")

  7. Strings Ä__stubs__TEXTÄ0ÄÄ__stub_helper__TEXT∞`∞Ä__const__TEXTà__cstring_ TEXTòò__unwind_info__TEXT∞P∞Ë__DATA__nl_symbol_ptr__DATA__la_symbol_ptr__DATA@ H__LINKEDIT 0 †&"Ä0 pê ∏ ∞!x P

    h! /usr/lib/dyldvÛQùØ=ìRA¡SwE}$
  8. Strings segiddins@segiddins.me

  9. Strings → Unformatted data → Often has meaning beyond the

    bytes → The universal data type?
  10. Motivation

  11. Motivation → I think I'm pretty good at this →

    But I find strings really hard
  12. String-Intensive Applications → Displaying rich text → Writing a JSON

    parser → Detecting data in forms → Literally anything that involves code → Communication
  13. Motivation → I've spent more time writing string-related code than

    I want to → I want to work on projects that require understanding strings → I don't think I'm alone
  14. Motivation → Strings are useful → Strings are hard →

    Strings aren't getting any less hard → I want to rationalize all the pain I've had working with Strings
  15. Strings (in our languages)

  16. In C char *string = "string"; char string[] = {'s',

    't', 'r', 'i', 'n', 'g', 0};
  17. In C → String is a sequence of characters →

    A character is a typedefd uint8_t → Nul-terminated → That's its
  18. In Objective-C NSString *string = @"string\b!\0!";

  19. In Objective-C An NSString object encodes a Unicode-compliant text string,

    represented as a sequence of UTF–16 code units. All lengths, character indexes, and ranges are expressed in terms of 16-bit platform- endian values, with index values starting at 0.
  20. In Objective-C → Inspired by Pascal strings → NSString is

    a class cluster → Generally, a pointer to a buffer and a length → Canonically, UTF-16
  21. In Swift let ! = """

  22. In Swift A Unicode string value. A string is a

    series of characters. Strings in Swift are Unicode correct, locale insensitive, and designed to be efficient.
  23. In Swift → String is a value-type struct → Composed

    of characters/extended grapheme clusters → 15 associated types → 75 initializers → 269 methods → 107 properties
  24. Character Encodings

  25. Character Encodings → Maps bytes to 'characters' → Map isn't

    necessarily invertible → Many choices → Lots of legacy cruft
  26. Common Character Encodings → Morse → EBCDIC → ASCII →

    Windows-1252 → UTF-{8,16,32}
  27. ASCII

  28. ASCII → "plain text" → Great for English → Missing

    everything else
  29. What's missing from ASCII? → Accents → Non-latin characters →

    Emoji → Math symbols → ...
  30. Enter Unicode

  31. Unicode provides a unique number for every character, no matter

    what the platform, no matter what the program, no matter what the language.
  32. Why is Unicode important? → Puts "all" characters in a

    single encoding → Has idea of "canonical equivalence" → Several different underlying representations → Can choose between them depending upon the application
  33. Swift Strings are Unicode-Aware "\u{006E}\u{0303}" == "\u{00F1}" // true "ñ"

    == "ñ" // true
  34. Unicode Lessons → Byte equality string equality → String equality

    is (potentially) quadratic → Comparison, sorting, ... → Good luck! → Locales, casing, etc. are complicated
  35. Unicode Lessons → Unicode is complicated → Because the languages

    we use are complicated
  36. Correctness, Performance, & Shipping

  37. Correctness, Performance, & Shipping Choose 2

  38. Correctness → Being "unicode-aware" → Handling control characters → Detecting

    invalid byte sequences in a given encoding → Recognizing specific grammars in certain strings
  39. Recognizing specific grammars in certain strings → Telephone numbers →

    Emails → JSON → Programming languages
  40. Recognizing specific grammars in certain strings These all use strings

    in a certain format (defined by a grammar) to represent more than just their bytes
  41. Performance → Indexing is → Length is → Looping can

    be either or → Checking equality → What does that even mean‽
  42. Performance → Simple-looking code can be slow → Unicode adds

    performance wrinkles → String slicing can lead to memory leaks → Or string slicing can be slow
  43. Performance → Changing encodings → Delineating lines/paragraphs/pages → Text layout

    → S ̵͘ ͝ ͖͎̯̱͕̣̝ u ̵̊̈́̓̈́̾̎̔̄̃ ̢͚͓̞̺ ͜ p ̶͓͔͕̞͆͆̇̂̄ 㸅 ͍̘̰̮̝̪̖ e ̴ ̚ ͑̿ ̚ ̙̬̜͕͓͈̝ r ̷̛͠ ͐̈́͌̎̒́͑ ̕ ͖ ̴̔̎̇ ́͆͐̔̚ ̢̳̲͔ ͎ l ̸̀̎̿̚ ͉̥ a ̴̾̅̿́ ̢̪̤͇͎̣ ͔̠ r ̷̓̈́̓͑ ̡̥̯͈ ̨̜̬̘ g ̵͛̅ ͚̤̰ e ̸̡͍͚̞̩͇̱̘̭͇̝̦͕̃́́̀̈́̌̄̆́ ̸̽̈́̃͘ ̚ ̐ ͒͝ ̕ ̓̽ ̨͍͍̬͖̭̠͓ s ̷̓̔͌̌ ̡ ̞͍͕̻̰̥ t ̵̆͂̍͐͝ ͋ ͕̭̯ r ̷̖͎̟̻̯̗̥̓̅ i̴ ̠̝̑̚ ͇̼͎̖̝̮ n ̸̛͉ g ̵̀͛̓̒͗̓͌͆̑͘̚ ͓ s ̷̏́͛͒ ̡̙̻̪
  44. Shipping

  45. Shipping All of these considerations mean designing fast & correct

    string handling is a massive undertaking.
  46. Shipping Sometimes, sacrificing on one of them is OK. Sometimes,

    being incorrect or slow and shipping the rest of your app is a worthy tradeoff.
  47. sɓuıɹʇS 'oS

  48. So, Strings

  49. So, Strings → Not evil, just complicated → An overloaded

    concept – pick the overload you need → Move from a string to structured data as soon as possible
  50. So, Strings Not so simple in the end. Just as

    crucial.
  51. @segiddins