Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Why Strings Are Evil

Why Strings Are Evil

Given at Playgrounds 2017

Samuel E. Giddins

February 23, 2017
Tweet

More Decks by Samuel E. Giddins

Other Decks in Technology

Transcript

  1. Roadmap 1. Defining a String 2. Motivation 3. Strings in

    Our Programming Languages 4. Character Encodings 5. Unicode 6. Correctness, Performance, & Shipping
  2. → Text → A bunch of letters and weird characters

    → Emoji → Control characters
  3. Motivation → I think I'm pretty good at this →

    But I find strings really hard
  4. String-Intensive Applications → Displaying rich text → Writing a JSON

    parser → Detecting data in forms → Literally anything that involves code → Communication
  5. Motivation → I've spent more time writing string-related code than

    I want to → I want to work on projects that require understanding strings → I don't think I'm alone
  6. Motivation → Strings are useful → Strings are hard →

    Strings aren't getting any less hard → I want to rationalize all the pain I've had working with Strings
  7. In C → String is a sequence of characters →

    A character is a typedefd uint8_t → Nul-terminated → That's its
  8. In Objective-C An NSString object encodes a Unicode-compliant text string,

    represented as a sequence of UTF–16 code units. All lengths, character indexes, and ranges are expressed in terms of 16-bit platform- endian values, with index values starting at 0.
  9. In Objective-C → Inspired by Pascal strings → NSString is

    a class cluster → Generally, a pointer to a buffer and a length → Canonically, UTF-16
  10. In Swift A Unicode string value. A string is a

    series of characters. Strings in Swift are Unicode correct, locale insensitive, and designed to be efficient.
  11. In Swift → String is a value-type struct → Composed

    of characters/extended grapheme clusters → 15 associated types → 75 initializers → 269 methods → 107 properties
  12. Character Encodings → Maps bytes to 'characters' → Map isn't

    necessarily invertible → Many choices → Lots of legacy cruft
  13. Unicode provides a unique number for every character, no matter

    what the platform, no matter what the program, no matter what the language.
  14. Why is Unicode important? → Puts "all" characters in a

    single encoding → Has idea of "canonical equivalence" → Several different underlying representations → Can choose between them depending upon the application
  15. Unicode Lessons → Byte equality string equality → String equality

    is (potentially) quadratic → Comparison, sorting, ... → Good luck! → Locales, casing, etc. are complicated
  16. Correctness → Being "unicode-aware" → Handling control characters → Detecting

    invalid byte sequences in a given encoding → Recognizing specific grammars in certain strings
  17. Recognizing specific grammars in certain strings These all use strings

    in a certain format (defined by a grammar) to represent more than just their bytes
  18. Performance → Indexing is → Length is → Looping can

    be either or → Checking equality → What does that even mean‽
  19. Performance → Simple-looking code can be slow → Unicode adds

    performance wrinkles → String slicing can lead to memory leaks → Or string slicing can be slow
  20. Performance → Changing encodings → Delineating lines/paragraphs/pages → Text layout

    → S ̵͘ ͝ ͖͎̯̱͕̣̝ u ̵̊̈́̓̈́̾̎̔̄̃ ̢͚͓̞̺ ͜ p ̶͓͔͕̞͆͆̇̂̄ 㸅 ͍̘̰̮̝̪̖ e ̴ ̚ ͑̿ ̚ ̙̬̜͕͓͈̝ r ̷̛͠ ͐̈́͌̎̒́͑ ̕ ͖ ̴̔̎̇ ́͆͐̔̚ ̢̳̲͔ ͎ l ̸̀̎̿̚ ͉̥ a ̴̾̅̿́ ̢̪̤͇͎̣ ͔̠ r ̷̓̈́̓͑ ̡̥̯͈ ̨̜̬̘ g ̵͛̅ ͚̤̰ e ̸̡͍͚̞̩͇̱̘̭͇̝̦͕̃́́̀̈́̌̄̆́ ̸̽̈́̃͘ ̚ ̐ ͒͝ ̕ ̓̽ ̨͍͍̬͖̭̠͓ s ̷̓̔͌̌ ̡ ̞͍͕̻̰̥ t ̵̆͂̍͐͝ ͋ ͕̭̯ r ̷̖͎̟̻̯̗̥̓̅ i̴ ̠̝̑̚ ͇̼͎̖̝̮ n ̸̛͉ g ̵̀͛̓̒͗̓͌͆̑͘̚ ͓ s ̷̏́͛͒ ̡̙̻̪
  21. Shipping All of these considerations mean designing fast & correct

    string handling is a massive undertaking.
  22. Shipping Sometimes, sacrificing on one of them is OK. Sometimes,

    being incorrect or slow and shipping the rest of your app is a worthy tradeoff.
  23. So, Strings → Not evil, just complicated → An overloaded

    concept – pick the overload you need → Move from a string to structured data as soon as possible