Upgrade to Pro — share decks privately, control downloads, hide ads and more …

String Theory

Nathan Long
September 02, 2016

String Theory

By James Edward Gray and Nathan Long at ElixirConf (AKA Elixir and Phoenix Conf) 2016.

Elixir's strings and iolists enable great features, but do you understand them? Why is a string a binary, and what do the numbers in the binary have to do with Elixir's great Unicode support? What are iolists, and how do they enable efficient template rendering?
In this talk, you'll learn how to:
- Understand the relationship between bitstrings, binaries, and strings
- Properly compare and dissect UTF8 strings
- Efficiently build string output to write to a file or socket

Come along for a magical journey 🌈🌠 into the bits and bytes that make these structures so powerful.

Nathan Long

September 02, 2016
Tweet

Other Decks in Programming

Transcript

  1. 9/6/16, 10:57 AM String Theory Page 2 of 93 http://localhost:9090/print

    String Theory Artwork & photo by Mahmoud Al-Qammari
  2. 9/6/16, 10:57 AM String Theory Page 7 of 93 http://localhost:9090/print

    Overview of Stringy Types Charlist Atom String "iolist"
  3. 9/6/16, 10:57 AM String Theory Page 8 of 93 http://localhost:9090/print

    Charlist (Single-quoted Strings) List of integers that represent Unicode codepoints 'abc' == [97, 98, 99] > i 'abc' Term: 'abc' Data type: List Raw representation: [97, 98, 99]
  4. 9/6/16, 10:57 AM String Theory Page 9 of 93 http://localhost:9090/print

    Atom Fast pattern matching {:ok, contents} = File.read( "some_file.txt" ) Module names File == :"Elixir.File"
  5. 9/6/16, 10:57 AM String Theory Page 10 of 93 http://localhost:9090/print

    Atom Never garbage collected ✗String.to_atom(param) ✓String.to_existing_atom(param)
  6. 9/6/16, 10:57 AM String Theory Page 12 of 93 http://localhost:9090/print

    Bitstrings A sequence of bits, contiguous in memory <<1::size(1), 0::size(1)>>
  7. 9/6/16, 10:57 AM String Theory Page 13 of 93 http://localhost:9090/print

    Binaries A bitstring containing a sequence of bytes <<0, 255>> == <<0::size(8), 255::size(8)>>
  8. 9/6/16, 10:57 AM String Theory Page 14 of 93 http://localhost:9090/print

    Strings (Double-quoted Stings) A binary whose bytes can be interpreted as characters ✓ <<97, 98, 99>> => "abc" ✓ <<195, 161, 114, 98, 111, 108>> => "árbol" ✗ <<255, 0>> => <<255, 0>>
  9. 9/6/16, 10:57 AM String Theory Page 15 of 93 http://localhost:9090/print

    All Together Now bitstring if series of bits via <<…>> binary if byte count is an integer string if bytes are UTF-8 encoded
  10. 9/6/16, 10:57 AM String Theory Page 17 of 93 http://localhost:9090/print

    String Concatenation iex(1)> {james, n, nathan} = {"James", " and ", "Nathan"} {"James", " and ", "Nathan"} iex(2)> IO.puts james <> n <> nathan James and Nathan :ok Requires allocating an extra string Involves memory copying into that string
  11. 9/6/16, 10:57 AM String Theory Page 18 of 93 http://localhost:9090/print

    String Interpolation iex(3)> IO.puts "#{james}#{n}#{nathan}" James and Nathan :ok This is just good looking concatenation Under the hood it's similar
  12. 9/6/16, 10:57 AM String Theory Page 19 of 93 http://localhost:9090/print

    Can we just use lists? iex(4)> IO.puts [james, n, nathan] James and Nathan :ok Allocates a few small lists They point at the binaries No memory copying is needed
  13. 9/6/16, 10:57 AM String Theory Page 20 of 93 http://localhost:9090/print

    Improper Nesting iex(5)> IO.puts [[james | n] | nathan] James and Nathan :ok Now we're down to just two lists This format is also faster to build by appending data
  14. 9/6/16, 10:57 AM String Theory Page 21 of 93 http://localhost:9090/print

    Quick List Appending Slow: ["a"] ++ ["b"] ++ ["c"] ++ ["d"] Fast: output = "a" output = [output | "b"] output = [output | "c"] output = [output | "d"] # [[["a" | "b"] | "c"] | "d"]
  15. 9/6/16, 10:57 AM String Theory Page 24 of 93 http://localhost:9090/print

    Elixir's Definitions bitstring: Anything expressed as <<…>> binary: A bitstring with a length divisible by eight string: A binary that contains UTF-8 codepoints charlist: A proper list of integers representing codepoints iolist: A proper or improper list of numeric bytes, binaries, and/or nested iolists iodata: An iolist or a binary chardata: A proper or improper list of UTF-8 codepoints, strings, and/or nested chardata lists or a string
  16. 9/6/16, 10:57 AM String Theory Page 29 of 93 http://localhost:9090/print

    Converting to Base 2 base_2 = fn (i) -> Integer.to_string(i, 2) end
  17. 9/6/16, 10:57 AM String Theory Page 30 of 93 http://localhost:9090/print

    ASCII Fits in 1 Byte ?a == 97 base_2.(?a) == "1100001" 01100001
  18. 9/6/16, 10:57 AM String Theory Page 37 of 93 http://localhost:9090/print

    Unicode https://www.tofugu.com/japan/japanese-internet/
  19. 9/6/16, 10:57 AM String Theory Page 41 of 93 http://localhost:9090/print

    Unicode: Also A Mapping A = 65 λ = 923 (= 128,372
  20. 9/6/16, 10:57 AM String Theory Page 42 of 93 http://localhost:9090/print

    ~1 million codepoints! ?韴 == 132_878 base_2.(?韴) == "100000011100001110" Some codepoints need more bytes than others... )
  21. 9/6/16, 10:57 AM String Theory Page 43 of 93 http://localhost:9090/print

    UTF-8 Templates 0xxxxxxx 110xxxxx 10xxxxxx 1110xxxx 10xxxxxx 10xxxxxx 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
  22. 9/6/16, 10:57 AM String Theory Page 44 of 93 http://localhost:9090/print

    The Bits For ⏰ ?⏰ == 9200 base_2.(?⏰) == "10001111110000"
  23. 9/6/16, 10:57 AM String Theory Page 45 of 93 http://localhost:9090/print

    Encoding ⏰ Into UTF-8 10001111110000 1110xxxx 10xxxxxx 10xxxxxx 11100010 10001111 10110000 =
  24. 9/6/16, 10:57 AM String Theory Page 46 of 93 http://localhost:9090/print

    Is That What Elixir Does? i "⏰" .... Raw representation <<226, 143, 176>> [226, 143, 176] |> Enum.map(base_2) == ["11100010", "10001111", "10110000"]
  25. 9/6/16, 10:57 AM String Theory Page 47 of 93 http://localhost:9090/print

    Three Kinds of Bytes Starts With Kind 0 Solo 10 Continuation 110 or 1110 or 11110 First of N (count the 1s)
  26. 9/6/16, 10:57 AM String Theory Page 48 of 93 http://localhost:9090/print

    Three Kinds of Bytes Character UTF-8 bytes a 01100001 + 11110000 10011111 10001101 10100000
  27. 9/6/16, 10:57 AM String Theory Page 49 of 93 http://localhost:9090/print

    String Reversal Solo First of 3 Continuation Continuation
  28. 9/6/16, 10:57 AM String Theory Page 50 of 93 http://localhost:9090/print

    String Reversal: Wrong Solo First of 3 Continuation Continuation ꔅ Continuation Continuation First of 3 Solo
  29. 9/6/16, 10:57 AM String Theory Page 51 of 93 http://localhost:9090/print

    String Reversal: Right Solo First of 3 Continuation Continuation ꔅ First of 3 Continuation Continuation Solo
  30. 9/6/16, 10:57 AM String Theory Page 52 of 93 http://localhost:9090/print

    Combining Diacritical Marks Multiple codepoints, one "grapheme" ë = e + ̈
  31. 9/6/16, 10:57 AM String Theory Page 53 of 93 http://localhost:9090/print

    Combining Diacritical Marks noel = "noe\u0308l" "noël" String.codepoints(noel) ["n", "o", "e", "̈", "l"] String.graphemes(noel) ["n", "o", "ë", "l"]
  32. 9/6/16, 10:57 AM String Theory Page 60 of 93 http://localhost:9090/print

    Tricky Traversal - O(N) some_string |> String.length some_string |> String.slice(2,3)
  33. 9/6/16, 10:57 AM String Theory Page 61 of 93 http://localhost:9090/print

    Tricky Length # Length in bytes or graphemes? byte_size("noël") == 6 String.length("noël") == 4
  34. 9/6/16, 10:57 AM String Theory Page 62 of 93 http://localhost:9090/print

    Tricky Reversal Elixir: String.reverse("noe\u0308l") => "lëon" Ruby: "noe\u0308l".reverse => "l̈eon"
  35. 9/6/16, 10:57 AM String Theory Page 63 of 93 http://localhost:9090/print

    Tricky Equality {combined, single} = {"e\u0308", "\u00EB"} => {"ë", "ë"} combined == single => false String.equivalent?(combined, single) => true
  36. 9/6/16, 10:57 AM String Theory Page 64 of 93 http://localhost:9090/print

    Tricky Casing Elixir String.upcase("mañana") == "MAÑANA" Ruby < 2.4.0 "mañana".upcase => "MAñANA"
  37. 9/6/16, 10:57 AM String Theory Page 65 of 93 http://localhost:9090/print

    Even Elixir Is Not Perfect String.downcase("ΛΣ") # should be "λς" => "λσ"
  38. 9/6/16, 10:57 AM String Theory Page 67 of 93 http://localhost:9090/print

    The Rules of the BEAM Processes are isolated Data is immutable Message sends are memory copies
  39. 9/6/16, 10:57 AM String Theory Page 68 of 93 http://localhost:9090/print

    The Large Binary Space Binaries under 64 bytes are stored in process This is the same as other data types "Large" binaries (called Refc Binaries) are stored outside processes The actual binary is in the "Large Binary Space" A small reference to the binary (called a ProcBin) is used in process
  40. 9/6/16, 10:57 AM String Theory Page 69 of 93 http://localhost:9090/print

    So What??? This transparent … and AMAZING … until it isn't … and it tries to kill your code
  41. 9/6/16, 10:57 AM String Theory Page 70 of 93 http://localhost:9090/print

    The Major Win Passing large binaries between processes is cheap Small references are copied over This potentially very helpful since we store encoded data in binaries HTTP responses Rendered sections of Phoenix templates
  42. 9/6/16, 10:57 AM String Theory Page 71 of 93 http://localhost:9090/print

    The Tradeoff Refc Binaries are "reference counted" When a reference is made in a new process, the count goes up When a reference is garbage collected, the count goes down When the count hits zero, the binary is removed But what if a process doesn't hit garbage collection for a while?
  43. 9/6/16, 10:57 AM String Theory Page 72 of 93 http://localhost:9090/print

    Mysteries This problem sometimes surfaces as mysterious Memory leaks Crashes It has affected Heroku Splunk Avdi Grimm It can be a hard scenario to debug
  44. 9/6/16, 10:57 AM String Theory Page 73 of 93 http://localhost:9090/print

    Detecting the Leak This may help you see the memory used: :erlang.system_info(:allocated_areas) |> Enum.find(fn category -> elem(category, 0) == :binary end)
  45. 9/6/16, 10:57 AM String Theory Page 74 of 93 http://localhost:9090/print

    Possible Solutions Manually force the GC of processes periodically Make processes short running (There's no GC like exit!)
  46. 9/6/16, 10:57 AM String Theory Page 76 of 93 http://localhost:9090/print

    Look Ma, No Concatenation! IO.puts "Hi " <> "James" # => "Hi James" IO.puts ["Hi ", "James"] # => "Hi James" IO.puts ["Hi, ", 9731] # => "Hi, ‚" IO.puts ["H",["e",["l",["l",["o"]]]]] # => "Hello"
  47. 9/6/16, 10:57 AM String Theory Page 77 of 93 http://localhost:9090/print

    String Reuse users = [%{name: "Joe"}, %{name: "Amy"}] [start_li, end_li] = ["<li>", "</li>"] response = Enum.map(users, fn (user) -> [start_li, user.name, end_li] end) IO.puts response
  48. 9/6/16, 10:57 AM String Theory Page 78 of 93 http://localhost:9090/print

    Benefits of String Reuse Skip the work of concatenation Less string allocation = less RAM used Less string allocation = less GC work
  49. 9/6/16, 10:57 AM String Theory Page 79 of 93 http://localhost:9090/print

    "iolists" are for I/O Your process talking with the outside world: Printing to standard output Writing to a file Sending data over the network
  50. 9/6/16, 10:57 AM String Theory Page 81 of 93 http://localhost:9090/print

    Different System Calls The write system call writes out a binary to files or sockets But writev writes a vector (list) of binaries Addresses can be reused Concatenation isn't needed
  51. 9/6/16, 10:57 AM String Theory Page 82 of 93 http://localhost:9090/print

    Different System Calls {:ok, file} = :file.open("/tmp/tmp.txt", [:write, :raw]) foo = "foo" bar = "bar" output = [foo, bar, foo] output = Enum.join(output) :file.write(file, output) System Call: write:return Write data (9 bytes): \ 0x00000000146007e2 foobarfoo
  52. 9/6/16, 10:57 AM String Theory Page 83 of 93 http://localhost:9090/print

    Different System Calls {:ok, file} = :file.open("/tmp/tmp.txt", [:write, :raw]) foo = "foo" bar = "bar" output = [foo, bar, foo] # output = Enum.join(output) :file.write(file, output) System Call: writev:return Writev data 1/3: (3 bytes): \ 0x0000000014600430 foo writev:return Writev data 2/3: (3 bytes): \ 0x0000000014600120 bar writev:return Writev data 3/3: (3 bytes): \ 0x0000000014600430 foo
  53. 9/6/16, 10:57 AM String Theory Page 84 of 93 http://localhost:9090/print

    What IO has Repetitive Strings? Web responses! Snippets repeated in page: <li class="product"> Sections repeated across requests: <footer>...</footer>
  54. 9/6/16, 10:57 AM String Theory Page 87 of 93 http://localhost:9090/print

    Caching via function compilation foo.html.eex is found Phoenix uses EEx to compile it Phoenix specifies that rendered chunks should be added to an iolist Compiles to MyView.render("foo.html", assigns) That function builds iolists
  55. 9/6/16, 10:57 AM String Theory Page 89 of 93 http://localhost:9090/print

    Simple Caching All static strings from template are reused - "cached" No dynamic parts of template are cached Cache is invalidated only if you change the template
  56. 9/6/16, 10:57 AM String Theory Page 91 of 93 http://localhost:9090/print

    Moral: Use iolists When Doing I/O (With caveats) When passing iolists across processes (eg to Cowboy), small (non-refc) strings will get combined You have to use raw mode when writing files to get writev I'll elaborate in a blog post on https://www.bignerdranch.com/blog/ soon But: it won't hurt to use iolists, and may help There's no point in joining strings yourself
  57. 9/6/16, 10:57 AM String Theory Page 92 of 93 http://localhost:9090/print

    James graysoftinc.com Nathan NathanMLong.com THANKS FOR LISTENING!!! ,✌-