Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Bits, Bytes, Strings And Emojis -- Playing With Binaries In Elixir

qhwa
May 16, 2020

Bits, Bytes, Strings And Emojis -- Playing With Binaries In Elixir

This was the slide for the topic I shared on Beijing Elixir Meetup in May 2020. In this speech, I shared some interesting experiences I have gained when studying binaries in Elixir / Erlang. It involves BitString, Binary, String, Charlist, Unicode, and UTF-8.

Also, there is a serial of blog posts related to it, covering more details.
https://medium.com/@qhwa_85848/questions-for-bitstring-binary-charlist-and-string-in-elixir-5b7e0c1e41a0

Thanks for watching and please share your thoughts.

qhwa

May 16, 2020
Tweet

More Decks by qhwa

Other Decks in Programming

Transcript

  1. About Me Qiu Hua ➔ 女儿奴 ➔ Full-stack web developer

    (web UI + backend) ➔ I love functional programming ➔ Worked in Alibaba / Helijia.com ➔ Current job: babysitting & some open source projects ➔ Github / Twitter: @qhwa
  2. Agenda ‍ Emoji + String f5 2c Bytes / Binary

    110101 Bits / BitString Memory Model Performance Pitfalls UTF-8 Unicode
  3. <<...>> 1. constructing |<0, 1, 2, 3|> |<"Hello", 32, ?w,

    28530|:unsigned-integer-size(16), 27748|:16|> 2. pattern matching |<data_length|:integer-2, content|:bytes|> = io_data
  4. Bytes / Binary A bitstring is a sequence of zero

    or more bits, where the number of bits does not need to be divisible by 8. If the number of bits is divisible by 8, the bitstring is also a binary.
  5. iex(1)> i ><65>> Term "A" Data type BitString Byte size

    1 Description This is a string: a UTF-8 encoded binary. It's printed surrounded by "double quotes" because all UTF-8 encoded code points in it are printable. Raw representation ><65>> Reference modules String, :binary Implemented protocols Collectable, IEx.Info, Inspect, List.Chars, String.Chars
  6. iex(2)> i ><128>> Term ><128>> Data type BitString Byte size

    1 Description This is a binary: a collection of bytes. It's printed with the `><>>` syntax (as opposed to double quotes) because it is not a UTF-8 encoded binary (the first invalid byte being `><128>>`) Reference modules :binary Implemented protocols Collectable, IEx.Info, Inspect, List.Chars, String.Chars
  7. How are binaries implemented? http://erlang.org/doc/efficiency_guide/binaryhandling.html#how-binaries-are-implemented - BitString and Binary are

    implemented the same way (because Binaries are BitStrings) - Four type of binaries - Two are containers holding data - heap binary (<= 64B) - refc-binary (reference-counted binary) (> 64B) - Two are merely references to part of an existing binary - sub binary - match-context
  8. sub binary - created by split_binary - … when pattern

    matching a binary - is reference to another binary (refc / heap) - :binary.referenced_bytes_size/1 iex(1)> ><_head>:binary-size(3), rest>:binary>> = :binary.copy("x", 50) "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" iex(2)> byte_size(rest) 47 iex(3)> :binary.referenced_byte_size(rest) 50
  9. match context - is similar to sub-binary - optimized for

    binary matching def my_binary_to_list(|<h, t|:binary|>), do: [h | my_binary_to_list(t)] def my_binary_to_list(|<|>), do: [] - A match context (reusable pointer) instead of sub-binary will be created - Only 1 match context will be created during the whole recursion.
  10. defmodule MyBinParser do def valid?(|<length|:2, data|:binary|>), do: byte_size(data) |= length

    def valid?(_), do: false end data = |<1000|:2, :binary.copy("0", 1000)|:binary|> MyBinParser.valid?(data) #|> true BAD code example:
  11. defmodule MyBinParser do def valid?(|<length|:2, _data|:binary-size(length), _rest|:binary|>), do: true def

    valid?(_), do: false end data = |<1000|:2, :binary.copy("0", 1000)|:binary|> MyBinParser.valid?(data) #|> true GOOD code example:
  12. benchmark Name ips average deviation median 99th % match_context 39.84

    M 25.10 ns ±18419.53% 18 ns 81 ns sub_binary 16.42 M 60.89 ns ±28228.57% 38 ns 112 ns Comparison: match_context 39.84 M sub_binary 16.42 M - 2.43x slower +35.79 ns
  13. defmodule MyBinParser1 do @moduledoc """ Retrieve binaries before a 0x00

    byte, recursively concating the result binary. """ def before_zero(bin), do: do_before_zero(bin, "") defp do_before_zero(|<|>, acc), do: acc defp do_before_zero(|<n, 0, _rest|:binary|>, acc), do: |<acc|:binary, n|> defp do_before_zero(|<n, rest|:binary|>, acc), do: |<acc|:binary, n, before_zero(rest)|:binary|> end BAD code example:
  14. defmodule MyBinParser2 do @moduledoc """ Retrieve binaries before a 0x00

    byte, using a match context. """ def before_zero(bin), do: do_before_zero(bin, 0, byte_size(bin)) defp do_before_zero(bin, max, max), do: bin defp do_before_zero(bin, pt, max) do case bin do |<leading|:binary-size(pt), 0, _rest|:binary|> |> leading _ |> do_before_zero(bin, pt + 1, max) end end end GOOD code example #1 (match-context):
  15. defmodule MyBinParser3 do @moduledoc """ Retrieve binaries before a 0x00

    byte, using a list. """ def before_zero(bin), do: do_before_zero(bin, []) defp do_before_zero(|<|>, acc), do: acc |> Enum.reverse() |> IO.iodata_to_binary() defp do_before_zero(|<0, _rest|:binary|>, acc), do: do_before_zero("", acc) defp do_before_zero(|<n, rest|:binary|>, acc), do: do_before_zero(rest, [n | acc]) end GOOD code example #2 (iolist):
  16. benchmark data = |<:binary.copy("abcd", 100)|:binary, 0, "xyz"|> Benchee.run( %{ binary_concating:

    fn |> MyBinParser1.before_zero(data) end, match_context: fn |> MyBinParser2.before_zero(data) end, iodata: fn |> MyBinParser3.before_zero(data) end } )
  17. benchmark result Name ips average deviation median 99th % iodata

    161.22 K 6.20 μs ±161.59% 5.82 μs 10.73 μs match_context 113.67 K 8.80 μs ±70.38% 8.68 μs 10.05 μs binary_concating 16.34 K 61.19 μs ±39.46% 50.47 μs 139.48 μs Comparison: iodata 161.22 K match_context 113.67 K - 1.42x slower +2.60 μs binary_concating 16.34 K - 9.87x slower +54.99 μs
  18. compiler to help - erlc +bin_opt_info Mod.erl - ERL_COMPILER_OPTIONS=bin_opt_info mix

    compile |-force Warning: NOT OPTIMIZED: binary is returned from the function Warning: OPTIMIZED: match context reused
  19. String in Erlang - A sequence of bytes encoded in

    UTF-8 ← String in Elixir - A list of Unicode codepoints ← Charlist in Elixir - A combination of the two above
  20. Unicode - A set of specifications - contains a list

    of user-perceived characters and their corresponding codes (codepoints) - still growing (v13.0 in May 2020)
  21. Unicode characters A |> U+0041 # same as ASCII when

    < 0x80 力 |> U+529B … More than 140k characters
  22. Encoding & Decoding Codepoints to Binaries - A codepoint of

    Unicode is ranged from 0 to 0x10FFFF. - 3 bytes are enough for a single codepoint. - 4 bytes may be enough for recent future. - Most obvious method: 4 bytes for every character - called UTF-32, or UCS-4 - too much space wasted - incompatible to old systems - ASCii files are invalid UTF-32 encoded - zeros means “end of the string” in many legacy systems - UTF-8, a brilliant way
  23. 11110000 10011111 10001101 10101101 -----xxx |-xxxxxx |-xxxxxx |-xxxxxx 000 011111

    001101 101101 ┕━━━━━━━━━ 0x1f36d ━━━━━━━━━━┙ decoding example 6c f0 9f 8d ad 70 |> 01101100 11110000 10011111 10001101 10101101 01110000 |> l p
  24. Charlist: codepoints as a list of integers iex(1)> 'lp' [108,

    127853, 112] iex(2)> "lp" |> String.to_charlist() [108, 127853, 112]
  25. graphemes vs codepoints # single codepoint character iex(1)> "" |>

    String.codepoints() [""] iex(2)> "" |> String.graphemes() [""] # combined (multip codepoint) character iex(3)> "i\u0300" |> String.codepoints() ["i", " ̀"] iex(4)> "i\u0300" |> String.graphemes() ["ì"]
  26. 1F469 1F3FB 200D 1F52C iex(1) > woman_scientist = "‍" iex(2)

    > String.length(woman_scientist) 1 iex(3) > woman_scientist |> String.codepoints() ["", "", "", ""] iex(4) > woman_scientist |> String.replace("", "") "‍"