Slide 1

Slide 1 text

Bits, Bytes, String & Emojis Playing With Binaries In Elixir Qiu Hua • 2020/05

Slide 2

Slide 2 text

About Me Qiu Hua ➔ 女儿奴 ➔ Full-stack web developer (web UI + backend) ➔ I love functional programming ➔ Worked in Alibaba / Helijia.com ➔ Current job: babysitting & some open source projects ➔ Github / Twitter: @qhwa

Slide 3

Slide 3 text

Agenda ‍ Emoji + String f5 2c Bytes / Binary 110101 Bits / BitString Memory Model Performance Pitfalls UTF-8 Unicode

Slide 4

Slide 4 text

Bits / BitString Minimum unit of data we can operate

Slide 5

Slide 5 text

<<...>> 1. constructing |<0, 1, 2, 3|> |<"Hello", 32, ?w, 28530|:unsigned-integer-size(16), 27748|:16|> 2. pattern matching | = io_data

Slide 6

Slide 6 text

worth reading Elixir documentation for Kernel.SpecialForm: https://hexdocs.pm/elixir/Kernel.SpecialForms.html#%3C%3C%3E%3E/1

Slide 7

Slide 7 text

Bytes / Binary A bitstring is a sequence of zero or more bits, where the number of bits does not need to be divisible by 8. If the number of bits is divisible by 8, the bitstring is also a binary.

Slide 8

Slide 8 text

Binaries <<>> "" <<"">> <<1>> <<2000>> <<3::6, 0::18>>

Slide 9

Slide 9 text

iex(1)> i ><65>> Term "A" Data type BitString Byte size 1 Description This is a string: a UTF-8 encoded binary. It's printed surrounded by "double quotes" because all UTF-8 encoded code points in it are printable. Raw representation ><65>> Reference modules String, :binary Implemented protocols Collectable, IEx.Info, Inspect, List.Chars, String.Chars

Slide 10

Slide 10 text

iex(2)> i ><128>> Term ><128>> Data type BitString Byte size 1 Description This is a binary: a collection of bytes. It's printed with the `><>>` syntax (as opposed to double quotes) because it is not a UTF-8 encoded binary (the first invalid byte being `><128>>`) Reference modules :binary Implemented protocols Collectable, IEx.Info, Inspect, List.Chars, String.Chars

Slide 11

Slide 11 text

No content

Slide 12

Slide 12 text

How are binaries implemented? http://erlang.org/doc/efficiency_guide/binaryhandling.html#how-binaries-are-implemented - BitString and Binary are implemented the same way (because Binaries are BitStrings) - Four type of binaries - Two are containers holding data - heap binary (<= 64B) - refc-binary (reference-counted binary) (> 64B) - Two are merely references to part of an existing binary - sub binary - match-context

Slide 13

Slide 13 text

heap binary & refc binary credit: https://medium.com/@mentels/a-short-guide-to-refc-binaries-f13f9029f6e2

Slide 14

Slide 14 text

binary sharing across processes credit: https://medium.com/@mentels/a-short-guide-to-refc-binaries-f13f9029f6e2

Slide 15

Slide 15 text

sub binary - created by split_binary - … when pattern matching a binary - is reference to another binary (refc / heap) - :binary.referenced_bytes_size/1 iex(1)> ><_head>:binary-size(3), rest>:binary>> = :binary.copy("x", 50) "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" iex(2)> byte_size(rest) 47 iex(3)> :binary.referenced_byte_size(rest) 50

Slide 16

Slide 16 text

match context - is similar to sub-binary - optimized for binary matching def my_binary_to_list(|), do: [h | my_binary_to_list(t)] def my_binary_to_list(|<|>), do: [] - A match context (reusable pointer) instead of sub-binary will be created - Only 1 match context will be created during the whole recursion.

Slide 17

Slide 17 text

defmodule MyBinParser do def valid?(|), do: byte_size(data) |= length def valid?(_), do: false end data = |<1000|:2, :binary.copy("0", 1000)|:binary|> MyBinParser.valid?(data) #|> true BAD code example:

Slide 18

Slide 18 text

defmodule MyBinParser do def valid?(|), do: true def valid?(_), do: false end data = |<1000|:2, :binary.copy("0", 1000)|:binary|> MyBinParser.valid?(data) #|> true GOOD code example:

Slide 19

Slide 19 text

benchmark Name ips average deviation median 99th % match_context 39.84 M 25.10 ns ±18419.53% 18 ns 81 ns sub_binary 16.42 M 60.89 ns ±28228.57% 38 ns 112 ns Comparison: match_context 39.84 M sub_binary 16.42 M - 2.43x slower +35.79 ns

Slide 20

Slide 20 text

defmodule MyBinParser1 do @moduledoc """ Retrieve binaries before a 0x00 byte, recursively concating the result binary. """ def before_zero(bin), do: do_before_zero(bin, "") defp do_before_zero(|<|>, acc), do: acc defp do_before_zero(|, acc), do: | defp do_before_zero(|, acc), do: | end BAD code example:

Slide 21

Slide 21 text

defmodule MyBinParser2 do @moduledoc """ Retrieve binaries before a 0x00 byte, using a match context. """ def before_zero(bin), do: do_before_zero(bin, 0, byte_size(bin)) defp do_before_zero(bin, max, max), do: bin defp do_before_zero(bin, pt, max) do case bin do | |> leading _ |> do_before_zero(bin, pt + 1, max) end end end GOOD code example #1 (match-context):

Slide 22

Slide 22 text

defmodule MyBinParser3 do @moduledoc """ Retrieve binaries before a 0x00 byte, using a list. """ def before_zero(bin), do: do_before_zero(bin, []) defp do_before_zero(|<|>, acc), do: acc |> Enum.reverse() |> IO.iodata_to_binary() defp do_before_zero(|<0, _rest|:binary|>, acc), do: do_before_zero("", acc) defp do_before_zero(|, acc), do: do_before_zero(rest, [n | acc]) end GOOD code example #2 (iolist):

Slide 23

Slide 23 text

benchmark data = |<:binary.copy("abcd", 100)|:binary, 0, "xyz"|> Benchee.run( %{ binary_concating: fn |> MyBinParser1.before_zero(data) end, match_context: fn |> MyBinParser2.before_zero(data) end, iodata: fn |> MyBinParser3.before_zero(data) end } )

Slide 24

Slide 24 text

benchmark result Name ips average deviation median 99th % iodata 161.22 K 6.20 μs ±161.59% 5.82 μs 10.73 μs match_context 113.67 K 8.80 μs ±70.38% 8.68 μs 10.05 μs binary_concating 16.34 K 61.19 μs ±39.46% 50.47 μs 139.48 μs Comparison: iodata 161.22 K match_context 113.67 K - 1.42x slower +2.60 μs binary_concating 16.34 K - 9.87x slower +54.99 μs

Slide 25

Slide 25 text

compiler to help - erlc +bin_opt_info Mod.erl - ERL_COMPILER_OPTIONS=bin_opt_info mix compile |-force Warning: NOT OPTIMIZED: binary is returned from the function Warning: OPTIMIZED: match context reused

Slide 26

Slide 26 text

String UTF-8-encoded Unicode codepoints.

Slide 27

Slide 27 text

String in Erlang - A sequence of bytes encoded in UTF-8 ← String in Elixir - A list of Unicode codepoints ← Charlist in Elixir - A combination of the two above

Slide 28

Slide 28 text

Unicode - A set of specifications - contains a list of user-perceived characters and their corresponding codes (codepoints) - still growing (v13.0 in May 2020)

Slide 29

Slide 29 text

Unicode characters A |> U+0041 # same as ASCII when < 0x80 力 |> U+529B … More than 140k characters

Slide 30

Slide 30 text

Encoding & Decoding Codepoints to Binaries - A codepoint of Unicode is ranged from 0 to 0x10FFFF. - 3 bytes are enough for a single codepoint. - 4 bytes may be enough for recent future. - Most obvious method: 4 bytes for every character - called UTF-32, or UCS-4 - too much space wasted - incompatible to old systems - ASCii files are invalid UTF-32 encoded - zeros means “end of the string” in many legacy systems - UTF-8, a brilliant way

Slide 31

Slide 31 text

decoding example binaries: 6c f0 9f 8d ad 70 characters?

Slide 32

Slide 32 text

11110000 10011111 10001101 10101101 -----xxx |-xxxxxx |-xxxxxx |-xxxxxx 000 011111 001101 101101 ┕━━━━━━━━━ 0x1f36d ━━━━━━━━━━┙ decoding example 6c f0 9f 8d ad 70 |> 01101100 11110000 10011111 10001101 10101101 01110000 |> l p

Slide 33

Slide 33 text

Charlist: codepoints as a list of integers iex(1)> 'lp' [108, 127853, 112] iex(2)> "lp" |> String.to_charlist() [108, 127853, 112]

Slide 34

Slide 34 text

graphemes vs codepoints # single codepoint character iex(1)> "" |> String.codepoints() [""] iex(2)> "" |> String.graphemes() [""] # combined (multip codepoint) character iex(3)> "i\u0300" |> String.codepoints() ["i", " ̀"] iex(4)> "i\u0300" |> String.graphemes() ["ì"]

Slide 35

Slide 35 text

Emoji

Slide 36

Slide 36 text

1F469 1F3FB 200D 1F52C iex(1) > woman_scientist = "‍" iex(2) > String.length(woman_scientist) 1 iex(3) > woman_scientist |> String.codepoints() ["", "", "", ""] iex(4) > woman_scientist |> String.replace("", "") "‍"

Slide 37

Slide 37 text

Recall

Slide 38

Slide 38 text

Thank you! Have fun playing with binaries!