Bits, Bytes, Strings And Emojis -- Playing With Binaries In Elixir

Bits, Bytes, String & Emojis Playing With Binaries In Elixir
Qiu Hua • 2020/05

About Me Qiu Hua ➔ 女儿奴 ➔ Full-stack web developer
(web UI + backend) ➔ I love functional programming ➔ Worked in Alibaba / Helijia.com ➔ Current job: babysitting & some open source projects ➔ Github / Twitter: @qhwa

Agenda ‍ Emoji + String f5 2c Bytes / Binary
110101 Bits / BitString Memory Model Performance Pitfalls UTF-8 Unicode

Bits / BitString Minimum unit of data we can operate

worth reading Elixir documentation for Kernel.SpecialForm: https://hexdocs.pm/elixir/Kernel.SpecialForms.html#%3C%3C%3E%3E/1

Bytes / Binary A bitstring is a sequence of zero
or more bits, where the number of bits does not need to be divisible by 8. If the number of bits is divisible by 8, the bitstring is also a binary.

Binaries <<>> "" <<"">> <<1>> <<2000>> <<3::6, 0::18>>

iex(1)> i ><65>> Term "A" Data type BitString Byte size
1 Description This is a string: a UTF-8 encoded binary. It's printed surrounded by "double quotes" because all UTF-8 encoded code points in it are printable. Raw representation ><65>> Reference modules String, :binary Implemented protocols Collectable, IEx.Info, Inspect, List.Chars, String.Chars

iex(2)> i ><128>> Term ><128>> Data type BitString Byte size
1 Description This is a binary: a collection of bytes. It's printed with the `><>>` syntax (as opposed to double quotes) because it is not a UTF-8 encoded binary (the first invalid byte being `><128>>`) Reference modules :binary Implemented protocols Collectable, IEx.Info, Inspect, List.Chars, String.Chars

How are binaries implemented? http://erlang.org/doc/efficiency_guide/binaryhandling.html#how-binaries-are-implemented - BitString and Binary are
implemented the same way (because Binaries are BitStrings) - Four type of binaries - Two are containers holding data - heap binary (<= 64B) - refc-binary (reference-counted binary) (> 64B) - Two are merely references to part of an existing binary - sub binary - match-context

heap binary & refc binary credit: https://medium.com/@mentels/a-short-guide-to-refc-binaries-f13f9029f6e2

binary sharing across processes credit: https://medium.com/@mentels/a-short-guide-to-refc-binaries-f13f9029f6e2

sub binary - created by split_binary - … when pattern
matching a binary - is reference to another binary (refc / heap) - :binary.referenced_bytes_size/1 iex(1)> ><_head>:binary-size(3), rest>:binary>> = :binary.copy("x", 50) "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" iex(2)> byte_size(rest) 47 iex(3)> :binary.referenced_byte_size(rest) 50

match context - is similar to sub-binary - optimized for
binary matching def my_binary_to_list(|<h, t|:binary|>), do: [h | my_binary_to_list(t)] def my_binary_to_list(|<|>), do: [] - A match context (reusable pointer) instead of sub-binary will be created - Only 1 match context will be created during the whole recursion.

benchmark Name ips average deviation median 99th % match_context 39.84
M 25.10 ns ±18419.53% 18 ns 81 ns sub_binary 16.42 M 60.89 ns ±28228.57% 38 ns 112 ns Comparison: match_context 39.84 M sub_binary 16.42 M - 2.43x slower +35.79 ns

byte, using a match context. """ def before_zero(bin), do: do_before_zero(bin, 0, byte_size(bin)) defp do_before_zero(bin, max, max), do: bin defp do_before_zero(bin, pt, max) do case bin do |<leading|:binary-size(pt), 0, _rest|:binary|> |> leading _ |> do_before_zero(bin, pt + 1, max) end end end GOOD code example #1 (match-context):

benchmark data = |<:binary.copy("abcd", 100)|:binary, 0, "xyz"|> Benchee.run( %{ binary_concating:
fn |> MyBinParser1.before_zero(data) end, match_context: fn |> MyBinParser2.before_zero(data) end, iodata: fn |> MyBinParser3.before_zero(data) end } )

benchmark result Name ips average deviation median 99th % iodata
161.22 K 6.20 μs ±161.59% 5.82 μs 10.73 μs match_context 113.67 K 8.80 μs ±70.38% 8.68 μs 10.05 μs binary_concating 16.34 K 61.19 μs ±39.46% 50.47 μs 139.48 μs Comparison: iodata 161.22 K match_context 113.67 K - 1.42x slower +2.60 μs binary_concating 16.34 K - 9.87x slower +54.99 μs

compiler to help - erlc +bin_opt_info Mod.erl - ERL_COMPILER_OPTIONS=bin_opt_info mix
compile |-force Warning: NOT OPTIMIZED: binary is returned from the function Warning: OPTIMIZED: match context reused

String UTF-8-encoded Unicode codepoints.

String in Erlang - A sequence of bytes encoded in
UTF-8 ← String in Elixir - A list of Unicode codepoints ← Charlist in Elixir - A combination of the two above

Unicode - A set of speciﬁcations - contains a list
of user-perceived characters and their corresponding codes (codepoints) - still growing (v13.0 in May 2020)

Unicode characters A |> U+0041 # same as ASCII when
< 0x80 力 |> U+529B … More than 140k characters

Encoding & Decoding Codepoints to Binaries - A codepoint of
Unicode is ranged from 0 to 0x10FFFF. - 3 bytes are enough for a single codepoint. - 4 bytes may be enough for recent future. - Most obvious method: 4 bytes for every character - called UTF-32, or UCS-4 - too much space wasted - incompatible to old systems - ASCii ﬁles are invalid UTF-32 encoded - zeros means “end of the string” in many legacy systems - UTF-8, a brilliant way

decoding example binaries: 6c f0 9f 8d ad 70 characters?

11110000 10011111 10001101 10101101 -----xxx |-xxxxxx |-xxxxxx |-xxxxxx 000 011111
001101 101101 ┕━━━━━━━━━ 0x1f36d ━━━━━━━━━━┙ decoding example 6c f0 9f 8d ad 70 |> 01101100 11110000 10011111 10001101 10101101 01110000 |> l p

Charlist: codepoints as a list of integers iex(1)> 'lp' [108,
127853, 112] iex(2)> "lp" |> String.to_charlist() [108, 127853, 112]

graphemes vs codepoints # single codepoint character iex(1)> "" |>
String.codepoints() [""] iex(2)> "" |> String.graphemes() [""] # combined (multip codepoint) character iex(3)> "i\u0300" |> String.codepoints() ["i", " ̀"] iex(4)> "i\u0300" |> String.graphemes() ["ì"]

1F469 1F3FB 200D 1F52C iex(1) > woman_scientist = "‍" iex(2)
> String.length(woman_scientist) 1 iex(3) > woman_scientist |> String.codepoints() ["", "", "", ""] iex(4) > woman_scientist |> String.replace("", "") "‍"

Recall

Thank you! Have fun playing with binaries!

Bits, Bytes, Strings And Emojis -- Playing With...

Bits, Bytes, Strings And Emojis -- Playing With Binaries In Elixir

More Decks by qhwa

Other Decks in Programming

Featured

Transcript