Crashing BEAM Applications @ElixirConf.EU 2020

@nirev Crashing BEAM applications 2020-10-08 Guilherme de Maio

@nirev Why?

@nirev Who?

@nirev Based in São Paulo, Brazil    Elixir since 2015 
(mostly Java and C before that)    co-organizer of SP Elixir User Group ;)

@nirev Based in São Paulo, Brazil    Elixir since 2015 
(mostly Java and C before that)    co-organizer of SP Elixir User Group ;)  @Telnyx since 2017

@nirev

@nirev Things we often hear

@nirev Actor Model!

@nirev Isolated Processes!

@nirev Amazing GC per Process!

@nirev Super scalable!

@nirev Just let it CRASH!!

@nirev Well.. That’s not the full picture

@nirev How to crash your BEAM application

@nirev The case of Exploding Atoms 1

@nirev String.to_atom(“don't do it")

@nirev defmodule X do def explode do Enum.each(1 ..2_000_000, fn
i -> IO.puts(" #{i}") 20 |> :crypto.strong_rand_bytes() |> Base.encode64() |> String.to_atom() end) end end

@nirev 82534 ... 82535 ... 82536 ... 82537 ... 82538
... 82539 ... 82540 ... no more index entries in atom_tab (max=100000) Crash dump is being written to: erl_crash.dump ...done

@nirev • Module names • Node names • Struct fields
• “decode as atom” @nirev

@nirev The case of the Linked Agent 2

@nirev Process 1 Linked Process start_link

@nirev Process 1 Linked Process start_link exit reason: exception

@nirev Process 1 Linked Process start_link exit reason: normal

@nirev Linked Process “I'm a survivor”

@nirev defmodule X do def spawn do spawn(fn -> {:ok,
_pid} = Agent.start_link(fn -> 42 end) Process.sleep(1_000) exit(:normal) end) end end

@nirev iex(1)> Process.list() |> Enum.count() 50 iex(2)> for _ <-
1 ..10, do: X.spawn [#PID<0.94.0>, #PID<0.95.0>, #PID<0.96.0>, #PID<0.97.0>, #PID<0.98.0>, #PID<0.99.0>, #PID<0.100.0>, #PID<0.101.0>, #PID<0.102.0>, #PID<0.103.0>] iex(3)> Process.list() |> Enum.count() 70 iex(4)> Process.sleep(2_000) :ok iex(5)> Process.list() |> Enum.count() 60

@nirev ref = Process.monitor(pid) receive do {:DOWN, ^ref, :process, ^pid,
_reason} -> # receive down message and do something

@nirev The case of Requests Monitoring 3

@nirev request Tracker read data from db api call update
db publish to queue

@nirev Tracker.add_breadcrumb(:key, metadata) Tracker.add_breadcrumb(:update_db, %{user: x, role: y, ..}) request
Tracker read data from db api call update db publish to queue

@nirev request Tracker read data from db api call update
db publish to queue CRASH breadcrumbs = Tracker.get_breadcrumbs(tracker) send_report(exception, breadcrumbs)

@nirev For each request: - Start a new Agent -
Monitor the agent and the process - kill agent when process ends normally  OR   in case of exception:  take breadcrumbs from agent, report, and kill

@nirev 40 80 120 160 ??????

@nirev Cowboy implements the keep-alive mechanism by reusing the same
process for all requests. This allows Cowboy to save memory. This works well because most code will not have any side eﬀect impacting subsequent requests. But it also means you need to clean up if you do have code with side eﬀects. The terminate/3 function can be used for this purpose.

@nirev 40 80 120 160 agents forever

@nirev The case of Inﬁnite Restarts 4

@nirev Process

@nirev Error Reporter Task Process

@nirev Task Supervisor Error Reporter Task Error Reporter Task Process

@nirev Task Supervisor Error Reporter Task Error Reporter Task Remote
API is down!!!

@nirev Task Supervisor Task.Supervisor.start_child(…, restart: :transient)

@nirev Task Supervisor Error Reporter Task Error Reporter Task

@nirev Task Supervisor Error Reporter Task Error Reporter Task Remote
API is down!!!

@nirev Task Supervisor Task.Supervisor.start_child(…, restart: :temporary)

@nirev The case of Waiting Processes 5

@nirev Process request Task.Supervisor.start_child(…) Notifier

@nirev Process request Task.Supervisor.start_child(…) Notifier Process request Task.Supervisor.start_child(…) Notifier

@nirev Notifier short-lived  just one http request

@nirev

@nirev |No | Pid | Memory |Name or Initial Call
| Reductions| MsgQueue |Current Function | |1 |<0.433.0> | 137.1796 MB |inet_gethost_native | 3016210668| 179710 |inet_gethost_native:do_handle_call|

@nirev The case of the Message Router 6

@nirev Tracking devices

@nirev Backend Server TCP Socket A P I Commands

@nirev Message Router

@nirev Message Router Out of Memory

@nirev Process Control Block Stack Heap Process Memory Layout

@nirev Process Control Block Stack Heap Process Memory Layout Shared
Heap Refc Binary ProcBin (pointers)

@nirev Process Control Block Stack Heap Private Heap Garbage Collection
Generational • young generation: newly allocated data • old generation: data that survives GC Fullsweep vs Generational runs: • min_heap_size • fullsweep_after

@nirev Shared Heap Garbage Collection Reference Counting • any binary
without references will be cleaned Shared Heap Refc Binary

@nirev The problem with Message Router It receives messages >
64bytes (Refc bins) So, it adds a reference to that to its heap But, since it doesn’t use much memory, it won’t grow past min_heap_size That means, references are not cleaned, so large binaries linger on Shared Heap. Message Router

@nirev The SOLUTION for Message Router 1) fullsweep_after process ﬂag
conﬁgures how often a fullsweep GC will happen, and that will collect ProcBins 2) hibernating process when it hibernates, a fullsweep GC will run 3) moving work to short-lived processes if possible Message Router

@nirev What to do when it happens

@nirev What to do when it happens (before)

@nirev Make it operable!

@nirev $ iex Erlang/OTP 23 [erts-11.0] [source] [64-bit] [smp:4:4] [ds:4:4:10]
[async-threads:1] [hipe] Interactive Elixir (1.10.4) - press Ctrl+C to exit (type h() ENTER for help) iex(1)> Node.list []

@nirev $ iex --sname node1@localhost --cookie ilovecookies Erlang/OTP 23 [erts-11.0]
[source] [64-bit] [smp:4:4] [ds:4:4:10] [async-threads:1] [hipe] Interactive Elixir (1.10.4) - press Ctrl+C to exit (type h() ENTER for help) iex(node1@localhost)1> Node.list []

@nirev $ iex --sname node2 --cookie ilovecookies --remsh node1@localhost Erlang/OTP
23 [erts-11.0] [source] [64-bit] [smp:4:4] [ds:4:4:10] [async-threads:1] [hipe] Interactive Elixir (1.10.4) - press Ctrl+C to exit (type h() ENTER for help) iex(node1@localhost)1> Node.list [:node2@C02T24Z2HF1R]

@nirev $ mix release $ ./path/to/release/my_app start $ ./path/to/release/my_app remote

@nirev Operate!

@nirev Native tools Process.list/0 Process.info/1 :sys.get_* MyModule.my_very_helpful_debug_function()

@nirev :observer_cli.start()

@nirev phoenix_live_dashboard

@nirev https: //ferd.github.io/recon/ Recon is a library to be dropped
into any other Erlang project, to be used to assist DevOps people diagnose problems in production nodes. recon

@nirev iex( ..)1> :recon.bin_leak(3) [ {#PID<0.80.0>, -606, [current_function: {Process, :sleep,
1}, initial_call: {:erlang, :apply, 2} ]}, {#PID<0.124.0>, -176, [ :ssl_manager, {:current_function, {:gen_server, :loop, 7}}, {:initial_call, {:proc_lib, :init_p, 5}} ]}, {#PID<0.1905.0>, -165, [current_function: {:ranch_conns_sup, :loop, 4}, initial_call: {:proc_lib, :init_p, 5} ]} ] recon

@nirev Metrics and Visibility!

@nirev vmstats: send vm metrics to statsd https: //github.com/ferd/vmstats prometheus:
send vm metrics to… prometheus https: //github.com/deadtrickster/prometheus.ex

@nirev :telemetry

@nirev create your own dashboards!

@nirev Log aggregation

@nirev Error Reporting Sentry Appsignal Bugsnag Rollbar ..

@nirev Bottomline?

@nirev BEAM is amazing

@nirev There are several ways to break your systems

@nirev Don’t go to production without visibility

@nirev Please, Read Erlang in Anger

Guilherme de Maio [email protected] Obrigado! ❤Elixir Conf EU 2020 We
are hiring! www.telnyx.com @nirev

Crashing BEAM Applications @ElixirConf.EU 2020

Crashing BEAM Applications @ElixirConf.EU 2020

More Decks by Guilherme de Maio, nirev

Other Decks in Programming

Featured

Transcript