Slide 1

Slide 1 text

@nirev Crashing BEAM applications 2020-10-08 Guilherme de Maio

Slide 2

Slide 2 text

@nirev Why?

Slide 3

Slide 3 text

@nirev Who?

Slide 4

Slide 4 text

@nirev Based in São Paulo, Brazil
 
 Elixir since 2015
 (mostly Java and C before that)
 
 co-organizer of SP Elixir User Group ;)

Slide 5

Slide 5 text

@nirev Based in São Paulo, Brazil
 
 Elixir since 2015
 (mostly Java and C before that)
 
 co-organizer of SP Elixir User Group ;)
 @Telnyx since 2017

Slide 6

Slide 6 text

@nirev

Slide 7

Slide 7 text

@nirev Things we often hear

Slide 8

Slide 8 text

@nirev Actor Model!

Slide 9

Slide 9 text

@nirev Isolated Processes!

Slide 10

Slide 10 text

@nirev Amazing GC per Process!

Slide 11

Slide 11 text

@nirev Super scalable!

Slide 12

Slide 12 text

@nirev Just let it CRASH!!

Slide 13

Slide 13 text

@nirev Well.. That’s not the full picture

Slide 14

Slide 14 text

@nirev How to crash your BEAM application

Slide 15

Slide 15 text

@nirev The case of Exploding Atoms 1

Slide 16

Slide 16 text

@nirev String.to_atom(“don't do it")

Slide 17

Slide 17 text

@nirev defmodule X do def explode do Enum.each(1 ..2_000_000, fn i -> IO.puts(" #{i}") 20 |> :crypto.strong_rand_bytes() |> Base.encode64() |> String.to_atom() end) end end

Slide 18

Slide 18 text

@nirev 82534 ... 82535 ... 82536 ... 82537 ... 82538 ... 82539 ... 82540 ... no more index entries in atom_tab (max=100000) Crash dump is being written to: erl_crash.dump ...done

Slide 19

Slide 19 text

@nirev • Module names • Node names • Struct fields • “decode as atom” @nirev

Slide 20

Slide 20 text

@nirev The case of the Linked Agent 2

Slide 21

Slide 21 text

@nirev Process 1 Linked Process start_link

Slide 22

Slide 22 text

@nirev Process 1 Linked Process start_link exit reason: exception

Slide 23

Slide 23 text

@nirev Process 1 Linked Process start_link exit reason: exception

Slide 24

Slide 24 text

@nirev Process 1 Linked Process start_link exit reason: normal

Slide 25

Slide 25 text

@nirev Linked Process “I'm a survivor”

Slide 26

Slide 26 text

@nirev defmodule X do def spawn do spawn(fn -> {:ok, _pid} = Agent.start_link(fn -> 42 end) Process.sleep(1_000) exit(:normal) end) end end

Slide 27

Slide 27 text

@nirev iex(1)> Process.list() |> Enum.count() 50 iex(2)> for _ <- 1 ..10, do: X.spawn [#PID<0.94.0>, #PID<0.95.0>, #PID<0.96.0>, #PID<0.97.0>, #PID<0.98.0>, #PID<0.99.0>, #PID<0.100.0>, #PID<0.101.0>, #PID<0.102.0>, #PID<0.103.0>] iex(3)> Process.list() |> Enum.count() 70 iex(4)> Process.sleep(2_000) :ok iex(5)> Process.list() |> Enum.count() 60

Slide 28

Slide 28 text

@nirev ref = Process.monitor(pid) receive do {:DOWN, ^ref, :process, ^pid, _reason} -> # receive down message and do something

Slide 29

Slide 29 text

@nirev The case of Requests Monitoring 3

Slide 30

Slide 30 text

@nirev request Tracker read data from db api call update db publish to queue

Slide 31

Slide 31 text

@nirev Tracker.add_breadcrumb(:key, metadata) Tracker.add_breadcrumb(:update_db, %{user: x, role: y, ..}) request Tracker read data from db api call update db publish to queue

Slide 32

Slide 32 text

@nirev request Tracker read data from db api call update db publish to queue CRASH breadcrumbs = Tracker.get_breadcrumbs(tracker) send_report(exception, breadcrumbs)

Slide 33

Slide 33 text

@nirev For each request: - Start a new Agent - Monitor the agent and the process - kill agent when process ends normally
 OR 
 in case of exception:
 take breadcrumbs from agent, report, and kill

Slide 34

Slide 34 text

@nirev 40 80 120 160 ??????

Slide 35

Slide 35 text

@nirev Cowboy implements the keep-alive mechanism by reusing the same process for all requests. This allows Cowboy to save memory. This works well because most code will not have any side effect impacting subsequent requests. But it also means you need to clean up if you do have code with side effects. The terminate/3 function can be used for this purpose.

Slide 36

Slide 36 text

@nirev 40 80 120 160 agents forever

Slide 37

Slide 37 text

@nirev The case of Infinite Restarts 4

Slide 38

Slide 38 text

@nirev Process

Slide 39

Slide 39 text

@nirev Process

Slide 40

Slide 40 text

@nirev Error Reporter Task Process

Slide 41

Slide 41 text

@nirev Task Supervisor Error Reporter Task Error Reporter Task Process

Slide 42

Slide 42 text

@nirev Task Supervisor Error Reporter Task Error Reporter Task Remote API is down!!!

Slide 43

Slide 43 text

@nirev Task Supervisor Task.Supervisor.start_child(…, restart: :transient)

Slide 44

Slide 44 text

@nirev Task Supervisor Error Reporter Task Error Reporter Task

Slide 45

Slide 45 text

@nirev Task Supervisor Error Reporter Task Error Reporter Task Remote API is down!!!

Slide 46

Slide 46 text

@nirev Task Supervisor Task.Supervisor.start_child(…, restart: :temporary)

Slide 47

Slide 47 text

@nirev The case of Waiting Processes 5

Slide 48

Slide 48 text

@nirev Process request Task.Supervisor.start_child(…) Notifier

Slide 49

Slide 49 text

@nirev Process request Task.Supervisor.start_child(…) Notifier Process request Task.Supervisor.start_child(…) Notifier

Slide 50

Slide 50 text

@nirev Notifier short-lived
 just one http request

Slide 51

Slide 51 text

@nirev

Slide 52

Slide 52 text

@nirev

Slide 53

Slide 53 text

@nirev |No | Pid | Memory |Name or Initial Call | Reductions| MsgQueue |Current Function | |1 |<0.433.0> | 137.1796 MB |inet_gethost_native | 3016210668| 179710 |inet_gethost_native:do_handle_call|

Slide 54

Slide 54 text

@nirev The case of the Message Router 6

Slide 55

Slide 55 text

@nirev Tracking devices

Slide 56

Slide 56 text

@nirev Backend Server TCP Socket A P I Commands

Slide 57

Slide 57 text

@nirev Message Router

Slide 58

Slide 58 text

@nirev Message Router Out of Memory

Slide 59

Slide 59 text

@nirev Process Control Block Stack Heap Process Memory Layout

Slide 60

Slide 60 text

@nirev Process Control Block Stack Heap Process Memory Layout Shared Heap Refc Binary ProcBin (pointers)

Slide 61

Slide 61 text

@nirev Process Control Block Stack Heap Private Heap Garbage Collection Generational • young generation: newly allocated data • old generation: data that survives GC Fullsweep vs Generational runs: • min_heap_size • fullsweep_after

Slide 62

Slide 62 text

@nirev Shared Heap Garbage Collection Reference Counting • any binary without references will be cleaned Shared Heap Refc Binary

Slide 63

Slide 63 text

@nirev The problem with Message Router It receives messages > 64bytes (Refc bins) So, it adds a reference to that to its heap But, since it doesn’t use much memory, it won’t grow past min_heap_size That means, references are not cleaned, so large binaries linger on Shared Heap. Message Router

Slide 64

Slide 64 text

@nirev The SOLUTION for Message Router 1) fullsweep_after process flag configures how often a fullsweep GC will happen, and that will collect ProcBins 2) hibernating process when it hibernates, a fullsweep GC will run 3) moving work to short-lived processes if possible Message Router

Slide 65

Slide 65 text

@nirev What to do when it happens

Slide 66

Slide 66 text

@nirev What to do when it happens (before)

Slide 67

Slide 67 text

@nirev Make it operable!

Slide 68

Slide 68 text

@nirev $ iex Erlang/OTP 23 [erts-11.0] [source] [64-bit] [smp:4:4] [ds:4:4:10] [async-threads:1] [hipe] Interactive Elixir (1.10.4) - press Ctrl+C to exit (type h() ENTER for help) iex(1)> Node.list []

Slide 69

Slide 69 text

@nirev $ iex --sname node1@localhost --cookie ilovecookies Erlang/OTP 23 [erts-11.0] [source] [64-bit] [smp:4:4] [ds:4:4:10] [async-threads:1] [hipe] Interactive Elixir (1.10.4) - press Ctrl+C to exit (type h() ENTER for help) iex(node1@localhost)1> Node.list []

Slide 70

Slide 70 text

@nirev $ iex --sname node2 --cookie ilovecookies --remsh node1@localhost Erlang/OTP 23 [erts-11.0] [source] [64-bit] [smp:4:4] [ds:4:4:10] [async-threads:1] [hipe] Interactive Elixir (1.10.4) - press Ctrl+C to exit (type h() ENTER for help) iex(node1@localhost)1> Node.list [:node2@C02T24Z2HF1R]

Slide 71

Slide 71 text

@nirev $ iex --sname node2 --cookie ilovecookies --remsh node1@localhost Erlang/OTP 23 [erts-11.0] [source] [64-bit] [smp:4:4] [ds:4:4:10] [async-threads:1] [hipe] Interactive Elixir (1.10.4) - press Ctrl+C to exit (type h() ENTER for help) iex(node1@localhost)1> Node.list [:node2@C02T24Z2HF1R]

Slide 72

Slide 72 text

@nirev $ mix release $ ./path/to/release/my_app start $ ./path/to/release/my_app remote

Slide 73

Slide 73 text

@nirev Operate!

Slide 74

Slide 74 text

@nirev Native tools Process.list/0 Process.info/1 :sys.get_* MyModule.my_very_helpful_debug_function()

Slide 75

Slide 75 text

@nirev :observer_cli.start()

Slide 76

Slide 76 text

@nirev phoenix_live_dashboard

Slide 77

Slide 77 text

@nirev https: //ferd.github.io/recon/ Recon is a library to be dropped into any other Erlang project, to be used to assist DevOps people diagnose problems in production nodes. recon

Slide 78

Slide 78 text

@nirev iex( ..)1> :recon.bin_leak(3) [ {#PID<0.80.0>, -606, [current_function: {Process, :sleep, 1}, initial_call: {:erlang, :apply, 2} ]}, {#PID<0.124.0>, -176, [ :ssl_manager, {:current_function, {:gen_server, :loop, 7}}, {:initial_call, {:proc_lib, :init_p, 5}} ]}, {#PID<0.1905.0>, -165, [current_function: {:ranch_conns_sup, :loop, 4}, initial_call: {:proc_lib, :init_p, 5} ]} ] recon

Slide 79

Slide 79 text

@nirev Metrics and Visibility!

Slide 80

Slide 80 text

@nirev vmstats: send vm metrics to statsd https: //github.com/ferd/vmstats prometheus: send vm metrics to… prometheus https: //github.com/deadtrickster/prometheus.ex

Slide 81

Slide 81 text

@nirev :telemetry

Slide 82

Slide 82 text

@nirev create your own dashboards!

Slide 83

Slide 83 text

@nirev create your own dashboards!

Slide 84

Slide 84 text

@nirev Log aggregation

Slide 85

Slide 85 text

@nirev Error Reporting Sentry Appsignal Bugsnag Rollbar ..

Slide 86

Slide 86 text

@nirev Bottomline?

Slide 87

Slide 87 text

@nirev BEAM is amazing

Slide 88

Slide 88 text

@nirev There are several ways to break your systems

Slide 89

Slide 89 text

@nirev Don’t go to production without visibility

Slide 90

Slide 90 text

@nirev Please, Read Erlang in Anger

Slide 91

Slide 91 text

Guilherme de Maio nirev@taming-chaos.com Obrigado! ❤Elixir Conf EU 2020 We are hiring! www.telnyx.com @nirev