Crashing BEAM Applications @ElixirConf.EU 2020

Crashing BEAM Applications @ElixirConf.EU 2020

We often talk about embracing failures and making resilient applications, and how BEAM helps us do that. Nonetheless, there are several ways you can write code that will crash your application, and sometimes the whole VM! This talk is about those things, showing several ways I may or may have not managed to crash BEAM in the past, some obvious, and some not that obvious ;)

4b178f929b750c873b4d2b0c0a682051?s=128

Guilherme de Maio, nirev

October 08, 2020
Tweet

Transcript

  1. @nirev Crashing BEAM applications 2020-10-08 Guilherme de Maio

  2. @nirev Why?

  3. @nirev Who?

  4. @nirev Based in São Paulo, Brazil
 
 Elixir since 2015


    (mostly Java and C before that)
 
 co-organizer of SP Elixir User Group ;)
  5. @nirev Based in São Paulo, Brazil
 
 Elixir since 2015


    (mostly Java and C before that)
 
 co-organizer of SP Elixir User Group ;)
 @Telnyx since 2017
  6. @nirev

  7. @nirev Things we often hear

  8. @nirev Actor Model!

  9. @nirev Isolated Processes!

  10. @nirev Amazing GC per Process!

  11. @nirev Super scalable!

  12. @nirev Just let it CRASH!!

  13. @nirev Well.. That’s not the full picture

  14. @nirev How to crash your BEAM application

  15. @nirev The case of Exploding Atoms 1

  16. @nirev String.to_atom(“don't do it")

  17. @nirev defmodule X do def explode do Enum.each(1 ..2_000_000, fn

    i -> IO.puts(" #{i}") 20 |> :crypto.strong_rand_bytes() |> Base.encode64() |> String.to_atom() end) end end
  18. @nirev 82534 ... 82535 ... 82536 ... 82537 ... 82538

    ... 82539 ... 82540 ... no more index entries in atom_tab (max=100000) Crash dump is being written to: erl_crash.dump ...done
  19. @nirev • Module names • Node names • Struct fields

    • “decode as atom” @nirev
  20. @nirev The case of the Linked Agent 2

  21. @nirev Process 1 Linked Process start_link

  22. @nirev Process 1 Linked Process start_link exit reason: exception

  23. @nirev Process 1 Linked Process start_link exit reason: exception

  24. @nirev Process 1 Linked Process start_link exit reason: normal

  25. @nirev Linked Process “I'm a survivor”

  26. @nirev defmodule X do def spawn do spawn(fn -> {:ok,

    _pid} = Agent.start_link(fn -> 42 end) Process.sleep(1_000) exit(:normal) end) end end
  27. @nirev iex(1)> Process.list() |> Enum.count() 50 iex(2)> for _ <-

    1 ..10, do: X.spawn [#PID<0.94.0>, #PID<0.95.0>, #PID<0.96.0>, #PID<0.97.0>, #PID<0.98.0>, #PID<0.99.0>, #PID<0.100.0>, #PID<0.101.0>, #PID<0.102.0>, #PID<0.103.0>] iex(3)> Process.list() |> Enum.count() 70 iex(4)> Process.sleep(2_000) :ok iex(5)> Process.list() |> Enum.count() 60
  28. @nirev ref = Process.monitor(pid) receive do {:DOWN, ^ref, :process, ^pid,

    _reason} -> # receive down message and do something
  29. @nirev The case of Requests Monitoring 3

  30. @nirev request Tracker read data from db api call update

    db publish to queue
  31. @nirev Tracker.add_breadcrumb(:key, metadata) Tracker.add_breadcrumb(:update_db, %{user: x, role: y, ..}) request

    Tracker read data from db api call update db publish to queue
  32. @nirev request Tracker read data from db api call update

    db publish to queue CRASH breadcrumbs = Tracker.get_breadcrumbs(tracker) send_report(exception, breadcrumbs)
  33. @nirev For each request: - Start a new Agent -

    Monitor the agent and the process - kill agent when process ends normally
 OR 
 in case of exception:
 take breadcrumbs from agent, report, and kill
  34. @nirev 40 80 120 160 ??????

  35. @nirev Cowboy implements the keep-alive mechanism by reusing the same

    process for all requests. This allows Cowboy to save memory. This works well because most code will not have any side effect impacting subsequent requests. But it also means you need to clean up if you do have code with side effects. The terminate/3 function can be used for this purpose.
  36. @nirev 40 80 120 160 agents forever

  37. @nirev The case of Infinite Restarts 4

  38. @nirev Process

  39. @nirev Process

  40. @nirev Error Reporter Task Process

  41. @nirev Task Supervisor Error Reporter Task Error Reporter Task Process

  42. @nirev Task Supervisor Error Reporter Task Error Reporter Task Remote

    API is down!!!
  43. @nirev Task Supervisor Task.Supervisor.start_child(…, restart: :transient)

  44. @nirev Task Supervisor Error Reporter Task Error Reporter Task

  45. @nirev Task Supervisor Error Reporter Task Error Reporter Task Remote

    API is down!!!
  46. @nirev Task Supervisor Task.Supervisor.start_child(…, restart: :temporary)

  47. @nirev The case of Waiting Processes 5

  48. @nirev Process request Task.Supervisor.start_child(…) Notifier

  49. @nirev Process request Task.Supervisor.start_child(…) Notifier Process request Task.Supervisor.start_child(…) Notifier

  50. @nirev Notifier short-lived
 just one http request

  51. @nirev

  52. @nirev

  53. @nirev |No | Pid | Memory |Name or Initial Call

    | Reductions| MsgQueue |Current Function | |1 |<0.433.0> | 137.1796 MB |inet_gethost_native | 3016210668| 179710 |inet_gethost_native:do_handle_call|
  54. @nirev The case of the Message Router 6

  55. @nirev Tracking devices

  56. @nirev Backend Server TCP Socket A P I Commands

  57. @nirev Message Router

  58. @nirev Message Router Out of Memory

  59. @nirev Process Control Block Stack Heap Process Memory Layout

  60. @nirev Process Control Block Stack Heap Process Memory Layout Shared

    Heap Refc Binary ProcBin (pointers)
  61. @nirev Process Control Block Stack Heap Private Heap Garbage Collection

    Generational • young generation: newly allocated data • old generation: data that survives GC Fullsweep vs Generational runs: • min_heap_size • fullsweep_after
  62. @nirev Shared Heap Garbage Collection Reference Counting • any binary

    without references will be cleaned Shared Heap Refc Binary
  63. @nirev The problem with Message Router It receives messages >

    64bytes (Refc bins) So, it adds a reference to that to its heap But, since it doesn’t use much memory, it won’t grow past min_heap_size That means, references are not cleaned, so large binaries linger on Shared Heap. Message Router
  64. @nirev The SOLUTION for Message Router 1) fullsweep_after process flag

    configures how often a fullsweep GC will happen, and that will collect ProcBins 2) hibernating process when it hibernates, a fullsweep GC will run 3) moving work to short-lived processes if possible Message Router
  65. @nirev What to do when it happens

  66. @nirev What to do when it happens (before)

  67. @nirev Make it operable!

  68. @nirev $ iex Erlang/OTP 23 [erts-11.0] [source] [64-bit] [smp:4:4] [ds:4:4:10]

    [async-threads:1] [hipe] Interactive Elixir (1.10.4) - press Ctrl+C to exit (type h() ENTER for help) iex(1)> Node.list []
  69. @nirev $ iex --sname node1@localhost --cookie ilovecookies Erlang/OTP 23 [erts-11.0]

    [source] [64-bit] [smp:4:4] [ds:4:4:10] [async-threads:1] [hipe] Interactive Elixir (1.10.4) - press Ctrl+C to exit (type h() ENTER for help) iex(node1@localhost)1> Node.list []
  70. @nirev $ iex --sname node2 --cookie ilovecookies --remsh node1@localhost Erlang/OTP

    23 [erts-11.0] [source] [64-bit] [smp:4:4] [ds:4:4:10] [async-threads:1] [hipe] Interactive Elixir (1.10.4) - press Ctrl+C to exit (type h() ENTER for help) iex(node1@localhost)1> Node.list [:node2@C02T24Z2HF1R]
  71. @nirev $ iex --sname node2 --cookie ilovecookies --remsh node1@localhost Erlang/OTP

    23 [erts-11.0] [source] [64-bit] [smp:4:4] [ds:4:4:10] [async-threads:1] [hipe] Interactive Elixir (1.10.4) - press Ctrl+C to exit (type h() ENTER for help) iex(node1@localhost)1> Node.list [:node2@C02T24Z2HF1R]
  72. @nirev $ mix release $ ./path/to/release/my_app start $ ./path/to/release/my_app remote

  73. @nirev Operate!

  74. @nirev Native tools Process.list/0 Process.info/1 :sys.get_* MyModule.my_very_helpful_debug_function()

  75. @nirev :observer_cli.start()

  76. @nirev phoenix_live_dashboard

  77. @nirev https: //ferd.github.io/recon/ Recon is a library to be dropped

    into any other Erlang project, to be used to assist DevOps people diagnose problems in production nodes. recon
  78. @nirev iex( ..)1> :recon.bin_leak(3) [ {#PID<0.80.0>, -606, [current_function: {Process, :sleep,

    1}, initial_call: {:erlang, :apply, 2} ]}, {#PID<0.124.0>, -176, [ :ssl_manager, {:current_function, {:gen_server, :loop, 7}}, {:initial_call, {:proc_lib, :init_p, 5}} ]}, {#PID<0.1905.0>, -165, [current_function: {:ranch_conns_sup, :loop, 4}, initial_call: {:proc_lib, :init_p, 5} ]} ] recon
  79. @nirev Metrics and Visibility!

  80. @nirev vmstats: send vm metrics to statsd https: //github.com/ferd/vmstats prometheus:

    send vm metrics to… prometheus https: //github.com/deadtrickster/prometheus.ex
  81. @nirev :telemetry

  82. @nirev create your own dashboards!

  83. @nirev create your own dashboards!

  84. @nirev Log aggregation

  85. @nirev Error Reporting Sentry Appsignal Bugsnag Rollbar ..

  86. @nirev Bottomline?

  87. @nirev BEAM is amazing

  88. @nirev There are several ways to break your systems

  89. @nirev Don’t go to production without visibility

  90. @nirev Please, Read Erlang in Anger

  91. Guilherme de Maio nirev@taming-chaos.com Obrigado! ❤Elixir Conf EU 2020 We

    are hiring! www.telnyx.com @nirev