Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Crashing BEAM Applications @ElixirConf.EU 2020

Crashing BEAM Applications @ElixirConf.EU 2020

We often talk about embracing failures and making resilient applications, and how BEAM helps us do that. Nonetheless, there are several ways you can write code that will crash your application, and sometimes the whole VM! This talk is about those things, showing several ways I may or may have not managed to crash BEAM in the past, some obvious, and some not that obvious ;)

Guilherme de Maio, nirev

October 08, 2020
Tweet

More Decks by Guilherme de Maio, nirev

Other Decks in Programming

Transcript

  1. @nirev
    Crashing BEAM
    applications
    2020-10-08
    Guilherme de Maio

    View full-size slide

  2. @nirev
    Based in São Paulo, Brazil


    Elixir since 2015

    (mostly Java and C before that)


    co-organizer of SP Elixir User Group ;)

    View full-size slide

  3. @nirev
    Based in São Paulo, Brazil


    Elixir since 2015

    (mostly Java and C before that)


    co-organizer of SP Elixir User Group ;)

    @Telnyx since 2017

    View full-size slide

  4. @nirev
    Things we often hear

    View full-size slide

  5. @nirev
    Actor Model!

    View full-size slide

  6. @nirev
    Isolated Processes!

    View full-size slide

  7. @nirev
    Amazing GC
    per Process!

    View full-size slide

  8. @nirev
    Super scalable!

    View full-size slide

  9. @nirev
    Just
    let it CRASH!!

    View full-size slide

  10. @nirev
    Well.. That’s not the full
    picture

    View full-size slide

  11. @nirev
    How to crash your BEAM
    application

    View full-size slide

  12. @nirev
    The case of
    Exploding Atoms
    1

    View full-size slide

  13. @nirev
    String.to_atom(“don't do it")

    View full-size slide

  14. @nirev
    defmodule X do
    def explode do
    Enum.each(1 ..2_000_000, fn i ->
    IO.puts(" #{i}")
    20
    |> :crypto.strong_rand_bytes()
    |> Base.encode64()
    |> String.to_atom()
    end)
    end
    end

    View full-size slide

  15. @nirev
    82534 ...
    82535 ...
    82536 ...
    82537 ...
    82538 ...
    82539 ...
    82540 ...
    no more index entries in atom_tab (max=100000)
    Crash dump is being written to: erl_crash.dump ...done

    View full-size slide

  16. @nirev
    • Module names
    • Node names
    • Struct fields
    • “decode as atom”
    @nirev

    View full-size slide

  17. @nirev
    The case of the
    Linked Agent
    2

    View full-size slide

  18. @nirev
    Process 1 Linked
    Process
    start_link

    View full-size slide

  19. @nirev
    Process 1 Linked
    Process
    start_link
    exit reason: exception

    View full-size slide

  20. @nirev
    Process 1 Linked
    Process
    start_link
    exit reason: exception

    View full-size slide

  21. @nirev
    Process 1 Linked
    Process
    start_link
    exit reason: normal

    View full-size slide

  22. @nirev
    Linked
    Process
    “I'm a survivor”

    View full-size slide

  23. @nirev
    defmodule X do
    def spawn do
    spawn(fn ->
    {:ok, _pid} = Agent.start_link(fn -> 42 end)
    Process.sleep(1_000)
    exit(:normal)
    end)
    end
    end

    View full-size slide

  24. @nirev
    iex(1)> Process.list() |> Enum.count()
    50
    iex(2)> for _ <- 1 ..10, do: X.spawn
    [#PID<0.94.0>, #PID<0.95.0>, #PID<0.96.0>, #PID<0.97.0>, #PID<0.98.0>,
    #PID<0.99.0>, #PID<0.100.0>, #PID<0.101.0>, #PID<0.102.0>, #PID<0.103.0>]
    iex(3)> Process.list() |> Enum.count()
    70
    iex(4)> Process.sleep(2_000)
    :ok
    iex(5)> Process.list() |> Enum.count()
    60

    View full-size slide

  25. @nirev
    ref = Process.monitor(pid)
    receive do
    {:DOWN, ^ref, :process, ^pid, _reason} ->
    # receive down message and do something

    View full-size slide

  26. @nirev
    The case of
    Requests Monitoring
    3

    View full-size slide

  27. @nirev
    request
    Tracker
    read data from db
    api call
    update db
    publish to queue

    View full-size slide

  28. @nirev
    Tracker.add_breadcrumb(:key, metadata)
    Tracker.add_breadcrumb(:update_db, %{user: x, role: y, ..})
    request
    Tracker
    read data from db
    api call
    update db
    publish to queue

    View full-size slide

  29. @nirev
    request
    Tracker
    read data from db
    api call
    update db
    publish to queue
    CRASH
    breadcrumbs = Tracker.get_breadcrumbs(tracker)
    send_report(exception, breadcrumbs)

    View full-size slide

  30. @nirev
    For each request:
    - Start a new Agent
    - Monitor the agent and the process
    - kill agent when process ends normally

    OR 

    in case of exception:

    take breadcrumbs from agent, report, and kill

    View full-size slide

  31. @nirev
    40
    80
    120
    160
    ??????

    View full-size slide

  32. @nirev
    Cowboy implements the keep-alive mechanism by reusing the same
    process for all requests. This allows Cowboy to save memory. This
    works well because most code will not have any side effect impacting
    subsequent requests. But it also means you need to clean up if you do
    have code with side effects. The terminate/3 function can be used for
    this purpose.

    View full-size slide

  33. @nirev
    40
    80
    120
    160
    agents forever

    View full-size slide

  34. @nirev
    The case of
    Infinite Restarts
    4

    View full-size slide

  35. @nirev
    Process

    View full-size slide

  36. @nirev
    Process

    View full-size slide

  37. @nirev
    Error
    Reporter
    Task
    Process

    View full-size slide

  38. @nirev
    Task
    Supervisor
    Error
    Reporter
    Task
    Error
    Reporter
    Task
    Process

    View full-size slide

  39. @nirev
    Task
    Supervisor
    Error
    Reporter
    Task
    Error
    Reporter
    Task
    Remote API is down!!!

    View full-size slide

  40. @nirev
    Task
    Supervisor
    Task.Supervisor.start_child(…, restart: :transient)

    View full-size slide

  41. @nirev
    Task
    Supervisor
    Error
    Reporter
    Task
    Error
    Reporter
    Task

    View full-size slide

  42. @nirev
    Task
    Supervisor
    Error
    Reporter
    Task
    Error
    Reporter
    Task
    Remote API is down!!!

    View full-size slide

  43. @nirev
    Task
    Supervisor
    Task.Supervisor.start_child(…, restart: :temporary)

    View full-size slide

  44. @nirev
    The case of
    Waiting Processes
    5

    View full-size slide

  45. @nirev
    Process
    request Task.Supervisor.start_child(…)
    Notifier

    View full-size slide

  46. @nirev
    Process
    request Task.Supervisor.start_child(…)
    Notifier
    Process
    request Task.Supervisor.start_child(…)
    Notifier

    View full-size slide

  47. @nirev
    Notifier
    short-lived

    just one http request

    View full-size slide

  48. @nirev
    |No | Pid | Memory |Name or Initial Call | Reductions| MsgQueue |Current Function |
    |1 |<0.433.0> | 137.1796 MB |inet_gethost_native | 3016210668| 179710 |inet_gethost_native:do_handle_call|

    View full-size slide

  49. @nirev
    The case of
    the Message Router
    6

    View full-size slide

  50. @nirev
    Tracking devices

    View full-size slide

  51. @nirev
    Backend Server
    TCP Socket
    A
    P
    I
    Commands

    View full-size slide

  52. @nirev
    Message
    Router

    View full-size slide

  53. @nirev
    Message
    Router
    Out of Memory

    View full-size slide

  54. @nirev
    Process Control Block
    Stack
    Heap
    Process Memory
    Layout

    View full-size slide

  55. @nirev
    Process Control Block
    Stack
    Heap
    Process Memory
    Layout
    Shared Heap
    Refc Binary
    ProcBin (pointers)

    View full-size slide

  56. @nirev
    Process Control Block
    Stack
    Heap
    Private Heap Garbage Collection
    Generational
    • young generation: newly allocated data
    • old generation: data that survives GC
    Fullsweep vs Generational runs:
    • min_heap_size
    • fullsweep_after

    View full-size slide

  57. @nirev
    Shared Heap Garbage Collection
    Reference Counting
    • any binary without references will be cleaned
    Shared Heap
    Refc Binary

    View full-size slide

  58. @nirev
    The problem with Message Router
    It receives messages > 64bytes (Refc bins)
    So, it adds a reference to that to its heap
    But, since it doesn’t use much memory,
    it won’t grow past min_heap_size
    That means, references are not cleaned, so large
    binaries linger on Shared Heap.
    Message
    Router

    View full-size slide

  59. @nirev
    The SOLUTION for Message Router
    1) fullsweep_after process flag
    configures how often a fullsweep GC will happen,
    and that will collect ProcBins
    2) hibernating process
    when it hibernates, a fullsweep GC will run
    3) moving work to short-lived processes if possible
    Message
    Router

    View full-size slide

  60. @nirev
    What to do when it happens

    View full-size slide

  61. @nirev
    What to do when it happens
    (before)

    View full-size slide

  62. @nirev
    Make it operable!

    View full-size slide

  63. @nirev
    $ iex
    Erlang/OTP 23 [erts-11.0] [source] [64-bit] [smp:4:4] [ds:4:4:10] [async-threads:1] [hipe]
    Interactive Elixir (1.10.4) - press Ctrl+C to exit (type h() ENTER for help)
    iex(1)> Node.list
    []

    View full-size slide

  64. @nirev
    $ iex --sname node1@localhost --cookie ilovecookies
    Erlang/OTP 23 [erts-11.0] [source] [64-bit] [smp:4:4] [ds:4:4:10] [async-threads:1] [hipe]
    Interactive Elixir (1.10.4) - press Ctrl+C to exit (type h() ENTER for help)
    iex(node1@localhost)1> Node.list
    []

    View full-size slide

  65. @nirev
    $ iex --sname node2 --cookie ilovecookies --remsh node1@localhost
    Erlang/OTP 23 [erts-11.0] [source] [64-bit] [smp:4:4] [ds:4:4:10] [async-threads:1] [hipe]
    Interactive Elixir (1.10.4) - press Ctrl+C to exit (type h() ENTER for help)
    iex(node1@localhost)1> Node.list
    [:node2@C02T24Z2HF1R]

    View full-size slide

  66. @nirev
    $ iex --sname node2 --cookie ilovecookies --remsh node1@localhost
    Erlang/OTP 23 [erts-11.0] [source] [64-bit] [smp:4:4] [ds:4:4:10] [async-threads:1] [hipe]
    Interactive Elixir (1.10.4) - press Ctrl+C to exit (type h() ENTER for help)
    iex(node1@localhost)1> Node.list
    [:node2@C02T24Z2HF1R]

    View full-size slide

  67. @nirev
    $ mix release
    $ ./path/to/release/my_app start
    $ ./path/to/release/my_app remote

    View full-size slide

  68. @nirev
    Operate!

    View full-size slide

  69. @nirev
    Native tools
    Process.list/0
    Process.info/1
    :sys.get_*
    MyModule.my_very_helpful_debug_function()

    View full-size slide

  70. @nirev
    :observer_cli.start()

    View full-size slide

  71. @nirev
    phoenix_live_dashboard

    View full-size slide

  72. @nirev
    https: //ferd.github.io/recon/
    Recon is a library to be dropped into any other Erlang
    project, to be used to assist DevOps people diagnose
    problems in production nodes.
    recon

    View full-size slide

  73. @nirev
    iex( ..)1> :recon.bin_leak(3)
    [
    {#PID<0.80.0>, -606,
    [current_function: {Process, :sleep, 1},
    initial_call: {:erlang, :apply, 2}
    ]},
    {#PID<0.124.0>, -176,
    [ :ssl_manager,
    {:current_function, {:gen_server, :loop, 7}},
    {:initial_call, {:proc_lib, :init_p, 5}}
    ]},
    {#PID<0.1905.0>, -165,
    [current_function: {:ranch_conns_sup, :loop, 4},
    initial_call: {:proc_lib, :init_p, 5}
    ]}
    ]
    recon

    View full-size slide

  74. @nirev
    Metrics
    and
    Visibility!

    View full-size slide

  75. @nirev
    vmstats: send vm metrics to statsd
    https: //github.com/ferd/vmstats
    prometheus: send vm metrics to… prometheus
    https: //github.com/deadtrickster/prometheus.ex

    View full-size slide

  76. @nirev
    :telemetry

    View full-size slide

  77. @nirev
    create your own dashboards!

    View full-size slide

  78. @nirev
    create your own dashboards!

    View full-size slide

  79. @nirev
    Log aggregation

    View full-size slide

  80. @nirev
    Error Reporting
    Sentry
    Appsignal
    Bugsnag
    Rollbar
    ..

    View full-size slide

  81. @nirev
    Bottomline?

    View full-size slide

  82. @nirev
    BEAM is amazing

    View full-size slide

  83. @nirev
    There are several ways to
    break your systems

    View full-size slide

  84. @nirev
    Don’t go to production
    without visibility

    View full-size slide

  85. @nirev
    Please, Read Erlang in Anger

    View full-size slide

  86. Guilherme de Maio
    [email protected]
    Obrigado!
    ❤Elixir Conf EU 2020
    We are hiring!
    www.telnyx.com
    @nirev

    View full-size slide