Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Crashing BEAM Applications @ElixirConf.EU 2020

Crashing BEAM Applications @ElixirConf.EU 2020

We often talk about embracing failures and making resilient applications, and how BEAM helps us do that. Nonetheless, there are several ways you can write code that will crash your application, and sometimes the whole VM! This talk is about those things, showing several ways I may or may have not managed to crash BEAM in the past, some obvious, and some not that obvious ;)

Guilherme de Maio, nirev

October 08, 2020
Tweet

More Decks by Guilherme de Maio, nirev

Other Decks in Programming

Transcript

  1. @nirev
    Crashing BEAM
    applications
    2020-10-08
    Guilherme de Maio

    View Slide

  2. @nirev
    Why?

    View Slide

  3. @nirev
    Who?

    View Slide

  4. @nirev
    Based in São Paulo, Brazil


    Elixir since 2015

    (mostly Java and C before that)


    co-organizer of SP Elixir User Group ;)

    View Slide

  5. @nirev
    Based in São Paulo, Brazil


    Elixir since 2015

    (mostly Java and C before that)


    co-organizer of SP Elixir User Group ;)

    @Telnyx since 2017

    View Slide

  6. @nirev

    View Slide

  7. @nirev
    Things we often hear

    View Slide

  8. @nirev
    Actor Model!

    View Slide

  9. @nirev
    Isolated Processes!

    View Slide

  10. @nirev
    Amazing GC
    per Process!

    View Slide

  11. @nirev
    Super scalable!

    View Slide

  12. @nirev
    Just
    let it CRASH!!

    View Slide

  13. @nirev
    Well.. That’s not the full
    picture

    View Slide

  14. @nirev
    How to crash your BEAM
    application

    View Slide

  15. @nirev
    The case of
    Exploding Atoms
    1

    View Slide

  16. @nirev
    String.to_atom(“don't do it")

    View Slide

  17. @nirev
    defmodule X do
    def explode do
    Enum.each(1 ..2_000_000, fn i ->
    IO.puts(" #{i}")
    20
    |> :crypto.strong_rand_bytes()
    |> Base.encode64()
    |> String.to_atom()
    end)
    end
    end

    View Slide

  18. @nirev
    82534 ...
    82535 ...
    82536 ...
    82537 ...
    82538 ...
    82539 ...
    82540 ...
    no more index entries in atom_tab (max=100000)
    Crash dump is being written to: erl_crash.dump ...done

    View Slide

  19. @nirev
    • Module names
    • Node names
    • Struct fields
    • “decode as atom”
    @nirev

    View Slide

  20. @nirev
    The case of the
    Linked Agent
    2

    View Slide

  21. @nirev
    Process 1 Linked
    Process
    start_link

    View Slide

  22. @nirev
    Process 1 Linked
    Process
    start_link
    exit reason: exception

    View Slide

  23. @nirev
    Process 1 Linked
    Process
    start_link
    exit reason: exception

    View Slide

  24. @nirev
    Process 1 Linked
    Process
    start_link
    exit reason: normal

    View Slide

  25. @nirev
    Linked
    Process
    “I'm a survivor”

    View Slide

  26. @nirev
    defmodule X do
    def spawn do
    spawn(fn ->
    {:ok, _pid} = Agent.start_link(fn -> 42 end)
    Process.sleep(1_000)
    exit(:normal)
    end)
    end
    end

    View Slide

  27. @nirev
    iex(1)> Process.list() |> Enum.count()
    50
    iex(2)> for _ <- 1 ..10, do: X.spawn
    [#PID<0.94.0>, #PID<0.95.0>, #PID<0.96.0>, #PID<0.97.0>, #PID<0.98.0>,
    #PID<0.99.0>, #PID<0.100.0>, #PID<0.101.0>, #PID<0.102.0>, #PID<0.103.0>]
    iex(3)> Process.list() |> Enum.count()
    70
    iex(4)> Process.sleep(2_000)
    :ok
    iex(5)> Process.list() |> Enum.count()
    60

    View Slide

  28. @nirev
    ref = Process.monitor(pid)
    receive do
    {:DOWN, ^ref, :process, ^pid, _reason} ->
    # receive down message and do something

    View Slide

  29. @nirev
    The case of
    Requests Monitoring
    3

    View Slide

  30. @nirev
    request
    Tracker
    read data from db
    api call
    update db
    publish to queue

    View Slide

  31. @nirev
    Tracker.add_breadcrumb(:key, metadata)
    Tracker.add_breadcrumb(:update_db, %{user: x, role: y, ..})
    request
    Tracker
    read data from db
    api call
    update db
    publish to queue

    View Slide

  32. @nirev
    request
    Tracker
    read data from db
    api call
    update db
    publish to queue
    CRASH
    breadcrumbs = Tracker.get_breadcrumbs(tracker)
    send_report(exception, breadcrumbs)

    View Slide

  33. @nirev
    For each request:
    - Start a new Agent
    - Monitor the agent and the process
    - kill agent when process ends normally

    OR 

    in case of exception:

    take breadcrumbs from agent, report, and kill

    View Slide

  34. @nirev
    40
    80
    120
    160
    ??????

    View Slide

  35. @nirev
    Cowboy implements the keep-alive mechanism by reusing the same
    process for all requests. This allows Cowboy to save memory. This
    works well because most code will not have any side effect impacting
    subsequent requests. But it also means you need to clean up if you do
    have code with side effects. The terminate/3 function can be used for
    this purpose.

    View Slide

  36. @nirev
    40
    80
    120
    160
    agents forever

    View Slide

  37. @nirev
    The case of
    Infinite Restarts
    4

    View Slide

  38. @nirev
    Process

    View Slide

  39. @nirev
    Process

    View Slide

  40. @nirev
    Error
    Reporter
    Task
    Process

    View Slide

  41. @nirev
    Task
    Supervisor
    Error
    Reporter
    Task
    Error
    Reporter
    Task
    Process

    View Slide

  42. @nirev
    Task
    Supervisor
    Error
    Reporter
    Task
    Error
    Reporter
    Task
    Remote API is down!!!

    View Slide

  43. @nirev
    Task
    Supervisor
    Task.Supervisor.start_child(…, restart: :transient)

    View Slide

  44. @nirev
    Task
    Supervisor
    Error
    Reporter
    Task
    Error
    Reporter
    Task

    View Slide

  45. @nirev
    Task
    Supervisor
    Error
    Reporter
    Task
    Error
    Reporter
    Task
    Remote API is down!!!

    View Slide

  46. @nirev
    Task
    Supervisor
    Task.Supervisor.start_child(…, restart: :temporary)

    View Slide

  47. @nirev
    The case of
    Waiting Processes
    5

    View Slide

  48. @nirev
    Process
    request Task.Supervisor.start_child(…)
    Notifier

    View Slide

  49. @nirev
    Process
    request Task.Supervisor.start_child(…)
    Notifier
    Process
    request Task.Supervisor.start_child(…)
    Notifier

    View Slide

  50. @nirev
    Notifier
    short-lived

    just one http request

    View Slide

  51. @nirev

    View Slide

  52. @nirev

    View Slide

  53. @nirev
    |No | Pid | Memory |Name or Initial Call | Reductions| MsgQueue |Current Function |
    |1 |<0.433.0> | 137.1796 MB |inet_gethost_native | 3016210668| 179710 |inet_gethost_native:do_handle_call|

    View Slide

  54. @nirev
    The case of
    the Message Router
    6

    View Slide

  55. @nirev
    Tracking devices

    View Slide

  56. @nirev
    Backend Server
    TCP Socket
    A
    P
    I
    Commands

    View Slide

  57. @nirev
    Message
    Router

    View Slide

  58. @nirev
    Message
    Router
    Out of Memory

    View Slide

  59. @nirev
    Process Control Block
    Stack
    Heap
    Process Memory
    Layout

    View Slide

  60. @nirev
    Process Control Block
    Stack
    Heap
    Process Memory
    Layout
    Shared Heap
    Refc Binary
    ProcBin (pointers)

    View Slide

  61. @nirev
    Process Control Block
    Stack
    Heap
    Private Heap Garbage Collection
    Generational
    • young generation: newly allocated data
    • old generation: data that survives GC
    Fullsweep vs Generational runs:
    • min_heap_size
    • fullsweep_after

    View Slide

  62. @nirev
    Shared Heap Garbage Collection
    Reference Counting
    • any binary without references will be cleaned
    Shared Heap
    Refc Binary

    View Slide

  63. @nirev
    The problem with Message Router
    It receives messages > 64bytes (Refc bins)
    So, it adds a reference to that to its heap
    But, since it doesn’t use much memory,
    it won’t grow past min_heap_size
    That means, references are not cleaned, so large
    binaries linger on Shared Heap.
    Message
    Router

    View Slide

  64. @nirev
    The SOLUTION for Message Router
    1) fullsweep_after process flag
    configures how often a fullsweep GC will happen,
    and that will collect ProcBins
    2) hibernating process
    when it hibernates, a fullsweep GC will run
    3) moving work to short-lived processes if possible
    Message
    Router

    View Slide

  65. @nirev
    What to do when it happens

    View Slide

  66. @nirev
    What to do when it happens
    (before)

    View Slide

  67. @nirev
    Make it operable!

    View Slide

  68. @nirev
    $ iex
    Erlang/OTP 23 [erts-11.0] [source] [64-bit] [smp:4:4] [ds:4:4:10] [async-threads:1] [hipe]
    Interactive Elixir (1.10.4) - press Ctrl+C to exit (type h() ENTER for help)
    iex(1)> Node.list
    []

    View Slide

  69. @nirev
    $ iex --sname [email protected] --cookie ilovecookies
    Erlang/OTP 23 [erts-11.0] [source] [64-bit] [smp:4:4] [ds:4:4:10] [async-threads:1] [hipe]
    Interactive Elixir (1.10.4) - press Ctrl+C to exit (type h() ENTER for help)
    iex([email protected])1> Node.list
    []

    View Slide

  70. @nirev
    $ iex --sname node2 --cookie ilovecookies --remsh [email protected]
    Erlang/OTP 23 [erts-11.0] [source] [64-bit] [smp:4:4] [ds:4:4:10] [async-threads:1] [hipe]
    Interactive Elixir (1.10.4) - press Ctrl+C to exit (type h() ENTER for help)
    iex([email protected])1> Node.list
    [:[email protected]]

    View Slide

  71. @nirev
    $ iex --sname node2 --cookie ilovecookies --remsh [email protected]
    Erlang/OTP 23 [erts-11.0] [source] [64-bit] [smp:4:4] [ds:4:4:10] [async-threads:1] [hipe]
    Interactive Elixir (1.10.4) - press Ctrl+C to exit (type h() ENTER for help)
    iex([email protected])1> Node.list
    [:[email protected]]

    View Slide

  72. @nirev
    $ mix release
    $ ./path/to/release/my_app start
    $ ./path/to/release/my_app remote

    View Slide

  73. @nirev
    Operate!

    View Slide

  74. @nirev
    Native tools
    Process.list/0
    Process.info/1
    :sys.get_*
    MyModule.my_very_helpful_debug_function()

    View Slide

  75. @nirev
    :observer_cli.start()

    View Slide

  76. @nirev
    phoenix_live_dashboard

    View Slide

  77. @nirev
    https: //ferd.github.io/recon/
    Recon is a library to be dropped into any other Erlang
    project, to be used to assist DevOps people diagnose
    problems in production nodes.
    recon

    View Slide

  78. @nirev
    iex( ..)1> :recon.bin_leak(3)
    [
    {#PID<0.80.0>, -606,
    [current_function: {Process, :sleep, 1},
    initial_call: {:erlang, :apply, 2}
    ]},
    {#PID<0.124.0>, -176,
    [ :ssl_manager,
    {:current_function, {:gen_server, :loop, 7}},
    {:initial_call, {:proc_lib, :init_p, 5}}
    ]},
    {#PID<0.1905.0>, -165,
    [current_function: {:ranch_conns_sup, :loop, 4},
    initial_call: {:proc_lib, :init_p, 5}
    ]}
    ]
    recon

    View Slide

  79. @nirev
    Metrics
    and
    Visibility!

    View Slide

  80. @nirev
    vmstats: send vm metrics to statsd
    https: //github.com/ferd/vmstats
    prometheus: send vm metrics to… prometheus
    https: //github.com/deadtrickster/prometheus.ex

    View Slide

  81. @nirev
    :telemetry

    View Slide

  82. @nirev
    create your own dashboards!

    View Slide

  83. @nirev
    create your own dashboards!

    View Slide

  84. @nirev
    Log aggregation

    View Slide

  85. @nirev
    Error Reporting
    Sentry
    Appsignal
    Bugsnag
    Rollbar
    ..

    View Slide

  86. @nirev
    Bottomline?

    View Slide

  87. @nirev
    BEAM is amazing

    View Slide

  88. @nirev
    There are several ways to
    break your systems

    View Slide

  89. @nirev
    Don’t go to production
    without visibility

    View Slide

  90. @nirev
    Please, Read Erlang in Anger

    View Slide

  91. Guilherme de Maio
    [email protected]
    Obrigado!
    ❤Elixir Conf EU 2020
    We are hiring!
    www.telnyx.com
    @nirev

    View Slide