Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Let it crash - fault tolerance in Elixir/OTP

Let it crash - fault tolerance in Elixir/OTP

Maciej Kaszubowski

September 28, 2017
Tweet

More Decks by Maciej Kaszubowski

Other Decks in Programming

Transcript

  1. LET IT CRASH!
    Poznań Elixir Metup #4

    View full-size slide

  2. (DON'T) LET IT CRASH!
    Poznań Elixir Metup #4

    View full-size slide

  3. (DON'T) LET IT CRASH!
    Fault tolerance in Elixir/OTP
    Poznań Elixir Metup #4

    View full-size slide

  4. (YOU CAN ASK
    QUESTIONS)

    View full-size slide

  5. Elixir (Erlang) features
    ‣ Concurrent
    ‣ Functional
    ‣ Immutable state
    ‣ Message passing
    ‣ Distributed
    ‣ Hot upgrades

    View full-size slide

  6. FAULT TOLERANCE

    View full-size slide

  7. Elixir (Erlang) features
    ‣ Concurrent
    ‣ Functional
    ‣ Immutable state
    ‣ Message passing
    ‣ Distributed
    ‣ Hot upgrades

    View full-size slide

  8. LET IT CRASH!

    View full-size slide

  9. Let it crash!

    View full-size slide

  10. Let it crash!
    ‣ Accept the fact that things fail
    ‣ Focus on the happy path
    ‣ Make failures more predictable

    View full-size slide

  11. Let it crash!
    ‣ Separate the logic and error handling
    ‣ When something is wrong, let the
    process crash and let another one
    handle it (e.g. by restarting)

    View full-size slide

  12. https://ferd.ca/an-open-letter-to-the-erlang-beginner-or-onlooker.html

    View full-size slide

  13. Tools
    ‣ Monitors
    ‣ Links
    ‣ Supervisors
    ‣ Heart
    ‣ Distribution

    View full-size slide

  14. Monitors
    pid_a
    ref = Process.monitor(pid_b)
    pid_b

    View full-size slide

  15. Monitors
    pid_a
    ref = Process.monitor(pid_b)
    {:DOWN, ref, :process,
    pid_b, reason}
    pid_b

    View full-size slide

  16. Links
    pid_a
    Process.link(pid_b)
    pid_b

    View full-size slide

  17. Links
    pid_a
    pid_b
    Process.link(pid_b)

    View full-size slide

  18. Links
    pid_b
    Process.link(pid_b)
    Process.flag(:trap_exit, true)
    pid_a {:EXIT, from, reason}

    View full-size slide

  19. Links
    pid_b
    Process.link(pid_b)
    Process.flag(:trap_exit, true)
    pid_a

    View full-size slide

  20. Supervisors
    Worker Worker
    Supervisor

    View full-size slide

  21. Supervisors
    Worker Worker
    Supervisor

    View full-size slide

  22. Supervisors
    Worker Worker
    Supervisor
    Worker
    *New* process

    View full-size slide

  23. Supervision strategies

    View full-size slide

  24. opts = [
    name: MyApp.Supervisor,
    ]
    Supervisor.start_link(children, opts)

    View full-size slide

  25. opts = [
    name: MyApp.Supervisor,
    strategy: :one_for_one
    ]
    Supervisor.start_link(children, opts)

    View full-size slide

  26. :one_for_one
    W
    S
    W W
    S
    W W
    S
    W

    View full-size slide

  27. :all_for_one
    W
    S
    W W
    S
    W W
    S
    W
    W
    S
    W

    View full-size slide

  28. :rest_for_one
    W
    S
    W
    W W
    S
    W
    W W
    S
    W
    W W
    S
    W
    W

    View full-size slide

  29. :simple_one_for_one
    W
    S
    W W
    S
    W W
    S
    W

    View full-size slide

  30. Heart
    ## vm.args
    ## Heartbeat management; auto-restarts VM if it
    ##dies or becomes unresponsive
    ## (Disabled by default use with caution!)
    -heart
    -env HEART_COMMAND ~/heart_command.sh

    View full-size slide

  31. WHY RESTARTING
    WORKS

    View full-size slide

  32. Why restarting works
    ‣ Independent processes
    ‣ Clean state
    ‣ Bohrbugs vs. Heisenbugs

    View full-size slide

  33. Bohrbugs
    ‣ Repeatable
    ‣ Easy to debug
    ‣ Easy to fix
    ‣ Rare in production
    ‣ Restarting doesn't help

    View full-size slide

  34. Heisenbugs
    ‣ Unpredictable
    ‣ Hard to debug
    ‣ Hard to fix
    ‣ Frequent in production
    ‣ Restarting HELPS!

    View full-size slide

  35. Heisenbugs
    ‣ Unpredictable
    ‣ Hard to debug
    ‣ Hard to fix
    ‣ Frequent in production
    ‣ Restarting HELPS!

    View full-size slide

  36. Supervisors
    Worker Worker
    Supervisor
    Worker
    *New* process

    View full-size slide

  37. New process
    ‣ Clean state
    ‣ Predictable
    ‣ High chance of fixing the bug

    View full-size slide

  38. Limits
    ‣ :max_restarts (default: 3)
    ‣ :max_seconds (default: 5)

    View full-size slide

  39. opts = [
    name: MyApp.Supervisor,
    strategy: :one_for_one,
    max_restarts: 1, max_seconds: 1
    ]
    Supervisor.start_link(children, opts)
    Limits

    View full-size slide

  40. ‣ Process
    ‣ Supervisor
    ‣ Node
    ‣ Machine
    Restarting

    View full-size slide

  41. ‣ Poor supervision tree structure
    ‣ Not validating user params
    ‣ Not handling expected errors
    ‣ {:error, reason} tuples everywhere
    Mistakes

    View full-size slide

  42. ‣ Trying to recreate the state
    ‣ Timeouts
    ‣ Not reading libraries source code
    ‣ Incorrect limits
    Mistakes

    View full-size slide

  43. Expected errors
    {:ok, user} =
    Auth.authenticate(email, password)
    {:ok, user} =
    UserService.fetch_by_id(params["id"])

    View full-size slide

  44. Restoring the state
    def init(_) do
    state = restore_state()
    {:ok, state}
    end
    def terminate(_reason, state) do
    save_state(state)
    end
    http://mkaszubowski.pl/2017/09/02/On-Restoring-Process-State.html

    View full-size slide

  45. Poor supervision structure

    View full-size slide

  46. Stable, long-lived, important, protected
    Short-lived, transient, can fail

    View full-size slide

  47. Incorrect limits

    View full-size slide

  48. ‣ Less code (= less bugs, easier to
    understand, easier to change)
    ‣ Less logic duplication
    ‣ Faster bug fixes
    Benefits

    View full-size slide

  49. def update_name(user, name) do
    end

    View full-size slide

  50. def update_name(user, name) do
    update(user, %{name: name})
    end

    View full-size slide

  51. def update_name(user, name) do
    case update(user, %{name: name}) do
    end
    end

    View full-size slide

  52. def update_name(user, name) do
    case update(user, %{name: name}) do
    {:ok, user}  {:ok, user}
    end
    end

    View full-size slide

  53. def update_name(user, name) do
    case update(user, %{name: name}) do
    {:ok, user}  {:ok, user}
    {:error, reason}  {:error, reason}
    end
    end

    View full-size slide

  54. def update_name(user, name) do
    case update(user, %{name: name}) do
    {:ok, user}  {:ok, user}
    {:error, reason}  {:error, reason}
    end
    end

    View full-size slide

  55. def update_name(user, name) do
    case update(user, %{name: name}) do
    {:ok, user}  {:ok, user}
    {:error, reason}  {:error, reason}
    end
    end
    ‣ Do you know how to handle reason?
    ‣ Is {:error, reason} even possible?
    ‣ Fatal or acceptable error?

    View full-size slide

  56. ‣ What is likely to happen?
    ‣ What is an acceptable error?
    ‣ What do I know how to handle?

    View full-size slide

  57. def update_name(user, name) do
    {:ok, _} = update(user, %{name: name}) do
    end

    View full-size slide

  58. def update_name(user, name) do
    case update(user, %{name: name}) do
    {:ok, user}  {:ok, user}
    {:error, %{errors: [username: "cannot be blank"]}} 
    {:error, :blank_username}
    end
    end
    Acceptable error

    View full-size slide

  59. def update_description(transaction, user) do
    with \
    %{receipt: receipt}  transaction,
    false  is_nil(receipt),
    {:ok, %{"id"  id}  Poison.decode(receipt),
    {:ok, %{status: 200, body: body}}  Adapter.update(id, user)
    {:ok, _}  update_db_record(id, body)
    do
    :ok
    end
    end

    View full-size slide

  60. def update_description(transaction, user) do
    Task.Supervisor.start_child(MyApp.TaskSupervisor, fn 
    with \
    %{receipt: receipt}  transaction,
    false  is_nil(receipt),
    {:ok, %{"id"  id}  Poison.decode(receipt),
    {:ok, %{status: 200, body: body}}  Adapter.update(id, user)
    {:ok, _}  update_db_record(id, body)
    do
    :ok
    end
    end)
    end

    View full-size slide

  61. def update_description(transaction, user) do
    Task.Supervisor.start_child(MyApp.TaskSupervisor, fn 
    with \
    %{receipt: receipt}  transaction,
    false  is_nil(receipt),
    {:ok, %{"id"  id}  Poison.decode(receipt),
    {:ok, %{status: 200, body: body}}  Adapter.update(id, user)
    {:ok, _}  update_db_record(id, body)
    do
    :ok
    end
    end)
    end

    View full-size slide

  62. def update_description(transaction, user) do
    Task.Supervisor.start_child(MyApp.TaskSupervisor, fn 
    with \
    %{receipt: receipt}  transaction,
    false  is_nil(receipt),
    {:ok, %{"id"  id}  Poison.decode(receipt),
    {:ok, %{status: 200, body: body}}  Adapter.update(id, user)
    {:ok, _}  update_db_record(id, body)
    do
    :ok
    end
    end)
    end

    View full-size slide

  63. def update_description(transaction, user) do
    Task.Supervisor.start_child(MyApp.TaskSupervisor, fn 
    %{"id"  transaction_id} = Poison.decode!(receipt)
    {:ok, %{body: body}} = Adapter.update(transaction_id, user)
    {:ok, _} = update_db_record(transaction_id, body)
    end
    end

    View full-size slide

  64. Less duplicated logic

    View full-size slide

  65. def add_contact(current_user_id, nil),
    do: {:error, :invalid_contact_id}
    def add_contact(current_user_id, contact_id) do
    params = %{user_id: current_user_id, contact_id: contact_id}
    %Contact{}
     Contact.Changeset(params)
     Repo.insert()
     case do
    {:ok, contact}  {:ok, contact}
    {:error, changeset}  {:error, changeset}
    end
    end

    View full-size slide

  66. def add_contact(current_user_id, nil),
    do: {:error, :invalid_contact_id}
    def add_contact(current_user_id, contact_id) do
    params = %{user_id: current_user_id, contact_id: contact_id}
    %Contact{}
     Contact.Changeset(params)
     Repo.insert()
     case do
    {:ok, contact}  {:ok, contact}
    {:error, changeset}  {:error, changeset}
    end
    end

    View full-size slide

  67. def add_contact(current_user_id, nil),
    do: {:error, :invalid_contact_id}
    def add_contact(current_user_id, contact_id) do
    params = %{user_id: current_user_id, contact_id: contact_id}
    %Contact{}
     Contact.Changeset(params)
     Repo.insert()
     case do
    {:ok, contact}  {:ok, contact}
    {:error, changeset}  {:error, changeset}
    end
    end

    View full-size slide

  68. def add_contact(current_user_id, nil),
    do: {:error, :invalid_contact_id}
    def add_contact(current_user_id, contact_id) do
    params = %{user_id: current_user_id, contact_id: contact_id}
    %Contact{}
     Contact.Changeset(params)
     Repo.insert()
     case do
    {:ok, contact}  {:ok, contact}
    {:error, changeset}  {:error, changeset}
    end
    end

    View full-size slide

  69. def add_contact(current_user_id, contact_id) do
    params = %{user_id: current_user_id, contact_id: contact_id}
    {:ok, _} =
    %Contact{}
     Contact.Changeset(params)
     Repo.insert()
    end

    View full-size slide

  70. Faster bug fixes

    View full-size slide

  71. def handle_info(:do_work, state) do
    with {:ok, data}  ServiceA.fetch_data(),
    {:ok, other_data}  ServiceB.fetch_data()
    do
    do_some_work(data, other_data)
    end
    Process.send_after(self(), :do_work, @one_hour)
    {:noreply, state}
    end

    View full-size slide

  72. def handle_info(:do_work, state) do
    {:ok, data} = ServiceA.fetch_data()
    {:ok, other_data} = ServiceB.fetch_data()
    :ok = do_some_work(data, other_data)
    Process.send_after(self(), :do_work, @one_hour)
    {:noreply, state}
    end

    View full-size slide

  73. defmodule ServiceA do
    def fetch_data() do
    {:ok, [1, 2, 3, 4, 5]}
    end
    end
    defmodule ServiceA do
    def fetch_data() do
    [1, 2, 3, 4, 5]
    end
    end

    View full-size slide

  74. iex(4)> with {:ok, data}  ServiceA.fetch_data, do: :ok
    [1, 2, 3, 4, 5]
    iex(6)> {:ok, data} = ServiceA.fetch_data()
    ** (MatchError) no match of right hand side value: [1, 2, 3, 4, 5]

    View full-size slide

  75. [error] GenServer Fail.Worker terminating
    ** (MatchError) no match of right hand side value: [1, 2, 3, 4, 5]
    (fail) lib/fail/worker.ex:30: Fail.Worker.handle_info/2
    (stdlib) gen_server.erl:615: :gen_server.try_dispatch/4
    (stdlib) gen_server.erl:681: :gen_server.handle_msg/5
    (stdlib) proc_lib.erl:240: :proc_lib.init_p_do_apply/3
    Last message: :do_work
    State: nil

    View full-size slide

  76. ‣ Things will fail
    ‣ Fault tolerance isn't free
    ‣ Know your tools
    ‣ Think what you can handle
    ‣ Don't try to handle every possible error
    ‣ Think about supervision structure

    View full-size slide

  77. ‣ https://ferd.ca/the-zen-of-erlang.html
    ‣ https://medium.com/@jlouis666/error-kernels-9ad991200abd
    ‣ http://jlouisramblings.blogspot.com/2010/11/on-erlang-state-and-
    crashes.html
    ‣ https://mazenharake.wordpress.com/2009/09/14/let-it-crash-the-
    right-way/
    ‣ http://blog.plataformatec.com.br/2016/05/beyond-functional-
    programming-with-elixir-and-erlang/
    ‣ https://mazenharake.wordpress.com/2010/10/31/9-erlang-pitfalls-
    you-should-know-about/ ("Returning arbitrary {error, Reason}")
    ‣ http://mkaszubowski.pl/2017/09/02/On-Restoring-Process-State.html

    View full-size slide

  78. THANK YOU!
    mkaszubowski94
    http://mkaszubowski.pl

    View full-size slide