Let it crash - fault tolerance in Elixir/OTP

Let it crash - fault tolerance in Elixir/OTP

373dd7c51433dc3c38436dcfdec79cdc?s=128

Maciej Kaszubowski

September 28, 2017
Tweet

Transcript

  1. LET IT CRASH! Poznań Elixir Metup #4

  2. (DON'T) LET IT CRASH! Poznań Elixir Metup #4

  3. (DON'T) LET IT CRASH! Fault tolerance in Elixir/OTP Poznań Elixir

    Metup #4
  4. (YOU CAN ASK QUESTIONS)

  5. Elixir (Erlang) features ‣ Concurrent ‣ Functional ‣ Immutable state

    ‣ Message passing ‣ Distributed ‣ Hot upgrades
  6. FAULT TOLERANCE

  7. Elixir (Erlang) features ‣ Concurrent ‣ Functional ‣ Immutable state

    ‣ Message passing ‣ Distributed ‣ Hot upgrades
  8. LET IT CRASH!

  9. Let it crash!

  10. Let it crash! ‣ Accept the fact that things fail

    ‣ Focus on the happy path ‣ Make failures more predictable
  11. Let it crash! ‣ Separate the logic and error handling

    ‣ When something is wrong, let the process crash and let another one handle it (e.g. by restarting)
  12. https://ferd.ca/an-open-letter-to-the-erlang-beginner-or-onlooker.html

  13. THE TOOLS

  14. Tools ‣ Monitors ‣ Links ‣ Supervisors ‣ Heart ‣

    Distribution
  15. Monitors pid_a ref = Process.monitor(pid_b) pid_b

  16. Monitors pid_a ref = Process.monitor(pid_b) {:DOWN, ref, :process, pid_b, reason}

    pid_b
  17. Links pid_a Process.link(pid_b) pid_b

  18. Links pid_a pid_b Process.link(pid_b)

  19. Links pid_b Process.link(pid_b) Process.flag(:trap_exit, true) pid_a {:EXIT, from, reason}

  20. Links pid_b Process.link(pid_b) Process.flag(:trap_exit, true) pid_a

  21. Supervisors Worker Worker Supervisor

  22. Supervisors Worker Worker Supervisor

  23. Supervisors Worker Worker Supervisor Worker *New* process

  24. Supervision strategies

  25. opts = [ name: MyApp.Supervisor, ] Supervisor.start_link(children, opts)

  26. opts = [ name: MyApp.Supervisor, strategy: :one_for_one ] Supervisor.start_link(children, opts)

  27. :one_for_one W S W W S W W S W

  28. :all_for_one W S W W S W W S W

    W S W
  29. :rest_for_one W S W W W S W W W

    S W W W S W W
  30. :simple_one_for_one W S W W S W W S W

  31. Heart ## vm.args ## Heartbeat management; auto-restarts VM if it

    ##dies or becomes unresponsive ## (Disabled by default use with caution!) -heart -env HEART_COMMAND ~/heart_command.sh
  32. WHY RESTARTING WORKS

  33. Why restarting works ‣ Independent processes ‣ Clean state ‣

    Bohrbugs vs. Heisenbugs
  34. Bohrbugs ‣ Repeatable ‣ Easy to debug ‣ Easy to

    fix ‣ Rare in production ‣ Restarting doesn't help
  35. Heisenbugs ‣ Unpredictable ‣ Hard to debug ‣ Hard to

    fix ‣ Frequent in production ‣ Restarting HELPS!
  36. Heisenbugs ‣ Unpredictable ‣ Hard to debug ‣ Hard to

    fix ‣ Frequent in production ‣ Restarting HELPS!
  37. Supervisors Worker Worker Supervisor Worker *New* process

  38. New process ‣ Clean state ‣ Predictable ‣ High chance

    of fixing the bug
  39. LIMITS

  40. Limits ‣ :max_restarts (default: 3) ‣ :max_seconds (default: 5)

  41. opts = [ name: MyApp.Supervisor, strategy: :one_for_one, max_restarts: 1, max_seconds:

    1 ] Supervisor.start_link(children, opts) Limits
  42. ‣ Process ‣ Supervisor ‣ Node ‣ Machine Restarting

  43. MISTAKES

  44. ‣ Poor supervision tree structure ‣ Not validating user params

    ‣ Not handling expected errors ‣ {:error, reason} tuples everywhere Mistakes
  45. ‣ Trying to recreate the state ‣ Timeouts ‣ Not

    reading libraries source code ‣ Incorrect limits Mistakes
  46. Expected errors {:ok, user} = Auth.authenticate(email, password) {:ok, user} =

    UserService.fetch_by_id(params["id"])
  47. Restoring the state def init(_) do state = restore_state() {:ok,

    state} end def terminate(_reason, state) do save_state(state) end http://mkaszubowski.pl/2017/09/02/On-Restoring-Process-State.html
  48. Poor supervision structure

  49. Stable, long-lived, important, protected Short-lived, transient, can fail

  50. Incorrect limits

  51. DEMO!

  52. BENEFITS

  53. ‣ Less code (= less bugs, easier to understand, easier

    to change) ‣ Less logic duplication ‣ Faster bug fixes Benefits
  54. Less code

  55. def update_name(user, name) do end

  56. def update_name(user, name) do update(user, %{name: name}) end

  57. def update_name(user, name) do case update(user, %{name: name}) do end

    end
  58. def update_name(user, name) do case update(user, %{name: name}) do {:ok,

    user}  {:ok, user} end end
  59. def update_name(user, name) do case update(user, %{name: name}) do {:ok,

    user}  {:ok, user} {:error, reason}  {:error, reason} end end
  60. def update_name(user, name) do case update(user, %{name: name}) do {:ok,

    user}  {:ok, user} {:error, reason}  {:error, reason} end end
  61. def update_name(user, name) do case update(user, %{name: name}) do {:ok,

    user}  {:ok, user} {:error, reason}  {:error, reason} end end ‣ Do you know how to handle reason? ‣ Is {:error, reason} even possible? ‣ Fatal or acceptable error?
  62. ‣ What is likely to happen? ‣ What is an

    acceptable error? ‣ What do I know how to handle?
  63. def update_name(user, name) do {:ok, _} = update(user, %{name: name})

    do end
  64. def update_name(user, name) do case update(user, %{name: name}) do {:ok,

    user}  {:ok, user} {:error, %{errors: [username: "cannot be blank"]}}  {:error, :blank_username} end end Acceptable error
  65. None
  66. def update_description(transaction, user) do with \ %{receipt: receipt}  transaction,

    false  is_nil(receipt), {:ok, %{"id"  id}  Poison.decode(receipt), {:ok, %{status: 200, body: body}}  Adapter.update(id, user) {:ok, _}  update_db_record(id, body) do :ok end end
  67. def update_description(transaction, user) do Task.Supervisor.start_child(MyApp.TaskSupervisor, fn  with \ %{receipt:

    receipt}  transaction, false  is_nil(receipt), {:ok, %{"id"  id}  Poison.decode(receipt), {:ok, %{status: 200, body: body}}  Adapter.update(id, user) {:ok, _}  update_db_record(id, body) do :ok end end) end
  68. def update_description(transaction, user) do Task.Supervisor.start_child(MyApp.TaskSupervisor, fn  with \ %{receipt:

    receipt}  transaction, false  is_nil(receipt), {:ok, %{"id"  id}  Poison.decode(receipt), {:ok, %{status: 200, body: body}}  Adapter.update(id, user) {:ok, _}  update_db_record(id, body) do :ok end end) end
  69. def update_description(transaction, user) do Task.Supervisor.start_child(MyApp.TaskSupervisor, fn  with \ %{receipt:

    receipt}  transaction, false  is_nil(receipt), {:ok, %{"id"  id}  Poison.decode(receipt), {:ok, %{status: 200, body: body}}  Adapter.update(id, user) {:ok, _}  update_db_record(id, body) do :ok end end) end
  70. def update_description(transaction, user) do Task.Supervisor.start_child(MyApp.TaskSupervisor, fn  %{"id"  transaction_id}

    = Poison.decode!(receipt) {:ok, %{body: body}} = Adapter.update(transaction_id, user) {:ok, _} = update_db_record(transaction_id, body) end end
  71. Less duplicated logic

  72. def add_contact(current_user_id, nil), do: {:error, :invalid_contact_id} def add_contact(current_user_id, contact_id) do

    params = %{user_id: current_user_id, contact_id: contact_id} %Contact{}  Contact.Changeset(params)  Repo.insert()  case do {:ok, contact}  {:ok, contact} {:error, changeset}  {:error, changeset} end end
  73. def add_contact(current_user_id, nil), do: {:error, :invalid_contact_id} def add_contact(current_user_id, contact_id) do

    params = %{user_id: current_user_id, contact_id: contact_id} %Contact{}  Contact.Changeset(params)  Repo.insert()  case do {:ok, contact}  {:ok, contact} {:error, changeset}  {:error, changeset} end end
  74. def add_contact(current_user_id, nil), do: {:error, :invalid_contact_id} def add_contact(current_user_id, contact_id) do

    params = %{user_id: current_user_id, contact_id: contact_id} %Contact{}  Contact.Changeset(params)  Repo.insert()  case do {:ok, contact}  {:ok, contact} {:error, changeset}  {:error, changeset} end end
  75. def add_contact(current_user_id, nil), do: {:error, :invalid_contact_id} def add_contact(current_user_id, contact_id) do

    params = %{user_id: current_user_id, contact_id: contact_id} %Contact{}  Contact.Changeset(params)  Repo.insert()  case do {:ok, contact}  {:ok, contact} {:error, changeset}  {:error, changeset} end end
  76. def add_contact(current_user_id, contact_id) do params = %{user_id: current_user_id, contact_id: contact_id}

    {:ok, _} = %Contact{}  Contact.Changeset(params)  Repo.insert() end
  77. Faster bug fixes

  78. def handle_info(:do_work, state) do with {:ok, data}  ServiceA.fetch_data(), {:ok,

    other_data}  ServiceB.fetch_data() do do_some_work(data, other_data) end Process.send_after(self(), :do_work, @one_hour) {:noreply, state} end
  79. def handle_info(:do_work, state) do {:ok, data} = ServiceA.fetch_data() {:ok, other_data}

    = ServiceB.fetch_data() :ok = do_some_work(data, other_data) Process.send_after(self(), :do_work, @one_hour) {:noreply, state} end
  80. defmodule ServiceA do def fetch_data() do {:ok, [1, 2, 3,

    4, 5]} end end defmodule ServiceA do def fetch_data() do [1, 2, 3, 4, 5] end end
  81. iex(4)> with {:ok, data}  ServiceA.fetch_data, do: :ok [1, 2,

    3, 4, 5] iex(6)> {:ok, data} = ServiceA.fetch_data() ** (MatchError) no match of right hand side value: [1, 2, 3, 4, 5]
  82. [error] GenServer Fail.Worker terminating ** (MatchError) no match of right

    hand side value: [1, 2, 3, 4, 5] (fail) lib/fail/worker.ex:30: Fail.Worker.handle_info/2 (stdlib) gen_server.erl:615: :gen_server.try_dispatch/4 (stdlib) gen_server.erl:681: :gen_server.handle_msg/5 (stdlib) proc_lib.erl:240: :proc_lib.init_p_do_apply/3 Last message: :do_work State: nil
  83. SUMMARY

  84. ‣ Things will fail ‣ Fault tolerance isn't free ‣

    Know your tools ‣ Think what you can handle ‣ Don't try to handle every possible error ‣ Think about supervision structure
  85. ‣ https://ferd.ca/the-zen-of-erlang.html ‣ https://medium.com/@jlouis666/error-kernels-9ad991200abd ‣ http://jlouisramblings.blogspot.com/2010/11/on-erlang-state-and- crashes.html ‣ https://mazenharake.wordpress.com/2009/09/14/let-it-crash-the- right-way/

    ‣ http://blog.plataformatec.com.br/2016/05/beyond-functional- programming-with-elixir-and-erlang/ ‣ https://mazenharake.wordpress.com/2010/10/31/9-erlang-pitfalls- you-should-know-about/ ("Returning arbitrary {error, Reason}") ‣ http://mkaszubowski.pl/2017/09/02/On-Restoring-Process-State.html
  86. THANK YOU! mkaszubowski94 http://mkaszubowski.pl