Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Let it crash - fault tolerance in Elixir/OTP

Let it crash - fault tolerance in Elixir/OTP

Maciej Kaszubowski

September 28, 2017
Tweet

More Decks by Maciej Kaszubowski

Other Decks in Programming

Transcript

  1. Elixir (Erlang) features ‣ Concurrent ‣ Functional ‣ Immutable state

    ‣ Message passing ‣ Distributed ‣ Hot upgrades
  2. Elixir (Erlang) features ‣ Concurrent ‣ Functional ‣ Immutable state

    ‣ Message passing ‣ Distributed ‣ Hot upgrades
  3. Let it crash! ‣ Accept the fact that things fail

    ‣ Focus on the happy path ‣ Make failures more predictable
  4. Let it crash! ‣ Separate the logic and error handling

    ‣ When something is wrong, let the process crash and let another one handle it (e.g. by restarting)
  5. Heart ## vm.args ## Heartbeat management; auto-restarts VM if it

    ##dies or becomes unresponsive ## (Disabled by default use with caution!) -heart -env HEART_COMMAND ~/heart_command.sh
  6. Bohrbugs ‣ Repeatable ‣ Easy to debug ‣ Easy to

    fix ‣ Rare in production ‣ Restarting doesn't help
  7. Heisenbugs ‣ Unpredictable ‣ Hard to debug ‣ Hard to

    fix ‣ Frequent in production ‣ Restarting HELPS!
  8. Heisenbugs ‣ Unpredictable ‣ Hard to debug ‣ Hard to

    fix ‣ Frequent in production ‣ Restarting HELPS!
  9. ‣ Poor supervision tree structure ‣ Not validating user params

    ‣ Not handling expected errors ‣ {:error, reason} tuples everywhere Mistakes
  10. ‣ Trying to recreate the state ‣ Timeouts ‣ Not

    reading libraries source code ‣ Incorrect limits Mistakes
  11. Restoring the state def init(_) do state = restore_state() {:ok,

    state} end def terminate(_reason, state) do save_state(state) end http://mkaszubowski.pl/2017/09/02/On-Restoring-Process-State.html
  12. ‣ Less code (= less bugs, easier to understand, easier

    to change) ‣ Less logic duplication ‣ Faster bug fixes Benefits
  13. def update_name(user, name) do case update(user, %{name: name}) do {:ok,

    user}  {:ok, user} {:error, reason}  {:error, reason} end end
  14. def update_name(user, name) do case update(user, %{name: name}) do {:ok,

    user}  {:ok, user} {:error, reason}  {:error, reason} end end
  15. def update_name(user, name) do case update(user, %{name: name}) do {:ok,

    user}  {:ok, user} {:error, reason}  {:error, reason} end end ‣ Do you know how to handle reason? ‣ Is {:error, reason} even possible? ‣ Fatal or acceptable error?
  16. ‣ What is likely to happen? ‣ What is an

    acceptable error? ‣ What do I know how to handle?
  17. def update_name(user, name) do case update(user, %{name: name}) do {:ok,

    user}  {:ok, user} {:error, %{errors: [username: "cannot be blank"]}}  {:error, :blank_username} end end Acceptable error
  18. def update_description(transaction, user) do with \ %{receipt: receipt}  transaction,

    false  is_nil(receipt), {:ok, %{"id"  id}  Poison.decode(receipt), {:ok, %{status: 200, body: body}}  Adapter.update(id, user) {:ok, _}  update_db_record(id, body) do :ok end end
  19. def update_description(transaction, user) do Task.Supervisor.start_child(MyApp.TaskSupervisor, fn  with \ %{receipt:

    receipt}  transaction, false  is_nil(receipt), {:ok, %{"id"  id}  Poison.decode(receipt), {:ok, %{status: 200, body: body}}  Adapter.update(id, user) {:ok, _}  update_db_record(id, body) do :ok end end) end
  20. def update_description(transaction, user) do Task.Supervisor.start_child(MyApp.TaskSupervisor, fn  with \ %{receipt:

    receipt}  transaction, false  is_nil(receipt), {:ok, %{"id"  id}  Poison.decode(receipt), {:ok, %{status: 200, body: body}}  Adapter.update(id, user) {:ok, _}  update_db_record(id, body) do :ok end end) end
  21. def update_description(transaction, user) do Task.Supervisor.start_child(MyApp.TaskSupervisor, fn  with \ %{receipt:

    receipt}  transaction, false  is_nil(receipt), {:ok, %{"id"  id}  Poison.decode(receipt), {:ok, %{status: 200, body: body}}  Adapter.update(id, user) {:ok, _}  update_db_record(id, body) do :ok end end) end
  22. def update_description(transaction, user) do Task.Supervisor.start_child(MyApp.TaskSupervisor, fn  %{"id"  transaction_id}

    = Poison.decode!(receipt) {:ok, %{body: body}} = Adapter.update(transaction_id, user) {:ok, _} = update_db_record(transaction_id, body) end end
  23. def add_contact(current_user_id, nil), do: {:error, :invalid_contact_id} def add_contact(current_user_id, contact_id) do

    params = %{user_id: current_user_id, contact_id: contact_id} %Contact{}  Contact.Changeset(params)  Repo.insert()  case do {:ok, contact}  {:ok, contact} {:error, changeset}  {:error, changeset} end end
  24. def add_contact(current_user_id, nil), do: {:error, :invalid_contact_id} def add_contact(current_user_id, contact_id) do

    params = %{user_id: current_user_id, contact_id: contact_id} %Contact{}  Contact.Changeset(params)  Repo.insert()  case do {:ok, contact}  {:ok, contact} {:error, changeset}  {:error, changeset} end end
  25. def add_contact(current_user_id, nil), do: {:error, :invalid_contact_id} def add_contact(current_user_id, contact_id) do

    params = %{user_id: current_user_id, contact_id: contact_id} %Contact{}  Contact.Changeset(params)  Repo.insert()  case do {:ok, contact}  {:ok, contact} {:error, changeset}  {:error, changeset} end end
  26. def add_contact(current_user_id, nil), do: {:error, :invalid_contact_id} def add_contact(current_user_id, contact_id) do

    params = %{user_id: current_user_id, contact_id: contact_id} %Contact{}  Contact.Changeset(params)  Repo.insert()  case do {:ok, contact}  {:ok, contact} {:error, changeset}  {:error, changeset} end end
  27. def add_contact(current_user_id, contact_id) do params = %{user_id: current_user_id, contact_id: contact_id}

    {:ok, _} = %Contact{}  Contact.Changeset(params)  Repo.insert() end
  28. def handle_info(:do_work, state) do with {:ok, data}  ServiceA.fetch_data(), {:ok,

    other_data}  ServiceB.fetch_data() do do_some_work(data, other_data) end Process.send_after(self(), :do_work, @one_hour) {:noreply, state} end
  29. def handle_info(:do_work, state) do {:ok, data} = ServiceA.fetch_data() {:ok, other_data}

    = ServiceB.fetch_data() :ok = do_some_work(data, other_data) Process.send_after(self(), :do_work, @one_hour) {:noreply, state} end
  30. defmodule ServiceA do def fetch_data() do {:ok, [1, 2, 3,

    4, 5]} end end defmodule ServiceA do def fetch_data() do [1, 2, 3, 4, 5] end end
  31. iex(4)> with {:ok, data}  ServiceA.fetch_data, do: :ok [1, 2,

    3, 4, 5] iex(6)> {:ok, data} = ServiceA.fetch_data() ** (MatchError) no match of right hand side value: [1, 2, 3, 4, 5]
  32. [error] GenServer Fail.Worker terminating ** (MatchError) no match of right

    hand side value: [1, 2, 3, 4, 5] (fail) lib/fail/worker.ex:30: Fail.Worker.handle_info/2 (stdlib) gen_server.erl:615: :gen_server.try_dispatch/4 (stdlib) gen_server.erl:681: :gen_server.handle_msg/5 (stdlib) proc_lib.erl:240: :proc_lib.init_p_do_apply/3 Last message: :do_work State: nil
  33. ‣ Things will fail ‣ Fault tolerance isn't free ‣

    Know your tools ‣ Think what you can handle ‣ Don't try to handle every possible error ‣ Think about supervision structure
  34. ‣ https://ferd.ca/the-zen-of-erlang.html ‣ https://medium.com/@jlouis666/error-kernels-9ad991200abd ‣ http://jlouisramblings.blogspot.com/2010/11/on-erlang-state-and- crashes.html ‣ https://mazenharake.wordpress.com/2009/09/14/let-it-crash-the- right-way/

    ‣ http://blog.plataformatec.com.br/2016/05/beyond-functional- programming-with-elixir-and-erlang/ ‣ https://mazenharake.wordpress.com/2010/10/31/9-erlang-pitfalls- you-should-know-about/ ("Returning arbitrary {error, Reason}") ‣ http://mkaszubowski.pl/2017/09/02/On-Restoring-Process-State.html