Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Let it crash - fault tolerance in Elixir/OTP
Search
Maciej Kaszubowski
September 28, 2017
Programming
0
490
Let it crash - fault tolerance in Elixir/OTP
Maciej Kaszubowski
September 28, 2017
Tweet
Share
More Decks by Maciej Kaszubowski
See All by Maciej Kaszubowski
Error-free Elixir
mkaszubowski
0
390
Modular Design in Elixir (ElixirConf EU 2019)
mkaszubowski
2
790
The Big Ball of Nouns
mkaszubowski
0
110
Modular Design in Elixir
mkaszubowski
1
390
Our three years with Elixir
mkaszubowski
0
250
Concurrency Basics for Elixir
mkaszubowski
0
130
Distributed Elixir
mkaszubowski
0
160
Software Architecture
mkaszubowski
0
140
CRDTs - The science behind Phoenix Presence
mkaszubowski
2
270
Other Decks in Programming
See All in Programming
釣り地図SNSにおける有料機能の実装
nokonoko1203
0
200
Towards Transactional Buffering of CDC Events @ Flink Forward 2025 Barcelona Spain
hpgrahsl
0
120
The Past, Present, and Future of Enterprise Java
ivargrimstad
0
520
O Que É e Como Funciona o PHP-FPM?
marcelgsantos
0
210
Google Opalで使える37のライブラリ
mickey_kubo
3
150
Android16 Migration Stories ~Building a Pattern for Android OS upgrades~
reoandroider
0
140
One Enishi After Another
snoozer05
PRO
0
160
Amazon Verified Permissions実践入門 〜Cedar活用とAppSync導入事例/Practical Introduction to Amazon Verified Permissions
fossamagna
2
100
マンガアプリViewerの大画面対応を考える
kk__777
0
370
contribution to astral-sh/uv
shunsock
0
550
CSC305 Lecture 09
javiergs
PRO
0
320
Reactive Thinking with Signals and the Resource API
manfredsteyer
PRO
0
120
Featured
See All Featured
Measuring & Analyzing Core Web Vitals
bluesmoon
9
640
Save Time (by Creating Custom Rails Generators)
garrettdimon
PRO
32
1.7k
Leading Effective Engineering Teams in the AI Era
addyosmani
7
650
Bootstrapping a Software Product
garrettdimon
PRO
307
110k
Building Adaptive Systems
keathley
44
2.8k
Statistics for Hackers
jakevdp
799
220k
GitHub's CSS Performance
jonrohan
1032
470k
What’s in a name? Adding method to the madness
productmarketing
PRO
24
3.7k
How to train your dragon (web standard)
notwaldorf
97
6.3k
Designing Experiences People Love
moore
142
24k
Product Roadmaps are Hard
iamctodd
PRO
55
11k
Mobile First: as difficult as doing things right
swwweet
225
10k
Transcript
LET IT CRASH! Poznań Elixir Metup #4
(DON'T) LET IT CRASH! Poznań Elixir Metup #4
(DON'T) LET IT CRASH! Fault tolerance in Elixir/OTP Poznań Elixir
Metup #4
(YOU CAN ASK QUESTIONS)
Elixir (Erlang) features ‣ Concurrent ‣ Functional ‣ Immutable state
‣ Message passing ‣ Distributed ‣ Hot upgrades
FAULT TOLERANCE
Elixir (Erlang) features ‣ Concurrent ‣ Functional ‣ Immutable state
‣ Message passing ‣ Distributed ‣ Hot upgrades
LET IT CRASH!
Let it crash!
Let it crash! ‣ Accept the fact that things fail
‣ Focus on the happy path ‣ Make failures more predictable
Let it crash! ‣ Separate the logic and error handling
‣ When something is wrong, let the process crash and let another one handle it (e.g. by restarting)
https://ferd.ca/an-open-letter-to-the-erlang-beginner-or-onlooker.html
THE TOOLS
Tools ‣ Monitors ‣ Links ‣ Supervisors ‣ Heart ‣
Distribution
Monitors pid_a ref = Process.monitor(pid_b) pid_b
Monitors pid_a ref = Process.monitor(pid_b) {:DOWN, ref, :process, pid_b, reason}
pid_b
Links pid_a Process.link(pid_b) pid_b
Links pid_a pid_b Process.link(pid_b)
Links pid_b Process.link(pid_b) Process.flag(:trap_exit, true) pid_a {:EXIT, from, reason}
Links pid_b Process.link(pid_b) Process.flag(:trap_exit, true) pid_a
Supervisors Worker Worker Supervisor
Supervisors Worker Worker Supervisor
Supervisors Worker Worker Supervisor Worker *New* process
Supervision strategies
opts = [ name: MyApp.Supervisor, ] Supervisor.start_link(children, opts)
opts = [ name: MyApp.Supervisor, strategy: :one_for_one ] Supervisor.start_link(children, opts)
:one_for_one W S W W S W W S W
:all_for_one W S W W S W W S W
W S W
:rest_for_one W S W W W S W W W
S W W W S W W
:simple_one_for_one W S W W S W W S W
Heart ## vm.args ## Heartbeat management; auto-restarts VM if it
##dies or becomes unresponsive ## (Disabled by default use with caution!) -heart -env HEART_COMMAND ~/heart_command.sh
WHY RESTARTING WORKS
Why restarting works ‣ Independent processes ‣ Clean state ‣
Bohrbugs vs. Heisenbugs
Bohrbugs ‣ Repeatable ‣ Easy to debug ‣ Easy to
fix ‣ Rare in production ‣ Restarting doesn't help
Heisenbugs ‣ Unpredictable ‣ Hard to debug ‣ Hard to
fix ‣ Frequent in production ‣ Restarting HELPS!
Heisenbugs ‣ Unpredictable ‣ Hard to debug ‣ Hard to
fix ‣ Frequent in production ‣ Restarting HELPS!
Supervisors Worker Worker Supervisor Worker *New* process
New process ‣ Clean state ‣ Predictable ‣ High chance
of fixing the bug
LIMITS
Limits ‣ :max_restarts (default: 3) ‣ :max_seconds (default: 5)
opts = [ name: MyApp.Supervisor, strategy: :one_for_one, max_restarts: 1, max_seconds:
1 ] Supervisor.start_link(children, opts) Limits
‣ Process ‣ Supervisor ‣ Node ‣ Machine Restarting
MISTAKES
‣ Poor supervision tree structure ‣ Not validating user params
‣ Not handling expected errors ‣ {:error, reason} tuples everywhere Mistakes
‣ Trying to recreate the state ‣ Timeouts ‣ Not
reading libraries source code ‣ Incorrect limits Mistakes
Expected errors {:ok, user} = Auth.authenticate(email, password) {:ok, user} =
UserService.fetch_by_id(params["id"])
Restoring the state def init(_) do state = restore_state() {:ok,
state} end def terminate(_reason, state) do save_state(state) end http://mkaszubowski.pl/2017/09/02/On-Restoring-Process-State.html
Poor supervision structure
Stable, long-lived, important, protected Short-lived, transient, can fail
Incorrect limits
DEMO!
BENEFITS
‣ Less code (= less bugs, easier to understand, easier
to change) ‣ Less logic duplication ‣ Faster bug fixes Benefits
Less code
def update_name(user, name) do end
def update_name(user, name) do update(user, %{name: name}) end
def update_name(user, name) do case update(user, %{name: name}) do end
end
def update_name(user, name) do case update(user, %{name: name}) do {:ok,
user}  {:ok, user} end end
def update_name(user, name) do case update(user, %{name: name}) do {:ok,
user}  {:ok, user} {:error, reason}  {:error, reason} end end
def update_name(user, name) do case update(user, %{name: name}) do {:ok,
user}  {:ok, user} {:error, reason}  {:error, reason} end end
def update_name(user, name) do case update(user, %{name: name}) do {:ok,
user}  {:ok, user} {:error, reason}  {:error, reason} end end ‣ Do you know how to handle reason? ‣ Is {:error, reason} even possible? ‣ Fatal or acceptable error?
‣ What is likely to happen? ‣ What is an
acceptable error? ‣ What do I know how to handle?
def update_name(user, name) do {:ok, _} = update(user, %{name: name})
do end
def update_name(user, name) do case update(user, %{name: name}) do {:ok,
user}  {:ok, user} {:error, %{errors: [username: "cannot be blank"]}}  {:error, :blank_username} end end Acceptable error
None
def update_description(transaction, user) do with \ %{receipt: receipt}  transaction,
false  is_nil(receipt), {:ok, %{"id"  id}  Poison.decode(receipt), {:ok, %{status: 200, body: body}}  Adapter.update(id, user) {:ok, _}  update_db_record(id, body) do :ok end end
def update_description(transaction, user) do Task.Supervisor.start_child(MyApp.TaskSupervisor, fn  with \ %{receipt:
receipt}  transaction, false  is_nil(receipt), {:ok, %{"id"  id}  Poison.decode(receipt), {:ok, %{status: 200, body: body}}  Adapter.update(id, user) {:ok, _}  update_db_record(id, body) do :ok end end) end
def update_description(transaction, user) do Task.Supervisor.start_child(MyApp.TaskSupervisor, fn  with \ %{receipt:
receipt}  transaction, false  is_nil(receipt), {:ok, %{"id"  id}  Poison.decode(receipt), {:ok, %{status: 200, body: body}}  Adapter.update(id, user) {:ok, _}  update_db_record(id, body) do :ok end end) end
def update_description(transaction, user) do Task.Supervisor.start_child(MyApp.TaskSupervisor, fn  with \ %{receipt:
receipt}  transaction, false  is_nil(receipt), {:ok, %{"id"  id}  Poison.decode(receipt), {:ok, %{status: 200, body: body}}  Adapter.update(id, user) {:ok, _}  update_db_record(id, body) do :ok end end) end
def update_description(transaction, user) do Task.Supervisor.start_child(MyApp.TaskSupervisor, fn  %{"id"  transaction_id}
= Poison.decode!(receipt) {:ok, %{body: body}} = Adapter.update(transaction_id, user) {:ok, _} = update_db_record(transaction_id, body) end end
Less duplicated logic
def add_contact(current_user_id, nil), do: {:error, :invalid_contact_id} def add_contact(current_user_id, contact_id) do
params = %{user_id: current_user_id, contact_id: contact_id} %Contact{}  Contact.Changeset(params)  Repo.insert()  case do {:ok, contact}  {:ok, contact} {:error, changeset}  {:error, changeset} end end
def add_contact(current_user_id, nil), do: {:error, :invalid_contact_id} def add_contact(current_user_id, contact_id) do
params = %{user_id: current_user_id, contact_id: contact_id} %Contact{}  Contact.Changeset(params)  Repo.insert()  case do {:ok, contact}  {:ok, contact} {:error, changeset}  {:error, changeset} end end
def add_contact(current_user_id, nil), do: {:error, :invalid_contact_id} def add_contact(current_user_id, contact_id) do
params = %{user_id: current_user_id, contact_id: contact_id} %Contact{}  Contact.Changeset(params)  Repo.insert()  case do {:ok, contact}  {:ok, contact} {:error, changeset}  {:error, changeset} end end
def add_contact(current_user_id, nil), do: {:error, :invalid_contact_id} def add_contact(current_user_id, contact_id) do
params = %{user_id: current_user_id, contact_id: contact_id} %Contact{}  Contact.Changeset(params)  Repo.insert()  case do {:ok, contact}  {:ok, contact} {:error, changeset}  {:error, changeset} end end
def add_contact(current_user_id, contact_id) do params = %{user_id: current_user_id, contact_id: contact_id}
{:ok, _} = %Contact{}  Contact.Changeset(params)  Repo.insert() end
Faster bug fixes
def handle_info(:do_work, state) do with {:ok, data}  ServiceA.fetch_data(), {:ok,
other_data}  ServiceB.fetch_data() do do_some_work(data, other_data) end Process.send_after(self(), :do_work, @one_hour) {:noreply, state} end
def handle_info(:do_work, state) do {:ok, data} = ServiceA.fetch_data() {:ok, other_data}
= ServiceB.fetch_data() :ok = do_some_work(data, other_data) Process.send_after(self(), :do_work, @one_hour) {:noreply, state} end
defmodule ServiceA do def fetch_data() do {:ok, [1, 2, 3,
4, 5]} end end defmodule ServiceA do def fetch_data() do [1, 2, 3, 4, 5] end end
iex(4)> with {:ok, data}  ServiceA.fetch_data, do: :ok [1, 2,
3, 4, 5] iex(6)> {:ok, data} = ServiceA.fetch_data() ** (MatchError) no match of right hand side value: [1, 2, 3, 4, 5]
[error] GenServer Fail.Worker terminating ** (MatchError) no match of right
hand side value: [1, 2, 3, 4, 5] (fail) lib/fail/worker.ex:30: Fail.Worker.handle_info/2 (stdlib) gen_server.erl:615: :gen_server.try_dispatch/4 (stdlib) gen_server.erl:681: :gen_server.handle_msg/5 (stdlib) proc_lib.erl:240: :proc_lib.init_p_do_apply/3 Last message: :do_work State: nil
SUMMARY
‣ Things will fail ‣ Fault tolerance isn't free ‣
Know your tools ‣ Think what you can handle ‣ Don't try to handle every possible error ‣ Think about supervision structure
‣ https://ferd.ca/the-zen-of-erlang.html ‣ https://medium.com/@jlouis666/error-kernels-9ad991200abd ‣ http://jlouisramblings.blogspot.com/2010/11/on-erlang-state-and- crashes.html ‣ https://mazenharake.wordpress.com/2009/09/14/let-it-crash-the- right-way/
‣ http://blog.plataformatec.com.br/2016/05/beyond-functional- programming-with-elixir-and-erlang/ ‣ https://mazenharake.wordpress.com/2010/10/31/9-erlang-pitfalls- you-should-know-about/ ("Returning arbitrary {error, Reason}") ‣ http://mkaszubowski.pl/2017/09/02/On-Restoring-Process-State.html
THANK YOU! mkaszubowski94 http://mkaszubowski.pl