Upgrade to PRO for Only $50/Year—Limited-Time Offer! 🔥
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Let it crash - fault tolerance in Elixir/OTP
Search
Maciej Kaszubowski
September 28, 2017
Programming
0
500
Let it crash - fault tolerance in Elixir/OTP
Maciej Kaszubowski
September 28, 2017
Tweet
Share
More Decks by Maciej Kaszubowski
See All by Maciej Kaszubowski
Error-free Elixir
mkaszubowski
0
420
Modular Design in Elixir (ElixirConf EU 2019)
mkaszubowski
2
830
The Big Ball of Nouns
mkaszubowski
0
120
Modular Design in Elixir
mkaszubowski
1
400
Our three years with Elixir
mkaszubowski
0
260
Concurrency Basics for Elixir
mkaszubowski
0
130
Distributed Elixir
mkaszubowski
0
170
Software Architecture
mkaszubowski
0
150
CRDTs - The science behind Phoenix Presence
mkaszubowski
2
280
Other Decks in Programming
See All in Programming
AI Agent Dojo #4: watsonx Orchestrate ADK体験
oniak3ibm
PRO
0
110
実はマルチモーダルだった。ブラウザの組み込みAI🧠でWebの未来を感じてみよう #jsfes #gemini
n0bisuke2
3
1.3k
Combinatorial Interview Problems with Backtracking Solutions - From Imperative Procedural Programming to Declarative Functional Programming - Part 2
philipschwarz
PRO
0
120
まだ間に合う!Claude Code元年をふりかえる
nogu66
5
900
AI 駆動開発ライフサイクル(AI-DLC):ソフトウェアエンジニアリングの再構築 / AI-DLC Introduction
kanamasa
11
3.9k
Vibe codingでおすすめの言語と開発手法
uyuki234
0
120
これならできる!個人開発のすゝめ
tinykitten
PRO
0
130
フルサイクルエンジニアリングをAI Agentで全自動化したい 〜構想と現在地〜
kamina_zzz
0
300
ゲームの物理 剛体編
fadis
0
370
DevFest Android in Korea 2025 - 개발자 커뮤니티를 통해 얻는 가치
wisemuji
0
170
PC-6001でPSG曲を鳴らすまでを全部NetBSD上の Makefile に押し込んでみた / osc2025hiroshima
tsutsui
0
190
モデル駆動設計をやってみようワークショップ開催報告(Modeling Forum2025) / model driven design workshop report
haru860
0
280
Featured
See All Featured
How to Align SEO within the Product Triangle To Get Buy-In & Support - #RIMC
aleyda
1
1.3k
Leadership Guide Workshop - DevTernity 2021
reverentgeek
0
170
Scaling GitHub
holman
464
140k
What does AI have to do with Human Rights?
axbom
PRO
0
1.9k
SEO in 2025: How to Prepare for the Future of Search
ipullrank
3
3.3k
技術選定の審美眼(2025年版) / Understanding the Spiral of Technologies 2025 edition
twada
PRO
115
93k
Game over? The fight for quality and originality in the time of robots
wayneb77
1
66
Everyday Curiosity
cassininazir
0
110
The MySQL Ecosystem @ GitHub 2015
samlambert
251
13k
Rails Girls Zürich Keynote
gr2m
95
14k
Art, The Web, and Tiny UX
lynnandtonic
304
21k
Reflections from 52 weeks, 52 projects
jeffersonlam
355
21k
Transcript
LET IT CRASH! Poznań Elixir Metup #4
(DON'T) LET IT CRASH! Poznań Elixir Metup #4
(DON'T) LET IT CRASH! Fault tolerance in Elixir/OTP Poznań Elixir
Metup #4
(YOU CAN ASK QUESTIONS)
Elixir (Erlang) features ‣ Concurrent ‣ Functional ‣ Immutable state
‣ Message passing ‣ Distributed ‣ Hot upgrades
FAULT TOLERANCE
Elixir (Erlang) features ‣ Concurrent ‣ Functional ‣ Immutable state
‣ Message passing ‣ Distributed ‣ Hot upgrades
LET IT CRASH!
Let it crash!
Let it crash! ‣ Accept the fact that things fail
‣ Focus on the happy path ‣ Make failures more predictable
Let it crash! ‣ Separate the logic and error handling
‣ When something is wrong, let the process crash and let another one handle it (e.g. by restarting)
https://ferd.ca/an-open-letter-to-the-erlang-beginner-or-onlooker.html
THE TOOLS
Tools ‣ Monitors ‣ Links ‣ Supervisors ‣ Heart ‣
Distribution
Monitors pid_a ref = Process.monitor(pid_b) pid_b
Monitors pid_a ref = Process.monitor(pid_b) {:DOWN, ref, :process, pid_b, reason}
pid_b
Links pid_a Process.link(pid_b) pid_b
Links pid_a pid_b Process.link(pid_b)
Links pid_b Process.link(pid_b) Process.flag(:trap_exit, true) pid_a {:EXIT, from, reason}
Links pid_b Process.link(pid_b) Process.flag(:trap_exit, true) pid_a
Supervisors Worker Worker Supervisor
Supervisors Worker Worker Supervisor
Supervisors Worker Worker Supervisor Worker *New* process
Supervision strategies
opts = [ name: MyApp.Supervisor, ] Supervisor.start_link(children, opts)
opts = [ name: MyApp.Supervisor, strategy: :one_for_one ] Supervisor.start_link(children, opts)
:one_for_one W S W W S W W S W
:all_for_one W S W W S W W S W
W S W
:rest_for_one W S W W W S W W W
S W W W S W W
:simple_one_for_one W S W W S W W S W
Heart ## vm.args ## Heartbeat management; auto-restarts VM if it
##dies or becomes unresponsive ## (Disabled by default use with caution!) -heart -env HEART_COMMAND ~/heart_command.sh
WHY RESTARTING WORKS
Why restarting works ‣ Independent processes ‣ Clean state ‣
Bohrbugs vs. Heisenbugs
Bohrbugs ‣ Repeatable ‣ Easy to debug ‣ Easy to
fix ‣ Rare in production ‣ Restarting doesn't help
Heisenbugs ‣ Unpredictable ‣ Hard to debug ‣ Hard to
fix ‣ Frequent in production ‣ Restarting HELPS!
Heisenbugs ‣ Unpredictable ‣ Hard to debug ‣ Hard to
fix ‣ Frequent in production ‣ Restarting HELPS!
Supervisors Worker Worker Supervisor Worker *New* process
New process ‣ Clean state ‣ Predictable ‣ High chance
of fixing the bug
LIMITS
Limits ‣ :max_restarts (default: 3) ‣ :max_seconds (default: 5)
opts = [ name: MyApp.Supervisor, strategy: :one_for_one, max_restarts: 1, max_seconds:
1 ] Supervisor.start_link(children, opts) Limits
‣ Process ‣ Supervisor ‣ Node ‣ Machine Restarting
MISTAKES
‣ Poor supervision tree structure ‣ Not validating user params
‣ Not handling expected errors ‣ {:error, reason} tuples everywhere Mistakes
‣ Trying to recreate the state ‣ Timeouts ‣ Not
reading libraries source code ‣ Incorrect limits Mistakes
Expected errors {:ok, user} = Auth.authenticate(email, password) {:ok, user} =
UserService.fetch_by_id(params["id"])
Restoring the state def init(_) do state = restore_state() {:ok,
state} end def terminate(_reason, state) do save_state(state) end http://mkaszubowski.pl/2017/09/02/On-Restoring-Process-State.html
Poor supervision structure
Stable, long-lived, important, protected Short-lived, transient, can fail
Incorrect limits
DEMO!
BENEFITS
‣ Less code (= less bugs, easier to understand, easier
to change) ‣ Less logic duplication ‣ Faster bug fixes Benefits
Less code
def update_name(user, name) do end
def update_name(user, name) do update(user, %{name: name}) end
def update_name(user, name) do case update(user, %{name: name}) do end
end
def update_name(user, name) do case update(user, %{name: name}) do {:ok,
user} {:ok, user} end end
def update_name(user, name) do case update(user, %{name: name}) do {:ok,
user} {:ok, user} {:error, reason} {:error, reason} end end
def update_name(user, name) do case update(user, %{name: name}) do {:ok,
user} {:ok, user} {:error, reason} {:error, reason} end end
def update_name(user, name) do case update(user, %{name: name}) do {:ok,
user} {:ok, user} {:error, reason} {:error, reason} end end ‣ Do you know how to handle reason? ‣ Is {:error, reason} even possible? ‣ Fatal or acceptable error?
‣ What is likely to happen? ‣ What is an
acceptable error? ‣ What do I know how to handle?
def update_name(user, name) do {:ok, _} = update(user, %{name: name})
do end
def update_name(user, name) do case update(user, %{name: name}) do {:ok,
user} {:ok, user} {:error, %{errors: [username: "cannot be blank"]}} {:error, :blank_username} end end Acceptable error
None
def update_description(transaction, user) do with \ %{receipt: receipt} transaction,
false is_nil(receipt), {:ok, %{"id" id} Poison.decode(receipt), {:ok, %{status: 200, body: body}} Adapter.update(id, user) {:ok, _} update_db_record(id, body) do :ok end end
def update_description(transaction, user) do Task.Supervisor.start_child(MyApp.TaskSupervisor, fn with \ %{receipt:
receipt} transaction, false is_nil(receipt), {:ok, %{"id" id} Poison.decode(receipt), {:ok, %{status: 200, body: body}} Adapter.update(id, user) {:ok, _} update_db_record(id, body) do :ok end end) end
def update_description(transaction, user) do Task.Supervisor.start_child(MyApp.TaskSupervisor, fn with \ %{receipt:
receipt} transaction, false is_nil(receipt), {:ok, %{"id" id} Poison.decode(receipt), {:ok, %{status: 200, body: body}} Adapter.update(id, user) {:ok, _} update_db_record(id, body) do :ok end end) end
def update_description(transaction, user) do Task.Supervisor.start_child(MyApp.TaskSupervisor, fn with \ %{receipt:
receipt} transaction, false is_nil(receipt), {:ok, %{"id" id} Poison.decode(receipt), {:ok, %{status: 200, body: body}} Adapter.update(id, user) {:ok, _} update_db_record(id, body) do :ok end end) end
def update_description(transaction, user) do Task.Supervisor.start_child(MyApp.TaskSupervisor, fn %{"id" transaction_id}
= Poison.decode!(receipt) {:ok, %{body: body}} = Adapter.update(transaction_id, user) {:ok, _} = update_db_record(transaction_id, body) end end
Less duplicated logic
def add_contact(current_user_id, nil), do: {:error, :invalid_contact_id} def add_contact(current_user_id, contact_id) do
params = %{user_id: current_user_id, contact_id: contact_id} %Contact{} Contact.Changeset(params) Repo.insert() case do {:ok, contact} {:ok, contact} {:error, changeset} {:error, changeset} end end
def add_contact(current_user_id, nil), do: {:error, :invalid_contact_id} def add_contact(current_user_id, contact_id) do
params = %{user_id: current_user_id, contact_id: contact_id} %Contact{} Contact.Changeset(params) Repo.insert() case do {:ok, contact} {:ok, contact} {:error, changeset} {:error, changeset} end end
def add_contact(current_user_id, nil), do: {:error, :invalid_contact_id} def add_contact(current_user_id, contact_id) do
params = %{user_id: current_user_id, contact_id: contact_id} %Contact{} Contact.Changeset(params) Repo.insert() case do {:ok, contact} {:ok, contact} {:error, changeset} {:error, changeset} end end
def add_contact(current_user_id, nil), do: {:error, :invalid_contact_id} def add_contact(current_user_id, contact_id) do
params = %{user_id: current_user_id, contact_id: contact_id} %Contact{} Contact.Changeset(params) Repo.insert() case do {:ok, contact} {:ok, contact} {:error, changeset} {:error, changeset} end end
def add_contact(current_user_id, contact_id) do params = %{user_id: current_user_id, contact_id: contact_id}
{:ok, _} = %Contact{} Contact.Changeset(params) Repo.insert() end
Faster bug fixes
def handle_info(:do_work, state) do with {:ok, data} ServiceA.fetch_data(), {:ok,
other_data} ServiceB.fetch_data() do do_some_work(data, other_data) end Process.send_after(self(), :do_work, @one_hour) {:noreply, state} end
def handle_info(:do_work, state) do {:ok, data} = ServiceA.fetch_data() {:ok, other_data}
= ServiceB.fetch_data() :ok = do_some_work(data, other_data) Process.send_after(self(), :do_work, @one_hour) {:noreply, state} end
defmodule ServiceA do def fetch_data() do {:ok, [1, 2, 3,
4, 5]} end end defmodule ServiceA do def fetch_data() do [1, 2, 3, 4, 5] end end
iex(4)> with {:ok, data} ServiceA.fetch_data, do: :ok [1, 2,
3, 4, 5] iex(6)> {:ok, data} = ServiceA.fetch_data() ** (MatchError) no match of right hand side value: [1, 2, 3, 4, 5]
[error] GenServer Fail.Worker terminating ** (MatchError) no match of right
hand side value: [1, 2, 3, 4, 5] (fail) lib/fail/worker.ex:30: Fail.Worker.handle_info/2 (stdlib) gen_server.erl:615: :gen_server.try_dispatch/4 (stdlib) gen_server.erl:681: :gen_server.handle_msg/5 (stdlib) proc_lib.erl:240: :proc_lib.init_p_do_apply/3 Last message: :do_work State: nil
SUMMARY
‣ Things will fail ‣ Fault tolerance isn't free ‣
Know your tools ‣ Think what you can handle ‣ Don't try to handle every possible error ‣ Think about supervision structure
‣ https://ferd.ca/the-zen-of-erlang.html ‣ https://medium.com/@jlouis666/error-kernels-9ad991200abd ‣ http://jlouisramblings.blogspot.com/2010/11/on-erlang-state-and- crashes.html ‣ https://mazenharake.wordpress.com/2009/09/14/let-it-crash-the- right-way/
‣ http://blog.plataformatec.com.br/2016/05/beyond-functional- programming-with-elixir-and-erlang/ ‣ https://mazenharake.wordpress.com/2010/10/31/9-erlang-pitfalls- you-should-know-about/ ("Returning arbitrary {error, Reason}") ‣ http://mkaszubowski.pl/2017/09/02/On-Restoring-Process-State.html
THANK YOU! mkaszubowski94 http://mkaszubowski.pl