Building resilient systems with stacking Chris Keathley / @ChrisKeathley / [email protected]

Breaking resilient systems with stacking Chris Keathley / @ChrisKeathley / [email protected]

Purely functional data structures explained Chris Keathley / @ChrisKeathley / [email protected]

How to build reliable systems with your face (and not on your face) Chris Keathley / @ChrisKeathley / [email protected]

HOw to boot your apps correctly Chris Keathley / @ChrisKeathley / [email protected]

Scaling BEAM

Resilience an ability to recover from or adjust easily to Misfortune or change /ri-ˈzil-yən(t)s/

Complex systems run in degraded mode. “…complex systems run as broken systems. The system continues to function because it contains so many redundancies and because people can make it function, despite the presence of many flaws… System operations are dynamic, with components (organizational, human, technical) failing and being replaced continuously.”

System A group of interacting, interrelated, or interdependent elements forming a complex whole. /ˈsistəm/

Systems have dependencies

Our App Systems

Our App Webserver Systems

Our App Webserver DB Systems

Our App Webserver DB Redis Systems

Our App Webserver DB Redis Kafka Systems

Our App Systems

Our App Systems

Systems Our App

Systems Our App Other Service Other Service Other Service Other Service Other Service Other Service

Scaling is a problem of handling failure

Our App Systems Other Service Client

Our App Systems Other Service Client

Our App Systems Other Service Client

Our App Systems Other Service Client

Our App Systems Other Service Client

Our App Systems Other Service Client

Our App Systems Other Service Client

Our App Systems Other Service Client

Our App Systems Other Service Client

Our App Systems Other Service Client

Dependencies are more then other systems

Systems Our App

Systems Our App Humans!

Handle failures gracefully Provide feedback to other systems Give insight to operators Systems Should…

Our App Webserver DB Redis Kafka

Our App Webserver DB Redis Kafka Stacked Design

Lets talk about…

Lets talk about… Booting the runtime & Configuration Starting dependencies Connecting to external systems Alarms and feedback Communicating with services we don’t control

Kubernetes Release

Our App Webserver DB Redis Kafka

Our App

Releases are the unit of deployment in Erlang/Elixir

What has to be here to start our application?

App Boot

App Boot Read in system configuration

App Boot Read in system configuration Start the BEAM

App Boot Read in system configuration Start the BEAM Start the App

App Boot Start the App Read runtime configuration

App Boot Start the App Read runtime configuration Proceed to next level

Mix config vs. runtime config

defmodule Jenga.Application do use Application def start(_type, _args) do children = [ ] opts = [strategy: :one_for_one, name: Jenga.Supervisor] Supervisor.start_link(children, opts) end end

defmodule Jenga.Application do use Application def start(_type, _args) do config = [ port: "PORT", db_url: "DB_URL", ] children = [ ] opts = [strategy: :one_for_one, name: Jenga.Supervisor] Supervisor.start_link(children, opts) end end

defmodule Jenga.Application do use Application def start(_type, _args) do config = [ port: "PORT", db_url: "DB_URL", ] children = [ {Jenga.Config, config}, ] opts = [strategy: :one_for_one, name: Jenga.Supervisor] Supervisor.start_link(children, opts) end end

defmodule Jenga.Config do end

defmodule Jenga.Config do use GenServer def start_link(desired_config) do GenServer.start_link(__MODULE__, desired_config, name: __MODULE__) end end

defmodule Jenga.Config do use GenServer def start_link(desired_config) do GenServer.start_link(__MODULE__, desired_config, name: __MODULE__) end def init(desired) do :jenga_config =, [:set, :protected, :named_table]) end end

defmodule Jenga.Config do use GenServer def start_link(desired_config) do GenServer.start_link(__MODULE__, desired_config, name: __MODULE__) end def init(desired) do :jenga_config =, [:set, :protected, :named_table]) case load_config(:jenga_config, desired) do :ok -> {:ok, %{table: :jenga_config, desired: desired}} end end end

defmodule Jenga.Config do use GenServer def start_link(desired_config) do GenServer.start_link(__MODULE__, desired_config, name: __MODULE__) end def init(desired) do :jenga_config =, [:set, :protected, :named_table]) case load_config(:jenga_config, desired) do :ok -> {:ok, %{table: :jenga_config, desired: desired}} :error -> {:stop, :could_not_load_config} end end end

defmodule Jenga.Config do use GenServer def start_link(desired_config) do GenServer.start_link(__MODULE__, desired_config, name: __MODULE__) end def init(desired) do :jenga_config =, [:set, :protected, :named_table]) case load_config(:jenga_config, desired) do :ok -> {:ok, %{table: :jenga_config, desired: desired}} :error -> {:stop, :could_not_load_config} end end defp load_config(table, config, retry_count \\ 0) defp load_config(_table, [], _), do: :ok defp load_config(_table, _, 10), do: :error defp load_config(table, [{k, v} | tail], retry_count) do case System.get_env(v) do nil -> load_config(table, [{k, v} | tail], retry_count + 1) value -> :ets.insert(table, {k, value}) load_config(table, tail, retry_count) end end end

** (Mix) Could not start application jenga: Jenga.Application.start(:normal, []) returned an error: shutdown: failed to start child: Jenga.Config ** (EXIT) :could_not_load_config

Lets talk about… Booting the runtime & Configuration Starting dependencies Connecting to external systems Alarms and feedback Communicating with services we don’t control

App Load Balancer /up

App Load Balancer /up Operators alarms

App Phoenix

defmodule Jenga.Application do use Application def start(_type, _args) do config = [ port: "PORT", db_url: "DB_URL", ] children = [ {Jenga.Config, config}, ] opts = [strategy: :one_for_one, name: Jenga.Supervisor] Supervisor.start_link(children, opts) end end

defmodule Jenga.Application do use Application def start(_type, _args) do config = [ port: "PORT", db_url: "DB_URL", ] children = [ {Jenga.Config, config}, JengaWeb.Endpoint, ] opts = [strategy: :one_for_one, name: Jenga.Supervisor] Supervisor.start_link(children, opts) end end

defmodule JengaWeb.Endpoint do use Phoenix.Endpoint, otp_app: :jenga def init(_key, config) do port = Jenga.Config.get(:port) {:ok, Keyword.put(config, :http, [:inet6, port: port])} end end

defmodule JengaWeb.UpController do use JengaWeb, :controller def up(conn, _params) do {code, message} = status() conn |> Plug.Conn.put_status(code) |> json(message) end defp status do {500, %{status: “LOADING”}} end end

App Phoenix

App Phoenix Database

App Phoenix Pool Supervisor Conn Conn Conn

App Phoenix Pool Supervisor Conn Conn Conn Disconnected

Supervisors are about guarantees -“Friend of the show” Fred Hebert

App Phoenix Pool Supervisor Conn Conn Conn

App Phoenix Pool Supervisor Conn Conn Conn

App Phoenix Pool Supervisor Conn Conn Conn

App Phoenix Pool Supervisor Conn Conn Conn

App Phoenix Pool Supervisor Conn Conn Conn

defmodule Jenga.DemoConnection do use GenServer end

defmodule Jenga.DemoConnection do use GenServer def init(opts) do wait_for = 3_000 + backoff() + jitter() Process.send_after(self(), {:try_connect, opts}, wait_for) {:ok, %{state: :disconnected}} end end

defmodule Jenga.DemoConnection do use GenServer def init(opts) do wait_for = 3_000 + backoff() + jitter() Process.send_after(self(), {:try_connect, opts}, wait_for) {:ok, %{state: :disconnected}} end def handle_info({:try_connect, opts}, _) do do_connect(opts) {:noreply, state} end end

defmodule Jenga.DemoConnection do use GenServer def init(opts) do wait_for = 3_000 + backoff() + jitter() Process.send_after(self(), {:try_connect, opts}, wait_for) {:ok, %{state: :disconnected}} end def handle_info(:try_connect, state) do case do_connect do :ok -> {:noreply, %{state | state: :connected}} :error -> wait_for = 3_000 + backoff() + jitter() Process.send_after(self(), :try_connect, wait_for) {:noreply, state} end end end

App Phoenix Pool Supervisor Conn Conn Conn

App Phoenix Pool Supervisor Conn Conn Conn

App Phoenix Pool Supervisor Conn Conn Conn

App Phoenix Pool Supervisor Conn Conn Conn

App Phoenix Pool Supervisor Conn Conn Conn

App Phoenix Pool Supervisor Conn Conn Conn Load Balancer

defmodule JengaWeb.UpController do use JengaWeb, :controller def up(conn, _params) do {code, message} = status() conn |> Plug.Conn.put_status(code) |> json(message) end defp status do {500, %{status: “LOADING”}} end end

defmodule JengaWeb.UpController do use JengaWeb, :controller def up(conn, _params) do {code, message} = status() conn |> Plug.Conn.put_status(code) |> json(message) end defp status do case Database.check_status() do :ok -> {200, %{status: "OK"}} _ -> {500, %{status: "LOADING"}} end end end

App Phoenix Pool Supervisor Conn Conn Conn Load Balancer

App Phoenix Pool Supervisor Conn Conn Conn

App Phoenix Pool Supervisor Conn Conn Conn Operators alarms

App Phoenix Pool supervisor Operators alarms db_supervisor Watchdog

Watchdog Good Bad Check DB Status

Watchdog Good Bad Check DB Status Open alarm

Watchdog Good Bad Check DB Status Close alarm Open alarm

defmodule Jenga.Database.Watchdog do use GenServer end

defmodule Jenga.Database.Watchdog do use GenServer def init(:ok) do schedule_check() {:ok, %{status: :degraded, passing_checks: 0}} end end

defmodule Jenga.Database.Watchdog do use GenServer def init(:ok) do schedule_check() {:ok, %{status: :degraded, passing_checks: 0}} end def handle_info(:check_db, state) do status = Jenga.Database.check_status() state = change_state(status, state) schedule_check() {:noreply, state} end end

defmodule Jenga.Database.Watchdog do use GenServer def init(:ok) do schedule_check() {:ok, %{status: :degraded, passing_checks: 0}} end def handle_info(:check_db, state) do status = Jenga.Database.check_status() state = change_state(status, state) schedule_check() {:noreply, state} end defp change_state(result, %{status: status, passing_checks: count}) do end end

defmodule Jenga.Database.Watchdog do use GenServer def init(:ok) do schedule_check() {:ok, %{status: :degraded, passing_checks: 0}} end def handle_info(:check_db, state) do status = Jenga.Database.check_status() state = change_state(status, state) schedule_check() {:noreply, state} end defp change_state(result, %{status: status, passing_checks: count}) do case {result, status, count} do {:ok, :connected, count} -> if count == 3 do :alarm_handler.clear_alarm(@alarm_id) end %{status: :connected, passing_checks: count + 1} {:ok, :degraded, _} -> %{status: :connected, passing_checks: 0} end end end

defmodule Jenga.Database.Watchdog do use GenServer def init(:ok) do schedule_check() {:ok, %{status: :degraded, passing_checks: 0}} end def handle_info(:check_db, state) do status = Jenga.Database.check_status() state = change_state(status, state) schedule_check() {:noreply, state} end defp change_state(result, %{status: status, passing_checks: count}) do case {result, status, count} do {:ok, :connected, count} -> if count == 3 do :alarm_handler.clear_alarm(@alarm_id) end %{status: :connected, passing_checks: count + 1} {:ok, :degraded, _} -> %{status: :connected, passing_checks: 0} {:error, :connected, _} -> :alarm_handler.set_alarm({@alarm_id, "We cannot connect to the database”}) %{status: :degraded, passing_checks: 0} {:error, :degraded, _} -> %{status: :degraded, passing_checks: 0} end end end

:alarm_handler.clear_alarm(@alarm_id) :alarm_handler.set_alarm({@alarm_id, "We cannot connect to the database”})

defmodule Jenga.Application do use Application def start(_type, _args) do config = [ port: “PORT", db_url: "DB_URL", ] children = [ {Jenga.Config, config}, JengaWeb.Endpoint, Jenga.Database.Supervisor, ] opts = [strategy: :one_for_one, name: Jenga.Supervisor] Supervisor.start_link(children, opts) end end

defmodule Jenga.Application do use Application def start(_type, _args) do config = [ port: “PORT", db_url: "DB_URL", ] :gen_event.swap_handler( :alarm_handler, {:alarm_handler, :swap}, {Jenga.AlarmHandler, :ok}) children = [ {Jenga.Config, config}, JengaWeb.Endpoint, Jenga.Database.Supervisor, ] opts = [strategy: :one_for_one, name: Jenga.Supervisor] Supervisor.start_link(children, opts) end end

defmodule Jenga.AlarmHandler do require Logger def init({:ok, {:alarm_handler, _old_alarms}}) do"Installing alarm handler") {:ok, %{}} end def handle_event({:set_alarm, :database_disconnected}, alarms) do send_alert_to_slack(database_alarm()) {:ok, alarms} end def handle_event({:clear_alarm, :database_disconnected}, alarms) do send_recovery_to_slack(database_alarm()) {:ok, alarms} end def handle_event(event, state) do"Unhandled alarm event: #{inspect(event)}") {:ok, state} end end

App Other Service Client External Services

App Other Service Client External Services

App Other Service Client External Services

App Other Service Client External Services

App Other Service Client External Services

App Other Service Client External Services

App Other Service Client External Services

Circuit Breakers

defmodule Jenga.ExternalService do def fetch(params) do with :ok <- :fuse.ask(@fuse, :async_dirty), {:ok, result} <- make_call(params) do {:ok, result} else {:error, e} -> :ok = :fuse.melt(@fuse) {:error, e} :blown -> {:error, :service_is_down} end end end

App Other Service Client External Services

App Other Service Client External Services

App Other Service Client External Services ETS

App Other Service Client External Services ETS

App Other Service Client External Services ETS

App Other Service Client External Services ETS

Circuit Breakers

Additive Increase Multiplicative Decrease

We booted our application!

Now what?

Handle failures gracefully Provide feedback to other systems Give insight to operators Systems Should…

We have powerful tools in our runtime

Take advantage of them to build more robust systems

Thanks Chris Keathley / @ChrisKeathley /