Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building Resilient Elixir Systems

Building Resilient Elixir Systems

Presented at GigCity Elixir - 2018

This was my attempt at describing a methodology for building systems in elixir that can handle failures at all levels. It touches on technology solutions as well as how to engage humans in those solutions.

06f8b41980eb4c577fa40c41d5030c19?s=128

Chris Keathley

October 27, 2018
Tweet

Transcript

  1. Building resilient systems with stacking Chris Keathley / @ChrisKeathley /

    c@keathley.io
  2. Breaking resilient systems with stacking Chris Keathley / @ChrisKeathley /

    c@keathey.io
  3. Purely functional data structures explained Chris Keathley / @ChrisKeathley /

    c@keathey.io
  4. How to build reliable systems with your face (and not

    on your face) Chris Keathley / @ChrisKeathley / c@keathey.io
  5. HOw to boot your apps correctly Chris Keathley / @ChrisKeathley

    / c@keathey.io
  6. Scaling

  7. Scaling

  8. Scaling BEAM

  9. Resilience an ability to recover from or adjust easily to

    Misfortune or change /ri-ˈzil-yən(t)s/
  10. None
  11. Complex systems run in degraded mode. “…complex systems run as

    broken systems. The system continues to function because it contains so many redundancies and because people can make it function, despite the presence of many flaws… System operations are dynamic, with components (organizational, human, technical) failing and being replaced continuously.”
  12. System A group of interacting, interrelated, or interdependent elements forming

    a complex whole. /ˈsistəm/
  13. Systems have dependencies

  14. Systems

  15. Our App Systems

  16. Our App Webserver Systems

  17. Our App Webserver DB Systems

  18. Our App Webserver DB Redis Systems

  19. Our App Webserver DB Redis Kafka Systems

  20. Our App Systems

  21. Our App Systems

  22. Systems

  23. Systems Our App

  24. Systems Our App Other Service Other Service Other Service Other

    Service Other Service Other Service
  25. Scaling is a problem of handling failure

  26. Our App Systems Other Service Client

  27. Our App Systems Other Service Client

  28. Our App Systems Other Service Client

  29. Our App Systems Other Service Client

  30. Our App Systems Other Service Client

  31. Our App Systems Other Service Client

  32. Our App Systems Other Service Client

  33. Our App Systems Other Service Client

  34. Our App Systems Other Service Client

  35. Our App Systems Other Service Client

  36. Dependencies are more then other systems

  37. Systems Our App

  38. Systems Our App Humans!

  39. Handle failures gracefully Provide feedback to other systems Give insight

    to operators Systems Should…
  40. Our App Webserver DB Redis Kafka

  41. Our App Webserver DB Redis Kafka Stacked Design

  42. Lets talk about…

  43. Lets talk about… Booting the runtime & Configuration Starting dependencies

    Connecting to external systems Alarms and feedback Communicating with services we don’t control
  44. Lets talk about… Booting the runtime & Configuration Starting dependencies

    Connecting to external systems Alarms and feedback Communicating with services we don’t control
  45. Server

  46. Kubernetes

  47. Kubernetes Release

  48. Our App Webserver DB Redis Kafka

  49. Our App

  50. Releases are the unit of deployment in Erlang/Elixir

  51. What has to be here to start our application?

  52. App Boot

  53. App Boot Read in system configuration

  54. App Boot Read in system configuration Start the BEAM

  55. App Boot Read in system configuration Start the BEAM Start

    the App
  56. App Boot Start the App Read runtime configuration

  57. App Boot Start the App Read runtime configuration Proceed to

    next level
  58. App Boot Start the App Read runtime configuration Proceed to

    next level
  59. Mix config vs. runtime config

  60. None
  61. defmodule Jenga.Application do use Application def start(_type, _args) do children

    = [ ] opts = [strategy: :one_for_one, name: Jenga.Supervisor] Supervisor.start_link(children, opts) end end
  62. defmodule Jenga.Application do use Application def start(_type, _args) do config

    = [ port: "PORT", db_url: "DB_URL", ] children = [ ] opts = [strategy: :one_for_one, name: Jenga.Supervisor] Supervisor.start_link(children, opts) end end
  63. defmodule Jenga.Application do use Application def start(_type, _args) do config

    = [ port: "PORT", db_url: "DB_URL", ] children = [ {Jenga.Config, config}, ] opts = [strategy: :one_for_one, name: Jenga.Supervisor] Supervisor.start_link(children, opts) end end
  64. defmodule Jenga.Config do end

  65. defmodule Jenga.Config do use GenServer def start_link(desired_config) do GenServer.start_link(__MODULE__, desired_config,

    name: __MODULE__) end end
  66. defmodule Jenga.Config do use GenServer def start_link(desired_config) do GenServer.start_link(__MODULE__, desired_config,

    name: __MODULE__) end def init(desired) do :jenga_config = :ets.new(:jenga_config, [:set, :protected, :named_table]) end end
  67. defmodule Jenga.Config do use GenServer def start_link(desired_config) do GenServer.start_link(__MODULE__, desired_config,

    name: __MODULE__) end def init(desired) do :jenga_config = :ets.new(:jenga_config, [:set, :protected, :named_table]) case load_config(:jenga_config, desired) do :ok -> {:ok, %{table: :jenga_config, desired: desired}} end end end
  68. defmodule Jenga.Config do use GenServer def start_link(desired_config) do GenServer.start_link(__MODULE__, desired_config,

    name: __MODULE__) end def init(desired) do :jenga_config = :ets.new(:jenga_config, [:set, :protected, :named_table]) case load_config(:jenga_config, desired) do :ok -> {:ok, %{table: :jenga_config, desired: desired}} :error -> {:stop, :could_not_load_config} end end end
  69. defmodule Jenga.Config do use GenServer def start_link(desired_config) do GenServer.start_link(__MODULE__, desired_config,

    name: __MODULE__) end def init(desired) do :jenga_config = :ets.new(:jenga_config, [:set, :protected, :named_table]) case load_config(:jenga_config, desired) do :ok -> {:ok, %{table: :jenga_config, desired: desired}} :error -> {:stop, :could_not_load_config} end end defp load_config(table, config, retry_count \\ 0) defp load_config(_table, [], _), do: :ok defp load_config(_table, _, 10), do: :error defp load_config(table, [{k, v} | tail], retry_count) do case System.get_env(v) do nil -> load_config(table, [{k, v} | tail], retry_count + 1) value -> :ets.insert(table, {k, value}) load_config(table, tail, retry_count) end end end
  70. ** (Mix) Could not start application jenga: Jenga.Application.start(:normal, []) returned

    an error: shutdown: failed to start child: Jenga.Config ** (EXIT) :could_not_load_config
  71. Lets talk about… Booting the runtime & Configuration Starting dependencies

    Connecting to external systems Alarms and feedback Communicating with services we don’t control
  72. Lets talk about… Booting the runtime & Configuration Starting dependencies

    Connecting to external systems Alarms and feedback Communicating with services we don’t control
  73. App

  74. App Load Balancer /up

  75. App Load Balancer /up Operators alarms

  76. App

  77. App

  78. App Phoenix

  79. defmodule Jenga.Application do use Application def start(_type, _args) do config

    = [ port: "PORT", db_url: "DB_URL", ] children = [ {Jenga.Config, config}, ] opts = [strategy: :one_for_one, name: Jenga.Supervisor] Supervisor.start_link(children, opts) end end
  80. defmodule Jenga.Application do use Application def start(_type, _args) do config

    = [ port: "PORT", db_url: "DB_URL", ] children = [ {Jenga.Config, config}, JengaWeb.Endpoint, ] opts = [strategy: :one_for_one, name: Jenga.Supervisor] Supervisor.start_link(children, opts) end end
  81. defmodule JengaWeb.Endpoint do use Phoenix.Endpoint, otp_app: :jenga def init(_key, config)

    do port = Jenga.Config.get(:port) {:ok, Keyword.put(config, :http, [:inet6, port: port])} end end
  82. defmodule JengaWeb.UpController do use JengaWeb, :controller def up(conn, _params) do

    {code, message} = status() conn |> Plug.Conn.put_status(code) |> json(message) end defp status do {500, %{status: “LOADING”}} end end
  83. Lets talk about… Booting the runtime & Configuration Starting dependencies

    Connecting to external systems Alarms and feedback Communicating with services we don’t control
  84. Lets talk about… Booting the runtime & Configuration Starting dependencies

    Connecting to external systems Alarms and feedback Communicating with services we don’t control
  85. App Phoenix

  86. App Phoenix Database

  87. App Phoenix Pool Supervisor Conn Conn Conn

  88. App Phoenix Pool Supervisor Conn Conn Conn Disconnected

  89. Supervisors are about guarantees -“Friend of the show” Fred Hebert

  90. App Phoenix Pool Supervisor Conn Conn Conn

  91. App Phoenix Pool Supervisor Conn Conn Conn

  92. App Phoenix Pool Supervisor Conn Conn Conn

  93. App Phoenix Pool Supervisor Conn Conn Conn

  94. App Phoenix Pool Supervisor Conn Conn Conn

  95. defmodule Jenga.DemoConnection do use GenServer end

  96. defmodule Jenga.DemoConnection do use GenServer def init(opts) do wait_for =

    3_000 + backoff() + jitter() Process.send_after(self(), {:try_connect, opts}, wait_for) {:ok, %{state: :disconnected}} end end
  97. defmodule Jenga.DemoConnection do use GenServer def init(opts) do wait_for =

    3_000 + backoff() + jitter() Process.send_after(self(), {:try_connect, opts}, wait_for) {:ok, %{state: :disconnected}} end def handle_info({:try_connect, opts}, _) do do_connect(opts) {:noreply, state} end end
  98. defmodule Jenga.DemoConnection do use GenServer def init(opts) do wait_for =

    3_000 + backoff() + jitter() Process.send_after(self(), {:try_connect, opts}, wait_for) {:ok, %{state: :disconnected}} end def handle_info(:try_connect, state) do case do_connect do :ok -> {:noreply, %{state | state: :connected}} :error -> wait_for = 3_000 + backoff() + jitter() Process.send_after(self(), :try_connect, wait_for) {:noreply, state} end end end
  99. Lets talk about… Booting the runtime & Configuration Starting dependencies

    Connecting to external systems Alarms and feedback Communicating with services we don’t control
  100. Lets talk about… Booting the runtime & Configuration Starting dependencies

    Connecting to external systems Alarms and feedback Communicating with services we don’t control
  101. App Phoenix Pool Supervisor Conn Conn Conn

  102. App Phoenix Pool Supervisor Conn Conn Conn

  103. App Phoenix Pool Supervisor Conn Conn Conn

  104. App Phoenix Pool Supervisor Conn Conn Conn

  105. App Phoenix Pool Supervisor Conn Conn Conn

  106. App Phoenix Pool Supervisor Conn Conn Conn Load Balancer

  107. defmodule JengaWeb.UpController do use JengaWeb, :controller def up(conn, _params) do

    {code, message} = status() conn |> Plug.Conn.put_status(code) |> json(message) end defp status do {500, %{status: “LOADING”}} end end
  108. defmodule JengaWeb.UpController do use JengaWeb, :controller def up(conn, _params) do

    {code, message} = status() conn |> Plug.Conn.put_status(code) |> json(message) end defp status do case Database.check_status() do :ok -> {200, %{status: "OK"}} _ -> {500, %{status: "LOADING"}} end end end
  109. App Phoenix Pool Supervisor Conn Conn Conn Load Balancer

  110. App Phoenix Pool Supervisor Conn Conn Conn

  111. App Phoenix Pool Supervisor Conn Conn Conn Operators alarms

  112. App Phoenix Pool supervisor Operators alarms db_supervisor Watchdog

  113. Watchdog

  114. Watchdog Good Bad Check DB Status

  115. Watchdog Good Bad Check DB Status Open alarm

  116. Watchdog Good Bad Check DB Status Close alarm Open alarm

  117. defmodule Jenga.Database.Watchdog do use GenServer end

  118. defmodule Jenga.Database.Watchdog do use GenServer def init(:ok) do schedule_check() {:ok,

    %{status: :degraded, passing_checks: 0}} end end
  119. defmodule Jenga.Database.Watchdog do use GenServer def init(:ok) do schedule_check() {:ok,

    %{status: :degraded, passing_checks: 0}} end def handle_info(:check_db, state) do status = Jenga.Database.check_status() state = change_state(status, state) schedule_check() {:noreply, state} end end
  120. defmodule Jenga.Database.Watchdog do use GenServer def init(:ok) do schedule_check() {:ok,

    %{status: :degraded, passing_checks: 0}} end def handle_info(:check_db, state) do status = Jenga.Database.check_status() state = change_state(status, state) schedule_check() {:noreply, state} end defp change_state(result, %{status: status, passing_checks: count}) do end end
  121. defmodule Jenga.Database.Watchdog do use GenServer def init(:ok) do schedule_check() {:ok,

    %{status: :degraded, passing_checks: 0}} end def handle_info(:check_db, state) do status = Jenga.Database.check_status() state = change_state(status, state) schedule_check() {:noreply, state} end defp change_state(result, %{status: status, passing_checks: count}) do case {result, status, count} do {:ok, :connected, count} -> if count == 3 do :alarm_handler.clear_alarm(@alarm_id) end %{status: :connected, passing_checks: count + 1} {:ok, :degraded, _} -> %{status: :connected, passing_checks: 0} end end end
  122. defmodule Jenga.Database.Watchdog do use GenServer def init(:ok) do schedule_check() {:ok,

    %{status: :degraded, passing_checks: 0}} end def handle_info(:check_db, state) do status = Jenga.Database.check_status() state = change_state(status, state) schedule_check() {:noreply, state} end defp change_state(result, %{status: status, passing_checks: count}) do case {result, status, count} do {:ok, :connected, count} -> if count == 3 do :alarm_handler.clear_alarm(@alarm_id) end %{status: :connected, passing_checks: count + 1} {:ok, :degraded, _} -> %{status: :connected, passing_checks: 0} {:error, :connected, _} -> :alarm_handler.set_alarm({@alarm_id, "We cannot connect to the database”}) %{status: :degraded, passing_checks: 0} {:error, :degraded, _} -> %{status: :degraded, passing_checks: 0} end end end
  123. :alarm_handler.clear_alarm(@alarm_id) :alarm_handler.set_alarm({@alarm_id, "We cannot connect to the database”})

  124. defmodule Jenga.Application do use Application def start(_type, _args) do config

    = [ port: “PORT", db_url: "DB_URL", ] children = [ {Jenga.Config, config}, JengaWeb.Endpoint, Jenga.Database.Supervisor, ] opts = [strategy: :one_for_one, name: Jenga.Supervisor] Supervisor.start_link(children, opts) end end
  125. defmodule Jenga.Application do use Application def start(_type, _args) do config

    = [ port: “PORT", db_url: "DB_URL", ] :gen_event.swap_handler( :alarm_handler, {:alarm_handler, :swap}, {Jenga.AlarmHandler, :ok}) children = [ {Jenga.Config, config}, JengaWeb.Endpoint, Jenga.Database.Supervisor, ] opts = [strategy: :one_for_one, name: Jenga.Supervisor] Supervisor.start_link(children, opts) end end
  126. defmodule Jenga.AlarmHandler do require Logger def init({:ok, {:alarm_handler, _old_alarms}}) do

    Logger.info("Installing alarm handler") {:ok, %{}} end def handle_event({:set_alarm, :database_disconnected}, alarms) do send_alert_to_slack(database_alarm()) {:ok, alarms} end def handle_event({:clear_alarm, :database_disconnected}, alarms) do send_recovery_to_slack(database_alarm()) {:ok, alarms} end def handle_event(event, state) do Logger.info("Unhandled alarm event: #{inspect(event)}") {:ok, state} end end
  127. Lets talk about… Booting the runtime & Configuration Starting dependencies

    Connecting to external systems Alarms and feedback Communicating with services we don’t control
  128. Lets talk about… Booting the runtime & Configuration Starting dependencies

    Connecting to external systems Alarms and feedback Communicating with services we don’t control
  129. App Other Service Client External Services

  130. App Other Service Client External Services

  131. App Other Service Client External Services

  132. App Other Service Client External Services

  133. App Other Service Client External Services

  134. App Other Service Client External Services

  135. App Other Service Client External Services

  136. Circuit Breakers

  137. defmodule Jenga.ExternalService do def fetch(params) do with :ok <- :fuse.ask(@fuse,

    :async_dirty), {:ok, result} <- make_call(params) do {:ok, result} else {:error, e} -> :ok = :fuse.melt(@fuse) {:error, e} :blown -> {:error, :service_is_down} end end end
  138. App Other Service Client External Services

  139. App Other Service Client External Services

  140. App Other Service Client External Services ETS

  141. App Other Service Client External Services ETS

  142. App Other Service Client External Services ETS

  143. App Other Service Client External Services ETS

  144. Circuit Breakers

  145. Additive Increase Multiplicative Decrease

  146. Lets talk about… Booting the runtime & Configuration Starting dependencies

    Connecting to external systems Alarms and feedback Communicating with services we don’t control
  147. We booted our application!

  148. Now what?

  149. Handle failures gracefully Provide feedback to other systems Give insight

    to operators Systems Should…
  150. Handle failures gracefully Provide feedback to other systems Give insight

    to operators Systems Should…
  151. Handle failures gracefully Provide feedback to other systems Give insight

    to operators Systems Should…
  152. Handle failures gracefully Provide feedback to other systems Give insight

    to operators Systems Should…
  153. Handle failures gracefully Provide feedback to other systems Give insight

    to operators Systems Should…
  154. We have powerful tools in our runtime

  155. Take advantage of them to build more robust systems

  156. Thanks Chris Keathley / @ChrisKeathley / keathey.io