Back on your Feet

Back on your Feet

When writing resilient Elixir applications one of our major concerns is state: where do we store it, what happens to it when a process crashes, how do we efficiently recreate it.
In this talk, we'll look at an example application and apply different techniques that can be used to protect and recover state, each one with specific implications and tradeoffs.
All of the techniques shown in this talk can be applied to everyday development.

8f29619526ac4fb3d9b78c82eaa5e40e?s=128

Claudio Ortolina

September 07, 2017
Tweet

Transcript

  1. BACK ON YOUR FEET Or I how learned to love

    state
  2. HELLO! Claudio Ortolina Head of Elixir @ Erlang Solutions Ltd.

    claudio.ortolina@erlang-solutions.com @cloud8421
  3. ABOUT STATE Data that changes over time

  4. Useful Difficult to model Difficult to maintain Can be lost

    (and needs to be recovered)
  5. IT REQUIRES THINKING

  6. OUT OF THE TAR PIT* B. Moseley, P.Marks, 2006 *Link

  7. OUR USE CASE: GIGS NEAR ME For the concert connoisseur

  8. HOW IT WORKS ➤ Given my location (defined by lat/lng),

    get a list of relevant gigs ➤ For each one of them get, the artists involved ➤ For each artist, get their latest release
  9. EXAMPLE GET /gigs/47.6183116/-122.20037839999999

  10. DATA MODEL EVENTS has many LOCATION have many ARTISTS have

    one RELEASE
  11. DATA MODEL defmodule Gig.Location do defstruct coords: {0, 0}, metro_area:

    nil, event_ids: [] end defmodule Gig.Event do defstruct id: nil, name: nil, artists: [], venue: nil, starts_at: :not_available end
  12. DATA MODEL defmodule Gig.Artist do defstruct id: nil, mbid: nil,

    name: nil end defmodule Gig.Release do defstruct id: nil, title: nil, type: "Album", release_date: :not_available end
  13. PAIN POINTS ➤ Cannot query APIs in real-time (too expensive,

    N+1 api calls) ➤ Both APIs are rate-limited ➤ Need to cache data ➤ Results update over time without us knowing anything about it (making polling necessary)
  14. ITERATION ONE

  15. ONE LOCATION, ONE PROCESS ➤ For each location, start a

    new process ➤ We use registry to track them ➤ Each process fetches and refreshes its own data
  16. PROCESS LIFECYCLE Time INIT FETCH EVENTS cast FETCH RELEASES cast

    TERMINATE send_after :expire
  17. PROS ➤ Basic isolation (an isolated process crash doesn't affect

    others) ➤ Scales predictably (memory usage) ➤ Easy expire (self-terminate the process)
  18. CONS ➤ A repeated failure of a single process can

    take down the application tree ➤ Events are duplicated among processes
  19. ITERATION TWO

  20. EXTRACT DATA STORAGE ➤ Events and releases moved to shared

    ETS tables ➤ Process keeps location data + list of event ids ➤ Requires periodic cleanup of storage (in case of crashes, data may get stale)
  21. PROCESS LIFECYCLE Time INIT TERMINATE send_after :expire ETS FETCH EVENTS

    cast FETCH RELEASES cast
  22. PROS ➤ More efficient memory usage ➤ Concurrent reads and

    writes ➤ Data survives everything except a node crash
  23. CONS ➤ A process crash loses the relationship between location

    and events
  24. ITERATION THREE

  25. MORE EXTRACTIONS ➤ Move locations to ETS ➤ Don’t go

    through the process for any reads ➤ The process is only responsible for refresh and expire
  26. PROCESS LIFECYCLE Time INIT TERMINATE send_after :expire ETS FETCH EVENTS

    cast FETCH RELEASES cast STORE LOCATION cast
  27. PROS ➤ Fast concurrent lookup for everything ➤ Survives refresh

    crashes (worst case scenario is stale data)
  28. CONS ➤ Stale data requires explicit information about its nature,

    e.g. (6 hours old). ➤ Requires sweep
  29. ITERATION FOUR

  30. GOING DISTRIBUTED ➤ Discreet pieces of data linked by references

    (event ids, musicbrainz ids) ➤ If any reference points to non existing data, we can trigger a refresh and expose the inconsistent state to the api consumer, so that the user has the right expectations ➤ For sharding on normal distribution, we can replace ETS with shards (or equivalent) ➤ Other option is using classic, external datastore (which allows horizontal scaling)
  31. RATE LIMIT ➤ Interaction with external apis can modelled with

    a queue which fetches respecting a rate limit ➤ Api client should also use a rate limiter to avoid being blocked ➤ Load testing with rate limit is key
  32. SUMMARY

  33. FOCUS ON STATUSES Exposing status (started, fetching, fetched) forces clients

    to handle all corner cases
  34. ASSUME DATA INCONSISTENCY Sooner or later something will crash. Focus

    on writing code that recovers as efficiently as possible
  35. HANDLE DISTRIBUTION RESPONSIBLY Powerful, but may introduce more inconsistent scenarios

  36. KEEP A WIDE ARSENAL OF TOOLS Queues, rate limiters, backoffs

    are only two examples. Resilient design requires planning.
  37. THANKS! Any questions?

  38. HELLO! Claudio Ortolina Head of Elixir @ Erlang Solutions Ltd.

    claudio.ortolina@erlang-solutions.com @cloud8421