Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Elixir and the Internet of Things - Handling a Stampede

Elixir and the Internet of Things - Handling a Stampede

When Ruby/EventMachine failed to scale to the number of simultaneous connections we needed, we took a look at Elixir and the Erlang runtime and found it to be much easier to meet our performance requirements. Lean what an acceptor pool is and why you don't have to build your own, and what Erlang/OTP restart strategies mean to the runtime behavior of your system.

Doug Rohrer

July 22, 2014
Tweet

More Decks by Doug Rohrer

Other Decks in Technology

Transcript

  1. Elixir and the Internet of Things Handling a Stampede Doug

    Rohrer CC by https://www.flickr.com/people/t3rmin4t0r/
  2. Me • Professional software developer for over 20 years •

    Written in many languages on many operating systems • Often called upon to understand and mitigate performance issues on projects • Had never touched Elixir or Erlang (or much FP) before this project
  3. The Problem • 10s of thousands of connected devices •

    Sometimes all connecting in a short period of time • Need to process many messages in a short period of time (round-trip time is critical, as users are waiting for doors to unlock) • Beyond functionally routing messages, the number one requirement for this application was to scale to very large (200k) numbers of simultaneous connections
  4. Our Process • Built a performance testing tool to evaluate

    solutions – Written in Go, based on a tool used on another project written by Geoff Lane – Essentially an acceptance-testing tool for performance • Built a potential solution (Ruby, Elixir) • Ran the tool and evaluated results • Profiled applications to find and fix bottlenecks, and retest • Abandon the solution when there were no clear bottlenecks to fix but performance was inadequate
  5. We Tried Ruby/EventMachine • Single-threaded • High CPU overhead for

    AMQP • Unbalanced I/O – Would often do lots of “I” before any “O” • Hard to understand behavior – Event loop is opaque – Keep throwing “next_tick” callbacks in until the behavior is better. • Feels very “write-once, modify never”
  6. We Tried (J)Ruby/Celluloid • Programming was better than EventMachine –

    Agents are easier to understand than lots of callbacks • Uses Fibers – Which end up being real, native threads on JRuby – Which caused thread exhaustion on Linux without tweaking • Uses memory – Lots of memory – Really, a lot of memory – We may have done something wrong to cause this, but couldn’t find it in the time allotted for the Celluloid spike.
  7. Why Not Ruby/JRuby? • Inability to handle connection requirements •

    Concurrency is hard - Ruby doesn’t concurrency • High CPU requirements for parsing data from RabbitMQ – This was surprising – We fixed some AMQP gem issues, but couldn’t resolve the CPU utilization issues in the time box we had for testing solutions
  8. Why Try Erlang? • Erlang/OTP was designed to handle this

    kind of problem • {ok, concurrency} = Erlang.use – Concurrency is built into the runtime – Actor model is easy to implement – Known for handling the load levels we needed • Clustering is (almost) code free – We need to load balance across machines – Erlang allows us to do this with minimal additional coding
  9. Why Elixir and not Erlang? • The client is predominantly

    a Ruby shop, and Elixir was a more comfortable choice for them as it has a more familiar syntax than Erlang • While young, Elixir provides a much more comfortable syntax for Rubyists. • Can still use all Erlang libraries • Compiles to Erlang bytecode and runs on the Erlang VM • For all practical purposes, Elixir >= Erlang for us
  10. Processes • Are not OS processes - much more lightweight

    (100000 on a single Erlang instance is not unheard of) • Are how you maintain state (Actor model) • Communicate with other processes via messages and mailboxes • Process messages in FIFO order
  11. OTP - Open Telecom Platform • Abstraction above simple processes

    that provides templates for actors • Supervisors and Workers – If worker dies, supervisor is notified and can take action • Defined, well understood pattern for handling state and message passing among processes
  12. Supervisor Tree - Owsla supervisor Tcp Supervisor Tcp Listener Supervisor

    Rabbit Producer (1) Tcp Endpoint (N) Tcp
 Acceptor (1000) Rabbit Consumer (8) Tcp Endpoint Supervisor
  13. Why do I need an Acceptor Pool? • Single-threaded listen/accept

    loops artificially limit connections/second • Acceptors can all share the listen socket, with only 1 being signaled for each connection • Allows for concurrent acceptance of a very high number of connections/second (in our case, the goal was 1000/ second)
  14. Should I build my own like you did? • NO

    – We didn’t really know what an acceptor pool was when we started writing this code – Had we known, we would have found an existing library and used it instead. • What libraries are out there? – Writing Erlang? Check out Ranch: https://github.com/extend/ ranch – If you’re writing Elixir, check out reagent: https://github.com/meh/ reagent
  15. Houston, We Have a Problem • Load-testing tool would open

    5-10k connections per instance to our server • We’d run several instances of the load tool while testing • Every time we stopped one of them, the entire TcpSupervisor tree would crash/restart, taking down all of the live connections
  16. Restart Strategies • one_for_one – If a child crashes, restart

    it • one_for_all – if any child crashes, restart all of them • rest_for_one – If a child crashes, it and all children started after it are restarted • simple_one_for_one – like one-for-one, but designed for starting children dynamically.
  17. Maximum Restart Frequency • MaxT - the time window in

    which to measure restarts (in seconds) - default is 5 • MaxR - The maximum number of restarts that can occur within MaxT seconds before the supervisor will tear itself down - default is 5
  18. Child Specifications: Restart ! • permanent - always restart this

    child when it crashes (unless MaxR is exceeded) • transient - restart if the child ‘terminates abnormally’ • temporary - never restart, even if supervisor restart strategy tells you otherwise. These, therefore, aren’t counted when looking at MaxR
  19. Resources/Questions? • http://elixir-lang.org • http://erlang.org • irc: #elixir-lang • http://jeetkundoug.wordpress.com/2014/01/13/elixir-and-

    the-internet-of-things • Me: @jeetkundoug • My partners in crime on this project – @eymiha - Dave Anderson – @jvoegele - Jason Voegele