Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Elixir and the Internet of Things - IoT Day Norway Edition

Doug Rohrer
November 06, 2014

Elixir and the Internet of Things - IoT Day Norway Edition

The Internet of Things is upon us and being able to efficiently interconnect things will become increasingly difficult as more of them become available. How can we handle the stampede of connecting devices efficiently? When Ruby failed to live up to the task, we turned to Elixir and Erlang and were astonished by the results.

This session will showcase the use of Elixir for a real-world IoT system. We will walk though the actual code and discuss some of the struggles we faced while learning (enough) Erlang and Elixir to get things up and running. At the end of the presentation, you will have a basic understanding of Erlang’s OTP behaviors, realize the importance of restart strategies, know what an “Acceptor Pool” is, and how we built one without knowing it.

Doug Rohrer

November 06, 2014
Tweet

More Decks by Doug Rohrer

Other Decks in Technology

Transcript

  1. January 14, 2014 Elixir and the Internet of Things Handling

    a Stampede Doug Rohrer Basho Technologies CC by https://www.flickr.com/people/t3rmin4t0r/
  2. Me • Professional software developer for over 20 years •

    Polyglot developer on many operating systems • Often called upon to understand and mitigate performance issues on projects • Had never touched Elixir or Erlang (or much FP) before this project
  3. The Problem • 100s of thousands of connected devices •

    Sometimes all connecting in a short period of time • Need to process many messages in a short period of time (round-trip time is critical, as users are waiting for doors to unlock) • Beyond functionally routing messages, the number one requirement for this application was to scale to very large (200k) numbers of simultaneous connections
  4. Performance-Driven Development • Built a performance testing tool (in Go)

    • Built a potential solution (Ruby, Elixir) • Test • Evaluate • Profiled applications to find and fix bottlenecks. • Repeat until there were no clear bottlenecks to fix • Did it work? – Yes - keep it – No - Move on to another option
  5. We Tried Ruby/EventMachine • Pros • Existing solution • Cons

    • Single-threaded event loop (callback hell) • High CPU overhead for AMQP • Unbalanced I/O - Would do all of input before any output
  6. We Tried Ruby/Celluloid.io • Pros • Agents are easier to

    understand than lots of callbacks and next_tick calls • Cons • Uses Fibers, which are native threads on JRuby • Uses LOTS of memory (40GB for ~40k connections)
  7. Why Not Ruby/JRuby? • Inability to handle connection requirements •

    Concurrency is hard - Ruby doesn’t concurrency • Inability to handle push-based RabbitMQ subscriptions – Single RabbitMQ server could push data faster than Ruby/Eventmachine (with 8 instances running) could handle load – We fixed some AMQP gem issues, but couldn’t resolve the CPU utilization/event loop ’yielding’ issues in the time box we had for testing solutions
  8. Why Try Erlang? • Erlang/OTP was designed to handle this

    kind of problem • {ok, concurrency} = erlang:use() – Concurrency is built into the runtime – Actor model is easy to implement – Known for handling the load levels we needed • Clustering is (almost) code free – We need to load balance across machines – Erlang allows us to do this with minimal additional coding
  9. Why Elixir and not Erlang? • The client is predominantly

    a Ruby shop, and Elixir was a more comfortable choice for them as it has a more familiar syntax than Erlang • While young, Elixir provides a much more comfortable syntax for Rubyists (this was key for the POC). • Can still use all Erlang libraries • Compiles to BEAM bytecode and runs on the Erlang VM
  10. Processes • Are not OS processes - much more lightweight

    (100000 on a single Erlang instance is not uncommon, millions per instance are doable) • Can maintain state using recursive calls • Communicate with other processes via messages and mailboxes • Process messages in FIFO order
  11. OTP - Open Telecom Platform • Abstraction above simple processes

    that provides templates for actors • Supervisors and Workers – If worker dies, supervisor is notified and can take action • Defined, well understood pattern for handling state and message passing among processes
  12. Supervisor Tree - Owsla supervisor Tcp Supervisor Tcp Listener Supervisor

    Rabbit Producer (1) Tcp Endpoint (N) Tcp
 Acceptor (1000) Rabbit Consumer (8) Tcp Endpoint Supervisor Rabbit Consumer Supervisor
  13. Why do I need an Acceptor • Single-threaded listen/accept loops

    artificially limit connections/second • Acceptors can all share the listen socket, with only 1 being signaled for each connection • Allows for concurrent acceptance of a very high number of connections/second (in our case, the goal was 1000/second)
  14. Should I build my own like you • NO –

    We didn’t really know what an acceptor pool was when we started writing this code – Had we known, we would have found an existing library and used it instead. • What libraries are out there? – Writing Erlang? Check out Ranch: https://github.com/ extend/ranch – If you’re writing Elixir, check out reagent: https:// github.com/meh/reagent
  15. Mapping IDs to Processes • Thermostat connects & is assigned

    unique ID • Send messages to Ruby back-end • Ruby sends messages back to cluster, which must find the connection to which it should send the response
  16. First Attempt: A "Map Process" • Store a dict (Elixir

    Dictionary) in a process’s state • Add/remove/query from other processes to that process using standard OTP call/handle_call • Why is this a bad idea?
  17. Sharding • We use UUIDs for connection IDs • UUIDs

    are a series of bytes • Take the 1st byte (which should be fairly random if you use the UUID v4, or random, UUIDs) • Use the 1st byte to pick one of 256 processes, each with its own map that contains just the UUIDs in that begin with that byte
  18. Sharded Map Map 0 Thermostat Thermostat Thermostat Thermostat Back-end Back-end

    Back-end Back-end … … Map 1 … Map 253 Map 254
  19. ETS • Can store data so multiple processes can access

    them concurrently • Store Erlang tuples • Pick one item in the tuple as a key • Multiple types – set - each key must be unique – ordered_set - like sets, but iterating is in sorted order – bag - duplicate keys are OK – duplicate-bag - duplicate tuples are OK
  20. UUID -> PID mapping in ETS Thermostat Thermostat Thermostat Thermostat

    Back-end Back-end Back-end Back-end … … ETS
  21. Global Name Registration • The global_name_server process exists on every

    node in a cluster • Provides several cluster-wide services, including • name registration • locking • Maintaining the fully-connected network
  22. Global Naming • Register a process (PID) using any Erlang

    term (not limited to atoms like local names) – Can use things like UUIDs for names • Registration add/removes are propagated through the entire cluster – If a process exits or crashes, it is removed from the registry
  23. Houston, We Have a Problem • Load-testing tool would open

    5-10k connections per instance to our server • We’d run several instances of the load tool while testing • Every time we stopped one of them, the entire TcpSupervisor tree would crash/restart, taking down all of the live connections
  24. Restart Strategies • one_for_one – If a child crashes, restart

    it • one_for_all – if any child crashes, restart all of them • rest_for_one – If a child crashes, it and all children started after it are restarted • simple_one_for_one – like one-for-one, but designed for starting children dynamically.
  25. Maximum Restart Frequency • MaxT - the time window in

    which to measure restarts (in seconds) - default is 5 • MaxR - The maximum number of restarts that can occur within MaxT seconds before the supervisor will tear itself down - default is 5
  26. Child Specifications: Restart • permanent - always restart this child

    when it crashes (unless MaxR is exceeded) • transient - restart if the child ‘terminates abnormally’ • temporary - never restart, even if supervisor restart strategy tells you otherwise. These, therefore, aren’t counted when looking at MaxR
  27. Resources/Questions? • http://elixir-lang.org • http://erlang.org • irc: #elixir-lang • http://jeetkundoug.wordpress.com/2014/01/13/

    elixir-and-the-internet-of-things • Me: @jeetkundoug • My partners in crime on this project – @eymiha - Dave Anderson – @jvoegele - Jason Voegele