Erlang in production. Lessons learned

Erlang in production. Lessons learned

Development social applications platform for with Erlang, Riak Core and Riak Pipe


Oleksii Kachaiev

November 16, 2013


  1. 2.

    About me •CTO at •Erlang, Go, Clojure, Scala •СPython

    & Twitter Storm contributor •Author of library •Hobbies: Haskell, Scheme, Racket, CRDT, type systems, compilers
  2. 4.

    Will tell you •project goals and challenges •tech. stack that

    we use •development •testing, deployment, debugging •problems & solutions
  3. 6.

    Will not tell •why Erlang is cool •how Erlang is

    cool •why you should use Erlang •1 mln of concurrent users •1000+ nodes cluster
  4. 13.

    • profiles • social network accounts • chat • posts

    • photos • photo albums • timeline • notifications • likes • permissions •replies •following •tweets tracking •activity streams •checkins •notes •sharing •search •RSS news tracking •blocking & ban
  5. 16.
  6. 17.

    •10k++ mobile applications •~ 2k profiles for each •activity spikes

    (obviously) •apps should work independently (*) Technically
  7. 18.
  8. 19.

    •prototype, not so many features •~3 weeks of active development

    •wrapper for CouchDB (in Erlang) •biggest problem: push notifications •serves ~75 mobile apps and still running “Delaware”
  9. 21.
  10. 23.

    •~46k LOC •53 “xxx” comments (incl. 3 ”xxx!!!”) •47 libraries

    (incl. 3 forks) •117 public RPC methods “Gomer”
  11. 27.

    •graph-oriented (like Facebook) •Riak for most data: nodes, links, streams

    •etcd for consistent cases (Raft consensus): settings, cluster structure •in-memory ETS: cache, sync ordering •pre-built data for reading Data
  12. 28.

    •nodes: id, rev, attrs, system flags •links: from-id, to-id, type

    •holds essential part of logic, i.e. session is a link from profile to device etc •Facebook TAO model: fetching nodes and simplest links-walking •implemented as independent library Graph
  13. 29.

    •revision control for each entity •to ensure all client calls

    are idempotent •k-ordering for cursor-based sync (**) •flake library (snowflake-like) •one more, riak_id K-ordering
  14. 30.

    K-ordering ** client tells server max revision ever seen (a.k.a.

    cursor) server send changed data only (current rev > client max rev)
  15. 32.

    • •Actor, Action, Object, Target •cases: timelines, activity streams, chats,

    notification center •linked lists •cursor-based fetch Streams
  16. 33.
  17. 34.

    •cases: follow-ups, notes, cleared messages etc •event-sourcing (both server &

    clients) •LWW for conflicted rewrites Offline support
  18. 35.

    •use to avoid state copy in gen_server •2 approaches (use

    both): •supervisor creates ETS and gives it to child at start •server creates ETS and fills it with data on each gen_server:init ETS
  19. 36.
  20. 40.

    •started from “process per device” •easy to start, client is

    an Actor •not really HA •bad fit to few nodes cluster •many problems with events routing •reimplemented Processes v.1
  21. 41.

    •riak_core vnodes ring •riak_pipe vnodes ring •supervisor for each app

    •auth •profiles ordering •twitter reader •rss reader Processes v.2
  22. 43.

    •vnodes ring •“service” and compatibility tracking •consistent hashing for tasks

    routing •handoffs •join, leave, status, membership •CLI admin interface riak_core
  23. 44.

    •few problems •great facilities with no docs •... but easy

    to read whole source code •thanks to the guys from Basho for their advice •waiting for 2.0 version riak_core
  24. 45.
  25. 46.
  26. 48.

    •2-phase: “unit” and “functional” •eunit (built-in testing framework) •etest library

    for functional tests •functional tests in separated modules •don’t track coverage Testing
  27. 49.

    •a lot of high-level helpers •assert functions over JSON structure

    •?wait_for macro to test async operations Testing
  28. 50.

    •mocking: external HTTP endpoints, IP detectors •meck library: creating modules,

    history API •good enough •strange “random” problems after recompilation Mocks
  29. 51.

    •test coverage is a key factor for really quick development

    •concentrate on “negative” cases •it’s easy to turn this process into fun Lesson #7
  30. 52.

    •good types system matters •too many tests to check input

    values •too many tests to check formatting •too many tests to check protocols Lesson #8
  31. 53.

    •you need to prepare tests for multi- node system •(only)

    then start working on distribution •riak_test •property testing: PropEr •... both are great, but hard to adopt Cluster
  32. 56.

    •it’s hard to do everything right on the first try

    •it’s impossible to do it on the first try? •it’s impossible to do it at all? •more experiments! Lesson #9
  33. 57.

    •a lot of async operations •i.e. like → save in

    DB → update timeline entry → publish activity stream entry → add notification → send to device •started with RabbitMQ and exchanges for each event types (easy to start) •reimplemented Events
  34. 58.

    •2 types: bound & unbound •bound: known number of subscribers

    •i.e. “like” •converting to “active coordinator”: FSM under appropriate supervisor •sourcing for fault-tolerance Events 2
  35. 59.

    •unbound cases: •ban profile → remove all content •update timeline

    → send push to all subscribed devices •use riak_pipe Events 3
  36. 60.

    •part of Riak internals • map/reduce flavored with unix pipes

    •declarative fittings •custom routing •back-pressure control •logging and tracing •handoffs riak_pipe
  37. 61.
  38. 63.

    Lesson #10 there is no such thing as “exactly once

    delivery” back-pressure control is essential
  39. 64.

    •it matters! •cases: RPC definitions, permissions etc •-define(MACRO, ...) •...

    great, but sometimes inconvenient •parse_transform •... great, but hard to develop & support •Elixir? no, thanks Meta programming
  40. 66.

    •... external tool not so easy to support •edoc ?

    •parse_transform ? i.e. -doc() Documentation
  41. 68.

    •don’t use hot swapping for releases •reltool to prepare package(s)

    •run_erl to run VM as a daemon •shell script for common operations: start, stop, restart, attach •shell script for cluster operations (wrapper for node calls): join, leave, status (ring & members) Deployment
  42. 69.

    •rebar generate to /opt/ gomer/<version>/* •shared directory for compiled deps:

    much faster get-deps & compile •zip and store on S3 •download from S3, unzip, relink •fabric (Python) for automation Deployment
  43. 70.

    Lesson #12 still don’t know what the best way to

    deploy application among the cluster is
  44. 73.

    •a lot of log messages • for all concerned •dbg

    on live server •few own helpers for most common cases •“trace_off” on timeout Debugging
  45. 74.
  46. 76.

    Lesson #15 there are few features in Erlang that you

    really-really miss when using other technologies
  47. 77.

    •2 engineers •“2 weeks” to start writing production code •ha.

    first feature - on the second day * •* first day - stumbled by Mac OS The team
  48. 80.

    • • • • •

    • • • • • Libraries
  49. 81.

    •~20-25ms for most responses •100+ connections without any impact •faster

    then Python & Ruby •not as fast as Scala, Clojure and Go •... but do you really care? Questions #1 Performance
  50. 82.

    • Erlang (our choice) • Scala (jvm) • Clojure (jvm)

    • Python (bad fit) • Go (too large project) • Haskell (bad fit) • Java (oh, common..) Questions #2 Candidates
  51. 84.

    •we use Go and Clojure for other systems •do you

    want to ask “Why”? •we are still on early production stage •wait for new lessons coming soon Notes #1