Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Erlang in production. Lessons learned

Erlang in production. Lessons learned

Development social applications platform for attendify.com with Erlang, Riak Core and Riak Pipe

Oleksii Kachaiev

November 16, 2013

More Decks by Oleksii Kachaiev

Other Decks in Programming


  1. Social platform in Erlang Lessons Learned Alexey Kachayev, 2013

  2. About me •CTO at Attendify.com •Erlang, Go, Clojure, Scala •СPython

    & Twitter Storm contributor •Author of fn.py library •Hobbies: Haskell, Scheme, Racket, CRDT, type systems, compilers
  3. Contacts •@kachayev •github: kachayev •[email protected]

  4. Will tell you •project goals and challenges •tech. stack that

    we use •development •testing, deployment, debugging •problems & solutions
  5. unstructured content too much information at least 4 different talks

  6. Will not tell •why Erlang is cool •how Erlang is

    cool •why you should use Erlang •1 mln of concurrent users •1000+ nodes cluster
  7. attendify.com

  8. social platform hundreds of mobile apps “little facebook” inside each

  9. social platform hundreds thousands of mobile apps

  10. social platform “Yammer” for events (at least technically)

  11. tons of features

  12. no, really! tons of features

  13. • profiles • social network accounts • chat • posts

    • photos • photo albums • timeline • notifications • likes • permissions •replies •following •tweets tracking •activity streams •checkins •notes •sharing •search •RSS news tracking •blocking & ban
  14. •push notifications •multi-device synchronization •offline support for few features Special

  15. •high availability (HA) •plug-in infrastructure •RPC for thin clients •stability

    & guarantees •quick development Requirements
  16. Requirements goals •high availability (HA) •plug-in infrastructure •RPC for thin

    clients •stability & guarantees •quick development
  17. •10k++ mobile applications •~ 2k profiles for each •activity spikes

    (obviously) •apps should work independently (*) Technically
  18. Prototype

  19. •prototype, not so many features •~3 weeks of active development

    •wrapper for CouchDB (in Erlang) •biggest problem: push notifications •serves ~75 mobile apps and still running “Delaware”
  20. “Delaware” implementation check SNS integration switch admin panel

  21. Current

  22. •4 months of active development •2 engineers •10 repos •1,178

    commits “Gomer”
  23. •~46k LOC •53 “xxx” comments (incl. 3 ”xxx!!!”) •47 libraries

    (incl. 3 forks) •117 public RPC methods “Gomer”
  24. •1,148 test cases •make testall •390 apps •1,195 devices •19,047

    log messages “Gomer”
  25. •pretty big project •very dynamic •quickly growing in size &

    features so?
  26. System design

  27. •graph-oriented (like Facebook) •Riak for most data: nodes, links, streams

    •etcd for consistent cases (Raft consensus): settings, cluster structure •in-memory ETS: cache, sync ordering •pre-built data for reading Data
  28. •nodes: id, rev, attrs, system flags •links: from-id, to-id, type

    •holds essential part of logic, i.e. session is a link from profile to device etc •Facebook TAO model: fetching nodes and simplest links-walking •implemented as independent library Graph
  29. •revision control for each entity •to ensure all client calls

    are idempotent •k-ordering for cursor-based sync (**) •flake library (snowflake-like) •one more, riak_id K-ordering
  30. K-ordering ** client tells server max revision ever seen (a.k.a.

    cursor) server send changed data only (current rev > client max rev)
  31. •github.com/twitter/snowflake (Scala) •github.com/boundary/flake (Erlang) * •github.com/seancribbs/riak_id (Erlang) K-ordering

  32. •activitystrea.ms •Actor, Action, Object, Target •cases: timelines, activity streams, chats,

    notification center •linked lists •cursor-based fetch Streams
  33. Streams

  34. •cases: follow-ups, notes, cleared messages etc •event-sourcing (both server &

    clients) •LWW for conflicted rewrites Offline support
  35. •use to avoid state copy in gen_server •2 approaches (use

    both): •supervisor creates ETS and gives it to child at start •server creates ETS and fills it with data on each gen_server:init ETS
  36. Lesson #1 graph oriented data is a good fit (most)

    graph databases are strange
  37. Lesson #2 data modeling is hard any kind of consistency

    is hard
  38. Lesson #3 Erlang is good for async data processing

  39. Lesson #4 each mobile client is a part of single

    distributed system
  40. •started from “process per device” •easy to start, client is

    an Actor •not really HA •bad fit to few nodes cluster •many problems with events routing •reimplemented Processes v.1
  41. •riak_core vnodes ring •riak_pipe vnodes ring •supervisor for each app

    •auth •profiles ordering •twitter reader •rss reader Processes v.2
  42. Lesson #5 obvious solution can be a bad fit “fail

    fast” in your decisions
  43. •vnodes ring •“service” and compatibility tracking •consistent hashing for tasks

    routing •handoffs •join, leave, status, membership •CLI admin interface riak_core
  44. •few problems •great facilities with no docs •... but easy

    to read whole source code •thanks to the guys from Basho for their advice •waiting for 2.0 version riak_core
  45. riak_core

  46. riak_core

  47. Lesson #6 riak_core is a good enough reason for using

  48. •2-phase: “unit” and “functional” •eunit (built-in testing framework) •etest library

    for functional tests •functional tests in separated modules •don’t track coverage Testing
  49. •a lot of high-level helpers •assert functions over JSON structure

    •?wait_for macro to test async operations Testing
  50. •mocking: external HTTP endpoints, IP detectors •meck library: creating modules,

    history API •good enough •strange “random” problems after recompilation Mocks
  51. •test coverage is a key factor for really quick development

    •concentrate on “negative” cases •it’s easy to turn this process into fun Lesson #7
  52. •good types system matters •too many tests to check input

    values •too many tests to check formatting •too many tests to check protocols Lesson #8
  53. •you need to prepare tests for multi- node system •(only)

    then start working on distribution •riak_test •property testing: PropEr •... both are great, but hard to adopt Cluster
  54. •make devrel to run 3+ nodes •it’s fun too Cluster

  55. Cluster testing

  56. •it’s hard to do everything right on the first try

    •it’s impossible to do it on the first try? •it’s impossible to do it at all? •more experiments! Lesson #9
  57. •a lot of async operations •i.e. like → save in

    DB → update timeline entry → publish activity stream entry → add notification → send to device •started with RabbitMQ and exchanges for each event types (easy to start) •reimplemented Events
  58. •2 types: bound & unbound •bound: known number of subscribers

    •i.e. “like” •converting to “active coordinator”: FSM under appropriate supervisor •sourcing for fault-tolerance Events 2
  59. •unbound cases: •ban profile → remove all content •update timeline

    → send push to all subscribed devices •use riak_pipe Events 3
  60. •part of Riak internals • map/reduce flavored with unix pipes

    •declarative fittings •custom routing •back-pressure control •logging and tracing •handoffs riak_pipe
  61. riak_pipe

  62. riak_pipe working on workshop github.com/kachayev/riak-pipe-workshop

  63. Lesson #10 there is no such thing as “exactly once

    delivery” back-pressure control is essential
  64. •it matters! •cases: RPC definitions, permissions etc •-define(MACRO, ...) •...

    great, but sometimes inconvenient •parse_transform •... great, but hard to develop & support •Elixir? no, thanks Meta programming
  65. •it matters! •public API description, at least •our solution: parser

    for test logs (in python) Documentation
  66. •... external tool not so easy to support •edoc ?

    •parse_transform ? i.e. -doc() Documentation
  67. Lesson #11 meta programming matters documentation matters

  68. •don’t use hot swapping for releases •reltool to prepare package(s)

    •run_erl to run VM as a daemon •shell script for common operations: start, stop, restart, attach •shell script for cluster operations (wrapper for node calls): join, leave, status (ring & members) Deployment
  69. •rebar generate to /opt/ gomer/<version>/* •shared directory for compiled deps:

    much faster get-deps & compile •zip and store on S3 •download from S3, unzip, relink •fabric (Python) for automation Deployment
  70. Lesson #12 still don’t know what the best way to

    deploy application among the cluster is
  71. Lesson #13 Another to_erl process already attached to pipe

  72. Lesson #14 there is a big difference between ^C (stop

    VM) and ^D (quit)
  73. •a lot of log messages •papertailapp.com for all concerned •dbg

    on live server •few own helpers for most common cases •“trace_off” on timeout Debugging
  74. Debugging

  75. •erlang.org/doc/man/dbg.html •github.com/ferd/recon •erlang.org/doc/man/os_mon_app.html Debugging

  76. Lesson #15 there are few features in Erlang that you

    really-really miss when using other technologies
  77. •2 engineers •“2 weeks” to start writing production code •ha.

    first feature - on the second day * •* first day - stumbled by Mac OS The team
  78. Lesson #16 Erlang is a good technology to hire good

  79. •guys from Wooga •guys from Yammer •guys from Basho Thanks

  80. • github.com/eproxus/meck • github.com/uwiger/gproc • github.com/wooga/etest • github.com/wooga/etest_http • github.com/bash/riak_kv

    • github.com/basho/riak_core • github.com/basho/riak_pipe • github.com/basho/lager • github.com/marccampbell/flake • github.com/gleber/erlcloud Libraries
  81. •~20-25ms for most responses •100+ connections without any impact •faster

    then Python & Ruby •not as fast as Scala, Clojure and Go •... but do you really care? Questions #1 Performance
  82. • Erlang (our choice) • Scala (jvm) • Clojure (jvm)

    • Python (bad fit) • Go (too large project) • Haskell (bad fit) • Java (oh, common..) Questions #2 Candidates
  83. •Emacs •VIM Questions #3 IDE?

  84. •we use Go and Clojure for other systems •do you

    want to ask “Why”? •we are still on early production stage •wait for new lessons coming soon Notes #1
  85. Ideas? Questions? Alexey Kachayev, 2013