Erlang in production. Lessons learned

Slide 1

Slide 1 text

Social platform in Erlang Lessons Learned Alexey Kachayev, 2013

Slide 2

Slide 2 text

About me •CTO at Attendify.com •Erlang, Go, Clojure, Scala •СPython & Twitter Storm contributor •Author of fn.py library •Hobbies: Haskell, Scheme, Racket, CRDT, type systems, compilers

Slide 3

Slide 3 text

Contacts •@kachayev •github: kachayev •[email protected]

Slide 4

Slide 4 text

Will tell you •project goals and challenges •tech. stack that we use •development •testing, deployment, debugging •problems & solutions

Slide 5

Slide 5 text

unstructured content too much information at least 4 different talks

Slide 6

Slide 6 text

Will not tell •why Erlang is cool •how Erlang is cool •why you should use Erlang •1 mln of concurrent users •1000+ nodes cluster

Slide 7

Slide 7 text

attendify.com

Slide 8

Slide 8 text

social platform hundreds of mobile apps “little facebook” inside each one

Slide 9

Slide 9 text

social platform hundreds thousands of mobile apps

Slide 10

Slide 10 text

social platform “Yammer” for events (at least technically)

Slide 11

Slide 11 text

tons of features

Slide 12

Slide 12 text

no, really! tons of features

Slide 13

Slide 13 text

• proﬁles • social network accounts • chat • posts • photos • photo albums • timeline • notiﬁcations • likes • permissions •replies •following •tweets tracking •activity streams •checkins •notes •sharing •search •RSS news tracking •blocking & ban

Slide 14

Slide 14 text

•push notiﬁcations •multi-device synchronization •ofﬂine support for few features Special thanks

Slide 15

Slide 15 text

•high availability (HA) •plug-in infrastructure •RPC for thin clients •stability & guarantees •quick development Requirements

Slide 16

Slide 16 text

Requirements goals •high availability (HA) •plug-in infrastructure •RPC for thin clients •stability & guarantees •quick development

Slide 17

Slide 17 text

•10k++ mobile applications •~ 2k proﬁles for each •activity spikes (obviously) •apps should work independently (*) Technically

Slide 18

Slide 18 text

Prototype

Slide 19

Slide 19 text

•prototype, not so many features •~3 weeks of active development •wrapper for CouchDB (in Erlang) •biggest problem: push notiﬁcations •serves ~75 mobile apps and still running “Delaware”

Slide 20

Slide 20 text

“Delaware” implementation check SNS integration switch admin panel

Slide 21

Slide 21 text

Current

Slide 22

Slide 22 text

•4 months of active development •2 engineers •10 repos •1,178 commits “Gomer”

Slide 23

Slide 23 text

•~46k LOC •53 “xxx” comments (incl. 3 ”xxx!!!”) •47 libraries (incl. 3 forks) •117 public RPC methods “Gomer”

Slide 24

Slide 24 text

•1,148 test cases •make testall •390 apps •1,195 devices •19,047 log messages “Gomer”

Slide 25

Slide 25 text

•pretty big project •very dynamic •quickly growing in size & features so?

Slide 26

Slide 26 text

System design

Slide 27

Slide 27 text

•graph-oriented (like Facebook) •Riak for most data: nodes, links, streams •etcd for consistent cases (Raft consensus): settings, cluster structure •in-memory ETS: cache, sync ordering •pre-built data for reading Data

Slide 28

Slide 28 text

•nodes: id, rev, attrs, system ﬂags •links: from-id, to-id, type •holds essential part of logic, i.e. session is a link from proﬁle to device etc •Facebook TAO model: fetching nodes and simplest links-walking •implemented as independent library Graph

Slide 29

Slide 29 text

•revision control for each entity •to ensure all client calls are idempotent •k-ordering for cursor-based sync (**) •ﬂake library (snowﬂake-like) •one more, riak_id K-ordering

Slide 30

Slide 30 text

K-ordering ** client tells server max revision ever seen (a.k.a. cursor) server send changed data only (current rev > client max rev)

Slide 31

Slide 31 text

•github.com/twitter/snowﬂake (Scala) •github.com/boundary/ﬂake (Erlang) * •github.com/seancribbs/riak_id (Erlang) K-ordering

Slide 32

Slide 32 text

•activitystrea.ms •Actor, Action, Object, Target •cases: timelines, activity streams, chats, notiﬁcation center •linked lists •cursor-based fetch Streams

Slide 33

Slide 33 text

Streams

Slide 34

Slide 34 text

•cases: follow-ups, notes, cleared messages etc •event-sourcing (both server & clients) •LWW for conﬂicted rewrites Ofﬂine support

Slide 35

Slide 35 text

•use to avoid state copy in gen_server •2 approaches (use both): •supervisor creates ETS and gives it to child at start •server creates ETS and ﬁlls it with data on each gen_server:init ETS

Slide 36

Slide 36 text

Lesson #1 graph oriented data is a good ﬁt (most) graph databases are strange

Slide 37

Slide 37 text

Lesson #2 data modeling is hard any kind of consistency is hard

Slide 38

Slide 38 text

Lesson #3 Erlang is good for async data processing

Slide 39

Slide 39 text

Lesson #4 each mobile client is a part of single distributed system

Slide 40

Slide 40 text

•started from “process per device” •easy to start, client is an Actor •not really HA •bad ﬁt to few nodes cluster •many problems with events routing •reimplemented Processes v.1

Slide 41

Slide 41 text

•riak_core vnodes ring •riak_pipe vnodes ring •supervisor for each app •auth •proﬁles ordering •twitter reader •rss reader Processes v.2

Slide 42

Slide 42 text

Lesson #5 obvious solution can be a bad ﬁt “fail fast” in your decisions

Slide 43

Slide 43 text

•vnodes ring •“service” and compatibility tracking •consistent hashing for tasks routing •handoffs •join, leave, status, membership •CLI admin interface riak_core

Slide 44

Slide 44 text

•few problems •great facilities with no docs •... but easy to read whole source code •thanks to the guys from Basho for their advice •waiting for 2.0 version riak_core

Slide 45

Slide 45 text

riak_core

Slide 46

Slide 46 text

riak_core

Slide 47

Slide 47 text

Lesson #6 riak_core is a good enough reason for using Erlang

Slide 48

Slide 48 text

•2-phase: “unit” and “functional” •eunit (built-in testing framework) •etest library for functional tests •functional tests in separated modules •don’t track coverage Testing

Slide 49

Slide 49 text

•a lot of high-level helpers •assert functions over JSON structure •?wait_for macro to test async operations Testing

Slide 50

Slide 50 text

•mocking: external HTTP endpoints, IP detectors •meck library: creating modules, history API •good enough •strange “random” problems after recompilation Mocks

Slide 51

Slide 51 text

•test coverage is a key factor for really quick development •concentrate on “negative” cases •it’s easy to turn this process into fun Lesson #7

Slide 52

Slide 52 text

•good types system matters •too many tests to check input values •too many tests to check formatting •too many tests to check protocols Lesson #8

Slide 53

Slide 53 text

•you need to prepare tests for multi- node system •(only) then start working on distribution •riak_test •property testing: PropEr •... both are great, but hard to adopt Cluster

Slide 54

Slide 54 text

•make devrel to run 3+ nodes •it’s fun too Cluster

Slide 55

Slide 55 text

Cluster testing

Slide 56

Slide 56 text

•it’s hard to do everything right on the ﬁrst try •it’s impossible to do it on the ﬁrst try? •it’s impossible to do it at all? •more experiments! Lesson #9

Slide 57

Slide 57 text

•a lot of async operations •i.e. like → save in DB → update timeline entry → publish activity stream entry → add notiﬁcation → send to device •started with RabbitMQ and exchanges for each event types (easy to start) •reimplemented Events

Slide 58

Slide 58 text

•2 types: bound & unbound •bound: known number of subscribers •i.e. “like” •converting to “active coordinator”: FSM under appropriate supervisor •sourcing for fault-tolerance Events 2

Slide 59

Slide 59 text

•unbound cases: •ban proﬁle → remove all content •update timeline → send push to all subscribed devices •use riak_pipe Events 3

Slide 60

Slide 60 text

•part of Riak internals • map/reduce ﬂavored with unix pipes •declarative ﬁttings •custom routing •back-pressure control •logging and tracing •handoffs riak_pipe

Slide 61

Slide 61 text

riak_pipe

Slide 62

Slide 62 text

riak_pipe working on workshop github.com/kachayev/riak-pipe-workshop

Slide 63

Slide 63 text

Lesson #10 there is no such thing as “exactly once delivery” back-pressure control is essential

Slide 64

Slide 64 text

•it matters! •cases: RPC deﬁnitions, permissions etc •-define(MACRO, ...) •... great, but sometimes inconvenient •parse_transform •... great, but hard to develop & support •Elixir? no, thanks Meta programming

Slide 65

Slide 65 text

•it matters! •public API description, at least •our solution: parser for test logs (in python) Documentation

Slide 66

Slide 66 text

•... external tool not so easy to support •edoc ? •parse_transform ? i.e. -doc() Documentation

Slide 67

Slide 67 text

Lesson #11 meta programming matters documentation matters

Slide 68

Slide 68 text

•don’t use hot swapping for releases •reltool to prepare package(s) •run_erl to run VM as a daemon •shell script for common operations: start, stop, restart, attach •shell script for cluster operations (wrapper for node calls): join, leave, status (ring & members) Deployment

Slide 69

Slide 69 text

•rebar generate to /opt/ gomer//* •shared directory for compiled deps: much faster get-deps & compile •zip and store on S3 •download from S3, unzip, relink •fabric (Python) for automation Deployment

Slide 70

Slide 70 text

Lesson #12 still don’t know what the best way to deploy application among the cluster is

Slide 71

Slide 71 text

Lesson #13 Another to_erl process already attached to pipe

Slide 72

Slide 72 text

Lesson #14 there is a big difference between ^C (stop VM) and ^D (quit)

Slide 73

Slide 73 text

•a lot of log messages •papertailapp.com for all concerned •dbg on live server •few own helpers for most common cases •“trace_off” on timeout Debugging

Slide 74

Slide 74 text

Debugging

Slide 75

Slide 75 text

•erlang.org/doc/man/dbg.html •github.com/ferd/recon •erlang.org/doc/man/os_mon_app.html Debugging

Slide 76

Slide 76 text

Lesson #15 there are few features in Erlang that you really-really miss when using other technologies

Slide 77

Slide 77 text

•2 engineers •“2 weeks” to start writing production code •ha. ﬁrst feature - on the second day * •* ﬁrst day - stumbled by Mac OS The team

Slide 78

Slide 78 text

Lesson #16 Erlang is a good technology to hire good engineers

Slide 79

Slide 79 text

•guys from Wooga •guys from Yammer •guys from Basho Thanks to

Slide 80

Slide 80 text

• github.com/eproxus/meck • github.com/uwiger/gproc • github.com/wooga/etest • github.com/wooga/etest_http • github.com/bash/riak_kv • github.com/basho/riak_core • github.com/basho/riak_pipe • github.com/basho/lager • github.com/marccampbell/ﬂake • github.com/gleber/erlcloud Libraries

Slide 81

Slide 81 text

•~20-25ms for most responses •100+ connections without any impact •faster then Python & Ruby •not as fast as Scala, Clojure and Go •... but do you really care? Questions #1 Performance

Slide 82

Slide 82 text

• Erlang (our choice) • Scala (jvm) • Clojure (jvm) • Python (bad ﬁt) • Go (too large project) • Haskell (bad ﬁt) • Java (oh, common..) Questions #2 Candidates

Slide 83

Slide 83 text

•Emacs •VIM Questions #3 IDE?

Slide 84

Slide 84 text

•we use Go and Clojure for other systems •do you want to ask “Why”? •we are still on early production stage •wait for new lessons coming soon Notes #1

Slide 85

Slide 85 text

Ideas? Questions? Alexey Kachayev, 2013