Slide 1

Slide 1 text

Social platform in Erlang Lessons Learned Alexey Kachayev, 2013

Slide 2

Slide 2 text

About me •CTO at •Erlang, Go, Clojure, Scala •СPython & Twitter Storm contributor •Author of library •Hobbies: Haskell, Scheme, Racket, CRDT, type systems, compilers

Slide 3

Slide 3 text

Contacts •@kachayev •github: kachayev •[email protected]

Slide 4

Slide 4 text

Will tell you •project goals and challenges •tech. stack that we use •development •testing, deployment, debugging •problems & solutions

Slide 5

Slide 5 text

unstructured content too much information at least 4 different talks

Slide 6

Slide 6 text

Will not tell •why Erlang is cool •how Erlang is cool •why you should use Erlang •1 mln of concurrent users •1000+ nodes cluster

Slide 7

Slide 7 text

Slide 8

Slide 8 text

social platform hundreds of mobile apps “little facebook” inside each one

Slide 9

Slide 9 text

social platform hundreds thousands of mobile apps

Slide 10

Slide 10 text

social platform “Yammer” for events (at least technically)

Slide 11

Slide 11 text

tons of features

Slide 12

Slide 12 text

no, really! tons of features

Slide 13

Slide 13 text

• profiles • social network accounts • chat • posts • photos • photo albums • timeline • notifications • likes • permissions •replies •following •tweets tracking •activity streams •checkins •notes •sharing •search •RSS news tracking •blocking & ban

Slide 14

Slide 14 text

•push notifications •multi-device synchronization •offline support for few features Special thanks

Slide 15

Slide 15 text

•high availability (HA) •plug-in infrastructure •RPC for thin clients •stability & guarantees •quick development Requirements

Slide 16

Slide 16 text

Requirements goals •high availability (HA) •plug-in infrastructure •RPC for thin clients •stability & guarantees •quick development

Slide 17

Slide 17 text

•10k++ mobile applications •~ 2k profiles for each •activity spikes (obviously) •apps should work independently (*) Technically

Slide 18

Slide 18 text


Slide 19

Slide 19 text

•prototype, not so many features •~3 weeks of active development •wrapper for CouchDB (in Erlang) •biggest problem: push notifications •serves ~75 mobile apps and still running “Delaware”

Slide 20

Slide 20 text

“Delaware” implementation check SNS integration switch admin panel

Slide 21

Slide 21 text


Slide 22

Slide 22 text

•4 months of active development •2 engineers •10 repos •1,178 commits “Gomer”

Slide 23

Slide 23 text

•~46k LOC •53 “xxx” comments (incl. 3 ”xxx!!!”) •47 libraries (incl. 3 forks) •117 public RPC methods “Gomer”

Slide 24

Slide 24 text

•1,148 test cases •make testall •390 apps •1,195 devices •19,047 log messages “Gomer”

Slide 25

Slide 25 text

•pretty big project •very dynamic •quickly growing in size & features so?

Slide 26

Slide 26 text

System design

Slide 27

Slide 27 text

•graph-oriented (like Facebook) •Riak for most data: nodes, links, streams •etcd for consistent cases (Raft consensus): settings, cluster structure •in-memory ETS: cache, sync ordering •pre-built data for reading Data

Slide 28

Slide 28 text

•nodes: id, rev, attrs, system flags •links: from-id, to-id, type •holds essential part of logic, i.e. session is a link from profile to device etc •Facebook TAO model: fetching nodes and simplest links-walking •implemented as independent library Graph

Slide 29

Slide 29 text

•revision control for each entity •to ensure all client calls are idempotent •k-ordering for cursor-based sync (**) •flake library (snowflake-like) •one more, riak_id K-ordering

Slide 30

Slide 30 text

K-ordering ** client tells server max revision ever seen (a.k.a. cursor) server send changed data only (current rev > client max rev)

Slide 31

Slide 31 text

• (Scala) • (Erlang) * • (Erlang) K-ordering

Slide 32

Slide 32 text

• •Actor, Action, Object, Target •cases: timelines, activity streams, chats, notification center •linked lists •cursor-based fetch Streams

Slide 33

Slide 33 text


Slide 34

Slide 34 text

•cases: follow-ups, notes, cleared messages etc •event-sourcing (both server & clients) •LWW for conflicted rewrites Offline support

Slide 35

Slide 35 text

•use to avoid state copy in gen_server •2 approaches (use both): •supervisor creates ETS and gives it to child at start •server creates ETS and fills it with data on each gen_server:init ETS

Slide 36

Slide 36 text

Lesson #1 graph oriented data is a good fit (most) graph databases are strange

Slide 37

Slide 37 text

Lesson #2 data modeling is hard any kind of consistency is hard

Slide 38

Slide 38 text

Lesson #3 Erlang is good for async data processing

Slide 39

Slide 39 text

Lesson #4 each mobile client is a part of single distributed system

Slide 40

Slide 40 text

•started from “process per device” •easy to start, client is an Actor •not really HA •bad fit to few nodes cluster •many problems with events routing •reimplemented Processes v.1

Slide 41

Slide 41 text

•riak_core vnodes ring •riak_pipe vnodes ring •supervisor for each app •auth •profiles ordering •twitter reader •rss reader Processes v.2

Slide 42

Slide 42 text

Lesson #5 obvious solution can be a bad fit “fail fast” in your decisions

Slide 43

Slide 43 text

•vnodes ring •“service” and compatibility tracking •consistent hashing for tasks routing •handoffs •join, leave, status, membership •CLI admin interface riak_core

Slide 44

Slide 44 text

•few problems •great facilities with no docs •... but easy to read whole source code •thanks to the guys from Basho for their advice •waiting for 2.0 version riak_core

Slide 45

Slide 45 text


Slide 46

Slide 46 text


Slide 47

Slide 47 text

Lesson #6 riak_core is a good enough reason for using Erlang

Slide 48

Slide 48 text

•2-phase: “unit” and “functional” •eunit (built-in testing framework) •etest library for functional tests •functional tests in separated modules •don’t track coverage Testing

Slide 49

Slide 49 text

•a lot of high-level helpers •assert functions over JSON structure •?wait_for macro to test async operations Testing

Slide 50

Slide 50 text

•mocking: external HTTP endpoints, IP detectors •meck library: creating modules, history API •good enough •strange “random” problems after recompilation Mocks

Slide 51

Slide 51 text

•test coverage is a key factor for really quick development •concentrate on “negative” cases •it’s easy to turn this process into fun Lesson #7

Slide 52

Slide 52 text

•good types system matters •too many tests to check input values •too many tests to check formatting •too many tests to check protocols Lesson #8

Slide 53

Slide 53 text

•you need to prepare tests for multi- node system •(only) then start working on distribution •riak_test •property testing: PropEr •... both are great, but hard to adopt Cluster

Slide 54

Slide 54 text

•make devrel to run 3+ nodes •it’s fun too Cluster

Slide 55

Slide 55 text

Cluster testing

Slide 56

Slide 56 text

•it’s hard to do everything right on the first try •it’s impossible to do it on the first try? •it’s impossible to do it at all? •more experiments! Lesson #9

Slide 57

Slide 57 text

•a lot of async operations •i.e. like → save in DB → update timeline entry → publish activity stream entry → add notification → send to device •started with RabbitMQ and exchanges for each event types (easy to start) •reimplemented Events

Slide 58

Slide 58 text

•2 types: bound & unbound •bound: known number of subscribers •i.e. “like” •converting to “active coordinator”: FSM under appropriate supervisor •sourcing for fault-tolerance Events 2

Slide 59

Slide 59 text

•unbound cases: •ban profile → remove all content •update timeline → send push to all subscribed devices •use riak_pipe Events 3

Slide 60

Slide 60 text

•part of Riak internals • map/reduce flavored with unix pipes •declarative fittings •custom routing •back-pressure control •logging and tracing •handoffs riak_pipe

Slide 61

Slide 61 text


Slide 62

Slide 62 text

riak_pipe working on workshop

Slide 63

Slide 63 text

Lesson #10 there is no such thing as “exactly once delivery” back-pressure control is essential

Slide 64

Slide 64 text

•it matters! •cases: RPC definitions, permissions etc •-define(MACRO, ...) •... great, but sometimes inconvenient •parse_transform •... great, but hard to develop & support •Elixir? no, thanks Meta programming

Slide 65

Slide 65 text

•it matters! •public API description, at least •our solution: parser for test logs (in python) Documentation

Slide 66

Slide 66 text

•... external tool not so easy to support •edoc ? •parse_transform ? i.e. -doc() Documentation

Slide 67

Slide 67 text

Lesson #11 meta programming matters documentation matters

Slide 68

Slide 68 text

•don’t use hot swapping for releases •reltool to prepare package(s) •run_erl to run VM as a daemon •shell script for common operations: start, stop, restart, attach •shell script for cluster operations (wrapper for node calls): join, leave, status (ring & members) Deployment

Slide 69

Slide 69 text

•rebar generate to /opt/ gomer//* •shared directory for compiled deps: much faster get-deps & compile •zip and store on S3 •download from S3, unzip, relink •fabric (Python) for automation Deployment

Slide 70

Slide 70 text

Lesson #12 still don’t know what the best way to deploy application among the cluster is

Slide 71

Slide 71 text

Lesson #13 Another to_erl process already attached to pipe

Slide 72

Slide 72 text

Lesson #14 there is a big difference between ^C (stop VM) and ^D (quit)

Slide 73

Slide 73 text

•a lot of log messages • for all concerned •dbg on live server •few own helpers for most common cases •“trace_off” on timeout Debugging

Slide 74

Slide 74 text


Slide 75

Slide 75 text

• • • Debugging

Slide 76

Slide 76 text

Lesson #15 there are few features in Erlang that you really-really miss when using other technologies

Slide 77

Slide 77 text

•2 engineers •“2 weeks” to start writing production code •ha. first feature - on the second day * •* first day - stumbled by Mac OS The team

Slide 78

Slide 78 text

Lesson #16 Erlang is a good technology to hire good engineers

Slide 79

Slide 79 text

•guys from Wooga •guys from Yammer •guys from Basho Thanks to

Slide 80

Slide 80 text

• • • • • • • • • • Libraries

Slide 81

Slide 81 text

•~20-25ms for most responses •100+ connections without any impact •faster then Python & Ruby •not as fast as Scala, Clojure and Go •... but do you really care? Questions #1 Performance

Slide 82

Slide 82 text

• Erlang (our choice) • Scala (jvm) • Clojure (jvm) • Python (bad fit) • Go (too large project) • Haskell (bad fit) • Java (oh, common..) Questions #2 Candidates

Slide 83

Slide 83 text

•Emacs •VIM Questions #3 IDE?

Slide 84

Slide 84 text

•we use Go and Clojure for other systems •do you want to ask “Why”? •we are still on early production stage •wait for new lessons coming soon Notes #1

Slide 85

Slide 85 text

Ideas? Questions? Alexey Kachayev, 2013