Erlang in production. Lessons learned

Social platform in Erlang Lessons Learned Alexey Kachayev, 2013

About me •CTO at Attendify.com •Erlang, Go, Clojure, Scala •СPython
& Twitter Storm contributor •Author of fn.py library •Hobbies: Haskell, Scheme, Racket, CRDT, type systems, compilers

Contacts •@kachayev •github: kachayev •[email protected]

Will tell you •project goals and challenges •tech. stack that
we use •development •testing, deployment, debugging •problems & solutions

unstructured content too much information at least 4 different talks

Will not tell •why Erlang is cool •how Erlang is
cool •why you should use Erlang •1 mln of concurrent users •1000+ nodes cluster

attendify.com

social platform hundreds of mobile apps “little facebook” inside each
one

social platform hundreds thousands of mobile apps

social platform “Yammer” for events (at least technically)

tons of features

no, really! tons of features

• proﬁles • social network accounts • chat • posts
• photos • photo albums • timeline • notiﬁcations • likes • permissions •replies •following •tweets tracking •activity streams •checkins •notes •sharing •search •RSS news tracking •blocking & ban

•push notiﬁcations •multi-device synchronization •ofﬂine support for few features Special
thanks

•high availability (HA) •plug-in infrastructure •RPC for thin clients •stability
& guarantees •quick development Requirements

Requirements goals •high availability (HA) •plug-in infrastructure •RPC for thin
clients •stability & guarantees •quick development

•10k++ mobile applications •~ 2k proﬁles for each •activity spikes
(obviously) •apps should work independently (*) Technically

Prototype

•prototype, not so many features •~3 weeks of active development
•wrapper for CouchDB (in Erlang) •biggest problem: push notiﬁcations •serves ~75 mobile apps and still running “Delaware”

“Delaware” implementation check SNS integration switch admin panel

Current

•4 months of active development •2 engineers •10 repos •1,178
commits “Gomer”

•~46k LOC •53 “xxx” comments (incl. 3 ”xxx!!!”) •47 libraries
(incl. 3 forks) •117 public RPC methods “Gomer”

•1,148 test cases •make testall •390 apps •1,195 devices •19,047
log messages “Gomer”

•pretty big project •very dynamic •quickly growing in size &
features so?

System design

•graph-oriented (like Facebook) •Riak for most data: nodes, links, streams
•etcd for consistent cases (Raft consensus): settings, cluster structure •in-memory ETS: cache, sync ordering •pre-built data for reading Data

•nodes: id, rev, attrs, system ﬂags •links: from-id, to-id, type
•holds essential part of logic, i.e. session is a link from proﬁle to device etc •Facebook TAO model: fetching nodes and simplest links-walking •implemented as independent library Graph

•revision control for each entity •to ensure all client calls
are idempotent •k-ordering for cursor-based sync (**) •ﬂake library (snowﬂake-like) •one more, riak_id K-ordering

K-ordering ** client tells server max revision ever seen (a.k.a.
cursor) server send changed data only (current rev > client max rev)

•github.com/twitter/snowﬂake (Scala) •github.com/boundary/ﬂake (Erlang) * •github.com/seancribbs/riak_id (Erlang) K-ordering

•activitystrea.ms •Actor, Action, Object, Target •cases: timelines, activity streams, chats,
notiﬁcation center •linked lists •cursor-based fetch Streams

Streams

•cases: follow-ups, notes, cleared messages etc •event-sourcing (both server &
clients) •LWW for conﬂicted rewrites Ofﬂine support

•use to avoid state copy in gen_server •2 approaches (use
both): •supervisor creates ETS and gives it to child at start •server creates ETS and ﬁlls it with data on each gen_server:init ETS

Lesson #1 graph oriented data is a good ﬁt (most)
graph databases are strange

Lesson #2 data modeling is hard any kind of consistency
is hard

Lesson #3 Erlang is good for async data processing

Lesson #4 each mobile client is a part of single
distributed system

•started from “process per device” •easy to start, client is
an Actor •not really HA •bad ﬁt to few nodes cluster •many problems with events routing •reimplemented Processes v.1

•riak_core vnodes ring •riak_pipe vnodes ring •supervisor for each app
•auth •proﬁles ordering •twitter reader •rss reader Processes v.2

Lesson #5 obvious solution can be a bad ﬁt “fail
fast” in your decisions

•vnodes ring •“service” and compatibility tracking •consistent hashing for tasks
routing •handoffs •join, leave, status, membership •CLI admin interface riak_core

•few problems •great facilities with no docs •... but easy
to read whole source code •thanks to the guys from Basho for their advice •waiting for 2.0 version riak_core

riak_core

Lesson #6 riak_core is a good enough reason for using
Erlang

•2-phase: “unit” and “functional” •eunit (built-in testing framework) •etest library
for functional tests •functional tests in separated modules •don’t track coverage Testing

•a lot of high-level helpers •assert functions over JSON structure
•?wait_for macro to test async operations Testing

•mocking: external HTTP endpoints, IP detectors •meck library: creating modules,
history API •good enough •strange “random” problems after recompilation Mocks

•test coverage is a key factor for really quick development
•concentrate on “negative” cases •it’s easy to turn this process into fun Lesson #7

•good types system matters •too many tests to check input
values •too many tests to check formatting •too many tests to check protocols Lesson #8

•you need to prepare tests for multi- node system •(only)
then start working on distribution •riak_test •property testing: PropEr •... both are great, but hard to adopt Cluster

•make devrel to run 3+ nodes •it’s fun too Cluster

Cluster testing

•it’s hard to do everything right on the ﬁrst try
•it’s impossible to do it on the ﬁrst try? •it’s impossible to do it at all? •more experiments! Lesson #9

•a lot of async operations •i.e. like → save in
DB → update timeline entry → publish activity stream entry → add notiﬁcation → send to device •started with RabbitMQ and exchanges for each event types (easy to start) •reimplemented Events

•2 types: bound & unbound •bound: known number of subscribers
•i.e. “like” •converting to “active coordinator”: FSM under appropriate supervisor •sourcing for fault-tolerance Events 2

•unbound cases: •ban proﬁle → remove all content •update timeline
→ send push to all subscribed devices •use riak_pipe Events 3

•part of Riak internals • map/reduce ﬂavored with unix pipes
•declarative ﬁttings •custom routing •back-pressure control •logging and tracing •handoffs riak_pipe

riak_pipe

riak_pipe working on workshop github.com/kachayev/riak-pipe-workshop

Lesson #10 there is no such thing as “exactly once
delivery” back-pressure control is essential

•it matters! •cases: RPC deﬁnitions, permissions etc •-define(MACRO, ...) •...
great, but sometimes inconvenient •parse_transform •... great, but hard to develop & support •Elixir? no, thanks Meta programming

•it matters! •public API description, at least •our solution: parser
for test logs (in python) Documentation

•... external tool not so easy to support •edoc ?
•parse_transform ? i.e. -doc() Documentation

Lesson #11 meta programming matters documentation matters

•don’t use hot swapping for releases •reltool to prepare package(s)
•run_erl to run VM as a daemon •shell script for common operations: start, stop, restart, attach •shell script for cluster operations (wrapper for node calls): join, leave, status (ring & members) Deployment

•rebar generate to /opt/ gomer/<version>/* •shared directory for compiled deps:
much faster get-deps & compile •zip and store on S3 •download from S3, unzip, relink •fabric (Python) for automation Deployment

Lesson #12 still don’t know what the best way to
deploy application among the cluster is

Lesson #13 Another to_erl process already attached to pipe

Lesson #14 there is a big difference between ^C (stop
VM) and ^D (quit)

•a lot of log messages •papertailapp.com for all concerned •dbg
on live server •few own helpers for most common cases •“trace_off” on timeout Debugging

Debugging

•erlang.org/doc/man/dbg.html •github.com/ferd/recon •erlang.org/doc/man/os_mon_app.html Debugging

Lesson #15 there are few features in Erlang that you
really-really miss when using other technologies

•2 engineers •“2 weeks” to start writing production code •ha.
ﬁrst feature - on the second day * •* ﬁrst day - stumbled by Mac OS The team

Lesson #16 Erlang is a good technology to hire good
engineers

•guys from Wooga •guys from Yammer •guys from Basho Thanks
to

• github.com/eproxus/meck • github.com/uwiger/gproc • github.com/wooga/etest • github.com/wooga/etest_http • github.com/bash/riak_kv
• github.com/basho/riak_core • github.com/basho/riak_pipe • github.com/basho/lager • github.com/marccampbell/ﬂake • github.com/gleber/erlcloud Libraries

•~20-25ms for most responses •100+ connections without any impact •faster
then Python & Ruby •not as fast as Scala, Clojure and Go •... but do you really care? Questions #1 Performance

• Erlang (our choice) • Scala (jvm) • Clojure (jvm)
• Python (bad ﬁt) • Go (too large project) • Haskell (bad ﬁt) • Java (oh, common..) Questions #2 Candidates

•Emacs •VIM Questions #3 IDE?

•we use Go and Clojure for other systems •do you
want to ask “Why”? •we are still on early production stage •wait for new lessons coming soon Notes #1

Ideas? Questions? Alexey Kachayev, 2013

Erlang in production. Lessons learned

Erlang in production. Lessons learned

More Decks by Oleksii Kachaiev

Other Decks in Programming

Featured

Transcript