•prototype, not so many features •~3 weeks of active development •wrapper for CouchDB (in Erlang) •biggest problem: push notifications •serves ~75 mobile apps and still running “Delaware”
•graph-oriented (like Facebook) •Riak for most data: nodes, links, streams •etcd for consistent cases (Raft consensus): settings, cluster structure •in-memory ETS: cache, sync ordering •pre-built data for reading Data
•nodes: id, rev, attrs, system flags •links: from-id, to-id, type •holds essential part of logic, i.e. session is a link from profile to device etc •Facebook TAO model: fetching nodes and simplest links-walking •implemented as independent library Graph
•revision control for each entity •to ensure all client calls are idempotent •k-ordering for cursor-based sync (**) •flake library (snowflake-like) •one more, riak_id K-ordering
•use to avoid state copy in gen_server •2 approaches (use both): •supervisor creates ETS and gives it to child at start •server creates ETS and fills it with data on each gen_server:init ETS
•started from “process per device” •easy to start, client is an Actor •not really HA •bad fit to few nodes cluster •many problems with events routing •reimplemented Processes v.1
•few problems •great facilities with no docs •... but easy to read whole source code •thanks to the guys from Basho for their advice •waiting for 2.0 version riak_core
•mocking: external HTTP endpoints, IP detectors •meck library: creating modules, history API •good enough •strange “random” problems after recompilation Mocks
•you need to prepare tests for multi- node system •(only) then start working on distribution •riak_test •property testing: PropEr •... both are great, but hard to adopt Cluster
•it’s hard to do everything right on the first try •it’s impossible to do it on the first try? •it’s impossible to do it at all? •more experiments! Lesson #9
•a lot of async operations •i.e. like → save in DB → update timeline entry → publish activity stream entry → add notification → send to device •started with RabbitMQ and exchanges for each event types (easy to start) •reimplemented Events
•2 types: bound & unbound •bound: known number of subscribers •i.e. “like” •converting to “active coordinator”: FSM under appropriate supervisor •sourcing for fault-tolerance Events 2
•it matters! •cases: RPC definitions, permissions etc •-define(MACRO, ...) •... great, but sometimes inconvenient •parse_transform •... great, but hard to develop & support •Elixir? no, thanks Meta programming
•don’t use hot swapping for releases •reltool to prepare package(s) •run_erl to run VM as a daemon •shell script for common operations: start, stop, restart, attach •shell script for cluster operations (wrapper for node calls): join, leave, status (ring & members) Deployment
•rebar generate to /opt/ gomer//* •shared directory for compiled deps: much faster get-deps & compile •zip and store on S3 •download from S3, unzip, relink •fabric (Python) for automation Deployment
•a lot of log messages •papertailapp.com for all concerned •dbg on live server •few own helpers for most common cases •“trace_off” on timeout Debugging
•~20-25ms for most responses •100+ connections without any impact •faster then Python & Ruby •not as fast as Scala, Clojure and Go •... but do you really care? Questions #1 Performance