$30 off During Our Annual Pro Sale. View Details »

Erlang in production. Lessons learned

Erlang in production. Lessons learned

Development social applications platform for attendify.com with Erlang, Riak Core and Riak Pipe

Oleksii Kachaiev

November 16, 2013
Tweet

More Decks by Oleksii Kachaiev

Other Decks in Programming

Transcript

  1. Social platform
    in Erlang
    Lessons Learned
    Alexey Kachayev, 2013

    View Slide

  2. About me
    •CTO at Attendify.com
    •Erlang, Go, Clojure, Scala
    •СPython & Twitter Storm contributor
    •Author of fn.py library
    •Hobbies: Haskell, Scheme, Racket,
    CRDT, type systems, compilers

    View Slide

  3. Contacts
    •@kachayev
    •github: kachayev
    [email protected]

    View Slide

  4. Will tell you
    •project goals and challenges
    •tech. stack that we use
    •development
    •testing, deployment, debugging
    •problems & solutions

    View Slide

  5. unstructured
    content
    too much information
    at least 4 different talks

    View Slide

  6. Will not tell
    •why Erlang is cool
    •how Erlang is cool
    •why you should use Erlang
    •1 mln of concurrent users
    •1000+ nodes cluster

    View Slide

  7. attendify.com

    View Slide

  8. social platform
    hundreds of mobile apps
    “little facebook” inside each one

    View Slide

  9. social platform
    hundreds thousands
    of mobile apps

    View Slide

  10. social platform
    “Yammer” for events
    (at least technically)

    View Slide

  11. tons of features

    View Slide

  12. no, really!
    tons of features

    View Slide

  13. • profiles
    • social network
    accounts
    • chat
    • posts
    • photos
    • photo albums
    • timeline
    • notifications
    • likes
    • permissions
    •replies
    •following
    •tweets tracking
    •activity streams
    •checkins
    •notes
    •sharing
    •search
    •RSS news tracking
    •blocking & ban

    View Slide

  14. •push notifications
    •multi-device synchronization
    •offline support for few features
    Special thanks

    View Slide

  15. •high availability (HA)
    •plug-in infrastructure
    •RPC for thin clients
    •stability & guarantees
    •quick development
    Requirements

    View Slide

  16. Requirements
    goals
    •high availability (HA)
    •plug-in infrastructure
    •RPC for thin clients
    •stability & guarantees
    •quick development

    View Slide

  17. •10k++ mobile applications
    •~ 2k profiles for each
    •activity spikes (obviously)
    •apps should work independently (*)
    Technically

    View Slide

  18. Prototype

    View Slide

  19. •prototype, not so many features
    •~3 weeks of active development
    •wrapper for CouchDB (in Erlang)
    •biggest problem: push notifications
    •serves ~75 mobile apps and still running
    “Delaware”

    View Slide

  20. “Delaware”
    implementation
    check SNS
    integration
    switch
    admin panel

    View Slide

  21. Current

    View Slide

  22. •4 months of active development
    •2 engineers
    •10 repos
    •1,178 commits
    “Gomer”

    View Slide

  23. •~46k LOC
    •53 “xxx” comments (incl. 3 ”xxx!!!”)
    •47 libraries (incl. 3 forks)
    •117 public RPC methods
    “Gomer”

    View Slide

  24. •1,148 test cases
    •make testall
    •390 apps
    •1,195 devices
    •19,047 log messages
    “Gomer”

    View Slide

  25. •pretty big project
    •very dynamic
    •quickly growing in size & features
    so?

    View Slide

  26. System design

    View Slide

  27. •graph-oriented (like Facebook)
    •Riak for most data: nodes, links, streams
    •etcd for consistent cases (Raft
    consensus): settings, cluster structure
    •in-memory ETS: cache, sync ordering
    •pre-built data for reading
    Data

    View Slide

  28. •nodes: id, rev, attrs, system flags
    •links: from-id, to-id, type
    •holds essential part of logic, i.e. session
    is a link from profile to device etc
    •Facebook TAO model: fetching nodes and
    simplest links-walking
    •implemented as independent library
    Graph

    View Slide

  29. •revision control for each entity
    •to ensure all client calls are idempotent
    •k-ordering for cursor-based sync (**)
    •flake library (snowflake-like)
    •one more, riak_id
    K-ordering

    View Slide

  30. K-ordering
    ** client tells server max revision
    ever seen (a.k.a. cursor)
    server send changed data only
    (current rev > client max rev)

    View Slide

  31. •github.com/twitter/snowflake (Scala)
    •github.com/boundary/flake (Erlang) *
    •github.com/seancribbs/riak_id (Erlang)
    K-ordering

    View Slide

  32. •activitystrea.ms
    •Actor, Action, Object, Target
    •cases: timelines, activity streams,
    chats, notification center
    •linked lists
    •cursor-based fetch
    Streams

    View Slide

  33. Streams

    View Slide

  34. •cases: follow-ups, notes, cleared
    messages etc
    •event-sourcing (both server & clients)
    •LWW for conflicted rewrites
    Offline support

    View Slide

  35. •use to avoid state copy in gen_server
    •2 approaches (use both):
    •supervisor creates ETS and gives it to
    child at start
    •server creates ETS and fills it with data
    on each gen_server:init
    ETS

    View Slide

  36. Lesson #1
    graph oriented data is a good fit
    (most) graph databases are
    strange

    View Slide

  37. Lesson #2
    data modeling is hard
    any kind of consistency is hard

    View Slide

  38. Lesson #3
    Erlang is good for async data
    processing

    View Slide

  39. Lesson #4
    each mobile client is a part of
    single distributed system

    View Slide

  40. •started from “process per device”
    •easy to start, client is an Actor
    •not really HA
    •bad fit to few nodes cluster
    •many problems with events routing
    •reimplemented
    Processes v.1

    View Slide

  41. •riak_core vnodes ring
    •riak_pipe vnodes ring
    •supervisor for each app
    •auth
    •profiles ordering
    •twitter reader
    •rss reader
    Processes v.2

    View Slide

  42. Lesson #5
    obvious solution can be a bad fit
    “fail fast” in your decisions

    View Slide

  43. •vnodes ring
    •“service” and compatibility tracking
    •consistent hashing for tasks routing
    •handoffs
    •join, leave, status, membership
    •CLI admin interface
    riak_core

    View Slide

  44. •few problems
    •great facilities with no docs
    •... but easy to read whole source code
    •thanks to the guys from Basho for their
    advice
    •waiting for 2.0 version
    riak_core

    View Slide

  45. riak_core

    View Slide

  46. riak_core

    View Slide

  47. Lesson #6
    riak_core is a good enough
    reason for using Erlang

    View Slide

  48. •2-phase: “unit” and “functional”
    •eunit (built-in testing framework)
    •etest library for functional tests
    •functional tests in separated
    modules
    •don’t track coverage
    Testing

    View Slide

  49. •a lot of high-level helpers
    •assert functions over JSON structure
    •?wait_for macro to test async
    operations
    Testing

    View Slide

  50. •mocking: external HTTP endpoints, IP
    detectors
    •meck library: creating modules, history API
    •good enough
    •strange “random” problems after
    recompilation
    Mocks

    View Slide

  51. •test coverage is a key factor for really
    quick development
    •concentrate on “negative” cases
    •it’s easy to turn this process into fun
    Lesson #7

    View Slide

  52. •good types system matters
    •too many tests to check input values
    •too many tests to check formatting
    •too many tests to check protocols
    Lesson #8

    View Slide

  53. •you need to prepare tests for multi-
    node system
    •(only) then start working on
    distribution
    •riak_test
    •property testing: PropEr
    •... both are great, but hard to adopt
    Cluster

    View Slide

  54. •make devrel to run 3+ nodes
    •it’s fun too
    Cluster

    View Slide

  55. Cluster testing

    View Slide

  56. •it’s hard to do everything right on the
    first try
    •it’s impossible to do it on the first try?
    •it’s impossible to do it at all?
    •more experiments!
    Lesson #9

    View Slide

  57. •a lot of async operations
    •i.e. like → save in DB → update timeline
    entry → publish activity stream entry →
    add notification → send to device
    •started with RabbitMQ and exchanges
    for each event types (easy to start)
    •reimplemented
    Events

    View Slide

  58. •2 types: bound & unbound
    •bound: known number of subscribers
    •i.e. “like”
    •converting to “active coordinator”: FSM
    under appropriate supervisor
    •sourcing for fault-tolerance
    Events 2

    View Slide

  59. •unbound cases:
    •ban profile → remove all content
    •update timeline → send push to all
    subscribed devices
    •use riak_pipe
    Events 3

    View Slide

  60. •part of Riak internals
    • map/reduce flavored with unix pipes
    •declarative fittings
    •custom routing
    •back-pressure control
    •logging and tracing
    •handoffs
    riak_pipe

    View Slide

  61. riak_pipe

    View Slide

  62. riak_pipe
    working on workshop
    github.com/kachayev/riak-pipe-workshop

    View Slide

  63. Lesson #10
    there is no such thing as “exactly
    once delivery”
    back-pressure control is essential

    View Slide

  64. •it matters!
    •cases: RPC definitions, permissions etc
    •-define(MACRO, ...)
    •... great, but sometimes inconvenient
    •parse_transform
    •... great, but hard to develop & support
    •Elixir? no, thanks
    Meta programming

    View Slide

  65. •it matters!
    •public API description,
    at least
    •our solution: parser for
    test logs (in python)
    Documentation

    View Slide

  66. •... external tool not so easy to
    support
    •edoc ?
    •parse_transform ? i.e. -doc()
    Documentation

    View Slide

  67. Lesson #11
    meta programming matters
    documentation matters

    View Slide

  68. •don’t use hot swapping for releases
    •reltool to prepare package(s)
    •run_erl to run VM as a daemon
    •shell script for common operations:
    start, stop, restart, attach
    •shell script for cluster operations
    (wrapper for node calls): join, leave,
    status (ring & members)
    Deployment

    View Slide

  69. •rebar generate to /opt/
    gomer//*
    •shared directory for compiled deps:
    much faster get-deps & compile
    •zip and store on S3
    •download from S3, unzip, relink
    •fabric (Python) for automation
    Deployment

    View Slide

  70. Lesson #12
    still don’t know what the best
    way to deploy application among
    the cluster is

    View Slide

  71. Lesson #13
    Another to_erl process
    already attached to
    pipe

    View Slide

  72. Lesson #14
    there is a big difference between
    ^C (stop VM) and ^D (quit)

    View Slide

  73. •a lot of log messages
    •papertailapp.com for all concerned
    •dbg on live server
    •few own helpers for most common cases
    •“trace_off” on timeout
    Debugging

    View Slide

  74. Debugging

    View Slide

  75. •erlang.org/doc/man/dbg.html
    •github.com/ferd/recon
    •erlang.org/doc/man/os_mon_app.html
    Debugging

    View Slide

  76. Lesson #15
    there are few features in Erlang
    that you really-really miss when
    using other technologies

    View Slide

  77. •2 engineers
    •“2 weeks” to start writing production
    code
    •ha. first feature - on the second day *
    •* first day - stumbled by Mac OS
    The team

    View Slide

  78. Lesson #16
    Erlang is a good technology to
    hire good engineers

    View Slide

  79. •guys from Wooga
    •guys from Yammer
    •guys from Basho
    Thanks to

    View Slide

  80. • github.com/eproxus/meck
    • github.com/uwiger/gproc
    • github.com/wooga/etest
    • github.com/wooga/etest_http
    • github.com/bash/riak_kv
    • github.com/basho/riak_core
    • github.com/basho/riak_pipe
    • github.com/basho/lager
    • github.com/marccampbell/flake
    • github.com/gleber/erlcloud
    Libraries

    View Slide

  81. •~20-25ms for most responses
    •100+ connections without any impact
    •faster then Python & Ruby
    •not as fast as Scala, Clojure and Go
    •... but do you really care?
    Questions #1
    Performance

    View Slide

  82. • Erlang (our choice)
    • Scala (jvm)
    • Clojure (jvm)
    • Python (bad fit)
    • Go (too large project)
    • Haskell (bad fit)
    • Java (oh, common..)
    Questions #2
    Candidates

    View Slide

  83. •Emacs
    •VIM
    Questions #3
    IDE?

    View Slide

  84. •we use Go and Clojure for other systems
    •do you want to ask “Why”?
    •we are still on early production stage
    •wait for new lessons coming soon
    Notes #1

    View Slide

  85. Ideas?
    Questions?
    Alexey Kachayev, 2013

    View Slide