Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Distributed Systems at ok.ru by Oleg Anastasyev

Riga Dev Day
March 13, 2016
69

Distributed Systems at ok.ru by Oleg Anastasyev

Riga Dev Day

March 13, 2016
Tweet

Transcript

  1. 1. Absolutely reliable network 2. with negligible Latency 3. and

    practically unlimited Bandwidth 4. It is homogenous 5. Nobody can break into our LAN 6. Topology changes are unnoticeable 7. All managed by single genius admin 8. So data transport cost is zero now 2 OK.ru has come to:
  2. 1. Absolutely reliable network 2. with negligible Latency 3. and

    practically unlimited Bandwidth 4. It is homogenous (same HW and hop cnt to every server) 5. Nobody can break into our LAN 6. Topology changes are unnoticeable 7. All managed by single genius admin 8. So data transport cost is zero now 3 Fallacies of distributed computing https://en.wikipedia.org/wiki/Fallacies_of_distributed_computing [Peter Deutsch, 1994; James Gosling 1997]
  3. 6 My friends page 1. Retrieve friends ids 2. Filter

    by friendship type 3. Apply black list 4. Resolve ids to profiles 5. Sort profiles 6. Retrieve stickers 7. Calculate summaries
  4. 7 The Simple WayTM SELECT * FROM friendlist, users 


    WHERE userId=? AND f.kind=? AND u.name LIKE ? AND NOT EXISTS( SELECT * FROM blacklist …) …
  5. • Friendships • 12 billions of edges, 300GB • 500

    000 requests per sec 8 Simple ways don't work • User profiles • > 350 millions, • 3 500 000 requests/sec, 50 Gbit/sec
  6. 9 How stuff works web frontend API frontend app server

    one-graph user-cache black-list microservices
  7. 11 Micro-service dissected Remote interface https://github.com/odnoklassniki/one-nio interface GraphService extends RemoteService

    { @RemoteMethod long[] getFriendsByFilter(@Partition long vertexId, long relationMask); } interface UserCache {
 @RemoteMethod User getUserById(long id); }
  8. 12 App Server code https://github.com/odnoklassniki/one-nio long []friendsIds = graphService.getFriendsByFilter(userId, mask);

    List<User> users = new ArrayList<Long>(friendsIds.length); for (long id : friendsIds) { if(blackList.isAllowed(userId,id)) { users.add(userCache.getUserById(id)); } } … return users;
  9. • Partition by this parameter value • Using partitioning strategy

    • long id -> int partitionId(id) -> node1,node2,… • Strategies can be different • Cassandra ring, Voldemort partitions • or … 13 interface GraphService extends RemoteService { @RemoteMethod long[] getFriendsByFilter(@Partition long vertexId, long relationMask); }
  10. 14 Weighted quadrant p = id % 16 p =

    0 p = 15 p = 1 N01 N02 N03 . . . 019 020 W = 1 W = 100 N11 node = wrr(p) SET
  11. 15 A coding issue https://github.com/odnoklassniki/one-nio long []friendsIds = graphService.getFriendsByFilter(userId, mask);

    List<User> users = new ArrayList<Long>(friendsIds.length); for (long id : friendsIds) { if(blackList.isAllowed(userId,id)) { users.add(userCache.getUserById(id)); } } … return users;
  12. 16 latency 
 = 1.0ms * 2 reqs * 200

    friends
 = 400 ms
 A roundtrip price 0.1-0.3 ms 0.7-1.0 ms remote datacenter * this price is tightly coupled with the specific infrastructure and frameworks 10k friends latency = 20 seconds
  13. 17 Batch requests to the rescue public interface UserCache {


    @RemoteMethod( split = true ) Collection<User> getUsersByIds(long[] keys); } long []friendsIds = graphService.getFriendsByFilter(userId, mask); 
 friendsIds = blackList.filterAllowed(userId, friendsIds ); List<User> users = userCache.getUsersByIds(friendsIds); … return users;
  14. 18 split & merge split ( ids by p )

    -> ids0, ids1 p = 0 p = 1 N01 N02 N03 . . . N11 ids0 ids1 users = merge (users0, users1)
  15. 19 1. Client crash 2. Server crash 3. Request omission

    4. Response omission 5. Server timeout 6. Invalid value response 7. Arbitrary failure What could possibly fail ?
  16. • We can not prevent failures - only mask them

    • If a Failure can occur it will occur • Redundancy is a must to mask failures • Information ( error correction codes ) • Hardware (replicas, substitute hardware) • Time (transactions, retries) 21 What to do with failures ?
  17. 22 What happened to transaction ? Don’t give up! Must

    retry ! Must give up! 
 Don't retry ! ? ? Add Friend
  18. • Client does not really know • What client can

    do ? • Don’t make any guarantees. • Never retry. At Most Once. • Always retry. At Least Once. 23 Was friendship succeeded ?
  19. 1. Transaction in ACID database • single master, success is

    atomic (either yes or no) • atomic rollback is possible 2. Cache cluster refresh • many replicas, no master • no rollback, partial failures are possible 24 Making new friendship
  20. • Operation can be reapplied multiple times with same result

    • e.g.: read, Set.add(), Math.max(x,y) • Atomic change with order and dup control
 25 Idempotence “Always retry” policy can be applied
 only on
 Idempotent Operations https://en.wikipedia.org/wiki/Idempotence
  21. 26 Idempotence in ACID database Make friends wait; timeout Make

    friends (retry) Friendship, peace and bubble gum ! Already friends ? No, let’s make it ! Already friends ? Yes, NOP !
  22. 27 Sequencing MakeFriends (OpId) Made friends! Is Dup (OpId) ?

    No, making changes OpId := Generate() Generate() examples: • OpId+=1 • OpId=currentTimeMillis() • OpId=TimeUUID http://johannburkard.de/software/uuid/
  23. 1. Transaction in ACID database • single master, success is

    atomic (either yes or no) • atomic rollback is possible 2. Cache cluster refresh • many replicas, no master • no rollback, partial failures are possible 28 Making new friendship
  24. 29 Cache cluster refresh add(Friend) p = 0 N01 N02

    N03 . . . But replicas state will diverge otherwise Retries are meaningless
  25. • Background data sync process • Reads updated records from

    ACID store
 
 SELECT * FROM users WHERE modified > ? • Applies them into its memory • Loads updates on node startup • Retry can be omitted then
 30 Syncing cache from DB
  26. 1. Clients stop sending requests to server After X continuous

    failures for the last second 2. Clients monitor server availability In background, once a minute 3. And turn it back on 32 Server cut-off
  27. 33 Death by slowing down Avg = 1.5ms Max =

    1.5c 24 cpu cores Cap = 24,000 ops Choose 2.4ms timeout ? Cut it off from client if latency avg > 2.4ms ? Avg = 24ms Max = 1.5s 24 cpu cores Cap = 1,000 ops 10,000 ops
  28. • Makes requests to replicas before timeout • Better 99%,

    even average latencies • More stable system • Not always applicable: • Idempotent ops, additional load, traffic (to consider) • Can be balanced: always, >avg, >99p 35 Speculative retry
  29. • Excessive load • Excessive paranoia • Bugs • Human

    error • Massive outages 37 All replicas failure
  30. 38 Use of non-authoritative datasources, degrade consistency Use of incomplete

    data in UI, partial feature degradation
 Single feature full degradation Degrade (gracefully) !
  31. 39 The code interface UserCache {
 @RemoteMethod Distributed<Collection<User>> getUsersByIds(long[] keys);

    } interface Distributed<D> { boolean isInconsistency(); D getData(); } class UserCacheStub implements UserCache { 
 Distributed<Collection<User>> getUsersByIds(long[] keys) { return Distributed.inconsistent(); } }
  32. 41 The product you make Operations in production env What

    to test for failure ? “Standard” products - with special care !
  33. • What is does: • Detects network connections between servers

    • Disables them (iptables drop) • Runs auto tests • What we check • No crashes, nice UI messages are rendered • Server does start and can serve requests 42 The product we make : “Guerrilla”
  34. • To know an accident exists. Fast. • To track

    down to the source of accident. Fast. • To prevent accidents before they happen. 44 Why
  35. • Zabbix • Cacti • Operational metrics • Names od

    operations, e.g. “Graph.getFriendsByFilter” • Call count, their success or failure • Latency of calls 45 Is (will) there be accident ?
  36. • Current metrics and trends • Aggregated call and failure

    counts • Aggregated latencies • Average, Max • Percentiles 50,75,98,99,99.9 46 What charts show to us
  37. • The possibilities for failure in distributed systems are endless

    • Don't “prevent”, but mask failures through redundancy • Degrade gracefully on unmask-able failure • Test failures • Production diagnostics are key to failure detection and prevention 49 Short summary