Distributed Systems at ok.ru by Oleg Anastasyev

Distributed Systems @ OK.RU Oleg Anastasyev @m0nstermind [email protected]

1. Absolutely reliable network 2. with negligible Latency 3. and
practically unlimited Bandwidth 4. It is homogenous 5. Nobody can break into our LAN 6. Topology changes are unnoticeable 7. All managed by single genius admin 8. So data transport cost is zero now 2 OK.ru has come to:

1. Absolutely reliable network 2. with negligible Latency 3. and
practically unlimited Bandwidth 4. It is homogenous (same HW and hop cnt to every server) 5. Nobody can break into our LAN 6. Topology changes are unnoticeable 7. All managed by single genius admin 8. So data transport cost is zero now 3 Fallacies of distributed computing https://en.wikipedia.org/wiki/Fallacies_of_distributed_computing [Peter Deutsch, 1994; James Gosling 1997]

4 4 Datacenters 150 distinct microservices 8000 iron servers OK.RU
has come to:

5 hardware engineers network engineers operations developers

6 My friends page 1. Retrieve friends ids 2. Filter
by friendship type 3. Apply black list 4. Resolve ids to profiles 5. Sort profiles 6. Retrieve stickers 7. Calculate summaries

7 The Simple WayTM SELECT * FROM friendlist, users  
WHERE userId=? AND f.kind=? AND u.name LIKE ? AND NOT EXISTS( SELECT * FROM blacklist …) …

• Friendships • 12 billions of edges, 300GB • 500
000 requests per sec 8 Simple ways don't work • User profiles • > 350 millions, • 3 500 000 requests/sec, 50 Gbit/sec

9 How stuff works web frontend API frontend app server
one-graph user-cache black-list microservices

10 Micro-service dissected Remote interface Business logic, caches [ Local
storage ] 1 JVM

11 Micro-service dissected Remote interface https://github.com/odnoklassniki/one-nio interface GraphService extends RemoteService
{ @RemoteMethod long[] getFriendsByFilter(@Partition long vertexId, long relationMask); } interface UserCache {  @RemoteMethod User getUserById(long id); }

12 App Server code https://github.com/odnoklassniki/one-nio long []friendsIds = graphService.getFriendsByFilter(userId, mask);
List<User> users = new ArrayList<Long>(friendsIds.length); for (long id : friendsIds) { if(blackList.isAllowed(userId,id)) { users.add(userCache.getUserById(id)); } } … return users;

• Partition by this parameter value • Using partitioning strategy
• long id -> int partitionId(id) -> node1,node2,… • Strategies can be different • Cassandra ring, Voldemort partitions • or … 13 interface GraphService extends RemoteService { @RemoteMethod long[] getFriendsByFilter(@Partition long vertexId, long relationMask); }

14 Weighted quadrant p = id % 16 p =
0 p = 15 p = 1 N01 N02 N03 . . . 019 020 W = 1 W = 100 N11 node = wrr(p) SET

15 A coding issue https://github.com/odnoklassniki/one-nio long []friendsIds = graphService.getFriendsByFilter(userId, mask);
List<User> users = new ArrayList<Long>(friendsIds.length); for (long id : friendsIds) { if(blackList.isAllowed(userId,id)) { users.add(userCache.getUserById(id)); } } … return users;

16 latency   = 1.0ms * 2 reqs * 200
friends  = 400 ms  A roundtrip price 0.1-0.3 ms 0.7-1.0 ms remote datacenter * this price is tightly coupled with the specific infrastructure and frameworks 10k friends latency = 20 seconds

17 Batch requests to the rescue public interface UserCache { 
@RemoteMethod( split = true ) Collection<User> getUsersByIds(long[] keys); } long []friendsIds = graphService.getFriendsByFilter(userId, mask);   friendsIds = blackList.filterAllowed(userId, friendsIds ); List<User> users = userCache.getUsersByIds(friendsIds); … return users;

18 split & merge split ( ids by p )
-> ids0, ids1 p = 0 p = 1 N01 N02 N03 . . . N11 ids0 ids1 users = merge (users0, users1)

19 1. Client crash 2. Server crash 3. Request omission
4. Response omission 5. Server timeout 6. Invalid value response 7. Arbitrary failure What could possibly fail ?

Failures Distributed systems at OK.RU

• We can not prevent failures - only mask them
• If a Failure can occur it will occur • Redundancy is a must to mask failures • Information ( error correction codes ) • Hardware (replicas, substitute hardware) • Time (transactions, retries) 21 What to do with failures ?

22 What happened to transaction ? Don’t give up! Must
retry ! Must give up!   Don't retry ! ? ? Add Friend

• Client does not really know • What client can
do ? • Don’t make any guarantees. • Never retry. At Most Once. • Always retry. At Least Once. 23 Was friendship succeeded ?

1. Transaction in ACID database • single master, success is
atomic (either yes or no) • atomic rollback is possible 2. Cache cluster refresh • many replicas, no master • no rollback, partial failures are possible 24 Making new friendship

• Operation can be reapplied multiple times with same result
• e.g.: read, Set.add(), Math.max(x,y) • Atomic change with order and dup control  25 Idempotence “Always retry” policy can be applied  only on  Idempotent Operations https://en.wikipedia.org/wiki/Idempotence

26 Idempotence in ACID database Make friends wait; timeout Make
friends (retry) Friendship, peace and bubble gum ! Already friends ? No, let’s make it ! Already friends ? Yes, NOP !

27 Sequencing MakeFriends (OpId) Made friends! Is Dup (OpId) ?
No, making changes OpId := Generate() Generate() examples: • OpId+=1 • OpId=currentTimeMillis() • OpId=TimeUUID http://johannburkard.de/software/uuid/

1. Transaction in ACID database • single master, success is
atomic (either yes or no) • atomic rollback is possible 2. Cache cluster refresh • many replicas, no master • no rollback, partial failures are possible 28 Making new friendship

29 Cache cluster refresh add(Friend) p = 0 N01 N02
N03 . . . But replicas state will diverge otherwise Retries are meaningless

• Background data sync process • Reads updated records from
ACID store    SELECT * FROM users WHERE modified > ? • Applies them into its memory • Loads updates on node startup • Retry can be omitted then  30 Syncing cache from DB

31 Death by timeout GC Make Friends wait; timeout thread
pool   exhausted

1. Clients stop sending requests to server After X continuous
failures for the last second 2. Clients monitor server availability In background, once a minute 3. And turn it back on 32 Server cut-off

33 Death by slowing down Avg = 1.5ms Max =
1.5c 24 cpu cores Cap = 24,000 ops Choose 2.4ms timeout ? Cut it off from client if latency avg > 2.4ms ? Avg = 24ms Max = 1.5s 24 cpu cores Cap = 1,000 ops 10,000 ops

34 Speculative retry Idemponent Op wait; timeout Retry Result Response

• Makes requests to replicas before timeout • Better 99%,
even average latencies • More stable system • Not always applicable: • Idempotent ops, additional load, traffic (to consider) • Can be balanced: always, >avg, >99p 35 Speculative retry

More failures ! Distributed systems @ OK.RU

• Excessive load • Excessive paranoia • Bugs • Human
error • Massive outages 37 All replicas failure

38 Use of non-authoritative datasources, degrade consistency Use of incomplete
data in UI, partial feature degradation  Single feature full degradation Degrade (gracefully) !

39 The code interface UserCache {  @RemoteMethod Distributed<Collection<User>> getUsersByIds(long[] keys);
} interface Distributed<D> { boolean isInconsistency(); D getData(); } class UserCacheStub implements UserCache {   Distributed<Collection<User>> getUsersByIds(long[] keys) { return Distributed.inconsistent(); } }

Resilience testing Distributed systems at OK.RU

41 The product you make Operations in production env What
to test for failure ? “Standard” products - with special care !

• What is does: • Detects network connections between servers
• Disables them (iptables drop) • Runs auto tests • What we check • No crashes, nice UI messages are rendered • Server does start and can serve requests 42 The product we make : “Guerrilla”

Production diagnostics Distributed systems at OK.RU

• To know an accident exists. Fast. • To track
down to the source of accident. Fast. • To prevent accidents before they happen. 44 Why

• Zabbix • Cacti • Operational metrics • Names od
operations, e.g. “Graph.getFriendsByFilter” • Call count, their success or failure • Latency of calls 45 Is (will) there be accident ?

• Current metrics and trends • Aggregated call and failure
counts • Aggregated latencies • Average, Max • Percentiles 50,75,98,99,99.9 46 What charts show to us

47 More charts

48 Anomaly detection

• The possibilities for failure in distributed systems are endless
• Don't “prevent”, but mask failures through redundancy • Degrade gracefully on unmask-able failure • Test failures • Production diagnostics are key to failure detection and prevention 49 Short summary

50 Distributed Systems at OK.RU slideshare.net/m0nstermind https://v.ok.ru/publishing.html http://www.cs.yale.edu/homes/aspnes/classes/465/notes.pdf Notes on
Theory of Distributed Systems CS 465/565:   Spring 2014 James Aspnes Try these links for more

Distributed Systems at ok.ru by Oleg Anastasyev

Distributed Systems at ok.ru by Oleg Anastasyev

More Decks by Riga Dev Day

Featured

Transcript