Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Using Riak for events storage and analysis at Booking.com

8a1ff2661241c1c6fbf0d6875e9c0ba5?s=47 dams
June 12, 2015

Using Riak for events storage and analysis at Booking.com

At Booking.com, we have a constant flow of events coming from various applications and internal subsystems. This critical data needs to be stored for real-time, medium and long term analysis. Events are schema-less, making it difficult to use standard analysis tools.This presentation will explain how we built a real time storage and analysis solution based on Riak. The talk will cover: data aggregation and serialization, Riak configuration, solutions for lowering the network usage, and finally, how Riak's advanced features are used to perform real-time data crunching on the cluster nodes

8a1ff2661241c1c6fbf0d6875e9c0ba5?s=128

dams

June 12, 2015
Tweet

Transcript

  1. Using Riak for Events storage and analysis at Booking.com Damien

    Krotkine
  2. Damien Krotkine • Software Engineer at Booking.com • github.com/dams •

    @damsieboy • dkrotkine
  3. None
  4. KEY FIGURES • 600,000 hotels • 212 countries • 800,000

    room nights every 24 hours • 43 million+ guest reviews • 155+ offices worldwide • 8,600 people • not a small website…
  5. INTRODUCTION

  6. None
  7. None
  8. None
  9. None
  10. www API mobi

  11. www API mobi frontend

  12. www API frontend backend mobi events storage events: info about

    subsystems status
  13. backend web mobi api databases caches load balancers availability cluster

    email etc…
  14. WHAT IS AN EVENT ?

  15. EVENT STRUCTURE • Provides info about subsystems • Data •

    Deep HashMap • Timestamp • Type + Subtype • The rest: specific data • Schema-less
  16. { timestamp => 12345, type => 'WEB', subtype => 'app',

    dc => 1, action => { is_normal_user => 1, pageview_id => '188a362744c301c2', # ... }, tuning => { the_request => 'GET /display/...' bytes_body => 35, wallclock => 111, nr_warnings => 0, # ... }, # ... }
  17. { type => 'FAV', subtype => 'fav', timestamp => 1401262979,

    dc => 1, tuning => { flatav => { cluster => '205', sum_latencies => 21, role => 'fav', num_queries => 7 } } }
  18. EVENTS FLOW PROPERTIES • Read-only • Schema-less • Continuous, sequential,

    timed • 15 K events per sec • 1.25 Billion events per day • peak at 70 MB/s, min 25MB/s • 100 GB per hour
  19. USAGE

  20. USAGE 1. GRAPHS 2. DECISION MAKING 3. SHORT TERM ANALYSIS

    4. A/B TESTING
  21. GRAPHS • Graph in real-time ( few seconds lag )

    • Graph as many systems as possible • General platform health check
  22. GRAPHS

  23. GRAPHS

  24. DASHBOARDS

  25. META GRAPHS

  26. USAGE 1. GRAPHS 2. DECISION MAKING 3. SHORT TERM ANALYSIS

    4. A/B TESTING
  27. DECISION MAKING • Strategic decision ( use facts ) •

    Long term or short term • Technical / Non technical Reporting
  28. USAGE 1. GRAPHS 2. DECISION MAKING 3. SHORT TERM ANALYSIS

    4. A/B TESTING
  29. SHORT TERM ANALYSIS • From 10 sec ago -> few

    hours ago • Code deployment checks and rollback • Anomaly Detector
  30. USAGE 1. GRAPHS 2. DECISION MAKING 3. SHORT TERM ANALYSIS

    4. A/B TESTING
  31. A/B TESTING • Our core philosophy: use facts • It

    means: do A/B testing • Concept of Experiments • Events provide data to compare • We need data from the last few days
  32. EVENT AGGREGATION

  33. EVENT AGGREGATION • Group events • Granularity we need: second

  34. SERIALIZATION • JSON didn’t work for us (slow, big, lack

    features) • Created Sereal in 2012 • « Sereal, a new, binary data serialization format that provides high-performance, schema-less serialization » • https://github.com/Sereal/Sereal
  35. event event events storage event event event event event event

    event event event event event
  36. e e e e e e e e e e

    e e e e e e e e LOGGER e e
  37. web api e e e e e e e e

    e e e e e e e e e e e e e e e e e e e e e
  38. web api dbs e e e e e e e

    e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e
  39. web api dbs e e e e e e e

    e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e 1 sec e e
  40. web api dbs e e e e e e e

    e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e 1 sec e e
  41. web api dbs e e e e e e e

    e e e e e 1 sec events storage
  42. web api dbs e e e e e e e

    e e 1 sec events storage e e e reserialize + compress
  43. events storage LOGGER … LOGGER LOGGER

  44. STORAGE

  45. WHAT WE WANT • Storage security • Mass write performance

    • Mass read performance • Easy administration • Very scalable • => We reviewed a bunch of contenders
  46. WE CHOSE RIAK • Security: cluster, distributed, very robust •

    Good and predictable read / write performance • The easiest to setup and administrate • Advanced features (MapReduce, triggers, 2i, CRDTs …) • Riak Search • Multi Datacenter Replication
  47. None
  48. CLUSTER • Commodity hardware • All nodes serve data •

    Data replication • Gossip between nodes • No master • distributed system Ring of servers
  49. hash(key)

  50. None
  51. KEY VALUE STORE • Namespaces: bucket • Values: opaque or

    CRDTs
  52. RIAK: ADVANCED FEATURES • MapReduce • Secondary indexes (2i) •

    Riak Search • Multi DataCenter Replication
  53. MULTI-BACKEND FOR STORAGE • Bitcask • Eleveldb • Memory

  54. BACKEND: BITCASK • Log-based storage backend • Append-only files (AOF

    files) • Advanced expiration • Predictable performance (1 disk-seek max) • Perfect for sequential data • SSD or spinning disks ?
  55. CLUSTER CONFIGURATION

  56. DISK SPACE NEEDED • 8 days • 100 GB per

    hour • Replication 3 • 100 * 24 * 8 * 3 • Need 60 T
  57. HARDWARE • 2 clusters • 12 nodes, then 16, then

    27 • 12 CPU cores ( Xeon 2.5Ghz) • 192 GB RAM • network 1 Gbit/s • 8 TB (raid 6) • Cluster total space: 128 TB
  58. DATA DESIGN

  59. web api dbs e e e e e e e

    e e 1 sec events storage 1 blob per EPOCH / DC / CELL / TYPE / SUBTYPE 500 KB max chunks
  60. DATA • Bucket name: “data“ • Key: “12345:1:cell0:WEB:app:chunk0“ • Value:

    List of events (Hashmaps), serialized & compressed • 200 keys per seconds
  61. METADATA • Bucket name: “metadata“ • Key: <epoch>-<dc> “1428415043-2“ •

    Value: list of data keys:
 
 [ “1428415043:1:cell0:WEB:app:chunk0“,
 “1428415043:1:cell0:WEB:app:chunk1“
 …
 “1428415043:4:cell0:EMK::chunk3“ ] • As pipe separated value (PSV)
  62. WRITE DATA

  63. PUSH DATA IN • Every seconds: • Push data values

    to Riak, in parallel • Wait for success • Push metadata
  64. JAVA Bucket DataBucket = riakClient.fetchBucket("data").execute(); DataBucket.store("12345:1:cell0:WEB:app:chunk0", Data1).execute(); DataBucket.store("12345:1:cell0:WEB:app:chunk1", Data2).execute(); DataBucket.store("12345:1:cell0:WEB:app:chunk2",

    Data3).execute(); Bucket MetaDataBucket = riakClient.fetchBucket("metadata").execute(); MetaDataBucket.store("12345-1", metaData).execute(); riakClient.shutdown();
  65. Perl my $client = Riak::Client->new(…); $client->put(data => '12345:1:cell0:WEB:app:chunk0', $data1); $client->put(data

    => '12345:1:cell0:WEB:app:chunk1', $data2); $client->put(data => '12345:1:cell0:WEB:app:chunk2', $data3); $client->put(metadata => '12345-1', $metadata, 'text/plain' );
  66. READ DATA

  67. READ ONE SECOND • For one second (a given epoch)

    • Request metadata for <epoch>-DC • Parse value • Filter out unwanted types / subtypes • Fetch the keys from the “data” bucket
  68. Perl my $client = Riak::Client->new(…); my @array = split '\|',

    $client->get(metadata => '1428415043-1'); @filtered_array = grep { /WEB/ } @array; $client->get(data => $_) foreach @filtered_array;
  69. READ AN INTERVAL • For an interval epoch1 -> epoch2

    • Generate the list of epochs • Fetch in parallel • Riak excels at handling huge number of req/sec
  70. RIAK CLUSTER HEALTH

  71. CPU USAGE

  72. DISK IOPS

  73. DISK IO UTILIZATION %

  74. DISK SPACE RECLAIMED one day

  75. REAL TIME PROCESSING OUTSIDE OF RIAK

  76. EXAMPLES • Streaming => Graphite ( every sec ) •

    Streaming => Anomaly Detector ( last 2 min ) • Streaming => Experiment analysis ( last day ) • Every minute => Hadoop • Manual request => test, debug, investigate • Batch fetch => ad hoc analysis • => Huge numbers of read requests
  77. events storage graphite cluster Anomaly detector experiment
 cluster hadoop cluster

    mysql analysis manual requests 50 MB/s 50 MB/s 50 M B/s 50 M B/s 50 MB/s 50 MB/s
  78. THIS IS REALTIME • 1 second of data • Stored

    in < 1 sec • Available after < 1 sec • Issue : network saturation
  79. REAL TIME PROCESSING INSIDE RIAK

  80. THE IDEA • Instead of • Fetch data, • Crunch

    data (ex: average), • Produce a small result • Do • Bring code to data • Crunch data on Riak • Fetch the result
  81. WHAT TAKES TIME • Takes a lot of time •

    Fetching data out: network issue • Decompressing: CPU time issue • Takes almost no time • Crunching data
  82. MAPREDUCE • Input: epoch-dc • Map1: metadata keys => data

    keys • Map2: data crunching • Reduce: aggregate • Realtime: OK • network usage: OK • CPU time: NOT OK
  83. HOOKS • Every time metadata is written • Post-Commit hook

    triggered • Riak executes a callback we provided • Crunch data on the nodes
  84. None
  85. Riak post-commit hook REST service RIAK service key key socket

    result sent for storage decompress
 process all tasks NODE HOST
  86. HOOK CODE metadata_stored_hook(RiakObject) -> Key = riak_object:key(RiakObject), Bucket = riak_object:bucket(RiakObject),

    [ Epoch, DC ] = binary:split(Key, <<"-">>), MetaData = riak_object:get_value(RiakObject), DataKeys = binary:split(MetaData, <<"|">>, [ global ]), send_to_REST(Epoch, Hostname, DataKeys), ok.
  87. send_to_REST(Epoch, Hostname, DataKeys) -> Method = post, URL = "http://"

    ++ binary_to_list(Hostname) ++ ":5000?epoch=" ++ binary_to_list(Epoch), HTTPOptions = [ { timeout, 4000 } ], Options = [ { body_format, string }, { sync, false }, { receiver, fun(ReplyInfo) -> ok end } ], Body = iolist_to_binary(mochijson2:encode( DataKeys )), httpc:request(Method, {URL, [], "application/json", Body}, HTTPOptions, Options), ok.
  88. REST SERVICE • In Perl, using PSGI (WSGI-like), Starman, preforks

    • Allow to write data cruncher in Perl • Also supports loading code on demand • list of operational companions Monitored
  89. REST SERVICE: SCALABLE • Scalable • 1 second = 200

    keys • 16 nodes, 10 CPUs ( 2 for Riak ) • 1 key must be crunched in 16*10/200 sec = 0.8 sec • => we have time
  90. OPTIMIZATION R C R C R C key primary node

    for that key R C : Riak : Companion PUT
  91. ADVANTAGES • CPU usage and execution time can be capped

    • Data is local to processing • Data is decompressed only once • Data crunching done all at once • Two systems are loosely coupled • can be written in any language
  92. DISADVANTAGES • Only for incoming data (streaming), not old data

    • Can’t easily use cross-second data
  93. THE BANDWIDTH PROBLEM

  94. • PUT - bad case • n_val = 3 •

    inside usage =
 3 x outside usage
  95. • PUT - good case • n_val = 3 •

    inside usage =
 2 x outside usage
  96. • GET - bad case • inside usage =
 3

    x outside usage
  97. • GET - good case • inside usage =
 2

    x outside usage
  98. • network usage ( PUT and GET ): • 3

    x 13/16+ 2 x 3/16= 2.81 • plus gossip • inside network > 3 x outside network
  99. • Usually it’s not a problem • But in our

    case: • big values, constant PUTs, lots of GETs • sadly, only 1 Gbit/s • => network bandwidth issue
  100. THE BANDWIDTH SOLUTIONS

  101. THE BANDWIDTH SOLUTIONS 1. Optimize GET for network usage, not

    speed 2. Don’t choose a node at random
  102. • GET - bad case • n_val = 1 •

    inside usage =
 1 x outside
  103. • GET - good case • n_val = 1 •

    inside usage =
 0 x outside
  104. WARNING • Possible only because data is read-only • Data

    has internal checksum • No conflict possible • Corruption detected
  105. RESULT • practical network usage reduced by 2 !

  106. THE BANDWIDTH SOLUTIONS 1. Optimize GET for network usage, not

    speed 2. Don’t choose a node at random
  107. • bucket = “metadata” • key = “12345”

  108. • bucket = “metadata” • key = “12345” Hash =

    hashFunction(bucket + key) RingStatus = getRingStatus PrimaryNodes = Fun(Hash, RingStatus)
  109. hashFunction() getRingStatus()

  110. hashFunction() getRingStatus()

  111. WARNING • Possible only if • Nodes list is monitored

    • In case of failed node, default to random • Data is requested in an uniform way
  112. RESULT • Network usage even more reduced ! • Especially

    for GETs
  113. NETWORK USAGE

  114. CONCLUSION

  115. CONCLUSION • We used only Riak Open Source • No

    training, self-taught, small team • Riak is a great solution • Robust, fast, scalable, easy • Very flexible and hackable • Helps us continue scaling
  116. Q&A @damsieboy