Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Using Riak for events storage and analysis at B...

dams
June 12, 2015

Using Riak for events storage and analysis at Booking.com

At Booking.com, we have a constant flow of events coming from various applications and internal subsystems. This critical data needs to be stored for real-time, medium and long term analysis. Events are schema-less, making it difficult to use standard analysis tools.This presentation will explain how we built a real time storage and analysis solution based on Riak. The talk will cover: data aggregation and serialization, Riak configuration, solutions for lowering the network usage, and finally, how Riak's advanced features are used to perform real-time data crunching on the cluster nodes

dams

June 12, 2015
Tweet

Other Decks in Programming

Transcript

  1. KEY FIGURES • 600,000 hotels • 212 countries • 800,000

    room nights every 24 hours • 43 million+ guest reviews • 155+ offices worldwide • 8,600 people • not a small website…
  2. EVENT STRUCTURE • Provides info about subsystems • Data •

    Deep HashMap • Timestamp • Type + Subtype • The rest: specific data • Schema-less
  3. { timestamp => 12345, type => 'WEB', subtype => 'app',

    dc => 1, action => { is_normal_user => 1, pageview_id => '188a362744c301c2', # ... }, tuning => { the_request => 'GET /display/...' bytes_body => 35, wallclock => 111, nr_warnings => 0, # ... }, # ... }
  4. { type => 'FAV', subtype => 'fav', timestamp => 1401262979,

    dc => 1, tuning => { flatav => { cluster => '205', sum_latencies => 21, role => 'fav', num_queries => 7 } } }
  5. EVENTS FLOW PROPERTIES • Read-only • Schema-less • Continuous, sequential,

    timed • 15 K events per sec • 1.25 Billion events per day • peak at 70 MB/s, min 25MB/s • 100 GB per hour
  6. GRAPHS • Graph in real-time ( few seconds lag )

    • Graph as many systems as possible • General platform health check
  7. DECISION MAKING • Strategic decision ( use facts ) •

    Long term or short term • Technical / Non technical Reporting
  8. SHORT TERM ANALYSIS • From 10 sec ago -> few

    hours ago • Code deployment checks and rollback • Anomaly Detector
  9. A/B TESTING • Our core philosophy: use facts • It

    means: do A/B testing • Concept of Experiments • Events provide data to compare • We need data from the last few days
  10. SERIALIZATION • JSON didn’t work for us (slow, big, lack

    features) • Created Sereal in 2012 • « Sereal, a new, binary data serialization format that provides high-performance, schema-less serialization » • https://github.com/Sereal/Sereal
  11. e e e e e e e e e e

    e e e e e e e e LOGGER e e
  12. web api e e e e e e e e

    e e e e e e e e e e e e e e e e e e e e e
  13. web api dbs e e e e e e e

    e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e
  14. web api dbs e e e e e e e

    e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e 1 sec e e
  15. web api dbs e e e e e e e

    e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e 1 sec e e
  16. web api dbs e e e e e e e

    e e e e e 1 sec events storage
  17. web api dbs e e e e e e e

    e e 1 sec events storage e e e reserialize + compress
  18. WHAT WE WANT • Storage security • Mass write performance

    • Mass read performance • Easy administration • Very scalable • => We reviewed a bunch of contenders
  19. WE CHOSE RIAK • Security: cluster, distributed, very robust •

    Good and predictable read / write performance • The easiest to setup and administrate • Advanced features (MapReduce, triggers, 2i, CRDTs …) • Riak Search • Multi Datacenter Replication
  20. CLUSTER • Commodity hardware • All nodes serve data •

    Data replication • Gossip between nodes • No master • distributed system Ring of servers
  21. RIAK: ADVANCED FEATURES • MapReduce • Secondary indexes (2i) •

    Riak Search • Multi DataCenter Replication
  22. BACKEND: BITCASK • Log-based storage backend • Append-only files (AOF

    files) • Advanced expiration • Predictable performance (1 disk-seek max) • Perfect for sequential data • SSD or spinning disks ?
  23. DISK SPACE NEEDED • 8 days • 100 GB per

    hour • Replication 3 • 100 * 24 * 8 * 3 • Need 60 T
  24. HARDWARE • 2 clusters • 12 nodes, then 16, then

    27 • 12 CPU cores ( Xeon 2.5Ghz) • 192 GB RAM • network 1 Gbit/s • 8 TB (raid 6) • Cluster total space: 128 TB
  25. web api dbs e e e e e e e

    e e 1 sec events storage 1 blob per EPOCH / DC / CELL / TYPE / SUBTYPE 500 KB max chunks
  26. DATA • Bucket name: “data“ • Key: “12345:1:cell0:WEB:app:chunk0“ • Value:

    List of events (Hashmaps), serialized & compressed • 200 keys per seconds
  27. METADATA • Bucket name: “metadata“ • Key: <epoch>-<dc> “1428415043-2“ •

    Value: list of data keys:
 
 [ “1428415043:1:cell0:WEB:app:chunk0“,
 “1428415043:1:cell0:WEB:app:chunk1“
 …
 “1428415043:4:cell0:EMK::chunk3“ ] • As pipe separated value (PSV)
  28. PUSH DATA IN • Every seconds: • Push data values

    to Riak, in parallel • Wait for success • Push metadata
  29. JAVA Bucket DataBucket = riakClient.fetchBucket("data").execute(); DataBucket.store("12345:1:cell0:WEB:app:chunk0", Data1).execute(); DataBucket.store("12345:1:cell0:WEB:app:chunk1", Data2).execute(); DataBucket.store("12345:1:cell0:WEB:app:chunk2",

    Data3).execute(); Bucket MetaDataBucket = riakClient.fetchBucket("metadata").execute(); MetaDataBucket.store("12345-1", metaData).execute(); riakClient.shutdown();
  30. Perl my $client = Riak::Client->new(…); $client->put(data => '12345:1:cell0:WEB:app:chunk0', $data1); $client->put(data

    => '12345:1:cell0:WEB:app:chunk1', $data2); $client->put(data => '12345:1:cell0:WEB:app:chunk2', $data3); $client->put(metadata => '12345-1', $metadata, 'text/plain' );
  31. READ ONE SECOND • For one second (a given epoch)

    • Request metadata for <epoch>-DC • Parse value • Filter out unwanted types / subtypes • Fetch the keys from the “data” bucket
  32. Perl my $client = Riak::Client->new(…); my @array = split '\|',

    $client->get(metadata => '1428415043-1'); @filtered_array = grep { /WEB/ } @array; $client->get(data => $_) foreach @filtered_array;
  33. READ AN INTERVAL • For an interval epoch1 -> epoch2

    • Generate the list of epochs • Fetch in parallel • Riak excels at handling huge number of req/sec
  34. EXAMPLES • Streaming => Graphite ( every sec ) •

    Streaming => Anomaly Detector ( last 2 min ) • Streaming => Experiment analysis ( last day ) • Every minute => Hadoop • Manual request => test, debug, investigate • Batch fetch => ad hoc analysis • => Huge numbers of read requests
  35. events storage graphite cluster Anomaly detector experiment
 cluster hadoop cluster

    mysql analysis manual requests 50 MB/s 50 MB/s 50 M B/s 50 M B/s 50 MB/s 50 MB/s
  36. THIS IS REALTIME • 1 second of data • Stored

    in < 1 sec • Available after < 1 sec • Issue : network saturation
  37. THE IDEA • Instead of • Fetch data, • Crunch

    data (ex: average), • Produce a small result • Do • Bring code to data • Crunch data on Riak • Fetch the result
  38. WHAT TAKES TIME • Takes a lot of time •

    Fetching data out: network issue • Decompressing: CPU time issue • Takes almost no time • Crunching data
  39. MAPREDUCE • Input: epoch-dc • Map1: metadata keys => data

    keys • Map2: data crunching • Reduce: aggregate • Realtime: OK • network usage: OK • CPU time: NOT OK
  40. HOOKS • Every time metadata is written • Post-Commit hook

    triggered • Riak executes a callback we provided • Crunch data on the nodes
  41. Riak post-commit hook REST service RIAK service key key socket

    result sent for storage decompress
 process all tasks NODE HOST
  42. HOOK CODE metadata_stored_hook(RiakObject) -> Key = riak_object:key(RiakObject), Bucket = riak_object:bucket(RiakObject),

    [ Epoch, DC ] = binary:split(Key, <<"-">>), MetaData = riak_object:get_value(RiakObject), DataKeys = binary:split(MetaData, <<"|">>, [ global ]), send_to_REST(Epoch, Hostname, DataKeys), ok.
  43. send_to_REST(Epoch, Hostname, DataKeys) -> Method = post, URL = "http://"

    ++ binary_to_list(Hostname) ++ ":5000?epoch=" ++ binary_to_list(Epoch), HTTPOptions = [ { timeout, 4000 } ], Options = [ { body_format, string }, { sync, false }, { receiver, fun(ReplyInfo) -> ok end } ], Body = iolist_to_binary(mochijson2:encode( DataKeys )), httpc:request(Method, {URL, [], "application/json", Body}, HTTPOptions, Options), ok.
  44. REST SERVICE • In Perl, using PSGI (WSGI-like), Starman, preforks

    • Allow to write data cruncher in Perl • Also supports loading code on demand • list of operational companions Monitored
  45. REST SERVICE: SCALABLE • Scalable • 1 second = 200

    keys • 16 nodes, 10 CPUs ( 2 for Riak ) • 1 key must be crunched in 16*10/200 sec = 0.8 sec • => we have time
  46. OPTIMIZATION R C R C R C key primary node

    for that key R C : Riak : Companion PUT
  47. ADVANTAGES • CPU usage and execution time can be capped

    • Data is local to processing • Data is decompressed only once • Data crunching done all at once • Two systems are loosely coupled • can be written in any language
  48. • PUT - bad case • n_val = 3 •

    inside usage =
 3 x outside usage
  49. • PUT - good case • n_val = 3 •

    inside usage =
 2 x outside usage
  50. • network usage ( PUT and GET ): • 3

    x 13/16+ 2 x 3/16= 2.81 • plus gossip • inside network > 3 x outside network
  51. • Usually it’s not a problem • But in our

    case: • big values, constant PUTs, lots of GETs • sadly, only 1 Gbit/s • => network bandwidth issue
  52. THE BANDWIDTH SOLUTIONS 1. Optimize GET for network usage, not

    speed 2. Don’t choose a node at random
  53. • GET - bad case • n_val = 1 •

    inside usage =
 1 x outside
  54. • GET - good case • n_val = 1 •

    inside usage =
 0 x outside
  55. WARNING • Possible only because data is read-only • Data

    has internal checksum • No conflict possible • Corruption detected
  56. THE BANDWIDTH SOLUTIONS 1. Optimize GET for network usage, not

    speed 2. Don’t choose a node at random
  57. • bucket = “metadata” • key = “12345” Hash =

    hashFunction(bucket + key) RingStatus = getRingStatus PrimaryNodes = Fun(Hash, RingStatus)
  58. WARNING • Possible only if • Nodes list is monitored

    • In case of failed node, default to random • Data is requested in an uniform way
  59. CONCLUSION • We used only Riak Open Source • No

    training, self-taught, small team • Riak is a great solution • Robust, fast, scalable, easy • Very flexible and hackable • Helps us continue scaling