Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Using Riak for events storage and analysis at B...

Sponsored · Your Podcast. Everywhere. Effortlessly. Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.
Avatar for dams dams
June 12, 2015

Using Riak for events storage and analysis at Booking.com

At Booking.com, we have a constant flow of events coming from various applications and internal subsystems. This critical data needs to be stored for real-time, medium and long term analysis. Events are schema-less, making it difficult to use standard analysis tools.This presentation will explain how we built a real time storage and analysis solution based on Riak. The talk will cover: data aggregation and serialization, Riak configuration, solutions for lowering the network usage, and finally, how Riak's advanced features are used to perform real-time data crunching on the cluster nodes

Avatar for dams

dams

June 12, 2015
Tweet

Other Decks in Programming

Transcript

  1. KEY FIGURES • 600,000 hotels • 212 countries • 800,000

    room nights every 24 hours • 43 million+ guest reviews • 155+ offices worldwide • 8,600 people • not a small website…
  2. EVENT STRUCTURE • Provides info about subsystems • Data •

    Deep HashMap • Timestamp • Type + Subtype • The rest: specific data • Schema-less
  3. { timestamp => 12345, type => 'WEB', subtype => 'app',

    dc => 1, action => { is_normal_user => 1, pageview_id => '188a362744c301c2', # ... }, tuning => { the_request => 'GET /display/...' bytes_body => 35, wallclock => 111, nr_warnings => 0, # ... }, # ... }
  4. { type => 'FAV', subtype => 'fav', timestamp => 1401262979,

    dc => 1, tuning => { flatav => { cluster => '205', sum_latencies => 21, role => 'fav', num_queries => 7 } } }
  5. EVENTS FLOW PROPERTIES • Read-only • Schema-less • Continuous, sequential,

    timed • 15 K events per sec • 1.25 Billion events per day • peak at 70 MB/s, min 25MB/s • 100 GB per hour
  6. GRAPHS • Graph in real-time ( few seconds lag )

    • Graph as many systems as possible • General platform health check
  7. DECISION MAKING • Strategic decision ( use facts ) •

    Long term or short term • Technical / Non technical Reporting
  8. SHORT TERM ANALYSIS • From 10 sec ago -> few

    hours ago • Code deployment checks and rollback • Anomaly Detector
  9. A/B TESTING • Our core philosophy: use facts • It

    means: do A/B testing • Concept of Experiments • Events provide data to compare • We need data from the last few days
  10. SERIALIZATION • JSON didn’t work for us (slow, big, lack

    features) • Created Sereal in 2012 • « Sereal, a new, binary data serialization format that provides high-performance, schema-less serialization » • https://github.com/Sereal/Sereal
  11. e e e e e e e e e e

    e e e e e e e e LOGGER e e
  12. web api e e e e e e e e

    e e e e e e e e e e e e e e e e e e e e e
  13. web api dbs e e e e e e e

    e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e
  14. web api dbs e e e e e e e

    e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e 1 sec e e
  15. web api dbs e e e e e e e

    e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e 1 sec e e
  16. web api dbs e e e e e e e

    e e e e e 1 sec events storage
  17. web api dbs e e e e e e e

    e e 1 sec events storage e e e reserialize + compress
  18. WHAT WE WANT • Storage security • Mass write performance

    • Mass read performance • Easy administration • Very scalable • => We reviewed a bunch of contenders
  19. WE CHOSE RIAK • Security: cluster, distributed, very robust •

    Good and predictable read / write performance • The easiest to setup and administrate • Advanced features (MapReduce, triggers, 2i, CRDTs …) • Riak Search • Multi Datacenter Replication
  20. CLUSTER • Commodity hardware • All nodes serve data •

    Data replication • Gossip between nodes • No master • distributed system Ring of servers
  21. RIAK: ADVANCED FEATURES • MapReduce • Secondary indexes (2i) •

    Riak Search • Multi DataCenter Replication
  22. BACKEND: BITCASK • Log-based storage backend • Append-only files (AOF

    files) • Advanced expiration • Predictable performance (1 disk-seek max) • Perfect for sequential data • SSD or spinning disks ?
  23. DISK SPACE NEEDED • 8 days • 100 GB per

    hour • Replication 3 • 100 * 24 * 8 * 3 • Need 60 T
  24. HARDWARE • 2 clusters • 12 nodes, then 16, then

    27 • 12 CPU cores ( Xeon 2.5Ghz) • 192 GB RAM • network 1 Gbit/s • 8 TB (raid 6) • Cluster total space: 128 TB
  25. web api dbs e e e e e e e

    e e 1 sec events storage 1 blob per EPOCH / DC / CELL / TYPE / SUBTYPE 500 KB max chunks
  26. DATA • Bucket name: “data“ • Key: “12345:1:cell0:WEB:app:chunk0“ • Value:

    List of events (Hashmaps), serialized & compressed • 200 keys per seconds
  27. METADATA • Bucket name: “metadata“ • Key: <epoch>-<dc> “1428415043-2“ •

    Value: list of data keys:
 
 [ “1428415043:1:cell0:WEB:app:chunk0“,
 “1428415043:1:cell0:WEB:app:chunk1“
 …
 “1428415043:4:cell0:EMK::chunk3“ ] • As pipe separated value (PSV)
  28. PUSH DATA IN • Every seconds: • Push data values

    to Riak, in parallel • Wait for success • Push metadata
  29. JAVA Bucket DataBucket = riakClient.fetchBucket("data").execute(); DataBucket.store("12345:1:cell0:WEB:app:chunk0", Data1).execute(); DataBucket.store("12345:1:cell0:WEB:app:chunk1", Data2).execute(); DataBucket.store("12345:1:cell0:WEB:app:chunk2",

    Data3).execute(); Bucket MetaDataBucket = riakClient.fetchBucket("metadata").execute(); MetaDataBucket.store("12345-1", metaData).execute(); riakClient.shutdown();
  30. Perl my $client = Riak::Client->new(…); $client->put(data => '12345:1:cell0:WEB:app:chunk0', $data1); $client->put(data

    => '12345:1:cell0:WEB:app:chunk1', $data2); $client->put(data => '12345:1:cell0:WEB:app:chunk2', $data3); $client->put(metadata => '12345-1', $metadata, 'text/plain' );
  31. READ ONE SECOND • For one second (a given epoch)

    • Request metadata for <epoch>-DC • Parse value • Filter out unwanted types / subtypes • Fetch the keys from the “data” bucket
  32. Perl my $client = Riak::Client->new(…); my @array = split '\|',

    $client->get(metadata => '1428415043-1'); @filtered_array = grep { /WEB/ } @array; $client->get(data => $_) foreach @filtered_array;
  33. READ AN INTERVAL • For an interval epoch1 -> epoch2

    • Generate the list of epochs • Fetch in parallel • Riak excels at handling huge number of req/sec
  34. EXAMPLES • Streaming => Graphite ( every sec ) •

    Streaming => Anomaly Detector ( last 2 min ) • Streaming => Experiment analysis ( last day ) • Every minute => Hadoop • Manual request => test, debug, investigate • Batch fetch => ad hoc analysis • => Huge numbers of read requests
  35. events storage graphite cluster Anomaly detector experiment
 cluster hadoop cluster

    mysql analysis manual requests 50 MB/s 50 MB/s 50 M B/s 50 M B/s 50 MB/s 50 MB/s
  36. THIS IS REALTIME • 1 second of data • Stored

    in < 1 sec • Available after < 1 sec • Issue : network saturation
  37. THE IDEA • Instead of • Fetch data, • Crunch

    data (ex: average), • Produce a small result • Do • Bring code to data • Crunch data on Riak • Fetch the result
  38. WHAT TAKES TIME • Takes a lot of time •

    Fetching data out: network issue • Decompressing: CPU time issue • Takes almost no time • Crunching data
  39. MAPREDUCE • Input: epoch-dc • Map1: metadata keys => data

    keys • Map2: data crunching • Reduce: aggregate • Realtime: OK • network usage: OK • CPU time: NOT OK
  40. HOOKS • Every time metadata is written • Post-Commit hook

    triggered • Riak executes a callback we provided • Crunch data on the nodes
  41. Riak post-commit hook REST service RIAK service key key socket

    result sent for storage decompress
 process all tasks NODE HOST
  42. HOOK CODE metadata_stored_hook(RiakObject) -> Key = riak_object:key(RiakObject), Bucket = riak_object:bucket(RiakObject),

    [ Epoch, DC ] = binary:split(Key, <<"-">>), MetaData = riak_object:get_value(RiakObject), DataKeys = binary:split(MetaData, <<"|">>, [ global ]), send_to_REST(Epoch, Hostname, DataKeys), ok.
  43. send_to_REST(Epoch, Hostname, DataKeys) -> Method = post, URL = "http://"

    ++ binary_to_list(Hostname) ++ ":5000?epoch=" ++ binary_to_list(Epoch), HTTPOptions = [ { timeout, 4000 } ], Options = [ { body_format, string }, { sync, false }, { receiver, fun(ReplyInfo) -> ok end } ], Body = iolist_to_binary(mochijson2:encode( DataKeys )), httpc:request(Method, {URL, [], "application/json", Body}, HTTPOptions, Options), ok.
  44. REST SERVICE • In Perl, using PSGI (WSGI-like), Starman, preforks

    • Allow to write data cruncher in Perl • Also supports loading code on demand • list of operational companions Monitored
  45. REST SERVICE: SCALABLE • Scalable • 1 second = 200

    keys • 16 nodes, 10 CPUs ( 2 for Riak ) • 1 key must be crunched in 16*10/200 sec = 0.8 sec • => we have time
  46. OPTIMIZATION R C R C R C key primary node

    for that key R C : Riak : Companion PUT
  47. ADVANTAGES • CPU usage and execution time can be capped

    • Data is local to processing • Data is decompressed only once • Data crunching done all at once • Two systems are loosely coupled • can be written in any language
  48. • PUT - bad case • n_val = 3 •

    inside usage =
 3 x outside usage
  49. • PUT - good case • n_val = 3 •

    inside usage =
 2 x outside usage
  50. • network usage ( PUT and GET ): • 3

    x 13/16+ 2 x 3/16= 2.81 • plus gossip • inside network > 3 x outside network
  51. • Usually it’s not a problem • But in our

    case: • big values, constant PUTs, lots of GETs • sadly, only 1 Gbit/s • => network bandwidth issue
  52. THE BANDWIDTH SOLUTIONS 1. Optimize GET for network usage, not

    speed 2. Don’t choose a node at random
  53. • GET - bad case • n_val = 1 •

    inside usage =
 1 x outside
  54. • GET - good case • n_val = 1 •

    inside usage =
 0 x outside
  55. WARNING • Possible only because data is read-only • Data

    has internal checksum • No conflict possible • Corruption detected
  56. THE BANDWIDTH SOLUTIONS 1. Optimize GET for network usage, not

    speed 2. Don’t choose a node at random
  57. • bucket = “metadata” • key = “12345” Hash =

    hashFunction(bucket + key) RingStatus = getRingStatus PrimaryNodes = Fun(Hash, RingStatus)
  58. WARNING • Possible only if • Nodes list is monitored

    • In case of failed node, default to random • Data is requested in an uniform way
  59. CONCLUSION • We used only Riak Open Source • No

    training, self-taught, small team • Riak is a great solution • Robust, fast, scalable, easy • Very flexible and hackable • Helps us continue scaling