$30 off During Our Annual Pro Sale. View Details »

Using Riak for events storage and analysis at Booking.com

dams
June 12, 2015

Using Riak for events storage and analysis at Booking.com

At Booking.com, we have a constant flow of events coming from various applications and internal subsystems. This critical data needs to be stored for real-time, medium and long term analysis. Events are schema-less, making it difficult to use standard analysis tools.This presentation will explain how we built a real time storage and analysis solution based on Riak. The talk will cover: data aggregation and serialization, Riak configuration, solutions for lowering the network usage, and finally, how Riak's advanced features are used to perform real-time data crunching on the cluster nodes

dams

June 12, 2015
Tweet

Other Decks in Programming

Transcript

  1. Using Riak for Events storage and analysis at
    Booking.com
    Damien Krotkine

    View Slide

  2. Damien Krotkine
    • Software Engineer at Booking.com
    • github.com/dams
    • @damsieboy
    • dkrotkine

    View Slide

  3. View Slide

  4. KEY FIGURES
    • 600,000 hotels
    • 212 countries
    • 800,000 room nights every 24 hours
    • 43 million+ guest reviews
    • 155+ offices worldwide
    • 8,600 people
    • not a small website…

    View Slide

  5. INTRODUCTION

    View Slide

  6. View Slide

  7. View Slide

  8. View Slide

  9. View Slide

  10. www API
    mobi

    View Slide

  11. www API
    mobi
    frontend

    View Slide

  12. www API
    frontend
    backend
    mobi
    events storage
    events: info about
    subsystems status

    View Slide

  13. backend
    web mobi api
    databases
    caches
    load
    balancers
    availability
    cluster
    email
    etc…

    View Slide

  14. WHAT IS AN EVENT ?

    View Slide

  15. EVENT STRUCTURE
    • Provides info about subsystems
    • Data
    • Deep HashMap
    • Timestamp
    • Type + Subtype
    • The rest: specific data
    • Schema-less

    View Slide

  16. { timestamp => 12345,
    type => 'WEB',
    subtype => 'app',
    dc => 1,
    action => { is_normal_user => 1,
    pageview_id => '188a362744c301c2',
    # ...
    },
    tuning => { the_request => 'GET /display/...'
    bytes_body => 35,
    wallclock => 111,
    nr_warnings => 0,
    # ...
    },
    # ...
    }

    View Slide

  17. { type => 'FAV',
    subtype => 'fav',
    timestamp => 1401262979,
    dc => 1,
    tuning => {
    flatav => {
    cluster => '205',
    sum_latencies => 21,
    role => 'fav',
    num_queries => 7
    }
    }
    }

    View Slide

  18. EVENTS FLOW PROPERTIES
    • Read-only
    • Schema-less
    • Continuous, sequential, timed
    • 15 K events per sec
    • 1.25 Billion events per day
    • peak at 70 MB/s, min 25MB/s
    • 100 GB per hour

    View Slide

  19. USAGE

    View Slide

  20. USAGE
    1. GRAPHS
    2. DECISION MAKING
    3. SHORT TERM ANALYSIS
    4. A/B TESTING

    View Slide

  21. GRAPHS
    • Graph in real-time ( few seconds lag )
    • Graph as many systems as possible
    • General platform health check

    View Slide

  22. GRAPHS

    View Slide

  23. GRAPHS

    View Slide

  24. DASHBOARDS

    View Slide

  25. META GRAPHS

    View Slide

  26. USAGE
    1. GRAPHS
    2. DECISION MAKING
    3. SHORT TERM ANALYSIS
    4. A/B TESTING

    View Slide

  27. DECISION MAKING
    • Strategic decision ( use facts )
    • Long term or short term
    • Technical / Non technical Reporting

    View Slide

  28. USAGE
    1. GRAPHS
    2. DECISION MAKING
    3. SHORT TERM ANALYSIS
    4. A/B TESTING

    View Slide

  29. SHORT TERM ANALYSIS
    • From 10 sec ago -> few hours ago
    • Code deployment checks and rollback
    • Anomaly Detector

    View Slide

  30. USAGE
    1. GRAPHS
    2. DECISION MAKING
    3. SHORT TERM ANALYSIS
    4. A/B TESTING

    View Slide

  31. A/B TESTING
    • Our core philosophy: use facts
    • It means: do A/B testing
    • Concept of Experiments
    • Events provide data to compare
    • We need data from the last few days

    View Slide

  32. EVENT AGGREGATION

    View Slide

  33. EVENT AGGREGATION
    • Group events
    • Granularity we need: second

    View Slide

  34. SERIALIZATION
    • JSON didn’t work for us (slow, big, lack features)
    • Created Sereal in 2012
    • « Sereal, a new, binary data serialization format that
    provides high-performance, schema-less serialization »
    • https://github.com/Sereal/Sereal

    View Slide

  35. event
    event
    events storage
    event
    event
    event
    event
    event
    event
    event
    event
    event
    event
    event

    View Slide

  36. e e
    e e
    e
    e
    e e
    e
    e
    e
    e
    e
    e
    e
    e
    e
    e
    LOGGER
    e
    e

    View Slide

  37. web api
    e
    e
    e
    e
    e
    e
    e
    e
    e
    e e
    e e
    e
    e
    e e
    e
    e
    e
    e
    e
    e
    e
    e
    e
    e
    e
    e

    View Slide

  38. web api dbs
    e
    e
    e
    e
    e
    e
    e
    e
    e
    e
    e
    e
    e
    e
    e
    e
    e
    e
    e
    e
    e
    e
    e
    e
    e e
    e e
    e
    e
    e e
    e
    e
    e
    e
    e
    e
    e
    e
    e
    e
    e
    e

    View Slide

  39. web api dbs
    e
    e
    e
    e
    e
    e
    e
    e
    e
    e
    e
    e
    e
    e
    e
    e
    e
    e
    e
    e
    e
    e
    e
    e
    e e
    e e
    e
    e
    e e
    e
    e
    e
    e
    e
    e
    e
    e
    e
    e
    1 sec
    e
    e

    View Slide

  40. web api dbs
    e
    e
    e
    e
    e
    e
    e
    e
    e
    e
    e
    e
    e
    e
    e
    e
    e
    e
    e
    e
    e
    e
    e
    e
    e e
    e e
    e
    e
    e e
    e
    e
    e
    e
    e
    e
    e
    e
    e
    e
    1 sec
    e
    e

    View Slide

  41. web api dbs
    e
    e
    e
    e
    e
    e
    e
    e
    e
    e
    e
    e
    1 sec
    events storage

    View Slide

  42. web api dbs
    e
    e
    e
    e
    e
    e
    e
    e
    e
    1 sec
    events storage
    e
    e e reserialize
    + compress

    View Slide

  43. events storage
    LOGGER …
    LOGGER LOGGER

    View Slide

  44. STORAGE

    View Slide

  45. WHAT WE WANT
    • Storage security
    • Mass write performance
    • Mass read performance
    • Easy administration
    • Very scalable
    • => We reviewed a bunch of contenders

    View Slide

  46. WE CHOSE RIAK
    • Security: cluster, distributed, very robust
    • Good and predictable read / write performance
    • The easiest to setup and administrate
    • Advanced features (MapReduce, triggers, 2i, CRDTs …)
    • Riak Search
    • Multi Datacenter Replication

    View Slide

  47. View Slide

  48. CLUSTER
    • Commodity hardware
    • All nodes serve data
    • Data replication
    • Gossip between nodes
    • No master
    • distributed system
    Ring of servers

    View Slide

  49. hash(key)

    View Slide

  50. View Slide

  51. KEY VALUE STORE
    • Namespaces: bucket
    • Values: opaque or CRDTs

    View Slide

  52. RIAK: ADVANCED FEATURES
    • MapReduce
    • Secondary indexes (2i)
    • Riak Search
    • Multi DataCenter Replication

    View Slide

  53. MULTI-BACKEND FOR STORAGE
    • Bitcask
    • Eleveldb
    • Memory

    View Slide

  54. BACKEND: BITCASK
    • Log-based storage backend
    • Append-only files (AOF files)
    • Advanced expiration
    • Predictable performance (1 disk-seek max)
    • Perfect for sequential data
    • SSD or spinning disks ?

    View Slide

  55. CLUSTER CONFIGURATION

    View Slide

  56. DISK SPACE NEEDED
    • 8 days
    • 100 GB per hour
    • Replication 3
    • 100 * 24 * 8 * 3
    • Need 60 T

    View Slide

  57. HARDWARE
    • 2 clusters
    • 12 nodes, then 16, then 27
    • 12 CPU cores ( Xeon 2.5Ghz)
    • 192 GB RAM
    • network 1 Gbit/s
    • 8 TB (raid 6)
    • Cluster total space: 128 TB

    View Slide

  58. DATA DESIGN

    View Slide

  59. web api dbs
    e
    e
    e
    e
    e
    e
    e
    e
    e
    1 sec
    events storage
    1 blob per EPOCH / DC / CELL / TYPE / SUBTYPE
    500 KB max chunks

    View Slide

  60. DATA
    • Bucket name: “data“
    • Key: “12345:1:cell0:WEB:app:chunk0“
    • Value: List of events (Hashmaps), serialized & compressed
    • 200 keys per seconds

    View Slide

  61. METADATA
    • Bucket name: “metadata“
    • Key: - “1428415043-2“
    • Value: list of data keys:


    [ “1428415043:1:cell0:WEB:app:chunk0“,

    “1428415043:1:cell0:WEB:app:chunk1“

    …

    “1428415043:4:cell0:EMK::chunk3“ ]
    • As pipe separated value (PSV)

    View Slide

  62. WRITE DATA

    View Slide

  63. PUSH DATA IN
    • Every seconds:
    • Push data values to Riak, in parallel
    • Wait for success
    • Push metadata

    View Slide

  64. JAVA
    Bucket DataBucket = riakClient.fetchBucket("data").execute();
    DataBucket.store("12345:1:cell0:WEB:app:chunk0", Data1).execute();
    DataBucket.store("12345:1:cell0:WEB:app:chunk1", Data2).execute();
    DataBucket.store("12345:1:cell0:WEB:app:chunk2", Data3).execute();
    Bucket MetaDataBucket = riakClient.fetchBucket("metadata").execute();
    MetaDataBucket.store("12345-1", metaData).execute();
    riakClient.shutdown();

    View Slide

  65. Perl
    my $client = Riak::Client->new(…);
    $client->put(data => '12345:1:cell0:WEB:app:chunk0', $data1);
    $client->put(data => '12345:1:cell0:WEB:app:chunk1', $data2);
    $client->put(data => '12345:1:cell0:WEB:app:chunk2', $data3);
    $client->put(metadata => '12345-1', $metadata, 'text/plain' );

    View Slide

  66. READ DATA

    View Slide

  67. READ ONE SECOND
    • For one second (a given epoch)
    • Request metadata for -DC
    • Parse value
    • Filter out unwanted types / subtypes
    • Fetch the keys from the “data” bucket

    View Slide

  68. Perl
    my $client = Riak::Client->new(…);
    my @array = split '\|', $client->get(metadata => '1428415043-1');
    @filtered_array = grep { /WEB/ } @array;
    $client->get(data => $_) foreach @filtered_array;

    View Slide

  69. READ AN INTERVAL
    • For an interval epoch1 -> epoch2
    • Generate the list of epochs
    • Fetch in parallel
    • Riak excels at handling huge number of req/sec

    View Slide

  70. RIAK CLUSTER HEALTH

    View Slide

  71. CPU USAGE

    View Slide

  72. DISK IOPS

    View Slide

  73. DISK IO UTILIZATION %

    View Slide

  74. DISK SPACE RECLAIMED
    one day

    View Slide

  75. REAL TIME PROCESSING OUTSIDE OF RIAK

    View Slide

  76. EXAMPLES
    • Streaming => Graphite ( every sec )
    • Streaming => Anomaly Detector ( last 2 min )
    • Streaming => Experiment analysis ( last day )
    • Every minute => Hadoop
    • Manual request => test, debug, investigate
    • Batch fetch => ad hoc analysis
    • => Huge numbers of read requests

    View Slide

  77. events storage
    graphite
    cluster
    Anomaly
    detector
    experiment

    cluster
    hadoop
    cluster
    mysql
    analysis
    manual
    requests
    50 MB/s
    50 MB/s
    50
    M
    B/s
    50
    M
    B/s
    50 MB/s
    50 MB/s

    View Slide

  78. THIS IS REALTIME
    • 1 second of data
    • Stored in < 1 sec
    • Available after < 1 sec
    • Issue : network saturation

    View Slide

  79. REAL TIME PROCESSING INSIDE RIAK

    View Slide

  80. THE IDEA
    • Instead of
    • Fetch data,
    • Crunch data (ex: average),
    • Produce a small result
    • Do
    • Bring code to data
    • Crunch data on Riak
    • Fetch the result

    View Slide

  81. WHAT TAKES TIME
    • Takes a lot of time
    • Fetching data out: network issue
    • Decompressing: CPU time issue
    • Takes almost no time
    • Crunching data

    View Slide

  82. MAPREDUCE
    • Input: epoch-dc
    • Map1: metadata keys => data keys
    • Map2: data crunching
    • Reduce: aggregate
    • Realtime: OK
    • network usage: OK
    • CPU time: NOT OK

    View Slide

  83. HOOKS
    • Every time metadata is written
    • Post-Commit hook triggered
    • Riak executes a callback we provided
    • Crunch data on the nodes

    View Slide

  84. View Slide

  85. Riak post-commit hook
    REST service
    RIAK service
    key key
    socket
    result sent for storage
    decompress

    process all tasks
    NODE HOST

    View Slide

  86. HOOK CODE
    metadata_stored_hook(RiakObject) ->
    Key = riak_object:key(RiakObject),
    Bucket = riak_object:bucket(RiakObject),
    [ Epoch, DC ] = binary:split(Key, <<"-">>),
    MetaData = riak_object:get_value(RiakObject),
    DataKeys = binary:split(MetaData, <<"|">>, [ global ]),
    send_to_REST(Epoch, Hostname, DataKeys),
    ok.

    View Slide

  87. send_to_REST(Epoch, Hostname, DataKeys) ->
    Method = post,
    URL = "http://" ++ binary_to_list(Hostname)
    ++ ":5000?epoch=" ++ binary_to_list(Epoch),
    HTTPOptions = [ { timeout, 4000 } ],
    Options = [ { body_format, string },
    { sync, false },
    { receiver, fun(ReplyInfo) -> ok end }
    ],
    Body = iolist_to_binary(mochijson2:encode( DataKeys )),
    httpc:request(Method, {URL, [], "application/json", Body},
    HTTPOptions, Options),
    ok.

    View Slide

  88. REST SERVICE
    • In Perl, using PSGI (WSGI-like), Starman, preforks
    • Allow to write data cruncher in Perl
    • Also supports loading code on demand
    • list of operational companions Monitored

    View Slide

  89. REST SERVICE: SCALABLE
    • Scalable
    • 1 second = 200 keys
    • 16 nodes, 10 CPUs ( 2 for Riak )
    • 1 key must be crunched in 16*10/200 sec = 0.8 sec
    • => we have time

    View Slide

  90. OPTIMIZATION
    R C
    R C
    R C
    key
    primary node
    for that key
    R
    C
    : Riak
    : Companion
    PUT

    View Slide

  91. ADVANTAGES
    • CPU usage and execution time can be capped
    • Data is local to processing
    • Data is decompressed only once
    • Data crunching done all at once
    • Two systems are loosely coupled
    • can be written in any language

    View Slide

  92. DISADVANTAGES
    • Only for incoming data (streaming), not old data
    • Can’t easily use cross-second data

    View Slide

  93. THE BANDWIDTH PROBLEM

    View Slide

  94. • PUT - bad case
    • n_val = 3
    • inside usage =

    3 x outside usage

    View Slide

  95. • PUT - good case
    • n_val = 3
    • inside usage =

    2 x outside usage

    View Slide

  96. • GET - bad case
    • inside usage =

    3 x outside usage

    View Slide

  97. • GET - good case
    • inside usage =

    2 x outside usage

    View Slide

  98. • network usage ( PUT and GET ):
    • 3 x 13/16+ 2 x 3/16= 2.81
    • plus gossip
    • inside network > 3 x outside network

    View Slide

  99. • Usually it’s not a problem
    • But in our case:
    • big values, constant PUTs, lots of GETs
    • sadly, only 1 Gbit/s
    • => network bandwidth issue

    View Slide

  100. THE BANDWIDTH SOLUTIONS

    View Slide

  101. THE BANDWIDTH SOLUTIONS
    1. Optimize GET for network usage, not speed
    2. Don’t choose a node at random

    View Slide

  102. • GET - bad case
    • n_val = 1
    • inside usage =

    1 x outside

    View Slide

  103. • GET - good case
    • n_val = 1
    • inside usage =

    0 x outside

    View Slide

  104. WARNING
    • Possible only because data is read-only
    • Data has internal checksum
    • No conflict possible
    • Corruption detected

    View Slide

  105. RESULT
    • practical network usage reduced by 2 !

    View Slide

  106. THE BANDWIDTH SOLUTIONS
    1. Optimize GET for network usage, not speed
    2. Don’t choose a node at random

    View Slide

  107. • bucket = “metadata”
    • key = “12345”

    View Slide

  108. • bucket = “metadata”
    • key = “12345”
    Hash = hashFunction(bucket + key)
    RingStatus = getRingStatus
    PrimaryNodes = Fun(Hash, RingStatus)

    View Slide

  109. hashFunction()
    getRingStatus()

    View Slide

  110. hashFunction()
    getRingStatus()

    View Slide

  111. WARNING
    • Possible only if
    • Nodes list is monitored
    • In case of failed node, default to random
    • Data is requested in an uniform way

    View Slide

  112. RESULT
    • Network usage even more reduced !
    • Especially for GETs

    View Slide

  113. NETWORK USAGE

    View Slide

  114. CONCLUSION

    View Slide

  115. CONCLUSION
    • We used only Riak Open Source
    • No training, self-taught, small team
    • Riak is a great solution
    • Robust, fast, scalable, easy
    • Very flexible and hackable
    • Helps us continue scaling

    View Slide

  116. Q&A
    @damsieboy

    View Slide