Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building a Data Platform with Ruby glue

Building a Data Platform with Ruby glue

Augusto Becciu

November 28, 2013
Tweet

More Decks by Augusto Becciu

Other Decks in Programming

Transcript

  1. Who am I? • @abecciu {github, twitter} • Startups guy

    • Like to write and run code • Cur: CTO at • Prev: Dev / Ops Engineer at
  2. Data Applications Fraud detection Monitoring and alerting Metric dashboards Recommendation

    systems Bidding systems Audit trails Marketing automation A/B Testing User behavior analytics
  3. Two groups of applications • Real-time processing (ex. fraud detection,

    bidding systems, monitoring and alerting, etc. ). • Batch processing (ex. recommendations systems, user behavior analytics, etc.).
  4. Design Principles • Make it super easy for developers to

    produce and consume data. • Optimize for low operations and infrastructure costs. • Make the best effort to not loose data, but don’t worry about guarantees. • Be programming language agnostic.
  5. • Open source tool to collect events and logs. •

    Pluggable inputs and outputs. • Code and plugins written in Ruby. • Provides fault tolerance and simple HA facilities. • JSON everywhere.
  6. Data Storage • Data stream is partitioned by day and

    stored in a hierarchical way like: ! • Events are saved in newline-delimited text files. • Every file is gzipped. • Events are buffered in local filesystem and uploaded to S3 periodically. /year/month/day/events.gz
  7. Data Distribution • Publish-subscribe pattern. • Events are published to

    a topic exchange. • Consumers can subscribe to topics based on event keys. Ex. ui.webapp.# • SSL-based authentication and encryption over persistent TCP connections allow consumers to be anywhere on the Internet.
  8. Tools to produce data • Libraries were created to abstract

    the process of creating and publishing an event. • Multiple languages supported like: Ruby, Javascript and Objective-C. For example, in a Rails app it’s as simple as: (Namespace and context data are added automagically...)
  9. Tools to consume data • Any application can use any

    standard AMQP or S3 client • We also created a couple of command-line tools for easy prototyping and experimentation.
  10. Tools to consume data tail_f gets you events from the

    firehose $ tail_f ui.webapp.#, errors.webapp.# | ruby do_staff.rb do_staff.rb:
  11. Tools to consume data s3_cat downloads cold data from S3

    $ s3_cat 2013-11-26..2013-11-28 | ruby do_staff.rb $ s3_cat 2013-10, 2013-11-01..2013-11-15 > data $ s3_cat 2013 | mongoimport -d mydb -c mycollection