Building a data platform
with Ruby glue
Augusto Becciu
Slide 2
Slide 2 text
Who am I?
• @abecciu {github, twitter}
• Startups guy
• Like to write and run code
• Cur: CTO at
• Prev: Dev / Ops Engineer at
Slide 3
Slide 3 text
Data is the DNA of a
Company
Slide 4
Slide 4 text
What is a data
platform?
Slide 5
Slide 5 text
Data Applications
Fraud detection
Monitoring and alerting
Metric dashboards
Recommendation systems
Bidding systems
Audit trails
Marketing automation
A/B Testing
User behavior analytics
Slide 6
Slide 6 text
No content
Slide 7
Slide 7 text
Two groups of
applications
• Real-time processing (ex. fraud detection,
bidding systems, monitoring and alerting,
etc. ).
• Batch processing (ex. recommendations
systems, user behavior analytics, etc.).
Slide 8
Slide 8 text
No content
Slide 9
Slide 9 text
Design Principles
• Make it super easy for developers to
produce and consume data.
• Optimize for low operations and
infrastructure costs.
• Make the best effort to not loose data, but
don’t worry about guarantees.
• Be programming language agnostic.
Slide 10
Slide 10 text
Data Acquisition
Slide 11
Slide 11 text
Data Standardization
The unit of data is an event represented
as a JSON document.
Slide 12
Slide 12 text
Data Standardization
Anatomy of an event:
Slide 13
Slide 13 text
Data Consolidation
Slide 14
Slide 14 text
• Open source tool to collect events and
logs.
• Pluggable inputs and outputs.
• Code and plugins written in Ruby.
• Provides fault tolerance and simple HA
facilities.
• JSON everywhere.
Slide 15
Slide 15 text
No content
Slide 16
Slide 16 text
No content
Slide 17
Slide 17 text
Data Storage
• Data stream is partitioned by day and
stored in a hierarchical way like:
!
• Events are saved in newline-delimited text
files.
• Every file is gzipped.
• Events are buffered in local filesystem and
uploaded to S3 periodically.
/year/month/day/events.gz
Slide 18
Slide 18 text
Data Distribution
• Publish-subscribe pattern.
• Events are published to a topic exchange.
• Consumers can subscribe to topics based
on event keys. Ex. ui.webapp.#
• SSL-based authentication and encryption
over persistent TCP connections allow
consumers to be anywhere on the Internet.
Slide 19
Slide 19 text
Tools to produce data
• Libraries were created to abstract the
process of creating and publishing an event.
• Multiple languages supported like: Ruby,
Javascript and Objective-C.
For example, in a Rails app it’s as simple as:
(Namespace and context data are added automagically...)
Slide 20
Slide 20 text
Tools to consume data
• Any application can use any standard
AMQP or S3 client
• We also created a couple of command-line
tools for easy prototyping and
experimentation.
Slide 21
Slide 21 text
Tools to consume data
tail_f gets you events from the firehose
$ tail_f ui.webapp.#, errors.webapp.# | ruby do_staff.rb
do_staff.rb:
Slide 22
Slide 22 text
Tools to consume data
s3_cat downloads cold data from S3
$ s3_cat 2013-11-26..2013-11-28 | ruby do_staff.rb
$ s3_cat 2013-10, 2013-11-01..2013-11-15 > data
$ s3_cat 2013 | mongoimport -d mydb -c mycollection