Slide 1

Slide 1 text

Building a data platform with Ruby glue Augusto Becciu

Slide 2

Slide 2 text

Who am I? • @abecciu {github, twitter} • Startups guy • Like to write and run code • Cur: CTO at • Prev: Dev / Ops Engineer at

Slide 3

Slide 3 text

Data is the DNA of a Company

Slide 4

Slide 4 text

What is a data platform?

Slide 5

Slide 5 text

Data Applications Fraud detection Monitoring and alerting Metric dashboards Recommendation systems Bidding systems Audit trails Marketing automation A/B Testing User behavior analytics

Slide 6

Slide 6 text

No content

Slide 7

Slide 7 text

Two groups of applications • Real-time processing (ex. fraud detection, bidding systems, monitoring and alerting, etc. ). • Batch processing (ex. recommendations systems, user behavior analytics, etc.).

Slide 8

Slide 8 text

No content

Slide 9

Slide 9 text

Design Principles • Make it super easy for developers to produce and consume data. • Optimize for low operations and infrastructure costs. • Make the best effort to not loose data, but don’t worry about guarantees. • Be programming language agnostic.

Slide 10

Slide 10 text

Data Acquisition

Slide 11

Slide 11 text

Data Standardization The unit of data is an event represented as a JSON document.

Slide 12

Slide 12 text

Data Standardization Anatomy of an event:

Slide 13

Slide 13 text

Data Consolidation

Slide 14

Slide 14 text

• Open source tool to collect events and logs. • Pluggable inputs and outputs. • Code and plugins written in Ruby. • Provides fault tolerance and simple HA facilities. • JSON everywhere.

Slide 15

Slide 15 text

No content

Slide 16

Slide 16 text

No content

Slide 17

Slide 17 text

Data Storage • Data stream is partitioned by day and stored in a hierarchical way like: ! • Events are saved in newline-delimited text files. • Every file is gzipped. • Events are buffered in local filesystem and uploaded to S3 periodically. /year/month/day/events.gz

Slide 18

Slide 18 text

Data Distribution • Publish-subscribe pattern. • Events are published to a topic exchange. • Consumers can subscribe to topics based on event keys. Ex. ui.webapp.# • SSL-based authentication and encryption over persistent TCP connections allow consumers to be anywhere on the Internet.

Slide 19

Slide 19 text

Tools to produce data • Libraries were created to abstract the process of creating and publishing an event. • Multiple languages supported like: Ruby, Javascript and Objective-C. For example, in a Rails app it’s as simple as: (Namespace and context data are added automagically...)

Slide 20

Slide 20 text

Tools to consume data • Any application can use any standard AMQP or S3 client • We also created a couple of command-line tools for easy prototyping and experimentation.

Slide 21

Slide 21 text

Tools to consume data tail_f gets you events from the firehose $ tail_f ui.webapp.#, errors.webapp.# | ruby do_staff.rb do_staff.rb:

Slide 22

Slide 22 text

Tools to consume data s3_cat downloads cold data from S3 $ s3_cat 2013-11-26..2013-11-28 | ruby do_staff.rb $ s3_cat 2013-10, 2013-11-01..2013-11-15 > data $ s3_cat 2013 | mongoimport -d mydb -c mycollection

Slide 23

Slide 23 text

Resources • http://fluentd.org/ • https://github.com/fluent/fluent-plugin-s3 • https://github.com/restorando/fluent-plugin-amqp • http://www.rabbitmq.com/ • https://github.com/restorando

Slide 24

Slide 24 text

Thanks! / Questions? Meet the team http://restorando.com/joinus