Slide 1

Slide 1 text

Transforming your Data with Ruby and ActiveWarehouse-ETL Saturday, June 23, 12

Slide 2

Slide 2 text

Thibaut Barrère twitter: @thibaut_barrere github: @thbar Saturday, June 23, 12

Slide 3

Slide 3 text

Core Concepts 1 Saturday, June 23, 12

Slide 4

Slide 4 text

E T L ? Saturday, June 23, 12

Slide 5

Slide 5 text

Extract Transform Load Saturday, June 23, 12

Slide 6

Slide 6 text

Extract • from a local database • from a CRM via some command-line • from a remote source • partial or complete Saturday, June 23, 12

Slide 7

Slide 7 text

Transform • clean-up • look-up • enrich • filter • conform • (de) normalize Saturday, June 23, 12

Slide 8

Slide 8 text

Load • dump to file • insert • upsert (update/insert) Saturday, June 23, 12

Slide 9

Slide 9 text

Saturday, June 23, 12

Slide 10

Slide 10 text

•first released in 2006 •latest release this week •slow maintenance pace • http://www.activewarehouse.info/ ActiveWarehouse-ETL Saturday, June 23, 12

Slide 11

Slide 11 text

•MRI ruby 1.8.7 / 1.9.3 •activerecord 3+ ActiveWarehouse-ETL Saturday, June 23, 12

Slide 12

Slide 12 text

• gem command: “etl” • declarative ruby scripts aka “control files” • database.yml • works with or without a Rails app ActiveWarehouse-ETL Saturday, June 23, 12

Slide 13

Slide 13 text

https://github.com/activewarehouse/activewarehouse-etl-sample Sample Walk-Through Saturday, June 23, 12

Slide 14

Slide 14 text

| first_name | last_name | email | |------------|------------|------------------------| | John | Barry | [email protected] | | Jonathon | More | [email protected] | | Matt | Black | [email protected] | | Marlena | Shaw | [email protected] | | Neil | Young | [email protected] | customers table ETL upsert based on email extract email host remove hotmail rows 1 2 3 Saturday, June 23, 12

Slide 15

Slide 15 text

| first_name | last_name | email | |------------|------------|------------------------| | John | Barry | [email protected] | | Jonathon | More | [email protected] | | Matt | Black | [email protected] | | Marlena | Shaw | [email protected] | | Neil | Young | [email protected] | Input Saturday, June 23, 12

Slide 16

Slide 16 text

source :input, { :file => file, :parser => :csv, :skip_lines => 1 }, [ :first_name, :last_name, :email ] Data source Saturday, June 23, 12

Slide 17

Slide 17 text

transform(:email_provider) do |name,value,row| row[:email].downcase.split('@').last end Transforms Saturday, June 23, 12

Slide 18

Slide 18 text

transform :email_provider, :default, :default_value => "Unknown" Transforms Saturday, June 23, 12

Slide 19

Slide 19 text

transform(:full_name) do |n,v,r| [r[:first_name], r[:last_name]].join(' ') end Transforms Saturday, June 23, 12

Slide 20

Slide 20 text

before_write do |r| r[:email_provider] =~ /hotmail/ ? nil : r end Before Write Saturday, June 23, 12

Slide 21

Slide 21 text

destination :out, { :type => :insert_update_database, :target => :datawarehouse, :table => 'customers' }, { :primarykey => [:email], :order => [ :email, :full_name, :email_provider ] } Destination Saturday, June 23, 12

Slide 22

Slide 22 text

class Customer < ActiveRecord::Base end screen(:fatal) { assert_equal 1, Customer.where(:email => '[email protected]').count } Screens Saturday, June 23, 12

Slide 23

Slide 23 text

Pipeline Life-Cycle - file evaluated - pre_process - sources fetched - after_read(row) - transforms(row) - before_write(row) - row written to destination - screen - post_process - after_post_process_screen Saturday, June 23, 12

Slide 24

Slide 24 text

Key Points Saturday, June 23, 12

Slide 25

Slide 25 text

Highly hackable • Ruby! • Refactorable • Testable • Diff-able • Composable Saturday, June 23, 12

Slide 26

Slide 26 text

Declarative Syntax •Concise •Easier to maintain •Easier to reuse Saturday, June 23, 12

Slide 27

Slide 27 text

Row-based processing •Creative constraint •Provides a framework Saturday, June 23, 12

Slide 28

Slide 28 text

The docs kinda suck (embrace the code for now) Saturday, June 23, 12

Slide 29

Slide 29 text

Ruby is slow! Saturday, June 23, 12

Slide 30

Slide 30 text

Ruby is slow! • More often than not, it’s fast enough! People are handling tens of millions of rows daily • If too slow, delegate the critical loop to a faster tool and use Ruby to drive the workflow • Use bulk load/upsert features of your db Saturday, June 23, 12

Slide 31

Slide 31 text

Real-Life Examples 2 Saturday, June 23, 12

Slide 32

Slide 32 text

Reporting / Business Intelligence Saturday, June 23, 12

Slide 33

Slide 33 text

Production Reporting ETL Saturday, June 23, 12

Slide 34

Slide 34 text

Date Dimension Time Dimension Customer Dimension Purchase Fact Product Dimension Saturday, June 23, 12

Slide 35

Slide 35 text

| id | date | quarter | quarter_year | ... | |------------|------------|---------|--------------|-----| | 1 | 2000-01-01 | T1 | 2000-T1 | ... | | 2 | 2000-01-02 | T1 | 2000-T1 | ... | | | | | | ... | | ... | | | | ... | | 7660 | 2020-12-20 | T2 | 2020-T2 | ... | Date Dimension Saturday, June 23, 12

Slide 36

Slide 36 text

| id | email | email_provider | country | language | |------|------------|----------------|---------|----------| | 1 | xxx | gmail | France | french | | 2 | xxx | gmail | U.K. | english | | 3 | xxx | hotmail | France | french | Customer Dimension Saturday, June 23, 12

Slide 37

Slide 37 text

resolver = ActiveRecordResolver.new(DateDimension, :find_by_sql_date_stamp) copy :created_at, :date_id transform :date_id, :foreign_key_lookup, { :resolver => resolver } Surrogate Key Look-Up Saturday, June 23, 12

Slide 38

Slide 38 text

Saturday, June 23, 12

Slide 39

Slide 39 text

OLAP with JRuby API https://github.com/rsim/mondrian-olap Saturday, June 23, 12

Slide 40

Slide 40 text

CRM WWW Extracting and geocoding Saturday, June 23, 12

Slide 41

Slide 41 text

•Full export from CRM •Cleaning + geocoding •Push to production via SCP •Import/indexing via CRON Saturday, June 23, 12

Slide 42

Slide 42 text

before_write do |row| position, status = geocoding_with_redis_cache(row) row[:geocoding_status] = status row[:latitude] = position.latitude row[:longitude] = position.longitude row end Saturday, June 23, 12

Slide 43

Slide 43 text

JAVA COBOL Holà! What? Translating Saturday, June 23, 12

Slide 44

Slide 44 text

•Both systems are too rigid •Create a well-tested translator with Ruby Saturday, June 23, 12

Slide 45

Slide 45 text

it "raises an error when a data is too large" do lambda do destination.write( { :age => 270, :sex => "M" } ) destination.close end.should raise_error(/field age is too large (max width 2\)/) end it "outputs all the fields and terminates by eol" do destination.write( { :age => "27", :sex => "M"} ) destination.write( { :age => "20", :sex => "F"} ) destination.close output.should == "27M\r\n20F\r\n" end Test, test, test. Or you’re all fired. Saturday, June 23, 12

Slide 46

Slide 46 text

Consolidating AdWords Analytics App/CRM PABX Decide, buy, sell, bid Saturday, June 23, 12

Slide 47

Slide 47 text

Aggregating Rentals partner A CSV Rentals partner B XML Rentals partner C Rotten CSV Our holiday rental site Saturday, June 23, 12

Slide 48

Slide 48 text

Cleaning and extending the CRM (with rubygem functionalities) CRM Extract rows Write down corrupt IBANs Craft an email to supervisor 1 2 3 Saturday, June 23, 12

Slide 49

Slide 49 text

Cleaning and extending the CRM require 'iban-check' @invalid_ibans = [] before_write do |row| iban = row.slice(:iban_1, .., :iban_6).join(' ') iban_check = Iban::IbanCheck.new :iban => iban @invalid_ibans << iban if !iban_check.valid? row end post_process { Notifier.deliver_iban_report(@invalid_ibans) } Saturday, June 23, 12

Slide 50

Slide 50 text

Change Data Capture Tricks • immutable data • mutable data • “mark as deleted“ • trigger / list of deleted ids • row checksums created_at > last_extract_timestamp id > last_processed_id created_at > X or modified_at > X Saturday, June 23, 12

Slide 51

Slide 51 text

Thank You! Questions? Talk crafted and delivered by http://logeek.fr Ping me for more advanded topics! Thanks to Miss Anthropy for her “Thought Provoking” palette (ColourLovers)” Saturday, June 23, 12