Transforming Data with Ruby and ActiveWarehouse-ETL

Transforming Data with Ruby and ActiveWarehouse-ETL

This talk (given at RuLu 2012) will show how to use ActiveWarehouse-ETL to extract, transform and move your data around in Ruby, and what other people do in real life with this toolkit.

1# Beginner guide to ActiveWarehouse-ETL

Gem presentation (concepts etc)
Beginner examples of data processing

2# Real-life examples all in Ruby:

Extracting data from a CRM to build a metrics dimensional dashboard (business intelligence)
Extracting and geocoding places before publishing a map on the web
Adapting data for consumption by COBOL systems
Cleaning and extending the CRM
Extracting only what changed (aka Change Data Capture)

91eb330fb36d1e03c856574dfb77d2bc?s=128

Thibaut Barrère

June 23, 2012
Tweet

Transcript

  1. 6.

    Extract • from a local database • from a CRM

    via some command-line • from a remote source • partial or complete Saturday, June 23, 12
  2. 7.

    Transform • clean-up • look-up • enrich • filter •

    conform • (de) normalize Saturday, June 23, 12
  3. 10.

    •first released in 2006 •latest release this week •slow maintenance

    pace • http://www.activewarehouse.info/ ActiveWarehouse-ETL Saturday, June 23, 12
  4. 12.

    • gem command: “etl” • declarative ruby scripts aka “control

    files” • database.yml • works with or without a Rails app ActiveWarehouse-ETL Saturday, June 23, 12
  5. 14.

    | first_name | last_name | email | |------------|------------|------------------------| | John

    | Barry | john.barry@gmail.com | | Jonathon | More | jon@coldcut.com | | Matt | Black | matt@coldcut.com | | Marlena | Shaw | marlena.shaw@gmail.com | | Neil | Young | neil.young@hotmail.com | customers table ETL upsert based on email extract email host remove hotmail rows 1 2 3 Saturday, June 23, 12
  6. 15.

    | first_name | last_name | email | |------------|------------|------------------------| | John

    | Barry | john.barry@gmail.com | | Jonathon | More | jon@coldcut.com | | Matt | Black | matt@coldcut.com | | Marlena | Shaw | marlena.shaw@gmail.com | | Neil | Young | neil.young@hotmail.com | Input Saturday, June 23, 12
  7. 16.

    source :input, { :file => file, :parser => :csv, :skip_lines

    => 1 }, [ :first_name, :last_name, :email ] Data source Saturday, June 23, 12
  8. 20.

    before_write do |r| r[:email_provider] =~ /hotmail/ ? nil : r

    end Before Write Saturday, June 23, 12
  9. 21.

    destination :out, { :type => :insert_update_database, :target => :datawarehouse, :table

    => 'customers' }, { :primarykey => [:email], :order => [ :email, :full_name, :email_provider ] } Destination Saturday, June 23, 12
  10. 22.

    class Customer < ActiveRecord::Base end screen(:fatal) { assert_equal 1, Customer.where(:email

    => 'john.barry@gmail.com').count } Screens Saturday, June 23, 12
  11. 23.

    Pipeline Life-Cycle - file evaluated - pre_process - sources fetched

    - after_read(row) - transforms(row) - before_write(row) - row written to destination - screen - post_process - after_post_process_screen Saturday, June 23, 12
  12. 30.

    Ruby is slow! • More often than not, it’s fast

    enough! People are handling tens of millions of rows daily • If too slow, delegate the critical loop to a faster tool and use Ruby to drive the workflow • Use bulk load/upsert features of your db Saturday, June 23, 12
  13. 35.

    | id | date | quarter | quarter_year | ...

    | |------------|------------|---------|--------------|-----| | 1 | 2000-01-01 | T1 | 2000-T1 | ... | | 2 | 2000-01-02 | T1 | 2000-T1 | ... | | | | | | ... | | ... | | | | ... | | 7660 | 2020-12-20 | T2 | 2020-T2 | ... | Date Dimension Saturday, June 23, 12
  14. 36.

    | id | email | email_provider | country | language

    | |------|------------|----------------|---------|----------| | 1 | xxx | gmail | France | french | | 2 | xxx | gmail | U.K. | english | | 3 | xxx | hotmail | France | french | Customer Dimension Saturday, June 23, 12
  15. 41.

    •Full export from CRM •Cleaning + geocoding •Push to production

    via SCP •Import/indexing via CRON Saturday, June 23, 12
  16. 42.

    before_write do |row| position, status = geocoding_with_redis_cache(row) row[:geocoding_status] = status

    row[:latitude] = position.latitude row[:longitude] = position.longitude row end Saturday, June 23, 12
  17. 45.

    it "raises an error when a data is too large"

    do lambda do destination.write( { :age => 270, :sex => "M" } ) destination.close end.should raise_error(/field age is too large (max width 2\)/) end it "outputs all the fields and terminates by eol" do destination.write( { :age => "27", :sex => "M"} ) destination.write( { :age => "20", :sex => "F"} ) destination.close output.should == "27M\r\n20F\r\n" end Test, test, test. Or you’re all fired. Saturday, June 23, 12
  18. 47.

    Aggregating Rentals partner A CSV Rentals partner B XML Rentals

    partner C Rotten CSV Our holiday rental site Saturday, June 23, 12
  19. 48.

    Cleaning and extending the CRM (with rubygem functionalities) CRM Extract

    rows Write down corrupt IBANs Craft an email to supervisor 1 2 3 Saturday, June 23, 12
  20. 49.

    Cleaning and extending the CRM require 'iban-check' @invalid_ibans = []

    before_write do |row| iban = row.slice(:iban_1, .., :iban_6).join(' ') iban_check = Iban::IbanCheck.new :iban => iban @invalid_ibans << iban if !iban_check.valid? row end post_process { Notifier.deliver_iban_report(@invalid_ibans) } Saturday, June 23, 12
  21. 50.

    Change Data Capture Tricks • immutable data • mutable data

    • “mark as deleted“ • trigger / list of deleted ids • row checksums created_at > last_extract_timestamp id > last_processed_id created_at > X or modified_at > X Saturday, June 23, 12
  22. 51.

    Thank You! Questions? Talk crafted and delivered by http://logeek.fr Ping

    me for more advanded topics! Thanks to Miss Anthropy for her “Thought Provoking” palette (ColourLovers)” Saturday, June 23, 12