$30 off During Our Annual Pro Sale. View Details »

Transforming Data with Ruby and ActiveWarehouse-ETL

Transforming Data with Ruby and ActiveWarehouse-ETL

This talk (given at RuLu 2012) will show how to use ActiveWarehouse-ETL to extract, transform and move your data around in Ruby, and what other people do in real life with this toolkit.

1# Beginner guide to ActiveWarehouse-ETL

Gem presentation (concepts etc)
Beginner examples of data processing

2# Real-life examples all in Ruby:

Extracting data from a CRM to build a metrics dimensional dashboard (business intelligence)
Extracting and geocoding places before publishing a map on the web
Adapting data for consumption by COBOL systems
Cleaning and extending the CRM
Extracting only what changed (aka Change Data Capture)

Thibaut Barrère

June 23, 2012
Tweet

More Decks by Thibaut Barrère

Other Decks in Programming

Transcript

  1. Transforming your
    Data
    with Ruby
    and ActiveWarehouse-ETL
    Saturday, June 23, 12

    View Slide

  2. Thibaut Barrère
    twitter: @thibaut_barrere
    github: @thbar
    Saturday, June 23, 12

    View Slide

  3. Core Concepts
    1
    Saturday, June 23, 12

    View Slide

  4. E T L ?
    Saturday, June 23, 12

    View Slide

  5. Extract
    Transform
    Load
    Saturday, June 23, 12

    View Slide

  6. Extract
    • from a local database
    • from a CRM via some command-line
    • from a remote source
    • partial or complete
    Saturday, June 23, 12

    View Slide

  7. Transform
    • clean-up
    • look-up
    • enrich
    • filter
    • conform
    • (de) normalize
    Saturday, June 23, 12

    View Slide

  8. Load
    • dump to file
    • insert
    • upsert (update/insert)
    Saturday, June 23, 12

    View Slide

  9. Saturday, June 23, 12

    View Slide

  10. •first released in 2006
    •latest release this week
    •slow maintenance pace
    • http://www.activewarehouse.info/
    ActiveWarehouse-ETL
    Saturday, June 23, 12

    View Slide

  11. •MRI ruby 1.8.7 / 1.9.3
    •activerecord 3+
    ActiveWarehouse-ETL
    Saturday, June 23, 12

    View Slide

  12. • gem command: “etl”
    • declarative ruby scripts aka
    “control files”
    • database.yml
    • works with or without a Rails app
    ActiveWarehouse-ETL
    Saturday, June 23, 12

    View Slide

  13. https://github.com/activewarehouse/activewarehouse-etl-sample
    Sample Walk-Through
    Saturday, June 23, 12

    View Slide

  14. | first_name | last_name | email |
    |------------|------------|------------------------|
    | John | Barry | [email protected] |
    | Jonathon | More | [email protected] |
    | Matt | Black | [email protected] |
    | Marlena | Shaw | [email protected] |
    | Neil | Young | [email protected] |
    customers
    table
    ETL
    upsert based on email
    extract email host
    remove hotmail rows
    1
    2
    3
    Saturday, June 23, 12

    View Slide

  15. | first_name | last_name | email |
    |------------|------------|------------------------|
    | John | Barry | [email protected] |
    | Jonathon | More | [email protected] |
    | Matt | Black | [email protected] |
    | Marlena | Shaw | [email protected] |
    | Neil | Young | [email protected] |
    Input
    Saturday, June 23, 12

    View Slide

  16. source :input,
    {
    :file => file,
    :parser => :csv,
    :skip_lines => 1
    },
    [
    :first_name,
    :last_name,
    :email
    ]
    Data source
    Saturday, June 23, 12

    View Slide

  17. transform(:email_provider) do |name,value,row|
    row[:email].downcase.split('@').last
    end
    Transforms
    Saturday, June 23, 12

    View Slide

  18. transform :email_provider,
    :default,
    :default_value => "Unknown"
    Transforms
    Saturday, June 23, 12

    View Slide

  19. transform(:full_name) do |n,v,r|
    [r[:first_name], r[:last_name]].join(' ')
    end
    Transforms
    Saturday, June 23, 12

    View Slide

  20. before_write do |r|
    r[:email_provider] =~ /hotmail/ ? nil : r
    end
    Before Write
    Saturday, June 23, 12

    View Slide

  21. destination :out,
    {
    :type => :insert_update_database,
    :target => :datawarehouse,
    :table => 'customers'
    },
    {
    :primarykey => [:email],
    :order => [
    :email, :full_name, :email_provider
    ]
    }
    Destination
    Saturday, June 23, 12

    View Slide

  22. class Customer < ActiveRecord::Base
    end
    screen(:fatal) {
    assert_equal 1, Customer.where(:email =>
    '[email protected]').count
    }
    Screens
    Saturday, June 23, 12

    View Slide

  23. Pipeline Life-Cycle
    - file evaluated
    - pre_process
    - sources fetched
    - after_read(row)
    - transforms(row)
    - before_write(row)
    - row written to destination
    - screen
    - post_process
    - after_post_process_screen
    Saturday, June 23, 12

    View Slide

  24. Key Points
    Saturday, June 23, 12

    View Slide

  25. Highly hackable
    • Ruby!
    • Refactorable
    • Testable
    • Diff-able
    • Composable
    Saturday, June 23, 12

    View Slide

  26. Declarative
    Syntax
    •Concise
    •Easier to maintain
    •Easier to reuse
    Saturday, June 23, 12

    View Slide

  27. Row-based
    processing
    •Creative constraint
    •Provides a framework
    Saturday, June 23, 12

    View Slide

  28. The docs kinda suck
    (embrace the code for now)
    Saturday, June 23, 12

    View Slide

  29. Ruby is
    slow!
    Saturday, June 23, 12

    View Slide

  30. Ruby is slow!
    • More often than not, it’s fast enough!
    People are handling tens of millions of
    rows daily
    • If too slow, delegate the critical loop to a
    faster tool and use Ruby to drive the
    workflow
    • Use bulk load/upsert features of your db
    Saturday, June 23, 12

    View Slide

  31. Real-Life Examples
    2
    Saturday, June 23, 12

    View Slide

  32. Reporting / Business Intelligence
    Saturday, June 23, 12

    View Slide

  33. Production Reporting
    ETL
    Saturday, June 23, 12

    View Slide

  34. Date
    Dimension
    Time
    Dimension Customer
    Dimension
    Purchase
    Fact
    Product
    Dimension
    Saturday, June 23, 12

    View Slide

  35. | id | date | quarter | quarter_year | ... |
    |------------|------------|---------|--------------|-----|
    | 1 | 2000-01-01 | T1 | 2000-T1 | ... |
    | 2 | 2000-01-02 | T1 | 2000-T1 | ... |
    | | | | | ... |
    | ... | | | | ... |
    | 7660 | 2020-12-20 | T2 | 2020-T2 | ... |
    Date Dimension
    Saturday, June 23, 12

    View Slide

  36. | id | email | email_provider | country | language |
    |------|------------|----------------|---------|----------|
    | 1 | xxx | gmail | France | french |
    | 2 | xxx | gmail | U.K. | english |
    | 3 | xxx | hotmail | France | french |
    Customer Dimension
    Saturday, June 23, 12

    View Slide

  37. resolver = ActiveRecordResolver.new(DateDimension, :find_by_sql_date_stamp)
    copy :created_at, :date_id
    transform :date_id,
    :foreign_key_lookup,
    { :resolver => resolver }
    Surrogate Key Look-Up
    Saturday, June 23, 12

    View Slide

  38. Saturday, June 23, 12

    View Slide

  39. OLAP with JRuby API
    https://github.com/rsim/mondrian-olap
    Saturday, June 23, 12

    View Slide

  40. CRM WWW
    Extracting and geocoding
    Saturday, June 23, 12

    View Slide

  41. •Full export from CRM
    •Cleaning + geocoding
    •Push to production via SCP
    •Import/indexing via CRON
    Saturday, June 23, 12

    View Slide

  42. before_write do |row|
    position, status = geocoding_with_redis_cache(row)
    row[:geocoding_status] = status
    row[:latitude] = position.latitude
    row[:longitude] = position.longitude
    row
    end
    Saturday, June 23, 12

    View Slide

  43. JAVA COBOL
    Holà!
    What?
    Translating
    Saturday, June 23, 12

    View Slide

  44. •Both systems are too rigid
    •Create a well-tested
    translator with Ruby
    Saturday, June 23, 12

    View Slide

  45. it "raises an error when a data is too large" do
    lambda do
    destination.write( { :age => 270, :sex => "M" } )
    destination.close
    end.should raise_error(/field age is too large (max width 2\)/)
    end
    it "outputs all the fields and terminates by eol" do
    destination.write( { :age => "27", :sex => "M"} )
    destination.write( { :age => "20", :sex => "F"} )
    destination.close
    output.should == "27M\r\n20F\r\n"
    end
    Test, test, test. Or you’re all fired.
    Saturday, June 23, 12

    View Slide

  46. Consolidating
    AdWords
    Analytics
    App/CRM
    PABX
    Decide,
    buy, sell, bid
    Saturday, June 23, 12

    View Slide

  47. Aggregating
    Rentals
    partner A
    CSV
    Rentals
    partner B
    XML
    Rentals
    partner C
    Rotten CSV
    Our holiday
    rental site
    Saturday, June 23, 12

    View Slide

  48. Cleaning and extending the CRM
    (with rubygem functionalities)
    CRM
    Extract rows
    Write down corrupt IBANs
    Craft an email to supervisor
    1
    2
    3
    Saturday, June 23, 12

    View Slide

  49. Cleaning and extending the CRM
    require 'iban-check'
    @invalid_ibans = []
    before_write do |row|
    iban = row.slice(:iban_1, .., :iban_6).join(' ')
    iban_check = Iban::IbanCheck.new :iban => iban
    @invalid_ibans << iban if !iban_check.valid?
    row
    end
    post_process {
    Notifier.deliver_iban_report(@invalid_ibans)
    }
    Saturday, June 23, 12

    View Slide

  50. Change Data Capture Tricks
    • immutable data
    • mutable data
    • “mark as deleted“
    • trigger / list of deleted ids
    • row checksums
    created_at > last_extract_timestamp
    id > last_processed_id
    created_at > X or modified_at > X
    Saturday, June 23, 12

    View Slide

  51. Thank You!
    Questions?
    Talk crafted and delivered by http://logeek.fr
    Ping me for more advanded topics!
    Thanks to Miss Anthropy for her “Thought Provoking” palette (ColourLovers)”
    Saturday, June 23, 12

    View Slide