• gem command: “etl”
• declarative ruby scripts aka
“control files”
• database.yml
• works with or without a Rails app
ActiveWarehouse-ETL
Saturday, June 23, 12
Slide 13
Slide 13 text
https://github.com/activewarehouse/activewarehouse-etl-sample
Sample Walk-Through
Saturday, June 23, 12
Slide 14
Slide 14 text
| first_name | last_name | email |
|------------|------------|------------------------|
| John | Barry | [email protected] |
| Jonathon | More | [email protected] |
| Matt | Black | [email protected] |
| Marlena | Shaw | [email protected] |
| Neil | Young | [email protected] |
customers
table
ETL
upsert based on email
extract email host
remove hotmail rows
1
2
3
Saturday, June 23, 12
Declarative
Syntax
•Concise
•Easier to maintain
•Easier to reuse
Saturday, June 23, 12
Slide 27
Slide 27 text
Row-based
processing
•Creative constraint
•Provides a framework
Saturday, June 23, 12
Slide 28
Slide 28 text
The docs kinda suck
(embrace the code for now)
Saturday, June 23, 12
Slide 29
Slide 29 text
Ruby is
slow!
Saturday, June 23, 12
Slide 30
Slide 30 text
Ruby is slow!
• More often than not, it’s fast enough!
People are handling tens of millions of
rows daily
• If too slow, delegate the critical loop to a
faster tool and use Ruby to drive the
workflow
• Use bulk load/upsert features of your db
Saturday, June 23, 12
Slide 31
Slide 31 text
Real-Life Examples
2
Saturday, June 23, 12
Slide 32
Slide 32 text
Reporting / Business Intelligence
Saturday, June 23, 12
Slide 33
Slide 33 text
Production Reporting
ETL
Saturday, June 23, 12
Slide 34
Slide 34 text
Date
Dimension
Time
Dimension Customer
Dimension
Purchase
Fact
Product
Dimension
Saturday, June 23, 12
OLAP with JRuby API
https://github.com/rsim/mondrian-olap
Saturday, June 23, 12
Slide 40
Slide 40 text
CRM WWW
Extracting and geocoding
Saturday, June 23, 12
Slide 41
Slide 41 text
•Full export from CRM
•Cleaning + geocoding
•Push to production via SCP
•Import/indexing via CRON
Saturday, June 23, 12
Slide 42
Slide 42 text
before_write do |row|
position, status = geocoding_with_redis_cache(row)
row[:geocoding_status] = status
row[:latitude] = position.latitude
row[:longitude] = position.longitude
row
end
Saturday, June 23, 12
Slide 43
Slide 43 text
JAVA COBOL
Holà!
What?
Translating
Saturday, June 23, 12
Slide 44
Slide 44 text
•Both systems are too rigid
•Create a well-tested
translator with Ruby
Saturday, June 23, 12
Slide 45
Slide 45 text
it "raises an error when a data is too large" do
lambda do
destination.write( { :age => 270, :sex => "M" } )
destination.close
end.should raise_error(/field age is too large (max width 2\)/)
end
it "outputs all the fields and terminates by eol" do
destination.write( { :age => "27", :sex => "M"} )
destination.write( { :age => "20", :sex => "F"} )
destination.close
output.should == "27M\r\n20F\r\n"
end
Test, test, test. Or you’re all fired.
Saturday, June 23, 12
Aggregating
Rentals
partner A
CSV
Rentals
partner B
XML
Rentals
partner C
Rotten CSV
Our holiday
rental site
Saturday, June 23, 12
Slide 48
Slide 48 text
Cleaning and extending the CRM
(with rubygem functionalities)
CRM
Extract rows
Write down corrupt IBANs
Craft an email to supervisor
1
2
3
Saturday, June 23, 12
Slide 49
Slide 49 text
Cleaning and extending the CRM
require 'iban-check'
@invalid_ibans = []
before_write do |row|
iban = row.slice(:iban_1, .., :iban_6).join(' ')
iban_check = Iban::IbanCheck.new :iban => iban
@invalid_ibans << iban if !iban_check.valid?
row
end
post_process {
Notifier.deliver_iban_report(@invalid_ibans)
}
Saturday, June 23, 12
Slide 50
Slide 50 text
Change Data Capture Tricks
• immutable data
• mutable data
• “mark as deleted“
• trigger / list of deleted ids
• row checksums
created_at > last_extract_timestamp
id > last_processed_id
created_at > X or modified_at > X
Saturday, June 23, 12
Slide 51
Slide 51 text
Thank You!
Questions?
Talk crafted and delivered by http://logeek.fr
Ping me for more advanded topics!
Thanks to Miss Anthropy for her “Thought Provoking” palette (ColourLovers)”
Saturday, June 23, 12