Upgrade to Pro — share decks privately, control downloads, hide ads and more …

pETL - Pretty good ETL framework

pETL - Pretty good ETL framework

Data Warehouse in Ruby, why not? I'll demonstrate how we Open Sourced our homegrown ETL engine. I'll show the challenges that we had and how we solved it here at ZestFinance.

Presented at LA Ruby Meetup http://www.meetup.com/laruby/events/142080482/

Alexander Tamoykin

November 14, 2013
Tweet

More Decks by Alexander Tamoykin

Other Decks in Technology

Transcript

  1. PETL - Pretty good ETL framework Alexander Tamoykin Senior Software

    Engineer @ Zest Finance http://alextamoykin.com
  2. Why ETL ? • Pros • Answer business questions fast

    • Many departments need it (Marketing, Accounting, Analysis etc) • SWE don’t have to deal with it • Cons • Support • Maintenance • Monitoring
  3. First Try desc "imports new history data from source database"

    task :source_history => :environment do puts "[ #{Time.now} ] importing history from the Source database (new sql)" SourceHistory.import puts "[ #{Time.now} ] imported #{SourceHistory.count} new entries into the SourceHistory" end
  4. class SourceHistory < Source::Base EPOCH = "1970-01-01".to_time ! set_table_name :source_history

    ! def query_source Source::History.since latest_post_date end ! def source_count Source::History.since(EPOCH.to_date).count end ! def latest_date (self.class.maximum(:Date) || EPOCH).to_date end ! def verify_count logger = Rails.logger if self.source_count != self.class.count logger.error "source_history_etl: counts don't match" end ! logger.info "source_history_etl: source count #{self.source_count}" logger.info "source_history_etl: destination count #{self.class.count}" end ! def import(models) puts " importing #{models.length}..." columns = self.class.column_names inserts = [] ! models.each do |model| values = columns.inject([]) do |memo,key| memo << (model.attributes[key].nil? ? 'NULL' : "'#{model.attributes[key]}'") end inserts << "(#{values.join(',')})" end ! sql ="INSERT IGNORE INTO #{self.class.table_name}(#{columns.join(',')}) VALUES #{inserts.join(",")}" self.class.connection.execute sql end end
  5. No Structure class SourceHistory < Source::Base EPOCH = "1970-01-01".to_time !

    set_table_name :source_history ! def query_source Source::History.since latest_post_date end ! def source_count Source::History.since(EPOCH.to_date).count end ! def latest_date (self.class.maximum(:Date) || EPOCH).to_date end ! def verify_count logger = Rails.logger if self.source_count != self.class.count logger.error "source_history_etl: counts don't match" end ! logger.info "source_history_etl: source count #{self.source_count}" logger.info "source_history_etl: destination count #{self.class.count}" end ! def import(models) puts " importing #{models.length}..." columns = self.class.column_names inserts = [] ! models.each do |model| values = columns.inject([]) do |memo,key| memo << (model.attributes[key].nil? ? 'NULL' : "'#{model.attributes[key]}'") end inserts << "(#{values.join(',')})" end ! sql ="INSERT IGNORE INTO #{self.class.table_name}(#{columns.join(',')}) VALUES #{inserts.join(",")}" self.class.connection.execute sql end end
  6. Code Duplication desc "import telephony agent qa files which have

    not yet been imported" task :update_agent_qa => :environment do ! start_count = TelephonyAgentQa.count start_time = Time.now TelephonyAgentQa.update finish_count = TelephonyAgentQa.count finish_time = Time.now ! puts "imported #{finish_count - start_count} new telephony agent qa entries" puts "in #{finish_time - start_time} seconds" end
  7. More problems • Fear caused by unfamiliarity • Bugs •

    Spotty test coverage • Sad users :(
  8. Extract Transform Load class SourceHistory < Source::Base … def query_source

    Source::History.since latest_post_date end ! def source_count Source::History.since(EPOCH.to_date).count end ! def latest_date (self.class.maximum(:Date) || EPOCH).to_date end ! def verify_count logger = Rails.logger if self.source_count != self.class.count logger.error "source_history_etl: counts don't match" end ! logger.info "source_history_etl: source count #{self.source_count}" logger.info "source_history_etl: destination count #{self.class.count}" end ! def import(models) puts " importing #{models.length}..." columns = self.class.column_names inserts = [] ! models.each do |model| values = columns.inject([]) do |memo,key| memo << (model.attributes[key].nil? ? 'NULL' : "'#{model.attributes[key]}'") end inserts << "(#{values.join(',')})" end ! sql ="INSERT IGNORE INTO #{self.class.table_name}(#{columns.join(',')}) VALUES #{inserts.join(",")}" self.class.connection.execute sql end end
  9. Verification class SourceHistory < Source::Base … def query_source Source::History.since latest_post_date

    end ! def source_count Source::History.since(EPOCH.to_date).count end ! def latest_date (self.class.maximum(:Date) || EPOCH).to_date end ! def verify_count logger = Rails.logger if self.source_count != self.class.count logger.error "source_history_etl: counts don't match" end ! logger.info "source_history_etl: source count #{self.source_count}" logger.info "source_history_etl: destination count #{self.class.count}" end ! def import(models) puts " importing #{models.length}..." columns = self.class.column_names inserts = [] ! models.each do |model| values = columns.inject([]) do |memo,key| memo << (model.attributes[key].nil? ? 'NULL' : "'#{model.attributes[key]}'") end inserts << "(#{values.join(',')})" end ! sql ="INSERT IGNORE INTO #{self.class.table_name}(#{columns.join(',')}) VALUES #{inserts.join(",")}" self.class.connection.execute sql end end
  10. Timing desc "imports new history data from source database" task

    :source_history => :environment do puts "[ #{Time.now} ] importing history from the Source database (new sql)" SourceHistory.import puts "[ #{Time.now} ] imported #{SourceHistory.count} new entries into the SourceHistory" end
  11. 2. Declare your ETL require 'petl' ! module ETL::Example extend

    Petl extend self ! def extract max_id = Destination.maximum(:id) Source.since(max_id).collect(&:attributes) end ! def transform rows rows.collect do |row| row.slice *Destination.column_names end end ! def load rows Destination.create! rows end ! def source_count Source.since(0).count end ! def destination_count Destination.count end end
  12. 3. Run it! • ETL::Example.perform • Output ! Nov 05

    23:10:04 [info] ETL::Example starting at Wed Nov 06 07:10:04 +0000 2013! Nov 05 23:11:12 [info] ETL::Example: source count 98383! Nov 05 23:11:12 [info] ETL::Example: destination count 98383! Nov 05 23:11:12 [info] ETL::Example finished at Wed Nov 06 07:11:12 +0000 2013. Took 68 seconds
  13. Before class SourceHistory < Source::Base EPOCH = "1970-01-01".to_time set_table_name :source_history

    def query_source Source::History.since latest_date end ! def source_count Source::History.since(EPOCH.to_date).count end ! def latest_date (self.class.maximum(:Date) || EPOCH).to_date end def verify_count logger = Rails.logger if self.source_count != self.class.count logger.error "source_history_etl: counts don't match" end logger.info "source_history_etl: source count #{self.source_count}" logger.info "source_history_etl: destination count #{self.class.count}" end ! def import(models) puts " importing #{models.length}..." columns = self.class.column_names inserts = [] ! models.each do |model| values = columns.inject([]) do |memo,key| memo << (model.attributes[key].nil? ? 'NULL' : "'#{model.attributes[key]}'") end inserts << "(#{values.join(',')})" end ! sql ="INSERT IGNORE INTO #{self.class.table_name} (#{columns.join(',')}) VALUES #{inserts.join(",")}" self.class.connection.execute sql end end desc "imports new history data from source database" task :source_history => :environment do puts "[ #{Time.now} ] importing history from the Source database (new sql)" SourceHistory.import puts "[ #{Time.now} ] imported #{SourceHistory.count} new entries into the SourceHistory" end
  14. After require 'petl' ! module ETL::Example extend Petl extend self

    ! def extract max_id = Destination.maximum(:id) Source.since(max_id).collect(&:attributes) end ! def transform rows rows.collect do |row| row.slice *Destination.column_names end end ! def load rows Destination.import! rows end ! def source_count Source.since(0).count end ! def destination_count Destination.count end end desc "Update history table from LA Pro" task :history_raw => :environment do ETL::Example.perform end
  15. Batching module ETL::Example ... def batch true end ! def

    extract batch_size = 500000, &block max_id = Destination.maximum(:ID) || 0 ! Source.after(max_id).find_in_batches(:batch_size => batch_size) do |batch| yield batch.collect(&:attributes) end end ... end
  16. One big SQL insert module ETL::Example … def load rows

    Destination.import! rows end … end
  17. Benchmarks • 200 000 rows • 6 hours => 25

    mins ! • 6 million rows • 1 day => 2 hours
  18. Thank you • Me on the web • http://alextamoykin.com !

    • ZestFinance is hiring • http://www.zestfinance.com/careers.html