pETL - Pretty good ETL framework

pETL - Pretty good ETL framework

Data Warehouse in Ruby, why not? I'll demonstrate how we Open Sourced our homegrown ETL engine. I'll show the challenges that we had and how we solved it here at ZestFinance.

Presented at LA Ruby Meetup http://www.meetup.com/laruby/events/142080482/

36b118e795c80c058d942b7af047ea67?s=128

Alexander Tamoykin

November 14, 2013
Tweet

Transcript

  1. PETL - Pretty good ETL framework Alexander Tamoykin Senior Software

    Engineer @ Zest Finance http://alextamoykin.com
  2. Story • What is ETL • First try • PETL

    gem
  3. ETL What ?

  4. Data Warehouse • Extract • Transform • Load

  5. ETL Visualization

  6. Why ETL ? • Pros • Answer business questions fast

    • Many departments need it (Marketing, Accounting, Analysis etc) • SWE don’t have to deal with it • Cons • Support • Maintenance • Monitoring
  7. First Try desc "imports new history data from source database"

    task :source_history => :environment do puts "[ #{Time.now} ] importing history from the Source database (new sql)" SourceHistory.import puts "[ #{Time.now} ] imported #{SourceHistory.count} new entries into the SourceHistory" end
  8. class SourceHistory < Source::Base EPOCH = "1970-01-01".to_time ! set_table_name :source_history

    ! def query_source Source::History.since latest_post_date end ! def source_count Source::History.since(EPOCH.to_date).count end ! def latest_date (self.class.maximum(:Date) || EPOCH).to_date end ! def verify_count logger = Rails.logger if self.source_count != self.class.count logger.error "source_history_etl: counts don't match" end ! logger.info "source_history_etl: source count #{self.source_count}" logger.info "source_history_etl: destination count #{self.class.count}" end ! def import(models) puts " importing #{models.length}..." columns = self.class.column_names inserts = [] ! models.each do |model| values = columns.inject([]) do |memo,key| memo << (model.attributes[key].nil? ? 'NULL' : "'#{model.attributes[key]}'") end inserts << "(#{values.join(',')})" end ! sql ="INSERT IGNORE INTO #{self.class.table_name}(#{columns.join(',')}) VALUES #{inserts.join(",")}" self.class.connection.execute sql end end
  9. None
  10. No Structure class SourceHistory < Source::Base EPOCH = "1970-01-01".to_time !

    set_table_name :source_history ! def query_source Source::History.since latest_post_date end ! def source_count Source::History.since(EPOCH.to_date).count end ! def latest_date (self.class.maximum(:Date) || EPOCH).to_date end ! def verify_count logger = Rails.logger if self.source_count != self.class.count logger.error "source_history_etl: counts don't match" end ! logger.info "source_history_etl: source count #{self.source_count}" logger.info "source_history_etl: destination count #{self.class.count}" end ! def import(models) puts " importing #{models.length}..." columns = self.class.column_names inserts = [] ! models.each do |model| values = columns.inject([]) do |memo,key| memo << (model.attributes[key].nil? ? 'NULL' : "'#{model.attributes[key]}'") end inserts << "(#{values.join(',')})" end ! sql ="INSERT IGNORE INTO #{self.class.table_name}(#{columns.join(',')}) VALUES #{inserts.join(",")}" self.class.connection.execute sql end end
  11. Code Duplication desc "import telephony agent qa files which have

    not yet been imported" task :update_agent_qa => :environment do ! start_count = TelephonyAgentQa.count start_time = Time.now TelephonyAgentQa.update finish_count = TelephonyAgentQa.count finish_time = Time.now ! puts "imported #{finish_count - start_count} new telephony agent qa entries" puts "in #{finish_time - start_time} seconds" end
  12. More problems • Fear caused by unfamiliarity • Bugs •

    Spotty test coverage • Sad users :(
  13. Solution • Structure • Automatic • Familiar

  14. Look at existing ETLs

  15. Extract Transform Load class SourceHistory < Source::Base … def query_source

    Source::History.since latest_post_date end ! def source_count Source::History.since(EPOCH.to_date).count end ! def latest_date (self.class.maximum(:Date) || EPOCH).to_date end ! def verify_count logger = Rails.logger if self.source_count != self.class.count logger.error "source_history_etl: counts don't match" end ! logger.info "source_history_etl: source count #{self.source_count}" logger.info "source_history_etl: destination count #{self.class.count}" end ! def import(models) puts " importing #{models.length}..." columns = self.class.column_names inserts = [] ! models.each do |model| values = columns.inject([]) do |memo,key| memo << (model.attributes[key].nil? ? 'NULL' : "'#{model.attributes[key]}'") end inserts << "(#{values.join(',')})" end ! sql ="INSERT IGNORE INTO #{self.class.table_name}(#{columns.join(',')}) VALUES #{inserts.join(",")}" self.class.connection.execute sql end end
  16. Verification class SourceHistory < Source::Base … def query_source Source::History.since latest_post_date

    end ! def source_count Source::History.since(EPOCH.to_date).count end ! def latest_date (self.class.maximum(:Date) || EPOCH).to_date end ! def verify_count logger = Rails.logger if self.source_count != self.class.count logger.error "source_history_etl: counts don't match" end ! logger.info "source_history_etl: source count #{self.source_count}" logger.info "source_history_etl: destination count #{self.class.count}" end ! def import(models) puts " importing #{models.length}..." columns = self.class.column_names inserts = [] ! models.each do |model| values = columns.inject([]) do |memo,key| memo << (model.attributes[key].nil? ? 'NULL' : "'#{model.attributes[key]}'") end inserts << "(#{values.join(',')})" end ! sql ="INSERT IGNORE INTO #{self.class.table_name}(#{columns.join(',')}) VALUES #{inserts.join(",")}" self.class.connection.execute sql end end
  17. Timing desc "imports new history data from source database" task

    :source_history => :environment do puts "[ #{Time.now} ] importing history from the Source database (new sql)" SourceHistory.import puts "[ #{Time.now} ] imported #{SourceHistory.count} new entries into the SourceHistory" end
  18. PETL • Extract • Transform • Load • Verify •

    Time
  19. Simple as 1, 2, 3

  20. 1. Installation gem install petl

  21. 2. Declare your ETL require 'petl' ! module ETL::Example extend

    Petl extend self ! def extract max_id = Destination.maximum(:id) Source.since(max_id).collect(&:attributes) end ! def transform rows rows.collect do |row| row.slice *Destination.column_names end end ! def load rows Destination.create! rows end ! def source_count Source.since(0).count end ! def destination_count Destination.count end end
  22. 3. Run it! • ETL::Example.perform • Output ! Nov 05

    23:10:04 [info] ETL::Example starting at Wed Nov 06 07:10:04 +0000 2013! Nov 05 23:11:12 [info] ETL::Example: source count 98383! Nov 05 23:11:12 [info] ETL::Example: destination count 98383! Nov 05 23:11:12 [info] ETL::Example finished at Wed Nov 06 07:11:12 +0000 2013. Took 68 seconds
  23. Before class SourceHistory < Source::Base EPOCH = "1970-01-01".to_time set_table_name :source_history

    def query_source Source::History.since latest_date end ! def source_count Source::History.since(EPOCH.to_date).count end ! def latest_date (self.class.maximum(:Date) || EPOCH).to_date end def verify_count logger = Rails.logger if self.source_count != self.class.count logger.error "source_history_etl: counts don't match" end logger.info "source_history_etl: source count #{self.source_count}" logger.info "source_history_etl: destination count #{self.class.count}" end ! def import(models) puts " importing #{models.length}..." columns = self.class.column_names inserts = [] ! models.each do |model| values = columns.inject([]) do |memo,key| memo << (model.attributes[key].nil? ? 'NULL' : "'#{model.attributes[key]}'") end inserts << "(#{values.join(',')})" end ! sql ="INSERT IGNORE INTO #{self.class.table_name} (#{columns.join(',')}) VALUES #{inserts.join(",")}" self.class.connection.execute sql end end desc "imports new history data from source database" task :source_history => :environment do puts "[ #{Time.now} ] importing history from the Source database (new sql)" SourceHistory.import puts "[ #{Time.now} ] imported #{SourceHistory.count} new entries into the SourceHistory" end
  24. After require 'petl' ! module ETL::Example extend Petl extend self

    ! def extract max_id = Destination.maximum(:id) Source.since(max_id).collect(&:attributes) end ! def transform rows rows.collect do |row| row.slice *Destination.column_names end end ! def load rows Destination.import! rows end ! def source_count Source.since(0).count end ! def destination_count Destination.count end end desc "Update history table from LA Pro" task :history_raw => :environment do ETL::Example.perform end
  25. Need for speed

  26. Batching module ETL::Example ... def batch true end ! def

    extract batch_size = 500000, &block max_id = Destination.maximum(:ID) || 0 ! Source.after(max_id).find_in_batches(:batch_size => batch_size) do |batch| yield batch.collect(&:attributes) end end ... end
  27. Load module ETL::Example … def load rows Destination.create! rows end

    … end
  28. One big SQL insert module ETL::Example … def load rows

    Destination.import! rows end … end
  29. Benchmarks • 200 000 rows • 6 hours => 25

    mins ! • 6 million rows • 1 day => 2 hours
  30. That’s pETL http://github.com/ZestFinance/petl

  31. Even 70 lines of code is enough to give back

    to the community
  32. Thank you • Me on the web • http://alextamoykin.com !

    • ZestFinance is hiring • http://www.zestfinance.com/careers.html