Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Processing data with Ruby and Kiba ETL

Processing data with Ruby and Kiba ETL

In this talk: why processing data matters, and how to do it with Kiba ETL in Ruby.

Check out the recording which includes a live coding session at
https://www.youtube.com/watch?v=6hmNuXXIKf4

Thibaut Barrère

November 21, 2015
Tweet

More Decks by Thibaut Barrère

Other Decks in Programming

Transcript

  1. BUILDING PRODUCTS THAT MAKE DATA SPEAK LEGACY / CRM /

    HEALTHCARE (unable to answer to questions 
 despite having the data) ETL MODERN PRODUCT / APP (tailored to answer
 to those questions)
  2. LEVERAGING DATA SYNC
 AS INTEGRATION MARKETING Your SaaS app Well

    established SaaS app with large customer base
  3. YOU CAN DO MORE WITH ETL & DATA PROCESSING… ▸

    Migrate your data from a schema to another (with grace). ▸ Generate reports automatically. ▸ Synchronize all or part of 2 of your apps. ▸ Prepare and clean data before indexing them for full-text search. ▸ Aggregate data from multiple sources inside your searchable app. ▸ Geocode records to present them online to your users. ▸ Implement a data import or export for your users.
  4. require_relative 'my_components' source CSVSource, filename: 'patients.csv' transform ParseDate, from: 'bd_001',

    to: :birth_date transform do |row| row[:birth_date].year < 2000 ? row : nil end destination InsertToDatabase, db_config, table: 'patients'
  5. require 'csv' class CSVSource def initialize(filename:) @filename = filename end

    def each csv = CSV.open(@filename, headers: true) csv.each do |row| yield(row.to_hash) end csv.close end end
  6. require 'recurly' class RecurlyInvoices def initialize(from:, to:, fields:, cache: NullCache.new)

    @range = (from..to) @cache = cache @fields = fields end def each @range.each do |number| cache_key = ([number]+@fields).map(&:to_s).join(':') row = @cache.fetch(cache_key) do invoice = Recurly::Invoice.find(number) invoice.attributes.slice(*@fields) end yield row.dup end end end
  7. class ParseDate def initialize(from:, to:, format:) @from, @to = from,

    to @format = format end def process(row) row[@to] = Date.strptime(row[@from], @format) row end end
  8. class CSVDestination def initialize(output_file) @csv = CSV.open(output_file, 'w') @headers_written =

    false end def write(row) unless @headers_written @headers_written = true @csv << row.keys end @csv << row.values end def close @csv.close end end
  9. ▸ Code-centric ETL. ▸ Versioned in git (branches, yay!). ▸

    Testable components, with clear separation of concerns. ▸ Reusable components across jobs. ▸ ETL jobs easy to maintain on the very long run. ▸ Ruby ecosystem (tap into gems for extra features). ▸ Blueprints and components soon available in Kiba Pro. KIBA ETL