Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Processing data with Ruby and Kiba ETL

Processing data with Ruby and Kiba ETL

In this talk: why processing data matters, and how to do it with Kiba ETL in Ruby.

Check out the recording which includes a live coding session at
https://www.youtube.com/watch?v=6hmNuXXIKf4

Thibaut Barrère

November 21, 2015
Tweet

More Decks by Thibaut Barrère

Other Decks in Programming

Transcript

  1. KIBA ETL
    PROCESSING DATA WITH
    www.kiba-etl.org http://thibautbarrere.com
    by

    View Slide

  2. WHY PROCESS DATA?
    Photo: Raymond Bryson

    View Slide

  3. BUILDING PRODUCTS THAT MAKE DATA SPEAK
    LEGACY / CRM / HEALTHCARE
    (unable to answer to questions 

    despite having the data)
    ETL
    MODERN PRODUCT / APP
    (tailored to answer

    to those questions)

    View Slide

  4. LEVERAGING DATA SYNC

    AS INTEGRATION MARKETING
    Your SaaS app
    Well established SaaS app with large customer base

    View Slide

  5. YOU CAN DO MORE WITH ETL & DATA PROCESSING…
    ▸ Migrate your data from a schema to another (with grace).
    ▸ Generate reports automatically.
    ▸ Synchronize all or part of 2 of your apps.
    ▸ Prepare and clean data before indexing them for full-text search.
    ▸ Aggregate data from multiple sources inside your searchable app.
    ▸ Geocode records to present them online to your users.
    ▸ Implement a data import or export for your users.

    View Slide

  6. A USEFUL TECHNIQUE
    TO PROCESS DATA?
    Photo: Raymond Bryson

    View Slide

  7. www.kiba-etl.org

    View Slide

  8. SOURCE
    TRANSFORM
    TRANSFORM
    TRANSFORM
    DESTINATION

    View Slide

  9. require_relative 'my_components'
    source CSVSource, filename: 'patients.csv'
    transform ParseDate, from: 'bd_001', to: :birth_date
    transform do |row|
    row[:birth_date].year < 2000 ? row : nil
    end
    destination InsertToDatabase, db_config, table: 'patients'

    View Slide

  10. $ bundle exec kiba my-etl-script.etl

    View Slide

  11. class YourSource
    def initialize(parameters)
    def each
    end
    source YourSource, setting: 'value'
    SOURCE

    View Slide

  12. require 'csv'
    class CSVSource
    def initialize(filename:)
    @filename = filename
    end
    def each
    csv = CSV.open(@filename, headers: true)
    csv.each do |row|
    yield(row.to_hash)
    end
    csv.close
    end
    end

    View Slide

  13. require 'recurly'
    class RecurlyInvoices
    def initialize(from:, to:, fields:, cache: NullCache.new)
    @range = (from..to)
    @cache = cache
    @fields = fields
    end
    def each
    @range.each do |number|
    cache_key = ([number][email protected]).map(&:to_s).join(':')
    row = @cache.fetch(cache_key) do
    invoice = Recurly::Invoice.find(number)
    invoice.attributes.slice(*@fields)
    end
    yield row.dup
    end
    end
    end

    View Slide

  14. recurly_fields = [
    'invoice_number',
    'total_in_cents'
    ]
    source RecurlyInvoices,
    from: 1000,
    to: 2000,
    fields: recurly_fields,
    cache: my_cache

    View Slide

  15. class YourTransform
    def initialize(parameters)
    def process(row)
    end
    transform YourTransform, setting: 'value'
    TRANSFORM

    View Slide

  16. class ParseDate
    def initialize(from:, to:, format:)
    @from, @to = from, to
    @format = format
    end
    def process(row)
    row[@to] = Date.strptime(row[@from], @format)
    row
    end
    end

    View Slide

  17. class YourDestination
    def initialize(parameters)
    def write(row)
    def close
    end
    destination YourDestination, setting: 'value'
    DESTINATION

    View Slide

  18. class CSVDestination
    def initialize(output_file)
    @csv = CSV.open(output_file, 'w')
    @headers_written = false
    end
    def write(row)
    unless @headers_written
    @headers_written = true
    @csv << row.keys
    end
    @csv << row.values
    end
    def close
    @csv.close
    end
    end

    View Slide

  19. ▸ Code-centric ETL.
    ▸ Versioned in git (branches, yay!).
    ▸ Testable components, with clear separation of concerns.
    ▸ Reusable components across jobs.
    ▸ ETL jobs easy to maintain on the very long run.
    ▸ Ruby ecosystem (tap into gems for extra features).
    ▸ Blueprints and components soon available in Kiba Pro.
    KIBA ETL

    View Slide

  20. Building a data-backed SaaS app? 

    Get in touch, I can work with you.
    Thibaut Barrère ([email protected])

    View Slide