Slide 1

Slide 1 text

KIBA ETL PROCESSING DATA WITH www.kiba-etl.org http://thibautbarrere.com by

Slide 2

Slide 2 text

WHY PROCESS DATA? Photo: Raymond Bryson

Slide 3

Slide 3 text

BUILDING PRODUCTS THAT MAKE DATA SPEAK LEGACY / CRM / HEALTHCARE (unable to answer to questions 
 despite having the data) ETL MODERN PRODUCT / APP (tailored to answer
 to those questions)

Slide 4

Slide 4 text

LEVERAGING DATA SYNC
 AS INTEGRATION MARKETING Your SaaS app Well established SaaS app with large customer base

Slide 5

Slide 5 text

YOU CAN DO MORE WITH ETL & DATA PROCESSING… ▸ Migrate your data from a schema to another (with grace). ▸ Generate reports automatically. ▸ Synchronize all or part of 2 of your apps. ▸ Prepare and clean data before indexing them for full-text search. ▸ Aggregate data from multiple sources inside your searchable app. ▸ Geocode records to present them online to your users. ▸ Implement a data import or export for your users.

Slide 6

Slide 6 text

A USEFUL TECHNIQUE TO PROCESS DATA? Photo: Raymond Bryson

Slide 7

Slide 7 text

www.kiba-etl.org

Slide 8

Slide 8 text

SOURCE TRANSFORM TRANSFORM TRANSFORM DESTINATION

Slide 9

Slide 9 text

require_relative 'my_components' source CSVSource, filename: 'patients.csv' transform ParseDate, from: 'bd_001', to: :birth_date transform do |row| row[:birth_date].year < 2000 ? row : nil end destination InsertToDatabase, db_config, table: 'patients'

Slide 10

Slide 10 text

$ bundle exec kiba my-etl-script.etl

Slide 11

Slide 11 text

class YourSource def initialize(parameters) def each end source YourSource, setting: 'value' SOURCE

Slide 12

Slide 12 text

require 'csv' class CSVSource def initialize(filename:) @filename = filename end def each csv = CSV.open(@filename, headers: true) csv.each do |row| yield(row.to_hash) end csv.close end end

Slide 13

Slide 13 text

require 'recurly' class RecurlyInvoices def initialize(from:, to:, fields:, cache: NullCache.new) @range = (from..to) @cache = cache @fields = fields end def each @range.each do |number| cache_key = ([number]+@fields).map(&:to_s).join(':') row = @cache.fetch(cache_key) do invoice = Recurly::Invoice.find(number) invoice.attributes.slice(*@fields) end yield row.dup end end end

Slide 14

Slide 14 text

recurly_fields = [ 'invoice_number', 'total_in_cents' ] source RecurlyInvoices, from: 1000, to: 2000, fields: recurly_fields, cache: my_cache

Slide 15

Slide 15 text

class YourTransform def initialize(parameters) def process(row) end transform YourTransform, setting: 'value' TRANSFORM

Slide 16

Slide 16 text

class ParseDate def initialize(from:, to:, format:) @from, @to = from, to @format = format end def process(row) row[@to] = Date.strptime(row[@from], @format) row end end

Slide 17

Slide 17 text

class YourDestination def initialize(parameters) def write(row) def close end destination YourDestination, setting: 'value' DESTINATION

Slide 18

Slide 18 text

class CSVDestination def initialize(output_file) @csv = CSV.open(output_file, 'w') @headers_written = false end def write(row) unless @headers_written @headers_written = true @csv << row.keys end @csv << row.values end def close @csv.close end end

Slide 19

Slide 19 text

▸ Code-centric ETL. ▸ Versioned in git (branches, yay!). ▸ Testable components, with clear separation of concerns. ▸ Reusable components across jobs. ▸ ETL jobs easy to maintain on the very long run. ▸ Ruby ecosystem (tap into gems for extra features). ▸ Blueprints and components soon available in Kiba Pro. KIBA ETL

Slide 20

Slide 20 text

Building a data-backed SaaS app? 
 Get in touch, I can work with you. Thibaut Barrère ([email protected])