Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Kiba ETL v2 - RubyKaigi 2018

Kiba ETL v2 - RubyKaigi 2018

Kiba 2 - Past, present & future of data processing with Ruby.

Kiba ETL (http://www.kiba-etl.org) is a lightweight, generic data processing framework for Ruby, initially released in 2015 & now in v2.

In this talk, I highlight why Kiba was created, how it is used for low-maintenance data preparation and processing in the enterprise, why and how the version 2 (leveraging Ruby's Enumerator) brings a massive improvement in authoring reusable & composable data processing components, and why I'm optimistic about the future of data processing with Ruby.

Thibaut Barrère

May 31, 2018
Tweet

More Decks by Thibaut Barrère

Other Decks in Programming

Transcript

  1. ᇌ KIBA ETL
    Past, present & future of
    data processing with Ruby
    RubyKaigi 2018 - Thibaut Barrère (France), independent consultant
    [email protected] / twitter.com/thibaut_barrere

    View Slide

  2. KIBA ETL?
    lightweight, generic
    data processing framework
    for Ruby
    initially released in 2015 & now at v2.

    View Slide

  3. Kiba == Fang (scrunch your data)

    View Slide

  4. EXTRACT
    TRANSFORM
    LOAD
    (see http://thibautbarrere.com)

    View Slide

  5. View Slide

  6. Back in time
    Why was Kiba ETL created?

    View Slide

  7. 2006
    ❤ activewarehouse-etl
    Thanks Anthony Eden! (now founder at DNSimple)

    View Slide

  8. source :in, {
    :file => 'extracted/crm-changes.csv',
    :parser => :delimited,
    :skip_lines => 1
    }, [:first_name, :last_name]
    after_read :ensure_fields_presence, { :fields => fields }
    transform(:id_partenaire) { |n,v,r| some_computation }
    destination :out, {
    :file => output_file,
    :include_headers => true,
    :separator => ";"
    }

    View Slide

  9. transform(:email_provider) do |name, value, row|
    row.fetch(:email).downcase.split('@').last
    end
    transform :email_provider, :default,
    :default_value => 'Unknown'
    before_write do |r|
    r[:email_provider] =~ /Hotmail/ ? nil : r
    end

    View Slide

  10. before_write do |r|
    position, status = geocode_with_redis_cache(row)
    row[:geocoding_status] = status
    row[:latitude] = position.latitude
    row[:longitude] = position.longitude
    row
    end

    View Slide

  11. $ etl my_etl_script.etl

    View Slide

  12. Use cases 2006-2012
    Extract CSV data from a ($$$) CRM
    Import to MySQL (low-cost Business Intelligence)
    Geocode & export to "nearby search" Rails app
    IBAN validation & export to COBOL back-end
    More details available in RuLu 2012 talk1
    1 See Youtube video & SpeakerDeck slides

    View Slide

  13. 2011 Took over maintenance

    View Slide

  14. ETL::Parser::Parser: SaxParser, FixedWidthParser, NokogiriXmlParser, XmlParser, ExcelParser,
    CsvParser
    ETL::Processor::Processor: FtpUploaderProcessor, FtpDownloaderProcessor, EscapeCsvProcessor,
    ZipFileProcessor, SftpDownloaderProcessor, SftpUploaderProcessor,
    ImapattachmentDownloaderProcessor, EncodeProcessor, TruncateProcessor, BulkImportProcessor,
    Pop3attachmentDownloaderProcessor
    ETL::Processor::RowProcessor: BlockProcessor, EnsureFieldsPresenceProcessor,
    DatabaseJoinProcessor, RequireNonBlankProcessor, CopyFieldProcessor, SequenceProcessor,
    PrintRowProcessor, SurrogateKeyProcessor, FilterRowProcessor, CheckExistProcessor,
    HierarchyExploderProcessor, CheckUniqueProcessor, RenameProcessor
    ETL::Transform: StringToDateTimeTransform, DateToStringTransform, HierarchyLookupTransform,
    SplitFieldsTransform, DecodeTransform, StringToTimeTransform, Sha1Transform, TypeTransform,
    OrdinalizeTransform, BlockTransform, StringToDateTransform, Md5Transform, CalculationTransform,
    TrimTransform, ForeignKeyLookupTransform
    Source: ModelSource, DatabaseSource, FileSource
    Destination: CsvDestination, ExcelDestination, YamlDestination, DatabaseDestination,
    InsertUpdateDatabaseDestination, UpdateDatabaseDestination, FileDestination

    View Slide

  15. module ETL #:nodoc:
    module Parser #:nodoc:
    # Parses CSV files
    class CsvParser < ETL::Parser::Parser
    # Initialize the parser
    # * source: The Source object
    # * options: Hash of options for the parser, defaults to an empty hash
    def initialize(source, options={})
    super
    configure
    end
    attr_reader :validate_rows
    def get_fields_names(file)
    File.open(file) do |input|
    fields = CSV.parse(input.readline, options).first
    new_fields = []
    fields.each_with_index do |field,index|
    # compute the index of occurrence of this specific occurrence of the field (usually, will be 1)
    occurrence_index = fields[0..index].find_all { |e| e == field }.size
    number_of_occurrences = fields.find_all { |e| e == field }.size
    new_field = field + (number_of_occurrences > 1 ? "_#{occurrence_index}" : "")
    new_fields << Field.new(new_field.to_sym)
    end
    return new_fields
    end
    end
    # Returns each row.
    def each
    Dir.glob(file).each do |file|
    ETL::Engine.logger.debug "parsing #{file}"
    if fields.length == 0
    ETL::Engine.logger.debug "no columns specified so reading names from first line of #{file}"
    @fields = get_fields_names(file)
    end
    line = 0
    lines_skipped = 0
    CSV.foreach(file, options) do |raw_row|
    if lines_skipped < source.skip_lines
    ETL::Engine.logger.debug "skipping line"
    lines_skipped += 1
    next
    end
    line += 1
    row = {}
    validate_row(raw_row, line, file) if self.validate_rows
    raw_row.each_with_index do |value, index|
    f = fields[index]
    row[f.name] = value
    end
    yield row
    end
    end
    end
    # Get an array of defined fields
    def fields
    @fields ||= []
    end
    private
    def validate_row(row, line, file)
    ETL::Engine.logger.debug "validating line #{line} in file #{file}"
    if row.length != fields.length
    raise_with_info( MismatchError,
    "The number of columns from the source (#{row.length}) does not match the number of columns in the definition (#{fields.length})",
    line, file
    )
    end
    end
    def configure
    @validate_rows = if source.configuration.has_key?(:validate_rows)
    source.configuration[:validate_rows]
    else
    true
    end
    source.definition.each do |options|
    case options
    when Symbol
    fields << Field.new(options)
    when Hash
    fields << Field.new(options[:name])
    else
    raise DefinitionError, "Each field definition must either be a symbol or a hash"
    end
    end
    end
    class Field #:nodoc:
    attr_reader :name
    def initialize(name)
    @name = name
    end
    end
    end
    end
    end

    View Slide

  16. Photo Credit: National Fire Protection Association

    View Slide

  17. 2013 - Maintenance stopped
    Too many features
    Yet, features lacking flexibility
    Custom components complicated to implement
    Costly to maintain (code + CI)

    View Slide

  18. But I still have data processing needs!
    source
    transform
    transform
    transform
    destination
    and that row-based DSL syntax is really great for maintenance

    View Slide

  19. 2008 TinyTL (a tiny ETL)
    Clean-room implementation
    Minimalistic (113 lines total)
    2 small production apps

    View Slide

  20. 2015 Kiba ETL v0.5.0
    Lightweight.
    Flexible (write your own components)
    Simple assumptions (PORO)
    Standalone (no ActiveRecord dependency)
    Strong focus on documentation & articles
    Easy to "keep alive" as a maintainer

    View Slide

  21. 5 keywords only
    pre_process
    source
    transform
    destination
    post_process

    View Slide

  22. source CSVSource,
    filename: 'data/extract.csv',
    csv_options: {
    col_sep: ';',
    encoding: 'ISO-8859-1:UTF-8'
    }
    transform MyLookup, config: { xxx }
    transform do |row|
    {
    siren: row.fetch('SIREN'),
    libapen: row.fetch('LIBAPEN')
    }
    end
    destination CSVDestination,
    filename: 'data/output.csv'

    View Slide

  23. Source
    def initialize(*args)
    def each (yields N rows)

    View Slide

  24. class CSVSource
    def initialize(options)
    @filename = options.fetch(:filename)
    @csv_options = options.fetch(:csv_options)
    end
    def each
    CSV.foreach(@filename, @csv_options) do |row|
    yield(row)
    end
    end
    end

    View Slide

  25. Transform (as class)
    def initialize(*args)
    def process(row) (returns 1 or 0 row)
    Transform (as block)
    transform { |row| xxx } (returns 1 or 0 row)

    View Slide

  26. Destination
    def initialize(*args)
    def write(row) (writes 1 row)
    def close

    View Slide

  27. class CSVDestination
    def initialize(filename:, csv_options: {}, headers:)
    @filename = filename
    @csv_options = csv_options
    @headers = headers
    end
    def write(row)
    @csv ||= CSV.open(filename, 'wb', csv_options)
    @headers ||= row.keys
    @headers_written ||= (csv << headers ; true)
    csv << row.fetch_values(*@headers)
    end
    def close
    @csv&.close
    end
    end

    View Slide

  28. 2015-2018
    Numerous production uses
    Almost no core changes to Kiba
    Components easy to author & maintain
    "Long Term Support" of gem possible as solo dev

    View Slide

  29. Sustainable project evolution

    View Slide

  30. #1 Micro-batches
    < 50k rows per job
    Very near realtime sync

    View Slide

  31. View Slide

  32. Push mode

    View Slide

  33. Pull mode (with push notify)

    View Slide

  34. Don't shell out from HTTP or Sidekiq
    ❌ ❌ ❌
    def perform
    system("kiba my-import-job.etl")
    end
    ❌ ❌ ❌

    View Slide

  35. Use the Programmatic API (e.g call from Sidekiq2)
    class PartnersUploadProcessorWorker
    include Sidekiq::Worker
    def perform(options)
    job = Kiba.parse do
    source CSVSource, filename: options.fetch('filename')
    transform ...
    transform ...
    transform ...
    destination SQLUpsert, connection: SEQUEL_DB, table: xxx
    end
    Kiba.run(job)
    end
    end
    2 Kiba ETL Wiki: Considerations for running Kiba jobs programmatically

    View Slide

  36. Extract the job declaration
    module ETL
    module SyncPartners
    module_function
    def setup(source_file, sequel_connection, logger)
    Kiba.parse do
    source
    transform
    transform
    transform
    destination
    end
    end
    end
    end

    View Slide

  37. #2 Multistep batch processing
    (enterprise data aggregation)
    N different ERP systems (1 per SaaS client)
    Different extraction methods
    Different input formats (fields & data)
    Target: a single, common schema

    View Slide

  38. View Slide

  39. View Slide

  40. Keep things simple (cron + bash)
    #!/usr/bin/env bash
    set -e
    bundle exec kiba common/s3_downloader.etl
    bundle exec kiba client_acme/extractor.etl
    bundle exec kiba client_acme/transformer.etl
    bundle exec kiba common/api_uploader.etl
    bundle exec kiba common/cleaner.etl

    View Slide

  41. View Slide

  42. Amazon S3 = inbox
    Email from ERP (CSV/XLSX) -> Amazon SES -> S3
    ERP -> SFTP export -> S3
    Capybara -> XLSX download from ERP -> S3

    View Slide

  43. Pre-extraction from S3 to local file system
    Generic local download step (Kiba script)
    Move from /unprocessed to /processed
    S3 lifecycle rules to ensure data removal
    config = {
    aws_region: ENV.fetch('ACME_CORP_AWS_REGION'),
    aws_bucket: ENV.fetch('ACME_CORP_AWS_BUCKET'),
    aws_kms_key_id: ENV.fetch('ACME_CORP_AWS_KMS_KEY_ID')
    }
    source ETL::S3::ListObjects, config
    transform ETL::S3::Downloader

    View Slide

  44. Amazon S3 = message passing
    between machines (scalability!)

    View Slide

  45. Keep things DRY & enforce re-use
    Sources, transforms, destinations
    "Meta-transforms"
    Steps (whole Kiba scripts)

    View Slide

  46. Actual components re-use between pipelines
    module DSLExtensions
    module DefaultValue
    def default_value(field, default_value)
    transform do |row|
    value = row.fetch(field)
    row.merge(field => value.blank? ? default_value : value)
    end
    end
    end
    end
    extend DSLExtensions::DefaultValue
    default_value :some_field, 'Unknown'

    View Slide

  47. More actual components re-use between pipelines
    AddTimestamps
    DefaultValue
    EnforceValuesChoice
    RenameField
    HashLookup
    RaiseOnBlanks
    RemapBoolean
    (more...)

    View Slide

  48. #3 Automation of internal tasks

    View Slide

  49. View Slide

  50. View Slide

  51. View Slide

  52. source ETL::Sources::HTTPPaginatingFetcher,
    base_url: bank_base_url,
    headers: { Authorization: bank_auth }
    assert_equal :content_type, 'application/json'
    transform { |r| r.fetch(:body) }
    transform { |r| JSON.parse(r) }
    transform { |r| r.fetch('transactions') }
    transform Kiba::Common::Transforms::EnumerableExploder
    assert_equal 'currency', 'EUR'
    # SNIP
    destination Kiba::Pro::Destinations::SQLUpsert,
    database: ENV.fetch('DATABASE_URL'),
    unique_key: :fit_id,
    table: :bank_transactions

    View Slide

  53. source ETL::Sources::OFXSource,
    dir_pattern: "files/*.ofx"
    transform do |row|
    {
    fit_id: row.fit_id,
    memo: row.memo,
    name: row.name,
    posted_at: row.posted_at.to_date,
    amount: row.amount
    }
    end
    destination Kiba::Pro::Destinations::SQLUpsert,
    database: ENV.fetch('DATABASE_URL'),
    unique_key: :fit_id,
    table: :bank_transactions

    View Slide

  54. Source = PDF files

    View Slide

  55. PDF download & transform
    pre_process do
    system!("wget -A pdf ... https://vat-site.com")
    end
    source Kiba::Common::Enumerable, -> { Dir["downloads/*.pdf"] }
    transform { |r| { filename: r } }
    transform PDFToText, :filename_key => :filename
    # SNIP - extraction, verification
    destination CSVDestination, file: 'vat_rates.csv'

    View Slide

  56. class PDFToText
    def initialize(filename_key, content_key, pdftotext_args)
    @filename = filename_key
    @content = content_key
    @pdftotext_args = pdftotext_args
    end
    def process(row)
    cmd = "pdftotext #{@pdftotext_args} #{row[@filename]} -"
    output = system_with_output!(cmd)
    row[@content] = output
    row
    end
    end

    View Slide

  57. #4 Big rewrites & data migrations

    View Slide

  58. View Slide

  59. View Slide

  60. Use MiniTest & Sequel for "data screens"
    class Screens < ETL::BaseTest
    def test_well_known_record
    amount = db[:orders].where(id: 12456).amount_with_vat
    assert_equal 100.27, amount
    end
    def test_counts
    count = db[:orders].count
    assert_in_delta(230_000, count, 10_000)
    end
    def test_states
    states = db[:orders].distinct.select('state')
    assert_equal ['complete', 'pending'], states
    end
    end

    View Slide

  61. Key takeaways from data migrations
    Use "Bulk insert" (Kiba Pro, Sequel, ...)
    Use deterministic ids & reset sequence
    Bypass ActiveRecord validations
    validate_uniqueness_of :email, unless: -> { @etl_running }
    Work on a data subset for iteration
    Measure & optimize continuously
    Use screens to test specific & global data

    View Slide

  62. Why Kiba v2?
    mostly a drop-in replacement

    View Slide

  63. class XMLSource
    def initialize(...)
    @dir_pattern = dir_pattern
    end
    def each
    Dir[@dir_pattern].sort.each do |file|
    doc = Nokogiri::XML(IO.binread(file))
    doc.search('/invoices/invoice').each do |item|
    yield(item)
    end
    end
    end
    end

    View Slide

  64. class CSVSource
    def initialize(...)
    @dir_pattern = dir_pattern
    end
    def each
    Dir[@dir_pattern].sort.each do |file|
    CSV.foreach(file, csv_options) do |row|
    yield(row)
    end
    end
    end
    end

    View Slide

  65. Kiba v2 "StreamingRunner"
    def transform_stream(stream, t)
    Enumerator.new do |y|
    stream.each do |input_row|
    returned_row = t.process(input_row) do |yielded_row|
    y << yielded_row
    end
    y << returned_row if returned_row
    end
    end
    end
    ❤ Ruby's Enumerator

    View Slide

  66. v2 transforms can yield N rows, not just sources
    More components re-use + more composition
    class EnumerableExploder
    def process(row)
    row.each { |item| yield(item) }
    nil
    end
    end
    source DirectoryLister, dir_pattern: '*.csv'
    transform XMLReader
    transform XMLSearcher, selector: '/invoices/invoice'
    transform EnumerableExploder

    View Slide

  67. Happy with current data processing toolkit
    Kiba (core, OSS)
    Kiba Common (extra generic components, OSS)
    Kiba Pro (fast SQL, S3, profilers...)
    Tons of gems for pretty much any format

    View Slide

  68. More performance available soon
    TruffleRuby
    Ruby 2.6.0
    JIT improvements
    GIL & Guilds

    View Slide

  69. Ruby ETL performance trend4
    4 https://github.com/thbar/kiba-ruby-benchmarks

    View Slide

  70. Data + Ruby = ❤
    Great asset for data enterprise glue
    Efficient code reuse between data pipelines
    Components can be tested quite easily (PORO)
    Wealth of libraries to tap into
    Raw speed (single threaded) good & improving
    Concurrency story will improve too!

    View Slide