ETL with Ruby & Go

ETL with Ruby & Go David Michael July 9, 2013

Giant Machines [email protected] We Program.

Extract, Transform, Load

*Thanks Wikipedia Extract data from a source Transform it to
ﬁt “operational needs” Load it into the target data store

You might recognize ETL from such jargon as ...

Data warehousing Business Intelligence & Analytics API mashups Data aggregation
Search indexing Database migration

ETL processes can be challenging

Big data Bad data Impedance mismatches Idempotence Failure and rollback
Concurrency

Case Study

The Project: Externalize the query engine for a bespoke document
database

First commit: Nov 10 14:58:32 2008 -0500

4,687,718 documents

20,761,768 relationships

Powers a major eCommerce platform with > 1,600 stores

Uses MySQL as the storage medium

3 tables: documents, relationships, relationship_types

Documents are stored as JSON text in a column

App-local Lucene used for “queries”

Java core, JRuby client library

Renovate, not Replace

Improve the functioning of the database Greater query ﬂexibility Centralized,
consistent view of the data Faceted search Reduced reliance on Java/JRuby

We chose Elasticsearch

Also considered MongoDB, Rethink, Riak, Solr PostgreSQL, MySQL (relational) Data
impedance was important: relational DB not a good ﬁt Clustering built in JSON document base for easy-ish doc translation Lucene query syntax is supported Fast!

The ETL Process

Different types of extractions

1. Individual documents 2. Store graphs 3. Entire document set

Extract

1.Retrieve document(s) from MySQL 2.For each document, ﬁnd all the
relationships 3.Retrieve all related documents, grouped by “relationship type” 4.Repeat

Transform

1. Parse JSON column 2. Add metadata columns to JSON
3. Embed and ﬂatten related documents

1. Fire off HTTP request 2. Deal with errors 3.
Done?

Concurrency

Optimizing the use of all system resources is important for
efﬁcient ETL

What parts of our system can be run concurrently?

Querying source documents Transforming documents Sending documents to target

Sound familiar?

Concurrency in Ruby

That old chestnut

Non-blocking IO with Eventmachine

Connection pooling in EM is difﬁcult & error prone EventMachine::Synchrony::ConnectionPool

EM aware everything makes me a little sad

Processes & Forking

A common solution

Advocated by Matz

Easy to do, as long as you don’t need to
share data

Very memory intensive

Threads!

You can roll your own threaded implementation ...

... but Celluloid makes dealing with threads a joy

however

The GIL is real, and even affects IO heavy code

JRuby has native threads and no GIL, but can be
slower than YARV for other reasons, like JBDC

Let’s look at some code!

class Extractor include Celluloid def initialize(connection, manager) @connection = connection
@manager = manager end def fetch(query) # BEWARE: Sending the results one at a time # to be transformed is much much slower than # in batch! results = @connection.fetch(query) if results.size > 0 puts "Found #{results.size} documents." @manager.async.transform(results) end end end

class Transformer include Celluloid def initialize(manager) @manager = manager end
def transform(documents = []) puts "Processing #{documents.size} documents." docs = documents.each do |document| # Make the transformation document end @manager.async.load(docs) end end

class Loader include Celluloid def initialize(manager) @manager = manager end
def load(documents = []) puts "Sending documents to target datastore." documents.each do |document| # Make the transformation end # Tell the manager we done @manager.async.done(documents.size) end end

class Manager include Celluloid # Let's use Sequels threaded connection
pooling def self.connection @connection ||= Sequel.connect end def initialize @extractors = Fetcher.pool(size: 10, args: [ self.class.connection, current_actor]) @transformers = Transformer.pool(size: 10, args: [current_actor]) @loaders = Loader.pool(size: 10, args: [ current_actor]) end # The query should result in documents def extract(query) @extractors.async.extract(query) end

#!/usr/bin/env ruby require 'rubygems' require 'bundler/setup' require 'celluloid/autostart' # Create
the 28K or so queries queries = generate_queries manager = Manager.new # Send each query off to the worker pipeline queries.each {|query| manager.async.extract(query) } # TODO: find a way to know when all jobs are finished while true # !manager.done? sleep 0.1 end manager.shutdown puts "Finished."

How does it perform?

Basic extraction takes 5 minutes

Sending batches of documents between actors is much faster than
sending individual documents

Synchronized shared data with Queue to avoid copying?

Combination of process and threads for better resource utilization?

Ofﬂoad intermediate forms to Redis/Filesystem/etc and combine with Sidekiq?

Concurrency in Ruby is not that fun.

Concurrency in Go

Lightweight concurrency primitives built into the language

Goroutines are functions capable of running concurrently with other functions

package main import ( ! "fmt" ! "time" ) func
sleepy(name string) { ! fmt.Println("Zzzz ...") ! time.Sleep(1 * time.Second) } func main() { ! for i := 0; i < 5; i++ { ! ! go sleepy("boring!") ! } }

Channels allow you to pass references to data structures between
goroutines.* * http://golang.org/doc/codewalk/sharemem/

func main() { ! var wg sync.WaitGroup ! var queries
chan string ! var extracted chan Document ! var transformed chan string ! // returns a channel, ! // GetQueries will close the channel when its finished ! queries = GetQueries() ! db := GetConnection() ! ! wg.Add(3) ! startExtractors(db, queries, extracted, &wg) ! startTransformers(extracted, &wg) ! startLoaders(transformed, &wg) ! // Use the WaitGroup to block until all ! // workers have finished with all channels ! wg.Wait() }

func startExtractors( queries chan string, extracted chan Document, wg *sync.WaitGroup)
{ ! var localWg sync.WaitGroup ! go func() { ! ! for i := 0; i < 15; i++ { ! ! ! wg.Add(1) ! ! ! go extract(db, queries, extracted, &localWg) ! ! } ! ! // Wait until all extractors call done ! ! // which happens after the channel is drained ! ! // and the process finishes ! ! localWg.Wait() ! ! // Now tell the manager we are finished ! ! wg.Done() // When the queries closes and all extracts finish // wg stops blocking and the channel is closed ! ! close(extracted) ! }() }

func extract( db *sqlx.DB, queries chan string, extracted chan Document,
wg *sync.WaitGroup) { ! // When the func finishes, tell the WaitGroup ! defer wg.Done() ! // Listen on the queries channel until it is closed ! for sql := range queries { ! ! // Create a slice of structs to hold the results ! ! documents := []Document{} ! ! err := db.Select(&documents, sql) ! ! if err != nil { ! ! ! fmt.Println(err) ! ! } ! ! // Send these documents off to be tranformed ! ! for _, doc := range documents { ! ! ! extracted <- doc ! ! } ! } } // This is the inner func called by the worker pool

func transform( extracted chan Document, transformed chan string, wg *sync.WaitGroup)
{ ! // When the func finishes, tell the WaitGroup ! defer wg.Done() ! ! for document := range extracted { ! ! json, _ := doSomethingCrazy(document) ! ! if err != nil { ! ! ! fmt.Println(err) ! ! } else { ! ! ! transformed <- json! ! ! ! ! ! } ! } }

func load(transformed chan string, wg *sync.WaitGroup) { ! // When
the func finishes, tell the WaitGroup ! defer wg.Done() ! ! for json := range transformed { ! ! err := elasticsearch.Send(json) ! ! if err != nil { ! ! ! fmt.Println(err) ! ! } else { ! ! ! // hmm ! ! } ! } }

Best of both worlds?

Shell out to Go

Shelling out to Go lets you do things like put
a nice web face on very fast processing

Like using C extensions

Meaningless Benchmarks

JRuby 1.7.4 Finished 4456992 documents. ./bin/extractor 535.46s user 12.87s system
303% cpu 3:00.56 total Ruby 2.0.0-p247 Finished 4456992 documents. ./bin/extractor 242.21s user 20.98s system 104% cpu 4:11.05 total Ruby 2.0.0-p247 sidekiq -r ./bin/extractor-worker.rb -c 45 194.24s user 28.70s system 100% cpu 3:42.47 total Go ./bin/extractor2 68.41s user 4.73s system 362% cpu 20.150 total

References

https://practicingruby.com/articles/shared/nkhaprcgwrpv http://talks.golang.org/2012/concurrency.slide#1 http://talks.golang.org/2012/waza.slide http://celluloid.io/ http://merbist.com/2011/02/22/concurrency-in-ruby-explained/ http://www.jstorimer.com/products/working-with-ruby-threads

Notes from the talk

Shelling out makes it difﬁcult to manage resource footprints of
the system NSQ is a good external queue solution for Go In Ruby, using many Sidekiq processes can help with batch management and CPU utilization. This could also be achieved by forking other such process spawning - its just that Sidekiq makes it pretty simple and its easy enough to have it’s workers run your Celluloid actors. Instead of shelling out, a message bus or queue can be used to coordinate the activities of heterogeneous processes. Context dependent of course. MarkLogic is another early document-based store based on XML and XQuery. Company was founded in 2001, though its not immediately clear when the database was released.

ETL with Ruby & Go

ETL with Ruby & Go

Other Decks in Technology

Featured

Transcript