Upgrade to Pro — share decks privately, control downloads, hide ads and more …

ETL with Ruby & Go

ETL with Ruby & Go

A presentation to NYC.rb discussing concurrency in Ruby and Go as used in a client ETL process.

David Michael

July 09, 2013
Tweet

Other Decks in Technology

Transcript

  1. *Thanks Wikipedia Extract data from a source Transform it to

    fit “operational needs” Load it into the target data store
  2. Improve the functioning of the database Greater query flexibility Centralized,

    consistent view of the data Faceted search Reduced reliance on Java/JRuby
  3. Also considered MongoDB, Rethink, Riak, Solr PostgreSQL, MySQL (relational) Data

    impedance was important: relational DB not a good fit Clustering built in JSON document base for easy-ish doc translation Lucene query syntax is supported Fast!
  4. 1.Retrieve document(s) from MySQL 2.For each document, find all the

    relationships 3.Retrieve all related documents, grouped by “relationship type” 4.Repeat
  5. 1. Parse JSON column 2. Add metadata columns to JSON

    3. Embed and flatten related documents
  6. JRuby has native threads and no GIL, but can be

    slower than YARV for other reasons, like JBDC
  7. class Extractor include Celluloid def initialize(connection, manager) @connection = connection

    @manager = manager end def fetch(query) # BEWARE: Sending the results one at a time # to be transformed is much much slower than # in batch! results = @connection.fetch(query) if results.size > 0 puts "Found #{results.size} documents." @manager.async.transform(results) end end end
  8. class Transformer include Celluloid def initialize(manager) @manager = manager end

    def transform(documents = []) puts "Processing #{documents.size} documents." docs = documents.each do |document| # Make the transformation document end @manager.async.load(docs) end end
  9. class Loader include Celluloid def initialize(manager) @manager = manager end

    def load(documents = []) puts "Sending documents to target datastore." documents.each do |document| # Make the transformation end # Tell the manager we done @manager.async.done(documents.size) end end
  10. class Manager include Celluloid # Let's use Sequels threaded connection

    pooling def self.connection @connection ||= Sequel.connect end def initialize @extractors = Fetcher.pool(size: 10, args: [ self.class.connection, current_actor]) @transformers = Transformer.pool(size: 10, args: [current_actor]) @loaders = Loader.pool(size: 10, args: [ current_actor]) end # The query should result in documents def extract(query) @extractors.async.extract(query) end
  11. #!/usr/bin/env ruby require 'rubygems' require 'bundler/setup' require 'celluloid/autostart' # Create

    the 28K or so queries queries = generate_queries manager = Manager.new # Send each query off to the worker pipeline queries.each {|query| manager.async.extract(query) } # TODO: find a way to know when all jobs are finished while true # !manager.done? sleep 0.1 end manager.shutdown puts "Finished."
  12. package main import ( ! "fmt" ! "time" ) func

    sleepy(name string) { ! fmt.Println("Zzzz ...") ! time.Sleep(1 * time.Second) } func main() { ! for i := 0; i < 5; i++ { ! ! go sleepy("boring!") ! } }
  13. Channels allow you to pass references to data structures between

    goroutines.* * http://golang.org/doc/codewalk/sharemem/
  14. func main() { ! var wg sync.WaitGroup ! var queries

    chan string ! var extracted chan Document ! var transformed chan string ! // returns a channel, ! // GetQueries will close the channel when its finished ! queries = GetQueries() ! db := GetConnection() ! ! wg.Add(3) ! startExtractors(db, queries, extracted, &wg) ! startTransformers(extracted, &wg) ! startLoaders(transformed, &wg) ! // Use the WaitGroup to block until all ! // workers have finished with all channels ! wg.Wait() }
  15. func startExtractors( queries chan string, extracted chan Document, wg *sync.WaitGroup)

    { ! var localWg sync.WaitGroup ! go func() { ! ! for i := 0; i < 15; i++ { ! ! ! wg.Add(1) ! ! ! go extract(db, queries, extracted, &localWg) ! ! } ! ! // Wait until all extractors call done ! ! // which happens after the channel is drained ! ! // and the process finishes ! ! localWg.Wait() ! ! // Now tell the manager we are finished ! ! wg.Done() // When the queries closes and all extracts finish // wg stops blocking and the channel is closed ! ! close(extracted) ! }() }
  16. func extract( db *sqlx.DB, queries chan string, extracted chan Document,

    wg *sync.WaitGroup) { ! // When the func finishes, tell the WaitGroup ! defer wg.Done() ! // Listen on the queries channel until it is closed ! for sql := range queries { ! ! // Create a slice of structs to hold the results ! ! documents := []Document{} ! ! err := db.Select(&documents, sql) ! ! if err != nil { ! ! ! fmt.Println(err) ! ! } ! ! // Send these documents off to be tranformed ! ! for _, doc := range documents { ! ! ! extracted <- doc ! ! } ! } } // This is the inner func called by the worker pool
  17. func transform( extracted chan Document, transformed chan string, wg *sync.WaitGroup)

    { ! // When the func finishes, tell the WaitGroup ! defer wg.Done() ! ! for document := range extracted { ! ! json, _ := doSomethingCrazy(document) ! ! if err != nil { ! ! ! fmt.Println(err) ! ! } else { ! ! ! transformed <- json! ! ! ! ! ! } ! } }
  18. func load(transformed chan string, wg *sync.WaitGroup) { ! // When

    the func finishes, tell the WaitGroup ! defer wg.Done() ! ! for json := range transformed { ! ! err := elasticsearch.Send(json) ! ! if err != nil { ! ! ! fmt.Println(err) ! ! } else { ! ! ! // hmm ! ! } ! } }
  19. Shelling out to Go lets you do things like put

    a nice web face on very fast processing
  20. JRuby 1.7.4 Finished 4456992 documents. ./bin/extractor 535.46s user 12.87s system

    303% cpu 3:00.56 total Ruby 2.0.0-p247 Finished 4456992 documents. ./bin/extractor 242.21s user 20.98s system 104% cpu 4:11.05 total Ruby 2.0.0-p247 sidekiq -r ./bin/extractor-worker.rb -c 45 194.24s user 28.70s system 100% cpu 3:42.47 total Go ./bin/extractor2 68.41s user 4.73s system 362% cpu 20.150 total
  21. Shelling out makes it difficult to manage resource footprints of

    the system NSQ is a good external queue solution for Go In Ruby, using many Sidekiq processes can help with batch management and CPU utilization. This could also be achieved by forking other such process spawning - its just that Sidekiq makes it pretty simple and its easy enough to have it’s workers run your Celluloid actors. Instead of shelling out, a message bus or queue can be used to coordinate the activities of heterogeneous processes. Context dependent of course. MarkLogic is another early document-based store based on XML and XQuery. Company was founded in 2001, though its not immediately clear when the database was released.