Bulk load data with AR & NoSQL - ConferenciaRoR 2016

By - 2016 - ConferenciaRoR @carlossanchezp Bulk load data with
ActiveRecord and NoSQL The only card I need is .......The Ace Of Spades

Carlos Sánchez Pérez Pragmatic developer, Ruby/Rails choice at the moment,
Learning Scala, Elixir/Phoenix & DDD. Currently working as Web Dev at . Let's make it happen Blog (only in spanish): Twitter: @The_Cocktail carlossanchezperez.wordpress.com @carlossanchezp

Based on Real projects with Ruby On Rails My experience

1.-Project called Searcher: Bulk load 300k per week File CSV
data Active Record Medium Traﬃc Volume Create Update Delete MySQL - 300k Thinking Sphinx full-text search tool 2.-Project called E-commerce Bulk load 300k per day Connection with external services Hight Traﬃc Volume Create Update Logical deletion products CouchDB - 1M products Search tool ElasticSearch Overview

1.-Project called Searcher: 1. Architecture 2. Models 3. My First
solution - Arrays 4. Mistakes lead to good things 5. Tips & Ideas 6. With & without Active Record 2.-Project called E-commerce 1. Architecture 2. Real time 3. Document design - NoSQL 4. Superimpose processes 5. Data load & synchronization 4.-Conclusions RoadMap 3.-Logs & the power of information

1.-Project Searcher Data load with Active Record

1.1-Architecture

ARCHITECTURE

1.2- Models

Main Models

1.3- My First Solution Arrays & Performance

First Solution - Arrays

Solution method

It takes a LOT of time

1) Penalized performance with "not in" with a big array
(a lot of elements) 2) "uniq" because we can have duplicated elements in the array 3) "map" because we only need ID's to destroy them The reason why

AR suffers a lot with this solution

We realised we had to go over all these things:
Bad performance CSV ﬁle was read 3 times Big arrays with ID's processed data & "not in" SQL Several SQL "SELECT" where executed per CSV line

End process time was: 24 - 28 hours for 300,000
lines CSV file

1.4- Mistakes Lead To Good things

what happens if i change the way of doing things

Remember

Quick idea Read ﬁle CVS Professional ID's processed into Array
Delete method with "not processed" using "not in" More than 24h to ﬁnish A lot SQL for the same model per line read

1.4.1.Refactoring

Re-Thinking load process

def process_file set_ids_centers = Set.new get_ids_centers set_ids_professionals = Set.new get_ids_professionals
Set with ID's

# ID's with SET pop_ids(@set_ids_centers,center.id) pop_ids(@set_ids_professionals,professional.id) # Pops ID's to
delete at the end def pop_ids(ids, id) ids.delete(id) if ids.include? id end Check ID's from file into SET

# Unprocessed ID's def delete_unprocessed_ids(ids_centers,ids_professionals) Professional.destroy(ids_professionals.to_a) if ids_professionals Center.destroy(ids_centers.to_a) if
ids_centers end Delete ID's from SET

It's much faster

Less time consuming

We are getting better...

1.4.2- Tips & Ideas

1.4.2.1-Set vs Array

The reason why

Benchmark.bm do |x| x.report "Set" do a=Set.new 1000.times{a.add(rand(100))} end x.report
"Array" do a=[] 1000.times{r=rand(100);a<<r unless a.include?(r)} end end user system total real Set 0.000000 0.000000 0.000000 ( 0.001127) Array 0.000000 0.000000 0.000000 ( 0.000829) Array vs Set

"Array" do a=[] 1000.times{r=rand(1000);a<<r unless a.include?(r)} end end user system total real Set 0.000000 0.000000 0.000000 ( 0.003271) Array 0.000000 0.000000 0.000000 ( 0.003250) Array vs Set

"Array" do a=[] 1000.times{r=rand(10000);a<<r unless a.include?(r)} end end user system total real Set 0.000000 0.000000 0.000000 ( 0.001240) Array 0.010000 0.000000 0.010000 ( 0.010081) user system total real Set 0.000000 0.000000 0.000000 ( 0.001545) Array 0.010000 0.010000 0.020000 ( 0.008941) Array vs Set

the best way that can i use both of them
1) With a small number of elements is much better Array 2) With a large number of elements is much better Set

1.4.2.2.-Pluck vs Map

users = User.all => [#<User id: 1, email: '[email protected]', active:
true>, #<User id: 2, email: '[email protected]', active: false>] users.map(&:email) => ['[email protected]', '[email protected]'] # always i use: User.all.map(&:email) emails = User.select(:email) => [#<User email: '[email protected]'>, #<User email: '[email protected]'>] emails.map(&:email) => ['[email protected]', '[email protected]'] User.pluck(:email) => ['[email protected]', '[email protected]'] User.where(active:true).pluck(:email) Pluck vs Map - examples

ActiveRecord::Base.logger.level = 1 n = 1000 Benchmark.bm do |x| x.report('Country.all.map(&:name):
') { n.times { Country.all.map(&:name) } } x.report('Coutry.pluck(:name): ') { n.times { Country.pluck(:name) } } end ## Score user system total real Country.all.map(&:name): 3.830000 0.140000 3.970000 ( 4.328655) Coutry.pluck(:name): 1.550000 0.040000 1.590000 ( 1.879490) Benchmark Pluck vs map

# How to load set_ids_centers = Set.new get_ids_centers set_ids_professionals =
Set.new get_ids_professionals # Method only return id's def get_ids_centers Center.pluck(:id) end def get_ids_professionals Professional.pluck(:id) end Load element into SET

1.4.2.3-We need validations

Do we need model validations all along the load process?

create_table "professionals", force: true do |t| t.string "email", null: false
t.string "first_name", null: false t.string "last_name", null: false t.string "personal_web", default: "http://" t.string "telephone" t.boolean "show_telephone", default: true, null: false t.boolean "show_email", default: true, null: false t.text "cv" t.integer "update_check", default: 0 t.boolean "delta", default: true, null: false t.integer "type_id", default: 0, null: false t.string "languages" t.string "twitter" t.string "numbercol", limit: 30 t.boolean "active", default: true, null: false Validate - Data Base

class Professional < ActiveRecord::Base include NestedAttributeList, FriendlyId # Attributes friendly_id
:full_name, use: :slugged # Validations validates :email, uniqueness: true, case_sensitive: false, allow_blank: true validate :first_name, present: true validate :last_name, present: true validate :type_id, present: true Validate - Models

If i use these: Cache Model Model polymorphic

the answer was..... "not always"

Skip presence validations while loading.....

and delegate to DB validations

def skip_validations # Skip presence validations while loading, and delegate
to DB validations skip_presence_validation(Address, :country) skip_presence_validation(Address, :province) skip_presence_validation(Address, :city) skip_presence_validation(Skill, :professional) skip_presence_validation(SpecialitySpecialist, :speciality) skip_presence_validation(SpecialitySpecialist, :specialist) skip_presence_validation(ProfessionalCenter, :professional) skip_presence_validation(ProfessionalCenter, :center) skip_presence_validation(InsuranceCompanyPartner, :insurance_company) skip_presence_validation(InsuranceCompanyPartner, :partner) end Delegate to BBDD

def skip_presence_validation(model_class, field) validators = model_class._validators[field] validators.reject! do |validator| validator.is_a?(ActiveRecord::Validations::PresenceValidator)
end model_class._validators.delete(field) if validators.empty? empty_callbacks = [] callbacks = model_class._validate_callbacks callbacks.each do |callback| if callback.name == :validate && callback.filter.is_a?(ActiveRecord::Validati callback.filter.attributes.delete(field) empty_callbacks << callback if callback.filter.attributes.empty? end end empty_callbacks.each {|c| callbacks.delete(c) } end Delegate to BBDD

class Address < ActiveRecord::Base # Relations belongs_to :country, inverse_of: :addresses
belongs_to :city, inverse_of: :addresses belongs_to :province, inverse_of: :addresses has_many :centers, inverse_of: :address # Validations validates :country, :city, :province, presence: true validates :name, presence: true # Delegations delegate :name, :code, to: :country, prefix: true, allow_nil: true delegate :name, to: :city, prefix: true, allow_nil: true delegate :name, to: :province, prefix: true, allow_nil: true end Delegate to BBDD

class SpecialitySpecialist < ActiveRecord::Base # Relations belongs_to :speciality, inverse_of: :speciality_specialists
belongs_to :specialist, polymorphic: true # Validations validates :speciality, presence: true validates :specialist, presence: true end Delegate to BBDD

Less time consuming

1.4.2.4-Cache models

4 select AR per line reading - file

def cached_tables { cities: City.all.index_by { |c| "#{c.province_id}-#{c.external_id}" }, provinces:
Province.all.index_by(&:external_id), countries: Country.all.index_by(&:external_id), specialities: Speciality.all.index_by(&:external_id), insurance_companies: InsuranceCompany.all.to_a, } end Cached tables - Hash

def cities @caches[:cities] end def provinces @caches[:provinces] end def countries
@caches[:countries] end def specialities @caches[:specialities] end def insurance_companies @caches[:insurance_companies] end Cached tables - Method

def find_or_create_city(province, country) city = cities["#{province.id}-#{row.city_attributes[:external_id]}"] || City.new city.attributes =
row.city_attributes. merge(province: province, country: country) city.save! if city.changed? cities["#{city.province_id}-#{city.external_id}"] = city city end Method

Less time consuming

1.5-Result of this ideas and tips

Score Time process more or less 4 hours 400,000 AR
select less Code was more readable

1.6-With & without ActiveRecord

TIMES = 10000 def do_inserts TIMES.times { User.create(:user_id => 1,
:sku => 12, :delta => 1) } end Benchmark.measure { ActiveRecord::Base.transaction { do_inserts } } Benchmark.measure { do_inserts } CODE

CONN = ActiveRecord::Base.connection TIMES = 10000 def raw_sql TIMES.times {
CONN.execute "INSERT INTO ùser` (`delta`, ùpdated_at`, `sku`, ùser_id`) VALUES(1, '2015-11-21 20:21:13', 12, 1)" } end Benchmark.measure { ActiveRecord::Base.transaction { raw_sql } } Benchmark.measure { raw_sql } CODE

CONN = ActiveRecord::Base.connection TIMES = 10000 def mass_insert inserts =
[] TIMES.times do inserts.push "(1, '2015-11-21 20:21:13', 12, 1)" end sql = "INSERT INTO user (`delta`, `updated_at`, `sku`, `user_id`) VALUES #{inserts.join(", ")}" CONN.execute sql end Benchmark.measure { mass_insert } CODE

ActiveRecord without transaction: 14.930000 0.640000 15.570000 ( 18.898352) ActiveRecord with
transaction: 13.420000 0.310000 13.730000 ( 14.619136) 1.29x faster than base Raw SQL without transaction: 0.920000 0.170000 1.090000 ( 3.731032) 5.07x faster than base Raw SQL with transaction: 0.870000 0.150000 1.020000 ( 1.648834) 11.46x faster than base Only Mass Insert: 0.000000 0.000000 0.000000 ( 0.268634) 70.35x faster than base Score

Remember, knowledge, benchmark and refactoring

2.Project E-Commerce Real time load data

2.1-Architecture

Architecture

2.2.Real Time

Frontend Provider 1 Provider 2

2.3.Document design - NoSQL

Documents

2.4.Process Time & Superimpose process

*/30 * * * * ﬂock -n /tmp/cron.txt.lock sh -c
'cd /var/www/project/current && bundle exec rake load:parse' || sh -c 'echo MyProject already running; ps; ls /tmp/*.lock' crontab S.O.

2.5-Data load & synchronization

Sync CouchDB & Elastic

3- LOGS & The power of information

How many items have been loaded? How many errors have
you had? How many elements with a speciﬁc class have been loaded? How many items have been removed / changed? How many items did not meet load conditions? ..... Logs should answer questions like these:

feeds logs logs with information

LOG data Transaction ID Time of Web Services Diﬀerents Logs
We need to store more info at the beginning of the project

# warnings, errors o info Rails.logger.warn "cuidado no dispone de...."
Rails.logger.error "Error en..." Rails.logger.info "RESPONSE TOKEN: #{token_info["access_token"]}" # way to use Rails.logger.tagged "MYPROJECT" do Rails.logger.tagged "GET_OFFERS_BY_BLOCK" do end end Rails logger

4. Conclusion

Final summary The more ActiveRecord you know, the better Things
to take into consideration: speed, amount of data, real time or not real time and the architecture components Design of your documents and Models Cache as much as possible Re-think everything and refactor The power of Logging

and especially ... it's very very important

share your experiences

That's all Folks!! By - 2016 - ConferenciaRoR @carlossanchezp

Bulk load data with AR & NoSQL - ConferenciaRoR...

Bulk load data with AR & NoSQL - ConferenciaRoR 2016

More Decks by Carlos Sánchez Pérez

Other Decks in Programming

Featured

Transcript