Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Bulk load data with AR & NoSQL - ConferenciaRoR 2016

Bulk load data with AR & NoSQL - ConferenciaRoR 2016

In this talk I want to explain my experience in the bulk load data on different projects with Active Record and NoSQL. I will try to speak about speed aspects, migration from millions of rows, external data from other systems, problems encountered, performance tuning and learned some tricks…

Carlos Sánchez Pérez

December 10, 2016
Tweet

More Decks by Carlos Sánchez Pérez

Other Decks in Programming

Transcript

  1. By - 2016 - ConferenciaRoR @carlossanchezp Bulk load data with

    ActiveRecord and NoSQL The only card I need is .......The Ace Of Spades
  2. Carlos Sánchez Pérez Pragmatic developer, Ruby/Rails choice at the moment,

    Learning Scala, Elixir/Phoenix & DDD. Currently working as Web Dev at . Let's make it happen Blog (only in spanish): Twitter: @The_Cocktail carlossanchezperez.wordpress.com @carlossanchezp
  3. 1.-Project called Searcher: Bulk load 300k per week File CSV

    data Active Record Medium Traffic Volume Create Update Delete MySQL - 300k Thinking Sphinx full-text search tool 2.-Project called E-commerce Bulk load 300k per day Connection with external services Hight Traffic Volume Create Update Logical deletion products CouchDB - 1M products Search tool ElasticSearch Overview
  4. 1.-Project called Searcher: 1. Architecture 2. Models 3. My First

    solution - Arrays 4. Mistakes lead to good things 5. Tips & Ideas 6. With & without Active Record 2.-Project called E-commerce 1. Architecture 2. Real time 3. Document design - NoSQL 4. Superimpose processes 5. Data load & synchronization 4.-Conclusions RoadMap 3.-Logs & the power of information
  5. 1) Penalized performance with "not in" with a big array

    (a lot of elements) 2) "uniq" because we can have duplicated elements in the array 3) "map" because we only need ID's to destroy them The reason why
  6. We realised we had to go over all these things:

    Bad performance CSV file was read 3 times Big arrays with ID's processed data & "not in" SQL Several SQL "SELECT" where executed per CSV line
  7. Quick idea Read file CVS Professional ID's processed into Array

    Delete method with "not processed" using "not in" More than 24h to finish A lot SQL for the same model per line read
  8. # ID's with SET pop_ids(@set_ids_centers,center.id) pop_ids(@set_ids_professionals,professional.id) # Pops ID's to

    delete at the end def pop_ids(ids, id) ids.delete(id) if ids.include? id end Check ID's from file into SET
  9. Benchmark.bm do |x| x.report "Set" do a=Set.new 1000.times{a.add(rand(100))} end x.report

    "Array" do a=[] 1000.times{r=rand(100);a<<r unless a.include?(r)} end end user system total real Set 0.000000 0.000000 0.000000 ( 0.001127) Array 0.000000 0.000000 0.000000 ( 0.000829) Array vs Set
  10. Benchmark.bm do |x| x.report "Set" do a=Set.new 1000.times{a.add(rand(1000))} end x.report

    "Array" do a=[] 1000.times{r=rand(1000);a<<r unless a.include?(r)} end end user system total real Set 0.000000 0.000000 0.000000 ( 0.003271) Array 0.000000 0.000000 0.000000 ( 0.003250) Array vs Set
  11. Benchmark.bm do |x| x.report "Set" do a=Set.new 1000.times{a.add(rand(10000))} end x.report

    "Array" do a=[] 1000.times{r=rand(10000);a<<r unless a.include?(r)} end end user system total real Set 0.000000 0.000000 0.000000 ( 0.001240) Array 0.010000 0.000000 0.010000 ( 0.010081) user system total real Set 0.000000 0.000000 0.000000 ( 0.001545) Array 0.010000 0.010000 0.020000 ( 0.008941) Array vs Set
  12. the best way that can i use both of them

    1) With a small number of elements is much better Array 2) With a large number of elements is much better Set
  13. users = User.all => [#<User id: 1, email: '[email protected]', active:

    true>, #<User id: 2, email: '[email protected]', active: false>] users.map(&:email) => ['[email protected]', '[email protected]'] # always i use: User.all.map(&:email) emails = User.select(:email) => [#<User email: '[email protected]'>, #<User email: '[email protected]'>] emails.map(&:email) => ['[email protected]', '[email protected]'] User.pluck(:email) => ['[email protected]', '[email protected]'] User.where(active:true).pluck(:email) Pluck vs Map - examples
  14. ActiveRecord::Base.logger.level = 1 n = 1000 Benchmark.bm do |x| x.report('Country.all.map(&:name):

    ') { n.times { Country.all.map(&:name) } } x.report('Coutry.pluck(:name): ') { n.times { Country.pluck(:name) } } end ## Score user system total real Country.all.map(&:name): 3.830000 0.140000 3.970000 ( 4.328655) Coutry.pluck(:name): 1.550000 0.040000 1.590000 ( 1.879490) Benchmark Pluck vs map
  15. # How to load set_ids_centers = Set.new get_ids_centers set_ids_professionals =

    Set.new get_ids_professionals # Method only return id's def get_ids_centers Center.pluck(:id) end def get_ids_professionals Professional.pluck(:id) end Load element into SET
  16. create_table "professionals", force: true do |t| t.string "email", null: false

    t.string "first_name", null: false t.string "last_name", null: false t.string "personal_web", default: "http://" t.string "telephone" t.boolean "show_telephone", default: true, null: false t.boolean "show_email", default: true, null: false t.text "cv" t.integer "update_check", default: 0 t.boolean "delta", default: true, null: false t.integer "type_id", default: 0, null: false t.string "languages" t.string "twitter" t.string "numbercol", limit: 30 t.boolean "active", default: true, null: false Validate - Data Base
  17. class Professional < ActiveRecord::Base include NestedAttributeList, FriendlyId # Attributes friendly_id

    :full_name, use: :slugged # Validations validates :email, uniqueness: true, case_sensitive: false, allow_blank: true validate :first_name, present: true validate :last_name, present: true validate :type_id, present: true Validate - Models
  18. def skip_validations # Skip presence validations while loading, and delegate

    to DB validations skip_presence_validation(Address, :country) skip_presence_validation(Address, :province) skip_presence_validation(Address, :city) skip_presence_validation(Skill, :professional) skip_presence_validation(SpecialitySpecialist, :speciality) skip_presence_validation(SpecialitySpecialist, :specialist) skip_presence_validation(ProfessionalCenter, :professional) skip_presence_validation(ProfessionalCenter, :center) skip_presence_validation(InsuranceCompanyPartner, :insurance_company) skip_presence_validation(InsuranceCompanyPartner, :partner) end Delegate to BBDD
  19. def skip_presence_validation(model_class, field) validators = model_class._validators[field] validators.reject! do |validator| validator.is_a?(ActiveRecord::Validations::PresenceValidator)

    end model_class._validators.delete(field) if validators.empty? empty_callbacks = [] callbacks = model_class._validate_callbacks callbacks.each do |callback| if callback.name == :validate && callback.filter.is_a?(ActiveRecord::Validati callback.filter.attributes.delete(field) empty_callbacks << callback if callback.filter.attributes.empty? end end empty_callbacks.each {|c| callbacks.delete(c) } end Delegate to BBDD
  20. class Address < ActiveRecord::Base # Relations belongs_to :country, inverse_of: :addresses

    belongs_to :city, inverse_of: :addresses belongs_to :province, inverse_of: :addresses has_many :centers, inverse_of: :address # Validations validates :country, :city, :province, presence: true validates :name, presence: true # Delegations delegate :name, :code, to: :country, prefix: true, allow_nil: true delegate :name, to: :city, prefix: true, allow_nil: true delegate :name, to: :province, prefix: true, allow_nil: true end Delegate to BBDD
  21. class SpecialitySpecialist < ActiveRecord::Base # Relations belongs_to :speciality, inverse_of: :speciality_specialists

    belongs_to :specialist, polymorphic: true # Validations validates :speciality, presence: true validates :specialist, presence: true end Delegate to BBDD
  22. def cached_tables { cities: City.all.index_by { |c| "#{c.province_id}-#{c.external_id}" }, provinces:

    Province.all.index_by(&:external_id), countries: Country.all.index_by(&:external_id), specialities: Speciality.all.index_by(&:external_id), insurance_companies: InsuranceCompany.all.to_a, } end Cached tables - Hash
  23. def cities @caches[:cities] end def provinces @caches[:provinces] end def countries

    @caches[:countries] end def specialities @caches[:specialities] end def insurance_companies @caches[:insurance_companies] end Cached tables - Method
  24. def find_or_create_city(province, country) city = cities["#{province.id}-#{row.city_attributes[:external_id]}"] || City.new city.attributes =

    row.city_attributes. merge(province: province, country: country) city.save! if city.changed? cities["#{city.province_id}-#{city.external_id}"] = city city end Method
  25. Score Time process more or less 4 hours 400,000 AR

    select less Code was more readable
  26. TIMES = 10000 def do_inserts TIMES.times { User.create(:user_id => 1,

    :sku => 12, :delta => 1) } end Benchmark.measure { ActiveRecord::Base.transaction { do_inserts } } Benchmark.measure { do_inserts } CODE
  27. CONN = ActiveRecord::Base.connection TIMES = 10000 def raw_sql TIMES.times {

    CONN.execute "INSERT INTO `user` (`delta`, `updated_at`, `sku`, `user_id`) VALUES(1, '2015-11-21 20:21:13', 12, 1)" } end Benchmark.measure { ActiveRecord::Base.transaction { raw_sql } } Benchmark.measure { raw_sql } CODE
  28. CONN = ActiveRecord::Base.connection TIMES = 10000 def mass_insert inserts =

    [] TIMES.times do inserts.push "(1, '2015-11-21 20:21:13', 12, 1)" end sql = "INSERT INTO user (`delta`, `updated_at`, `sku`, `user_id`) VALUES #{inserts.join(", ")}" CONN.execute sql end Benchmark.measure { mass_insert } CODE
  29. ActiveRecord without transaction: 14.930000 0.640000 15.570000 ( 18.898352) ActiveRecord with

    transaction: 13.420000 0.310000 13.730000 ( 14.619136) 1.29x faster than base Raw SQL without transaction: 0.920000 0.170000 1.090000 ( 3.731032) 5.07x faster than base Raw SQL with transaction: 0.870000 0.150000 1.020000 ( 1.648834) 11.46x faster than base Only Mass Insert: 0.000000 0.000000 0.000000 ( 0.268634) 70.35x faster than base Score
  30. */30 * * * * flock -n /tmp/cron.txt.lock sh -c

    'cd /var/www/project/current && bundle exec rake load:parse' || sh -c 'echo MyProject already running; ps; ls /tmp/*.lock' crontab S.O.
  31. How many items have been loaded? How many errors have

    you had? How many elements with a specific class have been loaded? How many items have been removed / changed? How many items did not meet load conditions? ..... Logs should answer questions like these:
  32. LOG data Transaction ID Time of Web Services Differents Logs

    We need to store more info at the beginning of the project
  33. # warnings, errors o info Rails.logger.warn "cuidado no dispone de...."

    Rails.logger.error "Error en..." Rails.logger.info "RESPONSE TOKEN: #{token_info["access_token"]}" # way to use Rails.logger.tagged "MYPROJECT" do Rails.logger.tagged "GET_OFFERS_BY_BLOCK" do end end Rails logger
  34. Final summary The more ActiveRecord you know, the better Things

    to take into consideration: speed, amount of data, real time or not real time and the architecture components Design of your documents and Models Cache as much as possible Re-think everything and refactor The power of Logging