Save 37% off PRO during our Black Friday Sale! »

Collecting and processing book reading statistics @ Bookmate

Collecting and processing book reading statistics @ Bookmate

The talk was given at DevConf 2015

#Elasticsearch #MapReduce #ELK #MongoDB #RabbitMQ #Ruby #Rails

Cb09696b034cce3cc79cab80a4bba4a3?s=128

exAspArk

June 19, 2015
Tweet

Transcript

  1. Collecting and processing book reading statistics Evgeny Li exAspArk

  2. None
  3. • 2 million users • 500 thousand books • 25

    billion read characters per month
  4. 15 million printed A4 sheets

  5. 1.5 km in height 17 x

  6. 1 read page = 1 “reading” {      "book":

     {            "id":                      123123,            "language":          "en"        },      "user":                      {            "id":                      1864864,            "birthday_at":    630513585000,            "gender":              "m"        },      "legal_owner_id":  435435,      "size":                      965,      "progress":              {  "from":  34.6655,  "to":  36.5425  },      "created_at":          1430513785829,      “read_at":                1430513585000,      "country_code":      "RU",      "city":                      "Default",      "ip":                          "127.128.129.130",      "app":                        {  "name":  "Bookmate",  "version":  "3.3.14"  },      "model":                    {  "name":  "iPhone",      "version":  "4.1"  },      "os":                          {  "name":  "iPhoneOS",  "version":  "6.1.3"  },      ...   }
  7. ~ 1,000 readings / minute

  8. Current architecture

  9. http://cdn.themetapicture.com/pic/images/2015/03/18/funny-gif-professor-asking-class-hands.gif

  10. None
  11. • 4 servers • Nginx • Unicorn • HTTP API

    • Ruby on Rails 4
  12. None
  13. • Performance • Scale • Asynchronous • Loose coupling •

    Durable
  14. • 2 servers • 10 processes for reading from queue

    • Ruby on Rails 4 • AQMP + EventMachine
  15. EventMachine.run  do      connection  =  AMQP.connect({      

         host:  RabbitmqSettings.host,            port:  RabbitmqSettings.port        })            channel  =  AMQP::Channel.new(connection)      queue  =  channel.queue(          RabbitmqSettings.readings_queue_name,            durable:  true      )      queue.subscribe  do  |payload|          Stats::Reading.create_from_json(payload)      end   end
  16. None
  17. • Document-oriented, BSON • Sharding / Replication • MapReduce •

    Aggregation Pipeline
  18. Let’s store everything in MongoDB!

  19. http://giphy.com/gifs/funny-sand-bulldog-hiNJEpsTBTwhG

  20. Replica can’t catch up with the master after falling

  21. Solutions?

  22. Reduce the size of data

  23. Use short field names http://docs.mongodb.org/manual/faq/developers/#how-do-i-optimize-storage-use-for-small-documents

  24. None
  25. Upgrade MongoDB to 3.0

  26. New storage engine – WiredTiger • Before: 111 GB •

    After: 23 GB – almost 5x less! http://docs.mongodb.org/manual/release-notes/3.0/#wiredtiger
  27. https://github.com/mongoid/mongoid/pull/3941 https://www.mongodb.com/blog/post/announcing-ruby-driver-20-rewrite

  28. https://github.com/mongoid/mongoid/pull/3941 https://www.mongodb.com/blog/post/announcing-ruby-driver-20-rewrite

  29. Dump readings from MongoDB

  30. Files

  31. #!/usr/bin/env  python   client  =  MongoClient(HOSTS,  slave_okay  =  False)  

    db  =  client.publisher   collection  =  db["stats.readings"]   collection_tmp  =  db["stats.readings_tmp"]   collection.rename("stats.readings_tmp")   filepath  =  strftime("%Y%m%d")  +  “.json.gz"   file  =  gzip.open(filepath,  “ab")   for  i  in  collection_tmp.find():          print  >>  file,  dumps((i))   f.close   collection_tmp.drop()
  32. • 1.5 million readings per day • 750 MB of

    disk space • 100 MB in archive
  33. Store aggregated data

  34. class  Stats::Reading      include  Mongoid::Document      INTERVAL_READING_CLASSES  =

     [          Stats::ReadingDaily,          Stats::ReadingMonthly,          Stats::UserReadingDaily,          Stats::UserReadingMonthly,          Stats::CityReadingHourly,          Stats::CityReadingDaily,          ...      ]      after_create  :update_interval_readings      private      def  update_interval_readings          INTERVAL_READING_CLASSES.each  do  |interval_reading_class|              interval_reading_class.increment_reading_size!(self)          end      end   end
  35. class  Stats::UserReadingMonthly      include  Mongoid::Document      def  self.increment_reading_size!(reading)

             read_at  =  reading.read_at          canonical_read_at  =  Time.utc(read_at.year,  read_at.mon)          collection.where({              u_id:  reading.user_id,              d_id:  reading.document_id,              r_at:  canonical_read_at          }).find_and_modify({              "$inc"  =>  {  s:  reading.size  }            },  [:upsert])      end   end
  36. • 500 MB for 1.5 million documents • 500 MB

    for 4 indices
  37. And it worked for a while, until…

  38. We decide to get more various statistics

  39. https://dribbble.com/shots/1849628-Dashboard-for-Publishers

  40. None
  41. Calculate user sessions similar to Google Analytics

  42. Elasticsearch

  43. • Document-oriented, based on Lucene (Java) • Sharding / Replication

    • REST API and JSON • Nested aggregations • Scripting and plugins
  44. Who uses?

  45. • Работает на Lucene (Java) • REST API и JSON

    • Шардинг • Репликации • Полнотекстовый поиск • Агрегации SQL Elasticsearch Database Index Table Type Row Document Column Field Schema Mapping Index Everything SQL Query DSL
  46. Sharding / Replication Cluster Node 2 Node 1 P1 R2

    P3 R1 P2 R3 2 nodes, 1 replica, 3 shards
  47. Adding one node Cluster Node 2 Node 1 P1 R3

    R1 P2 3 nodes, 1 replica, 3 shards Node 3 R2 P3
  48. Adding one replica Cluster 3 nodes, 2 replicas, 3 shards

    Node 3 Node 2 Node 1 P1 R2 R3 R1 P2 R3 R1 R2 P3
  49. The number of shards can’t be changed after index creation

  50. • Near real-time analytics! • Full-text search! • ELK (Elasticsearch

    + Logstash + Kibana)!
  51. http://giphy.com/gifs/funny-pics-expectations-O8zj64CXHojja

  52. More RAM!

  53. • A machine with 64 GB of RAM is the

    ideal sweet spot, but 32 GB and 16 GB machines are also common. • Consider giving 32 GB to Elasticsearch and letting Lucene use the rest of memory via the OS filesystem cache. All that memory will cache segments and lead to blisteringly fast full-text search (50% for heaps, and 50% for Lucene). https://www.elastic.co/guide/en/elasticsearch/guide/current/hardware.html https://www.elastic.co/guide/en/elasticsearch/guide/current/heap-sizing.html
  54. • 2 x 64 GB nodes • 1 replica and

    1 shard • 1 day = 1 index • 180 million documents for 4 months
  55. 110 MB for 1.5 million documents!

  56. How did we reduce the size by over 400%?

  57. Mapping

  58. • Avoid nested – indexed as a separate document •

    Disable _source – actual JSON that was used as the indexed document • Disable _all – includes the text of one or more other fields within the document indexed • Disable analyzer и norms – to compute the score of a document relatively • Use doc_values – live on disk instead of in heap memory
  59. Before and after {      "reading":  {    

         "_all":        {              "enabled":  false          },          "_source":  {              "enabled":  false          },          "properties":  {              "from":  {                  "type":  "float"              },              "to":  {                  "type":  "float"              },              "size":  {                  "type":  "integer"              },            "read_at":  {                  "type":  "date"              },              "user_id":  {                  "type":  "integer"              },              "user_birthday_at":  {                  "type":  "date"              },              "user_gender":  {                  "type":  "string",                  "index":  "not_analyzed"              },              ...          }      }   } {      "reading":  {          "properties":  {              "from":  {                  "type":  "float"              },              "to":  {                  "type":  "float"              },              "size":  {                  "type":  "integer"              },            "read_at":  {                  "type":  "date"              },              "user":  {                  "type":  "nested",                  "include_in_parent":  true,                  "properties":  {                      "id":  {                          "type":  "integer"                      },                      "birthday_at":  {                          "type":  "date"                      },                      "gender":  {                          "type":  "string",                          "index":  "string_lowercase"                      }                  }              },            ...          }      }   }
  60. Indexing

  61. 2,500 documents / second – not a problem at all

  62. • Use bulk API (optimize size empirically) • Increase or

    disable refresh_interval and refresh manually • Temporary disable replication • Delay or disable flushes • Increase thread pool size for index and bulk operations • Use templates for creating new indices
  63. Parallelize indexing file_mask  =  "201502*"   day_filepaths  =  Dir.glob("#{  READINGS_PATH

     }/#{  file_mask  }").sort   day_filepaths.each_slice(MAX_PROCESS_COUNT)  do  |day_filepaths|      Parallel.each(day_filepaths,  in_processes:  day_filepaths.size)  do  |day_filepath|          index(day_filepath)      end   end https://github.com/grosser/parallel
  64. Bulk API and templates TEMPLATE  =  "readings-­‐#{  Rails.env  }-­‐*".freeze  

    def  index(day_filepath)      file  =  MultipleFilesGzipReader.new(File.open(day_filepath))      bulk  =  []      file.each_line  do  |line|          reading  =  parsed_reading(line)          bulk  <<  {              index:  {                    _type:    "reading",                    _index:  TEMPLATE.sub("*",  reading["read_at"].strftime("%Y.%m.%d"))                    _id:        reading["_id"],                    data:      serialized_reading(reading)              }          }          if  bulk.size  ==  BULK_SIZE              client.bulk(body:  body)              bulk  =  []          end      end   end https://github.com/exAspArk/multiple_files_gzip_reader
  65. • Try to load all data you need at once

    • Use sets and hashes http://spin.atomicobject.com/2012/09/04/when-is-a-set-better-than-an-array-in-ruby/
  66. USER_ATTR_NAMES  =  %w(id  gender  birthday_at)   def  user_attrs_by_id    

     @user_attrs_by_id  ||=  begin          users_attr_values  =  User.pluck(*USER_ATTR_NAMES)          user_attrs_by_id  =  users_attr_values.inject({})  do  |result,  user_attr_values|              result[user_attr_values.first]  =  Hash[USER_ATTR_NAMES.zip(user_attr_values)]              result          end      end   end   def  serialized_reading(reading)      user  =  user_attrs_by_id[reading["user_id"]]      {          from:                          reading["from"].to_f,          to:                              reading["to"].to_f,          size:                          reading["size"].to_i,          read_at:                    Time.at(reading["read_at"].to_i  /  1_000).utc,          user_id:                    user["id"],          user_birthday_at:  user["birthday_at"],          user_gender:            user["gender"].downcase,          ...      }   end
  67. Marvel https://www.elastic.co/products/marvel

  68. Building requests is fun!

  69. Kibana https://www.elastic.co/products/kibana

  70. MapReduce with scripting    aggs:  {        

     user_ids:  {              terms:  {  field:  "user_id"  },              aggs:  {                  by_days:  {                      date_histogram:  {  field:  "read_at",  interval:  "1d"  },                      aggs:  {                          sessions:  {                              scripted_metric:  {                                  init_script:  "_agg['read_ats']  =  []",                                  map_script:  "_agg.read_ats.add(doc['read_at'].value)",                                  combine_script:  oneliner(%Q{                                      sessions  =  []                                      if  (_agg.read_ats.size()  <  2)  {  return  sessions  }                                      for  (read_at  in  _agg.read_ats)  {                                            sessions  <<  ...                                      }                                      return  sessions                                  }),                                  reduce_script:  oneliner(%Q{                                      sessions  =  []                                      stats  =  [:]                                                                            for  (shard_sessions  in  _aggs)  {  sessions.addAll(shard_sessions)  }                                      if  (sessions.size()  ==  0)  {  return  stats  }                                      stats.average  =  sessions.sum()  /  sessions.size()                                      return  stats                                  })   https://gist.github.com/exAspArk/c325bb9a75dcda5c8212
  71. Percolator Docs D1 D2 Search Query Response Docs D1 D2

    Percolation Query Response https://www.elastic.co/guide/en/elasticsearch/reference/current/search-percolate.html
  72. Warmers • Filter cache • Filesystem cache • Loading field

    data for fields https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-warmers.html
  73. Plugins • Analysis (Russian, Chinese, Unicode) • Transport (Redis, ZeroMQ)

    • Scripting (JavaScript, Clojure) • Site (Elasticsearch HQ, BigDesk) • Snapshot/Restore (Hadoop HDFS, AWS S3) • Search for similar images • Mahout Taste-based recommendation • Render HTML from Elasticsearch https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-plugins.html
  74. Pay attention to

  75. Don’t use it as your primary datastore

  76. Don’t trust 100% (aka read docs carefully) https://www.elastic.co/blog/count-elasticsearch

  77. Some statistics from Elasticsearch

  78. How men and women read differently

  79. Readers broken down by age

  80. • https://github.com/elastic/elasticsearch-ruby • https://github.com/elastic/elasticsearch-rails • https://github.com/karmi/retire (retired) • https://github.com/ankane/searchkick •

    https://github.com/toptal/chewy • https://github.com/stretcher/stretcher • etc. Ruby
  81. Conclusion Elasticsearch works well when you don’t know your queries

    for sure
  82. Thank you!