Slide 1

Slide 1 text

Collecting and processing book reading statistics Evgeny Li exAspArk

Slide 2

Slide 2 text

No content

Slide 3

Slide 3 text

• 2 million users • 500 thousand books • 25 billion read characters per month

Slide 4

Slide 4 text

15 million printed A4 sheets

Slide 5

Slide 5 text

1.5 km in height 17 x

Slide 6

Slide 6 text

1 read page = 1 “reading” {      "book":  {            "id":                      123123,            "language":          "en"        },      "user":                      {            "id":                      1864864,            "birthday_at":    630513585000,            "gender":              "m"        },      "legal_owner_id":  435435,      "size":                      965,      "progress":              {  "from":  34.6655,  "to":  36.5425  },      "created_at":          1430513785829,      “read_at":                1430513585000,      "country_code":      "RU",      "city":                      "Default",      "ip":                          "127.128.129.130",      "app":                        {  "name":  "Bookmate",  "version":  "3.3.14"  },      "model":                    {  "name":  "iPhone",      "version":  "4.1"  },      "os":                          {  "name":  "iPhoneOS",  "version":  "6.1.3"  },      ...   }

Slide 7

Slide 7 text

~ 1,000 readings / minute

Slide 8

Slide 8 text

Current architecture

Slide 9

Slide 9 text

http://cdn.themetapicture.com/pic/images/2015/03/18/funny-gif-professor-asking-class-hands.gif

Slide 10

Slide 10 text

No content

Slide 11

Slide 11 text

• 4 servers • Nginx • Unicorn • HTTP API • Ruby on Rails 4

Slide 12

Slide 12 text

No content

Slide 13

Slide 13 text

• Performance • Scale • Asynchronous • Loose coupling • Durable

Slide 14

Slide 14 text

• 2 servers • 10 processes for reading from queue • Ruby on Rails 4 • AQMP + EventMachine

Slide 15

Slide 15 text

EventMachine.run  do      connection  =  AMQP.connect({            host:  RabbitmqSettings.host,            port:  RabbitmqSettings.port        })            channel  =  AMQP::Channel.new(connection)      queue  =  channel.queue(          RabbitmqSettings.readings_queue_name,            durable:  true      )      queue.subscribe  do  |payload|          Stats::Reading.create_from_json(payload)      end   end

Slide 16

Slide 16 text

No content

Slide 17

Slide 17 text

• Document-oriented, BSON • Sharding / Replication • MapReduce • Aggregation Pipeline

Slide 18

Slide 18 text

Let’s store everything in MongoDB!

Slide 19

Slide 19 text

http://giphy.com/gifs/funny-sand-bulldog-hiNJEpsTBTwhG

Slide 20

Slide 20 text

Replica can’t catch up with the master after falling

Slide 21

Slide 21 text

Solutions?

Slide 22

Slide 22 text

Reduce the size of data

Slide 23

Slide 23 text

Use short field names http://docs.mongodb.org/manual/faq/developers/#how-do-i-optimize-storage-use-for-small-documents

Slide 24

Slide 24 text

No content

Slide 25

Slide 25 text

Upgrade MongoDB to 3.0

Slide 26

Slide 26 text

New storage engine – WiredTiger • Before: 111 GB • After: 23 GB – almost 5x less! http://docs.mongodb.org/manual/release-notes/3.0/#wiredtiger

Slide 27

Slide 27 text

https://github.com/mongoid/mongoid/pull/3941 https://www.mongodb.com/blog/post/announcing-ruby-driver-20-rewrite

Slide 28

Slide 28 text

https://github.com/mongoid/mongoid/pull/3941 https://www.mongodb.com/blog/post/announcing-ruby-driver-20-rewrite

Slide 29

Slide 29 text

Dump readings from MongoDB

Slide 30

Slide 30 text

Files

Slide 31

Slide 31 text

#!/usr/bin/env  python   client  =  MongoClient(HOSTS,  slave_okay  =  False)   db  =  client.publisher   collection  =  db["stats.readings"]   collection_tmp  =  db["stats.readings_tmp"]   collection.rename("stats.readings_tmp")   filepath  =  strftime("%Y%m%d")  +  “.json.gz"   file  =  gzip.open(filepath,  “ab")   for  i  in  collection_tmp.find():          print  >>  file,  dumps((i))   f.close   collection_tmp.drop()

Slide 32

Slide 32 text

• 1.5 million readings per day • 750 MB of disk space • 100 MB in archive

Slide 33

Slide 33 text

Store aggregated data

Slide 34

Slide 34 text

class  Stats::Reading      include  Mongoid::Document      INTERVAL_READING_CLASSES  =  [          Stats::ReadingDaily,          Stats::ReadingMonthly,          Stats::UserReadingDaily,          Stats::UserReadingMonthly,          Stats::CityReadingHourly,          Stats::CityReadingDaily,          ...      ]      after_create  :update_interval_readings      private      def  update_interval_readings          INTERVAL_READING_CLASSES.each  do  |interval_reading_class|              interval_reading_class.increment_reading_size!(self)          end      end   end

Slide 35

Slide 35 text

class  Stats::UserReadingMonthly      include  Mongoid::Document      def  self.increment_reading_size!(reading)          read_at  =  reading.read_at          canonical_read_at  =  Time.utc(read_at.year,  read_at.mon)          collection.where({              u_id:  reading.user_id,              d_id:  reading.document_id,              r_at:  canonical_read_at          }).find_and_modify({              "$inc"  =>  {  s:  reading.size  }            },  [:upsert])      end   end

Slide 36

Slide 36 text

• 500 MB for 1.5 million documents • 500 MB for 4 indices

Slide 37

Slide 37 text

And it worked for a while, until…

Slide 38

Slide 38 text

We decide to get more various statistics

Slide 39

Slide 39 text

https://dribbble.com/shots/1849628-Dashboard-for-Publishers

Slide 40

Slide 40 text

No content

Slide 41

Slide 41 text

Calculate user sessions similar to Google Analytics

Slide 42

Slide 42 text

Elasticsearch

Slide 43

Slide 43 text

• Document-oriented, based on Lucene (Java) • Sharding / Replication • REST API and JSON • Nested aggregations • Scripting and plugins

Slide 44

Slide 44 text

Who uses?

Slide 45

Slide 45 text

• Работает на Lucene (Java) • REST API и JSON • Шардинг • Репликации • Полнотекстовый поиск • Агрегации SQL Elasticsearch Database Index Table Type Row Document Column Field Schema Mapping Index Everything SQL Query DSL

Slide 46

Slide 46 text

Sharding / Replication Cluster Node 2 Node 1 P1 R2 P3 R1 P2 R3 2 nodes, 1 replica, 3 shards

Slide 47

Slide 47 text

Adding one node Cluster Node 2 Node 1 P1 R3 R1 P2 3 nodes, 1 replica, 3 shards Node 3 R2 P3

Slide 48

Slide 48 text

Adding one replica Cluster 3 nodes, 2 replicas, 3 shards Node 3 Node 2 Node 1 P1 R2 R3 R1 P2 R3 R1 R2 P3

Slide 49

Slide 49 text

The number of shards can’t be changed after index creation

Slide 50

Slide 50 text

• Near real-time analytics! • Full-text search! • ELK (Elasticsearch + Logstash + Kibana)!

Slide 51

Slide 51 text

http://giphy.com/gifs/funny-pics-expectations-O8zj64CXHojja

Slide 52

Slide 52 text

More RAM!

Slide 53

Slide 53 text

• A machine with 64 GB of RAM is the ideal sweet spot, but 32 GB and 16 GB machines are also common. • Consider giving 32 GB to Elasticsearch and letting Lucene use the rest of memory via the OS filesystem cache. All that memory will cache segments and lead to blisteringly fast full-text search (50% for heaps, and 50% for Lucene). https://www.elastic.co/guide/en/elasticsearch/guide/current/hardware.html https://www.elastic.co/guide/en/elasticsearch/guide/current/heap-sizing.html

Slide 54

Slide 54 text

• 2 x 64 GB nodes • 1 replica and 1 shard • 1 day = 1 index • 180 million documents for 4 months

Slide 55

Slide 55 text

110 MB for 1.5 million documents!

Slide 56

Slide 56 text

How did we reduce the size by over 400%?

Slide 57

Slide 57 text

Mapping

Slide 58

Slide 58 text

• Avoid nested – indexed as a separate document • Disable _source – actual JSON that was used as the indexed document • Disable _all – includes the text of one or more other fields within the document indexed • Disable analyzer и norms – to compute the score of a document relatively • Use doc_values – live on disk instead of in heap memory

Slide 59

Slide 59 text

Before and after {      "reading":  {          "_all":        {              "enabled":  false          },          "_source":  {              "enabled":  false          },          "properties":  {              "from":  {                  "type":  "float"              },              "to":  {                  "type":  "float"              },              "size":  {                  "type":  "integer"              },            "read_at":  {                  "type":  "date"              },              "user_id":  {                  "type":  "integer"              },              "user_birthday_at":  {                  "type":  "date"              },              "user_gender":  {                  "type":  "string",                  "index":  "not_analyzed"              },              ...          }      }   } {      "reading":  {          "properties":  {              "from":  {                  "type":  "float"              },              "to":  {                  "type":  "float"              },              "size":  {                  "type":  "integer"              },            "read_at":  {                  "type":  "date"              },              "user":  {                  "type":  "nested",                  "include_in_parent":  true,                  "properties":  {                      "id":  {                          "type":  "integer"                      },                      "birthday_at":  {                          "type":  "date"                      },                      "gender":  {                          "type":  "string",                          "index":  "string_lowercase"                      }                  }              },            ...          }      }   }

Slide 60

Slide 60 text

Indexing

Slide 61

Slide 61 text

2,500 documents / second – not a problem at all

Slide 62

Slide 62 text

• Use bulk API (optimize size empirically) • Increase or disable refresh_interval and refresh manually • Temporary disable replication • Delay or disable flushes • Increase thread pool size for index and bulk operations • Use templates for creating new indices

Slide 63

Slide 63 text

Parallelize indexing file_mask  =  "201502*"   day_filepaths  =  Dir.glob("#{  READINGS_PATH  }/#{  file_mask  }").sort   day_filepaths.each_slice(MAX_PROCESS_COUNT)  do  |day_filepaths|      Parallel.each(day_filepaths,  in_processes:  day_filepaths.size)  do  |day_filepath|          index(day_filepath)      end   end https://github.com/grosser/parallel

Slide 64

Slide 64 text

Bulk API and templates TEMPLATE  =  "readings-­‐#{  Rails.env  }-­‐*".freeze   def  index(day_filepath)      file  =  MultipleFilesGzipReader.new(File.open(day_filepath))      bulk  =  []      file.each_line  do  |line|          reading  =  parsed_reading(line)          bulk  <<  {              index:  {                    _type:    "reading",                    _index:  TEMPLATE.sub("*",  reading["read_at"].strftime("%Y.%m.%d"))                    _id:        reading["_id"],                    data:      serialized_reading(reading)              }          }          if  bulk.size  ==  BULK_SIZE              client.bulk(body:  body)              bulk  =  []          end      end   end https://github.com/exAspArk/multiple_files_gzip_reader

Slide 65

Slide 65 text

• Try to load all data you need at once • Use sets and hashes http://spin.atomicobject.com/2012/09/04/when-is-a-set-better-than-an-array-in-ruby/

Slide 66

Slide 66 text

USER_ATTR_NAMES  =  %w(id  gender  birthday_at)   def  user_attrs_by_id      @user_attrs_by_id  ||=  begin          users_attr_values  =  User.pluck(*USER_ATTR_NAMES)          user_attrs_by_id  =  users_attr_values.inject({})  do  |result,  user_attr_values|              result[user_attr_values.first]  =  Hash[USER_ATTR_NAMES.zip(user_attr_values)]              result          end      end   end   def  serialized_reading(reading)      user  =  user_attrs_by_id[reading["user_id"]]      {          from:                          reading["from"].to_f,          to:                              reading["to"].to_f,          size:                          reading["size"].to_i,          read_at:                    Time.at(reading["read_at"].to_i  /  1_000).utc,          user_id:                    user["id"],          user_birthday_at:  user["birthday_at"],          user_gender:            user["gender"].downcase,          ...      }   end

Slide 67

Slide 67 text

Marvel https://www.elastic.co/products/marvel

Slide 68

Slide 68 text

Building requests is fun!

Slide 69

Slide 69 text

Kibana https://www.elastic.co/products/kibana

Slide 70

Slide 70 text

MapReduce with scripting    aggs:  {          user_ids:  {              terms:  {  field:  "user_id"  },              aggs:  {                  by_days:  {                      date_histogram:  {  field:  "read_at",  interval:  "1d"  },                      aggs:  {                          sessions:  {                              scripted_metric:  {                                  init_script:  "_agg['read_ats']  =  []",                                  map_script:  "_agg.read_ats.add(doc['read_at'].value)",                                  combine_script:  oneliner(%Q{                                      sessions  =  []                                      if  (_agg.read_ats.size()  <  2)  {  return  sessions  }                                      for  (read_at  in  _agg.read_ats)  {                                            sessions  <<  ...                                      }                                      return  sessions                                  }),                                  reduce_script:  oneliner(%Q{                                      sessions  =  []                                      stats  =  [:]                                                                            for  (shard_sessions  in  _aggs)  {  sessions.addAll(shard_sessions)  }                                      if  (sessions.size()  ==  0)  {  return  stats  }                                      stats.average  =  sessions.sum()  /  sessions.size()                                      return  stats                                  })   https://gist.github.com/exAspArk/c325bb9a75dcda5c8212

Slide 71

Slide 71 text

Percolator Docs D1 D2 Search Query Response Docs D1 D2 Percolation Query Response https://www.elastic.co/guide/en/elasticsearch/reference/current/search-percolate.html

Slide 72

Slide 72 text

Warmers • Filter cache • Filesystem cache • Loading field data for fields https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-warmers.html

Slide 73

Slide 73 text

Plugins • Analysis (Russian, Chinese, Unicode) • Transport (Redis, ZeroMQ) • Scripting (JavaScript, Clojure) • Site (Elasticsearch HQ, BigDesk) • Snapshot/Restore (Hadoop HDFS, AWS S3) • Search for similar images • Mahout Taste-based recommendation • Render HTML from Elasticsearch https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-plugins.html

Slide 74

Slide 74 text

Pay attention to

Slide 75

Slide 75 text

Don’t use it as your primary datastore

Slide 76

Slide 76 text

Don’t trust 100% (aka read docs carefully) https://www.elastic.co/blog/count-elasticsearch

Slide 77

Slide 77 text

Some statistics from Elasticsearch

Slide 78

Slide 78 text

How men and women read differently

Slide 79

Slide 79 text

Readers broken down by age

Slide 80

Slide 80 text

• https://github.com/elastic/elasticsearch-ruby • https://github.com/elastic/elasticsearch-rails • https://github.com/karmi/retire (retired) • https://github.com/ankane/searchkick • https://github.com/toptal/chewy • https://github.com/stretcher/stretcher • etc. Ruby

Slide 81

Slide 81 text

Conclusion Elasticsearch works well when you don’t know your queries for sure

Slide 82

Slide 82 text

Thank you!