Collecting and processing book reading statistics @ Bookmate

Collecting and processing book reading statistics Evgeny Li exAspArk

• 2 million users • 500 thousand books • 25
billion read characters per month

15 million printed A4 sheets

1.5 km in height 17 x

1 read page = 1 “reading” { "book":
{ "id": 123123, "language": "en" }, "user": { "id": 1864864, "birthday_at": 630513585000, "gender": "m" }, "legal_owner_id": 435435, "size": 965, "progress": { "from": 34.6655, "to": 36.5425 }, "created_at": 1430513785829, “read_at": 1430513585000, "country_code": "RU", "city": "Default", "ip": "127.128.129.130", "app": { "name": "Bookmate", "version": "3.3.14" }, "model": { "name": "iPhone", "version": "4.1" }, "os": { "name": "iPhoneOS", "version": "6.1.3" }, ... }

~ 1,000 readings / minute

Current architecture

http://cdn.themetapicture.com/pic/images/2015/03/18/funny-gif-professor-asking-class-hands.gif

• 4 servers • Nginx • Unicorn • HTTP API
• Ruby on Rails 4

• Performance • Scale • Asynchronous • Loose coupling •
Durable

• 2 servers • 10 processes for reading from queue
• Ruby on Rails 4 • AQMP + EventMachine

EventMachine.run do connection = AMQP.connect({
host: RabbitmqSettings.host, port: RabbitmqSettings.port }) channel = AMQP::Channel.new(connection) queue = channel.queue( RabbitmqSettings.readings_queue_name, durable: true ) queue.subscribe do |payload| Stats::Reading.create_from_json(payload) end end

• Document-oriented, BSON • Sharding / Replication • MapReduce •
Aggregation Pipeline

Let’s store everything in MongoDB!

http://giphy.com/gifs/funny-sand-bulldog-hiNJEpsTBTwhG

Replica can’t catch up with the master after falling

Solutions?

Reduce the size of data

Use short field names http://docs.mongodb.org/manual/faq/developers/#how-do-i-optimize-storage-use-for-small-documents

Upgrade MongoDB to 3.0

New storage engine – WiredTiger • Before: 111 GB •
After: 23 GB – almost 5x less! http://docs.mongodb.org/manual/release-notes/3.0/#wiredtiger

https://github.com/mongoid/mongoid/pull/3941 https://www.mongodb.com/blog/post/announcing-ruby-driver-20-rewrite

Dump readings from MongoDB

#!/usr/bin/env python client = MongoClient(HOSTS, slave_okay = False)
db = client.publisher collection = db["stats.readings"] collection_tmp = db["stats.readings_tmp"] collection.rename("stats.readings_tmp") filepath = strftime("%Y%m%d") + “.json.gz" file = gzip.open(filepath, “ab") for i in collection_tmp.find(): print >> file, dumps((i)) f.close collection_tmp.drop()

• 1.5 million readings per day • 750 MB of
disk space • 100 MB in archive

Store aggregated data

class Stats::Reading include Mongoid::Document INTERVAL_READING_CLASSES =
[ Stats::ReadingDaily, Stats::ReadingMonthly, Stats::UserReadingDaily, Stats::UserReadingMonthly, Stats::CityReadingHourly, Stats::CityReadingDaily, ... ] after_create :update_interval_readings private def update_interval_readings INTERVAL_READING_CLASSES.each do |interval_reading_class| interval_reading_class.increment_reading_size!(self) end end end

class Stats::UserReadingMonthly include Mongoid::Document def self.increment_reading_size!(reading)
read_at = reading.read_at canonical_read_at = Time.utc(read_at.year, read_at.mon) collection.where({ u_id: reading.user_id, d_id: reading.document_id, r_at: canonical_read_at }).find_and_modify({ "$inc" => { s: reading.size } }, [:upsert]) end end

• 500 MB for 1.5 million documents • 500 MB
for 4 indices

And it worked for a while, until…

We decide to get more various statistics

https://dribbble.com/shots/1849628-Dashboard-for-Publishers

Calculate user sessions similar to Google Analytics

Elasticsearch

• Document-oriented, based on Lucene (Java) • Sharding / Replication
• REST API and JSON • Nested aggregations • Scripting and plugins

Who uses?

• Работает на Lucene (Java) • REST API и JSON
• Шардинг • Репликации • Полнотекстовый поиск • Агрегации SQL Elasticsearch Database Index Table Type Row Document Column Field Schema Mapping Index Everything SQL Query DSL

Sharding / Replication Cluster Node 2 Node 1 P1 R2
P3 R1 P2 R3 2 nodes, 1 replica, 3 shards

Adding one node Cluster Node 2 Node 1 P1 R3
R1 P2 3 nodes, 1 replica, 3 shards Node 3 R2 P3

Adding one replica Cluster 3 nodes, 2 replicas, 3 shards
Node 3 Node 2 Node 1 P1 R2 R3 R1 P2 R3 R1 R2 P3

The number of shards can’t be changed after index creation

• Near real-time analytics! • Full-text search! • ELK (Elasticsearch
+ Logstash + Kibana)!

http://giphy.com/gifs/funny-pics-expectations-O8zj64CXHojja

More RAM!

• A machine with 64 GB of RAM is the
ideal sweet spot, but 32 GB and 16 GB machines are also common. • Consider giving 32 GB to Elasticsearch and letting Lucene use the rest of memory via the OS filesystem cache. All that memory will cache segments and lead to blisteringly fast full-text search (50% for heaps, and 50% for Lucene). https://www.elastic.co/guide/en/elasticsearch/guide/current/hardware.html https://www.elastic.co/guide/en/elasticsearch/guide/current/heap-sizing.html

• 2 x 64 GB nodes • 1 replica and
1 shard • 1 day = 1 index • 180 million documents for 4 months

110 MB for 1.5 million documents!

How did we reduce the size by over 400%?

Mapping

• Avoid nested – indexed as a separate document •
Disable _source – actual JSON that was used as the indexed document • Disable _all – includes the text of one or more other fields within the document indexed • Disable analyzer и norms – to compute the score of a document relatively • Use doc_values – live on disk instead of in heap memory

Before and after { "reading": {
"_all": { "enabled": false }, "_source": { "enabled": false }, "properties": { "from": { "type": "float" }, "to": { "type": "float" }, "size": { "type": "integer" }, "read_at": { "type": "date" }, "user_id": { "type": "integer" }, "user_birthday_at": { "type": "date" }, "user_gender": { "type": "string", "index": "not_analyzed" }, ... } } } { "reading": { "properties": { "from": { "type": "float" }, "to": { "type": "float" }, "size": { "type": "integer" }, "read_at": { "type": "date" }, "user": { "type": "nested", "include_in_parent": true, "properties": { "id": { "type": "integer" }, "birthday_at": { "type": "date" }, "gender": { "type": "string", "index": "string_lowercase" } } }, ... } } }

Indexing

2,500 documents / second – not a problem at all

• Use bulk API (optimize size empirically) • Increase or
disable refresh_interval and refresh manually • Temporary disable replication • Delay or disable flushes • Increase thread pool size for index and bulk operations • Use templates for creating new indices

Parallelize indexing file_mask = "201502*" day_filepaths = Dir.glob("#{ READINGS_PATH
}/#{ file_mask }").sort day_filepaths.each_slice(MAX_PROCESS_COUNT) do |day_filepaths| Parallel.each(day_filepaths, in_processes: day_filepaths.size) do |day_filepath| index(day_filepath) end end https://github.com/grosser/parallel

Bulk API and templates TEMPLATE = "readings-‐#{ Rails.env }-‐*".freeze
def index(day_filepath) file = MultipleFilesGzipReader.new(File.open(day_filepath)) bulk = [] file.each_line do |line| reading = parsed_reading(line) bulk << { index: { _type: "reading", _index: TEMPLATE.sub("*", reading["read_at"].strftime("%Y.%m.%d")) _id: reading["_id"], data: serialized_reading(reading) } } if bulk.size == BULK_SIZE client.bulk(body: body) bulk = [] end end end https://github.com/exAspArk/multiple_files_gzip_reader

• Try to load all data you need at once
• Use sets and hashes http://spin.atomicobject.com/2012/09/04/when-is-a-set-better-than-an-array-in-ruby/

USER_ATTR_NAMES = %w(id gender birthday_at) def user_attrs_by_id
@user_attrs_by_id ||= begin users_attr_values = User.pluck(*USER_ATTR_NAMES) user_attrs_by_id = users_attr_values.inject({}) do |result, user_attr_values| result[user_attr_values.first] = Hash[USER_ATTR_NAMES.zip(user_attr_values)] result end end end def serialized_reading(reading) user = user_attrs_by_id[reading["user_id"]] { from: reading["from"].to_f, to: reading["to"].to_f, size: reading["size"].to_i, read_at: Time.at(reading["read_at"].to_i / 1_000).utc, user_id: user["id"], user_birthday_at: user["birthday_at"], user_gender: user["gender"].downcase, ... } end

Marvel https://www.elastic.co/products/marvel

Building requests is fun!

Kibana https://www.elastic.co/products/kibana

MapReduce with scripting aggs: {
user_ids: { terms: { field: "user_id" }, aggs: { by_days: { date_histogram: { field: "read_at", interval: "1d" }, aggs: { sessions: { scripted_metric: { init_script: "_agg['read_ats'] = []", map_script: "_agg.read_ats.add(doc['read_at'].value)", combine_script: oneliner(%Q{ sessions = [] if (_agg.read_ats.size() < 2) { return sessions } for (read_at in _agg.read_ats) { sessions << ... } return sessions }), reduce_script: oneliner(%Q{ sessions = [] stats = [:] for (shard_sessions in _aggs) { sessions.addAll(shard_sessions) } if (sessions.size() == 0) { return stats } stats.average = sessions.sum() / sessions.size() return stats }) https://gist.github.com/exAspArk/c325bb9a75dcda5c8212

Percolator Docs D1 D2 Search Query Response Docs D1 D2
Percolation Query Response https://www.elastic.co/guide/en/elasticsearch/reference/current/search-percolate.html

Warmers • Filter cache • Filesystem cache • Loading field
data for fields https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-warmers.html

Plugins • Analysis (Russian, Chinese, Unicode) • Transport (Redis, ZeroMQ)
• Scripting (JavaScript, Clojure) • Site (Elasticsearch HQ, BigDesk) • Snapshot/Restore (Hadoop HDFS, AWS S3) • Search for similar images • Mahout Taste-based recommendation • Render HTML from Elasticsearch https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-plugins.html

Pay attention to

Don’t use it as your primary datastore

Don’t trust 100% (aka read docs carefully) https://www.elastic.co/blog/count-elasticsearch

Some statistics from Elasticsearch

How men and women read differently

Readers broken down by age

• https://github.com/elastic/elasticsearch-ruby • https://github.com/elastic/elasticsearch-rails • https://github.com/karmi/retire (retired) • https://github.com/ankane/searchkick •
https://github.com/toptal/chewy • https://github.com/stretcher/stretcher • etc. Ruby

Conclusion Elasticsearch works well when you don’t know your queries
for sure

Thank you!

Collecting and processing book reading statisti...

Collecting and processing book reading statistics @ Bookmate

More Decks by exAspArk

Other Decks in Technology

Featured

Transcript