Crawling, Categorizing, and Ranking 5 Billion Photos - Julio Viera, Pixable

Crawling, categorizing and ranking 5 billion photos How we did
it in MySQL and why we are migrating to MongoDB Julio Viera [email protected] | @julman99 VP of Engineering Pixable, inc

Whats Pixable about Aggregated, ranked and classified

Some of Pixable's technical challenges • 5 billion photos (~3
Terabytes) • 20 new million photos a day • 50 million categories • 16 million writes/hour (~40GB/hour) • 30 million reads/hour (~120GB/hour) Logging and Profiling • 15k inserts/sec. • More than 2 billion row tables MySQL numbers

Some of Pixable's technical challenges • Crawling the photos from
multiple services: - Facebook - Twitter - Instagram, Flickr, The Telegraph, Lomography, Picplz • Storing all of the meta data in our database • Offline crawl so we can notify users about new content • Adding hashtags to the photos 1,500 Hashtags/sec.

The entity-relationship model • Helps understanding how to initially build
a data model for a particular project • Based on normalization Does not take into account: • Database software carateristics • Number of rows (specially unbalanced growth or entities) • Sharding? • Traffic peaks?

Hundred of millions of rows & MySQL • Large joins
become painful • Replication starts to lag • Delete from <table> is expensive (before 5.5 was worst) • Batch operations take hours to complete and affect the whole application performance • Non-primary keys works well until you need to partition or shard • Starting a new slave from scratch is a 100% manual operation • Sharding? • Alter table on huge tables: legacy columns and new columns

Some good practices for data storage and how are we
using them in Pixable

Optimization and migrations tips user 1 2 3 user photo
1 A 1 B 1 C 2 A 3 A photo url A fb B twitter C flickr user user_photo photo basic normalized schema Problems we faced with this setup • On average, we store around 1000 photos per user. • With only 1MM users, user_photo will have >1Billion records. • Querying a single users photo will hit several random places on the photo table

Optimization and migrations tips user 1 2 3 user photo
1 A 1 B 1 C 2 A 3 A photo url A fb B twitter C flickr user 1 2 3 user photo url 1 A facebook 1 B twitter 1 C flickr 2 A facebook 3 A facebook user user_photo photo user user_photo de-normalize • Reading all users photos can now be done in one query • But user_photo will still be a count(users)x1000 table

Optimization and migrations tips user photo url 1 A facebook
1 B twitter 1 C flickr 2 A facebook 3 A facebook user photos 1 {A: facebook, B: twitter, C: flickr} 2 {A: facebook} 3 {A: facebook} user_photos user_photos use less rows • Reading all users photos can now be done in one query, retrieving one single row • count(user_photos)<=count(users) • Can't query by url or photo colums

Optimization and migrations tips photo* tag count A #nyc 3
B #nyc 25 C #car 30 A #party 8 Partition 1 Partition 2 Partition 3 Partition by photo Indexes • photo: primary key • tag: secondary index Queries: • get photo,tag where photo=A (Will hit partition 1) • get all photo where tag=#nyc (Will hit all partitions) partition your data, wisely choosing a partition key

Optimization and migrations tips Partition 1 Partition 2 Partition 3
user Partition by user_id tag* photo #car C #nyc A,B #party A Partition by tag age_user inverted indexes photo* tag count A #nyc 3 B #nyc 25 C #car 30 A #party 8

Optimization and migrations tips • Optimize reading the most you
can. • Read always by primary key • Avoid N-N relationships on huge tables. • Do changes in your schema one at a time and move the data in "lazy mode" • Use the new and old architectures at the same time an move users progressively other

What all of this have to do with MongoDB!?

MongoDB Document based user photos 1 {A: facebook, B: twitter,
C: flickr} { user_id: 1, photos:[ {id:"A",url: "facebook"}, {id:"B",url: "twitter"}, {id:"C",url: "flickr"}, ] }

MongoDB Built-in sharding user_id photos 1 {A: facebook, B: twitter,
C: flickr} 2 {B:twitter} 3 {C: flickr, D:facebook} 4 {E: facebook} Partition 1 Partition 2 Partition 3 MySQL shard 1 {user_id: 1,photos[...]}, {user_id: 4, photos[...]} shard 2 {user_id: 2,photos[...]} shard 3 {user_id: 3,photos[...]} MongoDB Shard Field: user_id Partition by user_id

Why moving to MongoDB • Mongo has built-in sharding •
ReplicaSet features automatic data clone, synchronization and PRIMARY failover • Our data fits perfectly in the MongoDB document paradigm • Schema-less • Easier to have many small machine, failures or maintenances are less traumatic. • Background index creation as our main data storage

Why moving to MongoDB • Building backups (not including photos):
MySQL: 45 minutes, 1 slave down (not using Xtrabackup). MongoDB: 30 minutes full non-blocking backup built in. • Launching a new slave: MySQL: at least 3 hours, a lot of manual intervention MongoDB: 20 minutes top, very little manual intervention • Creating a new shard and balance data across them: MySQL: ??? MongoDB: 20 minutes top as our main data storage

Considerations • MySQL (innodb) supports row level locking. Instead, MongoDB
does global-level locking. mysql / mongodb MySQL Big Machine Mongo Small machine Mongo Small machine Mongo Small machine

Considerations • Start from the beginning with a sharded (1
shard) setup at almost no extra cost. mysql / mongodb

Considerations • In MySQL 123 is equal to "123", in
MongoDB it is not. • Secondary indexes have the same effect on MySQL partitions than in MongoDB shards • Transactions, atomicity mysql / mongodb

Conclusion

Thanks More details about Pixable architecture: http://bit.ly/AvlAIy

Crawling, Categorizing, and Ranking 5 Billion P...

Crawling, Categorizing, and Ranking 5 Billion Photos - Julio Viera, Pixable

mongodb

More Decks by mongodb

Other Decks in Technology

Featured

Transcript

Crawling, categorizing and ranking 5 billion photos How we did

Whats Pixable about Aggregated, ranked and classified

Some of Pixable's technical challenges • 5 billion photos (~3

Some of Pixable's technical challenges • Crawling the photos from

The entity-relationship model • Helps understanding how to initially build

Hundred of millions of rows & MySQL • Large joins

Some good practices for data storage and how are we

Optimization and migrations tips user 1 2 3 user photo

Optimization and migrations tips user 1 2 3 user photo

Optimization and migrations tips user photo url 1 A facebook

Optimization and migrations tips photo* tag count A #nyc 3

Optimization and migrations tips Partition 1 Partition 2 Partition 3

Optimization and migrations tips • Optimize reading the most you

What all of this have to do with MongoDB!?

MongoDB Document based user photos 1 {A: facebook, B: twitter,

MongoDB Built-in sharding user_id photos 1 {A: facebook, B: twitter,

Why moving to MongoDB • Mongo has built-in sharding •

Why moving to MongoDB • Building backups (not including photos):

Considerations • MySQL (innodb) supports row level locking. Instead, MongoDB

Considerations • Start from the beginning with a sharded (1

Considerations • Start from the beginning with a sharded (1

Considerations • In MySQL 123 is equal to "123", in

Conclusion

Thanks More details about Pixable architecture: http://bit.ly/AvlAIy