Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Crawling, Categorizing, and Ranking 5 Billion P...

mongodb
March 14, 2012

Crawling, Categorizing, and Ranking 5 Billion Photos - Julio Viera, Pixable

MongoDB New York User Group

How we did it in MySQL and why we are migrating to MongoDB We have two main challenges at Pixable. The first one is how to access millions of photos per day from Facebook, Twitter, Instagram, and other services in the most efficient manner. The second one, is how to process, organize, index, and store all the meta-data related to those photos. In order to accomplish this, we heavily rely on Amazon Web Services, where we have 100+ servers running. As of today, we are processing the metadata of an average 20 million new photos per day, which we need to compare, rank and sort with over 5 billion that are already stored in our database. On top of this we have a API backend that serves thousands of request per second, and a logging and analytics system that performs over 10K updates per second in our database. To support all ever-growing database needs, we have built a highly customized MySQL cluster with semi-automatic shards and partitions, optimized every single request and de-normalized most of our data. For large write rates we use our own publish/subscriber frameworks with various database levels and memory buffers (almost everything in Pixable is asynchronous). But in our quest of building a system that could allow us to scale in a more easy, flexible and solid way, we chose MongoDB as our next generation data storage solution. Resuming, this is the story of how we got to this point in MySQL, what are our current challenges and how are we progressively migrating our millions of users and billion of photos to MongoDB.

mongodb

March 14, 2012
Tweet

More Decks by mongodb

Other Decks in Technology

Transcript

  1. Crawling, categorizing and ranking 5 billion photos How we did

    it in MySQL and why we are migrating to MongoDB Julio Viera [email protected] | @julman99 VP of Engineering Pixable, inc
  2. Some of Pixable's technical challenges • 5 billion photos (~3

    Terabytes) • 20 new million photos a day • 50 million categories • 16 million writes/hour (~40GB/hour) • 30 million reads/hour (~120GB/hour) Logging and Profiling • 15k inserts/sec. • More than 2 billion row tables MySQL numbers
  3. Some of Pixable's technical challenges • Crawling the photos from

    multiple services: - Facebook - Twitter - Instagram, Flickr, The Telegraph, Lomography, Picplz • Storing all of the meta data in our database • Offline crawl so we can notify users about new content • Adding hashtags to the photos 1,500 Hashtags/sec.
  4. The entity-relationship model • Helps understanding how to initially build

    a data model for a particular project • Based on normalization Does not take into account: • Database software carateristics • Number of rows (specially unbalanced growth or entities) • Sharding? • Traffic peaks?
  5. Hundred of millions of rows & MySQL • Large joins

    become painful • Replication starts to lag • Delete from <table> is expensive (before 5.5 was worst) • Batch operations take hours to complete and affect the whole application performance • Non-primary keys works well until you need to partition or shard • Starting a new slave from scratch is a 100% manual operation • Sharding? • Alter table on huge tables: legacy columns and new columns
  6. Optimization and migrations tips user 1 2 3 user photo

    1 A 1 B 1 C 2 A 3 A photo url A fb B twitter C flickr user user_photo photo basic normalized schema Problems we faced with this setup • On average, we store around 1000 photos per user. • With only 1MM users, user_photo will have >1Billion records. • Querying a single users photo will hit several random places on the photo table
  7. Optimization and migrations tips user 1 2 3 user photo

    1 A 1 B 1 C 2 A 3 A photo url A fb B twitter C flickr user 1 2 3 user photo url 1 A facebook 1 B twitter 1 C flickr 2 A facebook 3 A facebook user user_photo photo user user_photo de-normalize • Reading all users photos can now be done in one query • But user_photo will still be a count(users)x1000 table
  8. Optimization and migrations tips user photo url 1 A facebook

    1 B twitter 1 C flickr 2 A facebook 3 A facebook user photos 1 {A: facebook, B: twitter, C: flickr} 2 {A: facebook} 3 {A: facebook} user_photos user_photos use less rows • Reading all users photos can now be done in one query, retrieving one single row • count(user_photos)<=count(users) • Can't query by url or photo colums
  9. Optimization and migrations tips photo* tag count A #nyc 3

    B #nyc 25 C #car 30 A #party 8 Partition 1 Partition 2 Partition 3 Partition by photo Indexes • photo: primary key • tag: secondary index Queries: • get photo,tag where photo=A (Will hit partition 1) • get all photo where tag=#nyc (Will hit all partitions) partition your data, wisely choosing a partition key
  10. Optimization and migrations tips Partition 1 Partition 2 Partition 3

    user Partition by user_id tag* photo #car C #nyc A,B #party A Partition by tag age_user inverted indexes photo* tag count A #nyc 3 B #nyc 25 C #car 30 A #party 8
  11. Optimization and migrations tips • Optimize reading the most you

    can. • Read always by primary key • Avoid N-N relationships on huge tables. • Do changes in your schema one at a time and move the data in "lazy mode" • Use the new and old architectures at the same time an move users progressively other
  12. MongoDB Document based user photos 1 {A: facebook, B: twitter,

    C: flickr} { user_id: 1, photos:[ {id:"A",url: "facebook"}, {id:"B",url: "twitter"}, {id:"C",url: "flickr"}, ] }
  13. MongoDB Built-in sharding user_id photos 1 {A: facebook, B: twitter,

    C: flickr} 2 {B:twitter} 3 {C: flickr, D:facebook} 4 {E: facebook} Partition 1 Partition 2 Partition 3 MySQL shard 1 {user_id: 1,photos[...]}, {user_id: 4, photos[...]} shard 2 {user_id: 2,photos[...]} shard 3 {user_id: 3,photos[...]} MongoDB Shard Field: user_id Partition by user_id
  14. Why moving to MongoDB • Mongo has built-in sharding •

    ReplicaSet features automatic data clone, synchronization and PRIMARY failover • Our data fits perfectly in the MongoDB document paradigm • Schema-less • Easier to have many small machine, failures or maintenances are less traumatic. • Background index creation as our main data storage
  15. Why moving to MongoDB • Building backups (not including photos):

    MySQL: 45 minutes, 1 slave down (not using Xtrabackup). MongoDB: 30 minutes full non-blocking backup built in. • Launching a new slave: MySQL: at least 3 hours, a lot of manual intervention MongoDB: 20 minutes top, very little manual intervention • Creating a new shard and balance data across them: MySQL: ??? MongoDB: 20 minutes top as our main data storage
  16. Considerations • MySQL (innodb) supports row level locking. Instead, MongoDB

    does global-level locking. mysql / mongodb MySQL Big Machine Mongo Small machine Mongo Small machine Mongo Small machine
  17. Considerations • Start from the beginning with a sharded (1

    shard) setup at almost no extra cost. mysql / mongodb
  18. Considerations • Start from the beginning with a sharded (1

    shard) setup at almost no extra cost. mysql / mongodb
  19. Considerations • In MySQL 123 is equal to "123", in

    MongoDB it is not. • Secondary indexes have the same effect on MySQL partitions than in MongoDB shards • Transactions, atomicity mysql / mongodb