Upgrade to Pro — share decks privately, control downloads, hide ads and more …

MongoDB - data distribution (2014)

MongoDB - data distribution (2014)

The NoSQL database MongoDB is a great basis for a flexible and scalable application. Especially the scalability of your backend can be quite easy with MongoDB when you choose a good distribution criteria. This talk shows the idea behind sharding and possibilities how data can be distributed.

This deck is based on the originally published one from november 2013, but contains updates done in 2014.

Stefan Rudnitzki

June 02, 2014
Tweet

More Decks by Stefan Rudnitzki

Other Decks in Programming

Transcript

  1. _talk:me • Stefan Rudnitzki • job: developer @hypoport • spare

    time: organizer @MUGBerlin • interests: Java, search, distributed systems, NoSQL, Vagrant, Puppet
  2. _talk:basic intro • documents stored in collections (tables in relational)

    • reliable (build in replication) • scalable (see this talk)
  3. _talk:aim • „Less than 10 % of MongoDB users are

    using sharding“ • main goal: reduce anxiety
  4. _talk:agenda • Sharding • Pros/Cons • UseCase • Practical Experience

    https://www.iconfinder.com/icons/171757/calendar_icon
  5. _sharding:terminology • mongod (data) • shard (subset of data) •

    replica set (replication) shard01 mongod mongod mongod
  6. _sharding:example • server: 1 TB, 64 GB RAM • possible

    dataset: < 950 GB, est. index size < 56 GB https://www.iconfinder.com/icons/171754/data_icon
  7. _sharding:example • dataset grows to 2,3 TB, est. index size

    275 GB • distribution approaches • 3 shards (dataset) • 6 shards (index size) https://www.iconfinder.com/icons/171754/data_icon
  8. _sharding:activity Shard01 Shard02 Shard03 m .. r h .. l

    c .. g s .. t a .. b u .. z mongos d u n
  9. _sharding:pros • scale to handle „load“ • distribute reads •

    distribute writes • self-defined distribution criteria +
  10. _sharding:cons • single opportunity to define distribution criteria • RAM

    limits • monitoring is a key factor • sharding does not speed up everything • sharding does not make things easier !
  11. _usecase:setup • Try it and test with real data! •

    example with Vagrant/Puppet (3 mongod, mongos, configserver) • online (with documentation): https:/ /github.com/strud/vagrant- machines
  12. _usecase:data • URL • title (text) • abstract (full text)

    https://www.iconfinder.com/icons/171735/note_icon
  13. _usecase:candidates • artificial, e.g. count(docs) % no. shards and random

    ID • optimal distribution while importing • but: adding shards? • alternative: _id
  14. _sharding:_id • a lot of balancer action • not evenly

    balanced 0 750000 1500000 2250000 3000000 shard00 shard01 shard02
  15. _sharding:title • balancer running at the beginning (level out) •

    not balanced • best performance with title queries 0 450000 900000 1350000 1800000 shard00 shard01 shard02
  16. _sharding:url • balancer running at the beginning (level out) •

    not balanced • best performance with url queries 0 450000 900000 1350000 1800000 shard00 shard01 shard02
  17. _sharding:hash(title) • (no) balancer action • (nearly) balanced • max.

    insert performance • but: not every query possible 0 450000 900000 1350000 1800000 shard00 shard01 shard02
  18. ?

  19. _talk:conclusion • Sharding can increase the amount of data a

    system can handle! • Sharding may increase query performance!
  20. _talk:conclusion • shard as early as possible • early experiments

    and lots of insights • fast reset possible • even on one server • Scaling is easier because you are already sharded!
  21. _talk:conclusion • hashed sharding-keys = great for easy balancing •

    not all queries supported • performance is not optimal • queries cannot profit from based index
  22. _talk:conclusion • self-defined = best performance • bad sharding-key decision

    = possible problems • recommendation: a few random bytes and a then use case specific