MongoDB - data distribution (2014)

MongoDB - data distribution (2014)

The NoSQL database MongoDB is a great basis for a flexible and scalable application. Especially the scalability of your backend can be quite easy with MongoDB when you choose a good distribution criteria. This talk shows the idea behind sharding and possibilities how data can be distributed.

This deck is based on the originally published one from november 2013, but contains updates done in 2014.

A64aa22451bb30e6fe3cfd5de6f462b3?s=128

Stefan Rudnitzki

June 02, 2014
Tweet

Transcript

  1. MongoDB data distribution Stefan Rudnitzki

  2. _talk:me • Stefan Rudnitzki • job: developer @hypoport • spare

    time: organizer @MUGBerlin • interests: Java, search, distributed systems, NoSQL, Vagrant, Puppet
  3. _talk:basic intro • MongoDB • document oriented NoSQL store •

    query language • indexes
  4. _talk:basic intro • documents stored in collections (tables in relational)

    • reliable (build in replication) • scalable (see this talk)
  5. _talk:aim • How do you scale your data with MongoDB?

    • Sharding!
  6. _talk:aim • „Less than 10 % of MongoDB users are

    using sharding“ • main goal: reduce anxiety
  7. _talk:agenda • Sharding • Pros/Cons • UseCase • Practical Experience

    https://www.iconfinder.com/icons/171757/calendar_icon
  8. Sharding

  9. _sharding:basics • scale • spread documents over multiple nodes •

    optimze reads/writes
  10. _sharding:terminology

  11. _sharding:terminology • mongod (data) • shard (subset of data) shard01

    mongod mongod mongod
  12. _sharding:terminology • mongod (data) • shard (subset of data) •

    replica set (replication) shard01 mongod mongod mongod
  13. _sharding:terminology • mongos (sharding proxy) • configserver (metadata) configservers config01

    config02 config03 mongos
  14. _sharding:terminology

  15. _sharding:terminology • sharding-key: criteria for distribution • chunk: physical piece

    of data ?
  16. _sharding:example • server: 1 TB, 64 GB RAM • possible

    dataset: < 950 GB, est. index size < 56 GB https://www.iconfinder.com/icons/171754/data_icon
  17. _sharding:example • dataset grows to 2,3 TB, est. index size

    275 GB • distribution approaches • 3 shards (dataset) • 6 shards (index size) https://www.iconfinder.com/icons/171754/data_icon
  18. _sharding:activity Shard01 Shard02 Shard03 d .. u v .. z

    a .. c mongos ? ? ?
  19. _sharding:activity Shard01 Shard02 Shard03 d .. u v .. z

    a .. c mongos d u n
  20. _sharding:activity Shard01 Shard02 Shard03 d .. u v .. z

    a .. c Balancer
  21. _sharding:activity Shard01 Shard02 Shard03 m .. r h .. l

    c .. g s .. t a .. b u .. z mongos d u n
  22. Pros/Cons https://www.iconfinder.com/icons/171728/settings_icon + - /

  23. _sharding:pros • scale to handle „load“ • distribute reads •

    distribute writes • self-defined distribution criteria +
  24. _sharding:cons • single opportunity to define distribution criteria • RAM

    limits • monitoring is a key factor • sharding does not speed up everything • sharding does not make things easier !
  25. UseCase https://www.iconfinder.com/icons/171729/search_icon

  26. _usecase:setup • Try it and test with real data! •

    example with Vagrant/Puppet (3 mongod, mongos, configserver) • online (with documentation): https:/ /github.com/strud/vagrant- machines
  27. _usecase:vagrant

  28. _usecase:puppet

  29. _usecase:data • wikipedia dump • „real world“ data • ~

    4,4 mio. abstracts (3,8 GB XML)
  30. _usecase:data • URL • title (text) • abstract (full text)

    https://www.iconfinder.com/icons/171735/note_icon
  31. _usecase:data • Java parsing and import logic • online: https:/

    /github.com/strud/db_evaluation
  32. _usecase:candidates • title • text • unsure distribution

  33. _usecase:candidates • artificial, e.g. count(docs) % no. shards and random

    ID • optimal distribution while importing • but: adding shards? • alternative: _id
  34. _usecase:candidates • URL • structured • can be mapped to

    a tree data structure
  35. _usecase:candidates • since 2.4: Hashed Index • random • created

    by MongoDB
  36. Practical Experience https://www.iconfinder.com/icons/171728/settings_icon

  37. _sharding:_id • a lot of balancer action • not evenly

    balanced 0 750000 1500000 2250000 3000000 shard00 shard01 shard02
  38. _sharding:title • balancer running at the beginning (level out) •

    not balanced • best performance with title queries 0 450000 900000 1350000 1800000 shard00 shard01 shard02
  39. _sharding:url • balancer running at the beginning (level out) •

    not balanced • best performance with url queries 0 450000 900000 1350000 1800000 shard00 shard01 shard02
  40. _sharding:hash(title) • (no) balancer action • (nearly) balanced • max.

    insert performance • but: not every query possible 0 450000 900000 1350000 1800000 shard00 shard01 shard02
  41. _sharding:hash(title) • 1.457.725 • 1.462.456 • 1.465.252 • can be

    balanced (appears to depend)
  42. ?

  43. _talk:conclusion • Sharding can increase the amount of data a

    system can handle! • Sharding may increase query performance!
  44. _talk:conclusion • shard as early as possible • early experiments

    and lots of insights • fast reset possible • even on one server • Scaling is easier because you are already sharded!
  45. _talk:conclusion • hashed sharding-keys = great for easy balancing •

    not all queries supported • performance is not optimal • queries cannot profit from based index
  46. _talk:conclusion • self-defined = best performance • bad sharding-key decision

    = possible problems • recommendation: a few random bytes and a then use case specific
  47. _talk:conclusion • „Know your data!“ • „Test with your data!“

    • „Know your UseCases!“
  48. _talk:conclusion • „Consultant speech“? • Do not fear! Try it!

  49. Questions? • https:/ /github.com/strud/vagrant-machines • https:/ /github.com/strud/db_evaluation • http:/ /www.meetup.com/MUGBerlin

    • Twitter: @StRud2nd
  50. http:/ /www.wordle.net