MongoDB - data distribution (2014)

MongoDB data distribution Stefan Rudnitzki

_talk:me • Stefan Rudnitzki • job: developer @hypoport • spare
time: organizer @MUGBerlin • interests: Java, search, distributed systems, NoSQL, Vagrant, Puppet

_talk:basic intro • MongoDB • document oriented NoSQL store •
query language • indexes

_talk:basic intro • documents stored in collections (tables in relational)
• reliable (build in replication) • scalable (see this talk)

_talk:aim • How do you scale your data with MongoDB?
• Sharding!

_talk:aim • „Less than 10 % of MongoDB users are
using sharding“ • main goal: reduce anxiety

_talk:agenda • Sharding • Pros/Cons • UseCase • Practical Experience
https://www.iconﬁnder.com/icons/171757/calendar_icon

Sharding

_sharding:basics • scale • spread documents over multiple nodes •
optimze reads/writes

_sharding:terminology

_sharding:terminology • mongod (data) • shard (subset of data) shard01
mongod mongod mongod

_sharding:terminology • mongod (data) • shard (subset of data) •
replica set (replication) shard01 mongod mongod mongod

_sharding:terminology • mongos (sharding proxy) • configserver (metadata) configservers config01
config02 config03 mongos

_sharding:terminology

_sharding:terminology • sharding-key: criteria for distribution • chunk: physical piece
of data ?

_sharding:example • server: 1 TB, 64 GB RAM • possible
dataset: < 950 GB, est. index size < 56 GB https://www.iconﬁnder.com/icons/171754/data_icon

_sharding:example • dataset grows to 2,3 TB, est. index size
275 GB • distribution approaches • 3 shards (dataset) • 6 shards (index size) https://www.iconﬁnder.com/icons/171754/data_icon

_sharding:activity Shard01 Shard02 Shard03 d .. u v .. z
a .. c mongos ? ? ?

a .. c mongos d u n

a .. c Balancer

_sharding:activity Shard01 Shard02 Shard03 m .. r h .. l
c .. g s .. t a .. b u .. z mongos d u n

Pros/Cons https://www.iconﬁnder.com/icons/171728/settings_icon + - /

_sharding:pros • scale to handle „load“ • distribute reads •
distribute writes • self-defined distribution criteria +

_sharding:cons • single opportunity to define distribution criteria • RAM
limits • monitoring is a key factor • sharding does not speed up everything • sharding does not make things easier !

UseCase https://www.iconﬁnder.com/icons/171729/search_icon

_usecase:setup • Try it and test with real data! •
example with Vagrant/Puppet (3 mongod, mongos, configserver) • online (with documentation): https:/ /github.com/strud/vagrant- machines

_usecase:vagrant

_usecase:puppet

_usecase:data • wikipedia dump • „real world“ data • ~
4,4 mio. abstracts (3,8 GB XML)

_usecase:data • URL • title (text) • abstract (full text)
https://www.iconﬁnder.com/icons/171735/note_icon

_usecase:data • Java parsing and import logic • online: https:/
/github.com/strud/db_evaluation

_usecase:candidates • title • text • unsure distribution

_usecase:candidates • artificial, e.g. count(docs) % no. shards and random
ID • optimal distribution while importing • but: adding shards? • alternative: _id

_usecase:candidates • URL • structured • can be mapped to
a tree data structure

_usecase:candidates • since 2.4: Hashed Index • random • created
by MongoDB

Practical Experience https://www.iconﬁnder.com/icons/171728/settings_icon

_sharding:_id • a lot of balancer action • not evenly
balanced 0 750000 1500000 2250000 3000000 shard00 shard01 shard02

_sharding:title • balancer running at the beginning (level out) •
not balanced • best performance with title queries 0 450000 900000 1350000 1800000 shard00 shard01 shard02

_sharding:url • balancer running at the beginning (level out) •
not balanced • best performance with url queries 0 450000 900000 1350000 1800000 shard00 shard01 shard02

_sharding:hash(title) • (no) balancer action • (nearly) balanced • max.
insert performance • but: not every query possible 0 450000 900000 1350000 1800000 shard00 shard01 shard02

_sharding:hash(title) • 1.457.725 • 1.462.456 • 1.465.252 • can be
balanced (appears to depend)

_talk:conclusion • Sharding can increase the amount of data a
system can handle! • Sharding may increase query performance!

_talk:conclusion • shard as early as possible • early experiments
and lots of insights • fast reset possible • even on one server • Scaling is easier because you are already sharded!

_talk:conclusion • hashed sharding-keys = great for easy balancing •
not all queries supported • performance is not optimal • queries cannot profit from based index

_talk:conclusion • self-defined = best performance • bad sharding-key decision
= possible problems • recommendation: a few random bytes and a then use case specific

_talk:conclusion • „Know your data!“ • „Test with your data!“
• „Know your UseCases!“

_talk:conclusion • „Consultant speech“? • Do not fear! Try it!

Questions? • https:/ /github.com/strud/vagrant-machines • https:/ /github.com/strud/db_evaluation • http:/ /www.meetup.com/MUGBerlin
• Twitter: @StRud2nd

http:/ /www.wordle.net

MongoDB - data distribution (2014)

MongoDB - data distribution (2014)

More Decks by Stefan Rudnitzki

Other Decks in Programming

Featured

Transcript