_sharding:pros
• scale to handle „load“
• distribute reads
• distribute writes
• self-defined distribution criteria
+
Slide 24
Slide 24 text
_sharding:cons
• single opportunity to define distribution criteria
• RAM limits
• monitoring is a key factor
• sharding does not speed up everything
• sharding does not make things easier
!
_usecase:setup
• Try it and test with real data!
• example with Vagrant/Puppet (3 mongod, mongos, configserver)
• online (with documentation): https:/
/github.com/strud/vagrant-
machines
Slide 27
Slide 27 text
_usecase:vagrant
Slide 28
Slide 28 text
_usecase:puppet
Slide 29
Slide 29 text
_usecase:data
• wikipedia dump
• „real world“ data
• ~ 4,4 mio. abstracts (3,8 GB XML)
_usecase:candidates
• title
• text
• unsure distribution
Slide 33
Slide 33 text
_usecase:candidates
• artificial, e.g. count(docs) % no. shards and random ID
• optimal distribution while importing
• but: adding shards?
• alternative: _id
Slide 34
Slide 34 text
_usecase:candidates
• URL
• structured
• can be mapped to a tree data structure
Slide 35
Slide 35 text
_usecase:candidates
• since 2.4: Hashed Index
• random
• created by MongoDB
_sharding:_id
• a lot of balancer action
• not evenly balanced
0
750000
1500000
2250000
3000000
shard00 shard01 shard02
Slide 38
Slide 38 text
_sharding:title
• balancer running at the
beginning (level out)
• not balanced
• best performance with title
queries
0
450000
900000
1350000
1800000
shard00 shard01 shard02
Slide 39
Slide 39 text
_sharding:url
• balancer running at the
beginning (level out)
• not balanced
• best performance with url
queries
0
450000
900000
1350000
1800000
shard00 shard01 shard02
Slide 40
Slide 40 text
_sharding:hash(title)
• (no) balancer action
• (nearly) balanced
• max. insert performance
• but: not every query possible
0
450000
900000
1350000
1800000
shard00 shard01 shard02
Slide 41
Slide 41 text
_sharding:hash(title)
• 1.457.725
• 1.462.456
• 1.465.252
• can be balanced (appears to
depend)
Slide 42
Slide 42 text
?
Slide 43
Slide 43 text
_talk:conclusion
• Sharding can increase the amount of data a system can handle!
• Sharding may increase query performance!
Slide 44
Slide 44 text
_talk:conclusion
• shard as early as possible
• early experiments and lots of insights
• fast reset possible
• even on one server
• Scaling is easier because you are already sharded!
Slide 45
Slide 45 text
_talk:conclusion
• hashed sharding-keys = great for easy balancing
• not all queries supported
• performance is not optimal
• queries cannot profit from based index
Slide 46
Slide 46 text
_talk:conclusion
• self-defined = best performance
• bad sharding-key decision = possible problems
• recommendation: a few random bytes and a then use case specific
Slide 47
Slide 47 text
_talk:conclusion
• „Know your data!“
• „Test with your data!“
• „Know your UseCases!“
Slide 48
Slide 48 text
_talk:conclusion
• „Consultant speech“?
• Do not fear! Try it!