MongoDB - data distribution (2014)

Slide 1

Slide 1 text

MongoDB data distribution Stefan Rudnitzki

Slide 2

Slide 2 text

_talk:me • Stefan Rudnitzki • job: developer @hypoport • spare time: organizer @MUGBerlin • interests: Java, search, distributed systems, NoSQL, Vagrant, Puppet

Slide 3

Slide 3 text

_talk:basic intro • MongoDB • document oriented NoSQL store • query language • indexes

Slide 4

Slide 4 text

_talk:basic intro • documents stored in collections (tables in relational) • reliable (build in replication) • scalable (see this talk)

Slide 5

Slide 5 text

_talk:aim • How do you scale your data with MongoDB? • Sharding!

Slide 6

Slide 6 text

_talk:aim • „Less than 10 % of MongoDB users are using sharding“ • main goal: reduce anxiety

Slide 7

Slide 7 text

_talk:agenda • Sharding • Pros/Cons • UseCase • Practical Experience https://www.iconﬁnder.com/icons/171757/calendar_icon

Slide 8

Slide 8 text

Sharding

Slide 9

Slide 9 text

_sharding:basics • scale • spread documents over multiple nodes • optimze reads/writes

Slide 10

Slide 10 text

_sharding:terminology

Slide 11

Slide 11 text

_sharding:terminology • mongod (data) • shard (subset of data) shard01 mongod mongod mongod

Slide 12

Slide 12 text

_sharding:terminology • mongod (data) • shard (subset of data) • replica set (replication) shard01 mongod mongod mongod

Slide 13

Slide 13 text

_sharding:terminology • mongos (sharding proxy) • configserver (metadata) configservers config01 config02 config03 mongos

Slide 14

Slide 14 text

_sharding:terminology

Slide 15

Slide 15 text

_sharding:terminology • sharding-key: criteria for distribution • chunk: physical piece of data ?

Slide 16

Slide 16 text

_sharding:example • server: 1 TB, 64 GB RAM • possible dataset: < 950 GB, est. index size < 56 GB https://www.iconﬁnder.com/icons/171754/data_icon

Slide 17

Slide 17 text

_sharding:example • dataset grows to 2,3 TB, est. index size 275 GB • distribution approaches • 3 shards (dataset) • 6 shards (index size) https://www.iconﬁnder.com/icons/171754/data_icon

Slide 18

Slide 18 text

_sharding:activity Shard01 Shard02 Shard03 d .. u v .. z a .. c mongos ? ? ?

Slide 19

Slide 19 text

_sharding:activity Shard01 Shard02 Shard03 d .. u v .. z a .. c mongos d u n

Slide 20

Slide 20 text

_sharding:activity Shard01 Shard02 Shard03 d .. u v .. z a .. c Balancer

Slide 21

Slide 21 text

_sharding:activity Shard01 Shard02 Shard03 m .. r h .. l c .. g s .. t a .. b u .. z mongos d u n

Slide 22

Slide 22 text

Pros/Cons https://www.iconﬁnder.com/icons/171728/settings_icon + - /

Slide 23

Slide 23 text

_sharding:pros • scale to handle „load“ • distribute reads • distribute writes • self-defined distribution criteria +

Slide 24

Slide 24 text

_sharding:cons • single opportunity to define distribution criteria • RAM limits • monitoring is a key factor • sharding does not speed up everything • sharding does not make things easier !

Slide 25

Slide 25 text

UseCase https://www.iconﬁnder.com/icons/171729/search_icon

Slide 26

Slide 26 text

_usecase:setup • Try it and test with real data! • example with Vagrant/Puppet (3 mongod, mongos, configserver) • online (with documentation): https:/ /github.com/strud/vagrant- machines

Slide 27

Slide 27 text

_usecase:vagrant

Slide 28

Slide 28 text

_usecase:puppet

Slide 29

Slide 29 text

_usecase:data • wikipedia dump • „real world“ data • ~ 4,4 mio. abstracts (3,8 GB XML)

Slide 30

Slide 30 text

_usecase:data • URL • title (text) • abstract (full text) https://www.iconﬁnder.com/icons/171735/note_icon

Slide 31

Slide 31 text

_usecase:data • Java parsing and import logic • online: https:/ /github.com/strud/db_evaluation

Slide 32

Slide 32 text

_usecase:candidates • title • text • unsure distribution

Slide 33

Slide 33 text

_usecase:candidates • artificial, e.g. count(docs) % no. shards and random ID • optimal distribution while importing • but: adding shards? • alternative: _id

Slide 34

Slide 34 text

_usecase:candidates • URL • structured • can be mapped to a tree data structure

Slide 35

Slide 35 text

_usecase:candidates • since 2.4: Hashed Index • random • created by MongoDB

Slide 36

Slide 36 text

Practical Experience https://www.iconﬁnder.com/icons/171728/settings_icon

Slide 37

Slide 37 text

_sharding:_id • a lot of balancer action • not evenly balanced 0 750000 1500000 2250000 3000000 shard00 shard01 shard02

Slide 38

Slide 38 text

_sharding:title • balancer running at the beginning (level out) • not balanced • best performance with title queries 0 450000 900000 1350000 1800000 shard00 shard01 shard02

Slide 39

Slide 39 text

_sharding:url • balancer running at the beginning (level out) • not balanced • best performance with url queries 0 450000 900000 1350000 1800000 shard00 shard01 shard02

Slide 40

Slide 40 text

_sharding:hash(title) • (no) balancer action • (nearly) balanced • max. insert performance • but: not every query possible 0 450000 900000 1350000 1800000 shard00 shard01 shard02

Slide 41

Slide 41 text

_sharding:hash(title) • 1.457.725 • 1.462.456 • 1.465.252 • can be balanced (appears to depend)

Slide 42

Slide 42 text

Slide 43

Slide 43 text

_talk:conclusion • Sharding can increase the amount of data a system can handle! • Sharding may increase query performance!

Slide 44

Slide 44 text

_talk:conclusion • shard as early as possible • early experiments and lots of insights • fast reset possible • even on one server • Scaling is easier because you are already sharded!

Slide 45

Slide 45 text

_talk:conclusion • hashed sharding-keys = great for easy balancing • not all queries supported • performance is not optimal • queries cannot profit from based index

Slide 46

Slide 46 text

_talk:conclusion • self-defined = best performance • bad sharding-key decision = possible problems • recommendation: a few random bytes and a then use case specific

Slide 47

Slide 47 text

_talk:conclusion • „Know your data!“ • „Test with your data!“ • „Know your UseCases!“

Slide 48

Slide 48 text

_talk:conclusion • „Consultant speech“? • Do not fear! Try it!

Slide 49

Slide 49 text

Questions? • https:/ /github.com/strud/vagrant-machines • https:/ /github.com/strud/db_evaluation • http:/ /www.meetup.com/MUGBerlin • Twitter: @StRud2nd

Slide 50

Slide 50 text

http:/ /www.wordle.net