This four hour online conference will introduce you to some MongoDB basics and get you up to speed with why and how you should choose MongoDB for your next project.
3 3 • Quick introduction to mongoDB • Data modeling in mongoDB, queries, geospatial, updates and map reduce. • Using a location-based app as an example • Example works in mongoDB JS shell
5 5 MongoDB is a scalable, high-performance, open source, document-oriented database. • Fast Querying • In-place updates • Full Index Support • Replication /High Availability • Auto-Sharding • Aggregation; Map/Reduce • GridFS
6 6 MongoDB is Implemented in C++ • Windows, Linux, Mac OS-X, Solaris Drivers are available in many languages 10gen supported • C, C# (.Net), C++, Erlang, Haskell, Java, JavaScript, Perl, PHP, Python, Ruby, Scala, nodejs! • Multiple community supported drivers The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x
9 9 • JSON has powerful, limited set of datatypes – Mongo extends datatypes with Date, Int types, Id, … • MongoDB stores data in BSON • BSON is a binary representation of JSON – Optimized for performance and navigational abilities – Also compression See: bsonspec.org!
10 10 • Intrinsic support for fast, iterative development • Super low latency access to your data • Very little CPU overhead • No additional caching layer required • Built in replication and horizontal scaling support
12 12 "As a user I want to be able to find other locations nearby" • Need to store locations (Offices, Restaurants, etc) – name, address, tags – coordinates – User generated content e.g. tips / notes
13 13 "As a user I want to be able to 'checkin' to a location" Checkins – User should be able to 'check in' to a location – Want to be able to generate statistics: • Recent checkins • Popular locations
28 28 "As a user I want to be able to 'checkin' to a location" Checkins – User should be able to 'check in' to a location – Want to be able to generate statistics: • Recent checkins • Popular locations
40 40 • Single server - need a strong backup plan • Replica sets - High availability - Automatic failover • Sharded - Horizontally scale - Auto balancing P S S P S S P P S S
43 43 @mongodb conferences, appearances, and meetups
http://www.10gen.com/events
http://bit.ly/mongofb Facebook | Twitter | LinkedIn http://linkd.in/joinmongo download at mongodb.org support, training, and this talk brought to you by
Goals Avoid anomalies when inserting, updating or deleting Minimize redesign when extending the schema Avoid bias toward a particular query Make use of all SQL features In MongoDB Similar goals apply but rules are different Denormalization for optimization is an option: most features still exist, contrary to BLOBS Normalization
Equivalent to a Table in SQL Cheap to create (max 24000) Collections don’t have a fixed schema Common for documents in a collection to share a schema Document schema can evolve Consider using multiple related collections tied together by a naming convention: e.g. LogData-2011-02-08 Collections Basics
Elements are name/value pairs, equivalent to column value in SQL elements can be nested Rich data types for values JSON for the human eye BSON for all internals 16MB maximum size (many books..) What you see is what is stored Document basics
> db.blogs.find()! ! { _id: ObjectId("4c4ba5c0672c685e5e8aabf3"),! author: "Hergé", ! date: ISODate("2011-09-18T09:56:06.298Z"), ! text: "Destination Moon", ! tags: [ "comic", "adventure" ]! }
Notes: • ID must be unique, but can be anything you’d like • MongoDB will generate a default ID if one is not supplied Find the document
// create index on nested documents: > db.blogs.ensureIndex( { "comments.author": 1 } ) > db.blogs.find( { "comments.author": "Kyle" } ) // find last 5 posts: > db.blogs.find().sort( { date: -1 } ).limit(5) // most commented post: > db.blogs.find().sort( { comments_count: -1 } ).limit(1) When sorting, check if you need an index Extending the Schema
// Each product list the IDs of the categories! products:! { _id: 10, name: "Destination Moon",! category_ids: [ 20, 30 ] }! ! // Association not stored on the categories! categories:! { _id: 20, ! name: "adventure"}! ! Alternative
// Each product list the IDs of the categories! products:! { _id: 10, name: "Destination Moon",! category_ids: [ 20, 30 ] }! ! // Association not stored on the categories! categories:! { _id: 20, ! name: "adventure"}! ! // All products for a given category! > db.products.ensureIndex( { category_ids: 1} ) // yes!! > db.products.find( { category_ids: 20 } )! ! ! Alternative
// Store all Ancestors of a node { _id: "a" } { _id: "b", tree: [ "a" ], retweet: "a" } { _id: "c", tree: [ "a", "b" ], retweet: "b" } { _id: "d", tree: [ "a", "b" ], retweet: "b" } { _id: "e", tree: [ "a" ], retweet: "a" } { _id: "f", tree: [ "a", "e" ], retweet: "e" } // find all direct retweets of "b" > db.tweets.find( { retweet: "b" } ) Array of Ancestors A
B
C
D
E
F
// Store all Ancestors of a node { _id: "a" } { _id: "b", tree: [ "a" ], retweet: "a" } { _id: "c", tree: [ "a", "b" ], retweet: "b" } { _id: "d", tree: [ "a", "b" ], retweet: "b" } { _id: "e", tree: [ "a" ], retweet: "a" } { _id: "f", tree: [ "a", "e" ], retweet: "e" } // find all direct retweets of "b" > db.tweets.find( { retweet: "b" } ) // find all retweets of "e" anywhere in tree > db.tweets.find( { tree: "e" } ) Array of Ancestors A
B
C
D
E
F
Store hierarchy as a path expression • Separate each node by a delimiter, e.g. “,” • Use text search for find parts of a tree • search must be left-rooted and use an index! { retweets: [! { _id: "a", text: "initial tweet", ! path: "a" },! { _id: "b", text: "reweet with comment",! path: "a,b" },! { _id: "c", text: "reply to retweet",! path : "a,b,c"} ] }! ! // Find the conversations "a" started ! > db.tweets.find( { path: /^a/i } )! // Find the conversations under a branch ! > db.tweets.find( { path: /^a,b/i } )! Trees as Paths A
B
C
D
E
F
• Sequence of key/value pairs • NOT a hash map • Optimized to scan quickly BSON Storage ...
0
1
2
3
1439
What is the cost of update the minute before midnight?
Use more of a Tree structure by nesting! // Time series buckets, each hour a sub-document { _id: "20111209-1231", ts: ISODate("2011-12-09T00:00:00.000Z") daily: 67, minute: { 0: { 0: 0, 1: 7, ... 59: 2 }, ... 23: { 0: 15, ... 59: 6 } } } // Add one to the last second before midnight > db.votes.update( { _id: "20111209-1231" }, ts: ISODate("2011-12-09T00:00:00.000Z") }, { $inc: { "minute.23.59": 1 } }) Time Series
Document to represent a shopping order: { _id: 1234, ts: ISODate("2011-12-09T00:00:00.000Z") customerId: 67, total_price: 1050, items: [{ sku: 123, quantity: 2, price: 50, name: “macbook”, thumbnail: “macbook.png” }, { sku: 234, quantity: 1, price: 20, name: “iphone”, thumbnail: “iphone.png” }, ... } } The item information is duplicated in every order that reference it. Mongo’s flexible schema makes it easy! Duplicate data
Pros: only 1 query to get all information needed to display the order processing on the db is as fast as a BLOB can achieve much higher performance Cons: more storage used ... cheap enough updates are much more complicated ... just consider fields immutable Duplicate data
Basic data design principles stay the same ... But MongoDB is more flexible and brings possibilities embed or duplicate data to speed up operations, cut down the number of collections and indexes watch for documents growing too large make sure to use the proper indexes for querying and sorting schema should feel natural to your application! Summary
101 101 • High Availability (auto-failover) • Read Scaling (extra copies to read from) • Backups – Online, Delayed Copy (fat finger) – Point in Time (PiT) backups • Use (hidden) replica for secondary workload – Analytics – Data-processing – Integration with external systems
102 102 Planned – Hardware upgrade – O/S or file-system tuning – Relocation of data to new file-system / storage – Software upgrade Unplanned – Hardware failure – Data center failure – Region outage – Human error – Application corruption
103 103 • A cluster of N servers • All writes to primary • Reads can be to primary (default) or a secondary • Any (one) node can be primary • Consensus election of primary • Automatic failover • Automatic recovery
125 125 Primary Secondary Secondary San Francisco Dallas Priority 1 Priority 1 Priority 0 Disaster recover data center. Will never become primary automatically.
131 131 Primary Arbiter Secondary Primary Arbiter Secondary 1 2 Primary Arbiter Secondary 3 Secondary Full Sync Uh oh. Full Sync is going to use a lot of resources on the primary. So I may have downtime or degraded performance
134 134 Primary Secondary Primary Secondary 1 2 Primary Secondary 3 Secondary Full Sync Sync can happen from secondary, which will not impact traffic on Primary. Secondary Secondary Secondary
135 135 • Avoid single points of failure – Separate racks – Separate data centers • Avoid long recovery downtime – Use journaling – Use 3+ replicas • Keep your actives close – Use priority to control where failovers happen
139 139 • You
are
using,
or
want
to
use,
MongoDB
– What
benefits?
– Poten9al
Use
cases
– Steering
the
adop9on
of
MongoDB
• Why
is
MongoDB
Safe
– Execu9on
– Opera9onal
– Financial
• Why
10gen?
– People
– Company
– Future
144 144 • “NoSQL databases are proving valuable for scaling out cloud and on- premises uses of numerous content types, and document-oriented open- source solutions are emerging as one of the leading choices. “
145 145 • Reassuring
the
Ops
Team
• Reassuring
the
Business
Team
• Start
with
low
stakes
–
learn
to
trust
• Grow
towards
a
mission
cri9cal
use
case
• LET
US
HELP
YOU!
è
[email protected]
156 156 • Less
code
• More
produc9ve
coding
• Easier
to
maintain
• Con9ngency
plans
for
turnover
• Commodity
hardware
• No
upfront
license,
pay
for
value
over
9me
• Cost
visibility
for
growth
of
usage
157 157 § Analyze
a
staggering
amount
of
data
for
a
system
build
on
con9nuous
stream
of
high-‐quality
text
pulled
from
online
sources
§ Adding
too
much
data
too
quickly
resulted
in
outages;
tables
locked
for
tens
of
seconds
during
inserts
§ Ini9ally
launched
en9rely
on
MySQL
but
quickly
hit
performance
road
blocks
Problem Life
with
MongoDB
has
been
good
for
Wordnik.
Our
code
is
faster,
more
flexible
and
drama?cally
smaller.
Since
we
don’t
spend
?me
worrying
about
the
database,
we
can
spend
more
?me
wri?ng
code
for
our
applica?on.
§ Migrated
5
billion
records
in
a
single
day
with
zero
down9me
§ MongoDB
powers
every
website
requests:
20m
API
calls
per
day
§ Ability
to
eliminated
memcached
layer,
crea9ng
a
simplified
system
that
required
fewer
resources
and
was
less
prone
to
error.
Why MongoDB § Reduced
code
by
75%
compared
to
MySQL
§ Fetch
9me
cut
from
400ms
to
60ms
§ Sustained
insert
speed
of
8k
words
per
second,
with
frequent
bursts
of
up
to
50k
per
second
§ Significant
cost
savings
and
15%
reduc9on
in
servers
Impact Wordnik uses MongoDB as the foundation for its “live” dictionary that stores its entire text corpus – 3.5T of data in 20 billion records Tony Tam, Vice President of Engineering and Technical Co-founder
159 159 Dwight Merriman – CEO! Founder, CTO DoubleClick" Max Shireson – President! COO MarkLogic" 9 Years at Oracle" Eliot Horowitz – CTO ! Co-founder of Shopwiki, DoubleClick Erik Frieberg – VP Marketing! HP Software, Borland, BEA Ben Sabrin – VP of Sales ! VP of Sales at Jboss, over 9 years of Open Source experience