Slide 1

Slide 1 text

Max Schireson President, 10gen [email protected] @mschireson maxschireson.com

Slide 2

Slide 2 text

My background •  Oracle from July 1994 to June 2003 •  MarkLogic from July 2003 to Feb 2011 •  10gen (makers of MongoDB) since Feb 2011

Slide 3

Slide 3 text

In this talk •  Why is everyone and their brother inventing a new database •  Why do they all look so different from each other and what we’re used to •  Use MongoDB as an example of some of the choices involved in designing a database •  Talk about some use cases

Slide 4

Slide 4 text

Where I stand •  Longtime user of RDBMS •  Think they are useful for many applications •  Don’t think they’re the only game in town •  Think new types of databases will become an important part of the industry •  Skeptical of some approaches currently being taken

Slide 5

Slide 5 text

Since the dawn of the RDBMS 1970   2012   Main  memory   Intel  1103,  1k  bits   4GB  of  RAM  costs  $25.99   Mass  storage   IBM  3330  Model  1,  100   MB   3TB  Superspeed  USB  for   $129   Microprocessor   Nearly  –  4004  being   developed;  4  bits  and   92,000  instructions  per   second   Westmere  EX  has  10   cores,  30MB  L3  cache,   runs  at  2.4GHz  

Slide 6

Slide 6 text

More recent changes A  decade  ago   Now   Faster   Buy  a  bigger  server   Buy  more  servers   Faster  storage   A  SAN  with  more   spindles   SSD   More  reliable  storage   More  expensive  SAN   More  copies  of  local   storage   Deployed  in   Your  data  center   The  cloud  –  private  or   public   Large  data  set   Millions  of  rows   Billions  to  trillions  of   rows   Development   Waterfall   Iterative  

Slide 7

Slide 7 text

Questioning assumptions about databases •  ACID •  Data models

Slide 8

Slide 8 text

Assumptions behind todays DBMS •  Relational data model •  Third normal form •  ACID •  Multi-statement transactions •  SQL •  RAM is small and disks are slow •  If its too slow you can buy a faster computer

Slide 9

Slide 9 text

Yesterday’s assumptions in today’s world •  Scaleout is hard •  Distributed joins are hard •  Making two-phase commits fast is hard •  Custom solutions proliferate •  Too slow? Just add a cache •  ORM tools everywhere •  More computers and disk are nearly free but SAN and faster computers are expensive

Slide 10

Slide 10 text

Challenging some assumptions •  Do you need a database at all •  How does it handle transactions and consistency •  How does it scale out •  How should it model data •  How do you query it •  Is it enterprise software, open source, an appliance, or a cloud service?

Slide 11

Slide 11 text

Do you need a database at all •  Can you better solve your problem with an in memory object store/cache •  Can you better solve your problem with a batch processing framework

Slide 12

Slide 12 text

The CAP Theorem •  Common sense •  It says if a distributed system is partitioned, you can’t be able to update everywhere and have consistency •  Duh •  So, either allow inconsistency or limit where updates can be applied

Slide 13

Slide 13 text

Two choices for consistency •  Eventual consistency •  Allow updates when a system has been partitioned •  Resolve conflicts later •  Example: CouchDB, Cassandra •  Immediate consistency •  Limit the application of updates to a single master node for a given slice of data •  Avoids the possibility of conflicts •  Example: MongoDB

Slide 14

Slide 14 text

Transactions •  Do they exist •  At what level of granularity •  MongoDB example •  Transactions are document-level •  Those short transactions are atomic, consistent, isolated and durable

Slide 15

Slide 15 text

Scaleout architecture •  How do you distribute data among many servers •  Some examples: •  Amazon Dynamo: Hash-based ring •  Google bigtable: Key-range partitioning •  Complication: secondary indexes •  Tradeoff: cluster rebalancing ease vs performance optimization •  MongoDB example: bigtable style with key range segments being logical

Slide 16

Slide 16 text

Scaleout – no free lunch •  With a large cluster: •  No known solution to fast distributed joins •  No known solution to fast distributed transactions

Slide 17

Slide 17 text

Data Model - Preamble •  I am not attempting to start a religious war •  I spent many years of my life dealing with relational data modeling issues •  I think the RDBMS is very useful and will be around for a long time •  I think that when you are using an RDBMS you would be well served to normalize your data

Slide 18

Slide 18 text

Data model - however •  Relational minus joins and multi-statement transactions is much less useful •  Therefore alternatives are worth considering for distributed systems •  Common alternatives •  Key-value •  Document •  Graph •  Column-family •  MongoDB example: JSON-based document oriented

Slide 19

Slide 19 text

Change one assumption •  First normal form: no repeating groups •  Why? •  What if that is not a requirement? •  You need many fewer joins •  Transactions are often simplified •  Data locality is often increased •  But at a cost •  Much further theory is now moot •  Implementation complexity •  From a different initial assumption, different rules apply

Slide 20

Slide 20 text

Querying a database •  By primary key only •  Ad-hoc queries •  SQL or otherwise, but language details are a minor choice •  Via map-reduce •  MongoDB example: ad-hoc queries (based on JSON) and map-reduce

Slide 21

Slide 21 text

How to package a database •  Enterprise software •  Tried and true, worked well for Larry •  Much harder for a new entrant to become mainstream •  Open source •  Much faster adoption •  Lower price point •  Customers are not trapped by support •  Online service •  Increasingly important •  My opinion: an important option but shouldn’t be exclusive •  Appliance •  Tightly coupled to proprietary HW or preintegrated with standard HW

Slide 22

Slide 22 text

Now, on to usage •  Examples will be MongoDB based

Slide 23

Slide 23 text

NoSQL Started on the Web •  Foursquare •  Craigslist •  Shutterfly •  EBay

Slide 24

Slide 24 text

NoSQL goes Enterprise •  NY Times •  Viacom •  Disney •  SAP

Slide 25

Slide 25 text

Even Governments! •  Whitehouse •  UK National Archives •  Indian National ID •  Major US Intelligence Agancy

Slide 26

Slide 26 text

Warning: Its Habit Forming •  Telefonica •  Major electronics manufacturer •  Disney •  Guardian

Slide 27

Slide 27 text

Thank you Max Schireson [email protected] @mschireson maxschireson.com

Slide 28

Slide 28 text

28  

Slide 29

Slide 29 text

Max Schireson President, 10gen [email protected] @mschireson maxschireson.com