Upgrade to Pro — share decks privately, control downloads, hide ads and more …

NoSQL in the Enterprise

mongodb
April 30, 2012
330

NoSQL in the Enterprise

This is a two-part session that will take place on Monday and Tuesday. The database landscape is evolving as new, scalable data stores emerge. Key value stores, large tabular stores, and document-oriented databases offer a compelling alternative to the traditional relational database. By removing joins and loosening ACID constraints, this new class of non-relational or "NoSQL" solutions gain the ability to scale horizontally. This presentation will introduce attendees to the key concepts required to understand and evaluate NoSQL data stores, and conclude with an in-depth examination of the document-oriented database MongoDB.

mongodb

April 30, 2012
Tweet

Transcript

  1. My background •  Oracle from July 1994 to June 2003

    •  MarkLogic from July 2003 to Feb 2011 •  10gen (makers of MongoDB) since Feb 2011
  2. In this talk •  Why is everyone and their brother

    inventing a new database •  Why do they all look so different from each other and what we’re used to •  Use MongoDB as an example of some of the choices involved in designing a database •  Talk about some use cases
  3. Where I stand •  Longtime user of RDBMS •  Think

    they are useful for many applications •  Don’t think they’re the only game in town •  Think new types of databases will become an important part of the industry •  Skeptical of some approaches currently being taken
  4. Since the dawn of the RDBMS 1970   2012  

    Main  memory   Intel  1103,  1k  bits   4GB  of  RAM  costs  $25.99   Mass  storage   IBM  3330  Model  1,  100   MB   3TB  Superspeed  USB  for   $129   Microprocessor   Nearly  –  4004  being   developed;  4  bits  and   92,000  instructions  per   second   Westmere  EX  has  10   cores,  30MB  L3  cache,   runs  at  2.4GHz  
  5. More recent changes A  decade  ago   Now   Faster

      Buy  a  bigger  server   Buy  more  servers   Faster  storage   A  SAN  with  more   spindles   SSD   More  reliable  storage   More  expensive  SAN   More  copies  of  local   storage   Deployed  in   Your  data  center   The  cloud  –  private  or   public   Large  data  set   Millions  of  rows   Billions  to  trillions  of   rows   Development   Waterfall   Iterative  
  6. Assumptions behind todays DBMS •  Relational data model •  Third

    normal form •  ACID •  Multi-statement transactions •  SQL •  RAM is small and disks are slow •  If its too slow you can buy a faster computer
  7. Yesterday’s assumptions in today’s world •  Scaleout is hard • 

    Distributed joins are hard •  Making two-phase commits fast is hard •  Custom solutions proliferate •  Too slow? Just add a cache •  ORM tools everywhere •  More computers and disk are nearly free but SAN and faster computers are expensive
  8. Challenging some assumptions •  Do you need a database at

    all •  How does it handle transactions and consistency •  How does it scale out •  How should it model data •  How do you query it •  Is it enterprise software, open source, an appliance, or a cloud service?
  9. Do you need a database at all •  Can you

    better solve your problem with an in memory object store/cache •  Can you better solve your problem with a batch processing framework
  10. The CAP Theorem •  Common sense •  It says if

    a distributed system is partitioned, you can’t be able to update everywhere and have consistency •  Duh •  So, either allow inconsistency or limit where updates can be applied
  11. Two choices for consistency •  Eventual consistency •  Allow updates

    when a system has been partitioned •  Resolve conflicts later •  Example: CouchDB, Cassandra •  Immediate consistency •  Limit the application of updates to a single master node for a given slice of data •  Avoids the possibility of conflicts •  Example: MongoDB
  12. Transactions •  Do they exist •  At what level of

    granularity •  MongoDB example •  Transactions are document-level •  Those short transactions are atomic, consistent, isolated and durable
  13. Scaleout architecture •  How do you distribute data among many

    servers •  Some examples: •  Amazon Dynamo: Hash-based ring •  Google bigtable: Key-range partitioning •  Complication: secondary indexes •  Tradeoff: cluster rebalancing ease vs performance optimization •  MongoDB example: bigtable style with key range segments being logical
  14. Scaleout – no free lunch •  With a large cluster:

    •  No known solution to fast distributed joins •  No known solution to fast distributed transactions
  15. Data Model - Preamble •  I am not attempting to

    start a religious war •  I spent many years of my life dealing with relational data modeling issues •  I think the RDBMS is very useful and will be around for a long time •  I think that when you are using an RDBMS you would be well served to normalize your data
  16. Data model - however •  Relational minus joins and multi-statement

    transactions is much less useful •  Therefore alternatives are worth considering for distributed systems •  Common alternatives •  Key-value •  Document •  Graph •  Column-family •  MongoDB example: JSON-based document oriented
  17. Change one assumption •  First normal form: no repeating groups

    •  Why? •  What if that is not a requirement? •  You need many fewer joins •  Transactions are often simplified •  Data locality is often increased •  But at a cost •  Much further theory is now moot •  Implementation complexity •  From a different initial assumption, different rules apply
  18. Querying a database •  By primary key only •  Ad-hoc

    queries •  SQL or otherwise, but language details are a minor choice •  Via map-reduce •  MongoDB example: ad-hoc queries (based on JSON) and map-reduce
  19. How to package a database •  Enterprise software •  Tried

    and true, worked well for Larry •  Much harder for a new entrant to become mainstream •  Open source •  Much faster adoption •  Lower price point •  Customers are not trapped by support •  Online service •  Increasingly important •  My opinion: an important option but shouldn’t be exclusive •  Appliance •  Tightly coupled to proprietary HW or preintegrated with standard HW