Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Navigating the NoSQL Landscape

Navigating the NoSQL Landscape

I gave this talk at ConFESS 2013 (https://2013.con-fess.com/). This talk was not scheduled originally, but one talk got cancelled (about Lego Mindstorms programming with Java - that's why its mentioned in the Key slides) and I jumped in.

It's about a more or less large overview of the "current" NoSQL landscape and meant as a starting point where people can start their own investigations.

Michael Nitschinger

April 05, 2013
Tweet

More Decks by Michael Nitschinger

Other Decks in Programming

Transcript

  1. Navigating the NoSQL Landscape using Lego Mindstorms and Java Michael

    Nitschinger Developer Advocate, Couchbase Inc.
  2. Navigating the NoSQL Landscape using Lego Mindstorms and Java Michael

    Nitschinger Developer Advocate, Couchbase Inc.
  3. •  Developer  Advocate  at  Couchbase,  Inc.   •  Maintainer  of

     the  Couchbase  Java  SDK   •  Speaking  at  Conferences  and  Meetups   •  Living  and  Working  here  in  Vienna,  Austria   {“about”:  “me”}  
  4. What  we’ll  talk  about   •  What  are  the  limits

     of  RDBMS  solu=ons?   •  What  are  the  different  NoSQL  taxonomies?   •  Which  NoSQL  solu=on  is  right  for  me?  
  5. Growth  is  the  New  Reality   •  Instagram  gained  nearly

     1  million  users  overnight  when  they   expanded  to  Android  
  6. Does  it  work  with  RDMBS  backend?   Application Scales Out

    Just add more commodity web servers Database Scales Up Get a bigger, more complex server Note  –  RelaEonal  database  technology  is  great  for  what  it  is  great  for,  but  it  is  not  great  for  this.  
  7. Some  alterna=ves  to  scale  out  your  RDBMS   Scale  out

     your  RDBMS   •  Run  many  SQL  Servers   •  Data  is  sharded   (on  the  app  level!)   •  Memcached/Cache  for  faster   response  =me   •  Writes  are  s=ll  slow  
  8. Scale  out  with  RDBMS   Is  this  a  good  approach

     to  scale?   •  Lot  of  components  to  deploy   •  Scale  by  Hand   ­  Caching   ­  Sharding/ReplicaEon   Learn  From  Others     This  Scenario  Costs  Time  and  Money.  Scaling  SQL  is  potenEally  disastrous  when  going  Viral:     Very  risky  Eme  for  major  code  changes  and  migraEons...   You  have  no  Time  when  skyrockeEng  up!  
  9. The  Rela=onal  Model   •  Formulated  and  proposed  by  Edgar

     Codd  in  1969.   ­  hPp://en.wikipedia.org/wiki/RelaEonal_model   •  Based  on  Rela=onal  Algebra   ­  which  is  based  on  Set  Theory   •  Not  all  Problems  fit  into  Set  Theory   ­  i.e.  Graph  Theory   ­  RelaEonships   ­  RecommendaEons   hPp://en.wikipedia.org/   wiki/Honeywell_316  
  10. Lacking  market  solu=ons,  users  forced  to   invent   Dynamo

      October  2007   Cassandra   August  2008   Voldemort   February  2009   Bigtable   November  2006   Very  few  organizaEons  want  to  (fewer  can)  build  and  maintain  database  sobware  technology.   But  every  organizaEon  building  interacEve  web  applicaEons  needs  this  technology.   •  No  schema  required  before  inserEng  data   •  No  schema  change  required  to  change  data  format   •  Auto-­‐sharding  without  applicaEon  parEcipaEon   •  Distributed  queries   •  Integrated  main  memory  caching   •  Data  synchronizaEon  (mobile,  mulE-­‐datacenter)  
  11. Survey:  Schema  inflexibility  #1   adop=on  driver   11%  

    12%   16%   29%   35%   49%   Other   All  of  these   Costs   High  latency/low  performance   Inability  to  scale  out  data   Lack  of  flexibility/rigid  schemas   Source: Couchbase NoSQL Survey, December 2011, n=1351 What  is  the  biggest  data  management  problem     driving  your  use  of  NoSQL  in  the  coming  year?  
  12. NoSQL  database  matches  applica=on  logic  =er  architecture   Data  layer

     now  scales  with  linear  cost  and  constant  performance   Application Scales Out Just add more commodity web servers Database Scales Out Just add more commodity data servers Scaling out flattens the cost and performance curves. NoSQL  Database  Servers  
  13. The  CAP  Theorem   •  In  a  distributed  System:  

    ­  Consistency   ­  Availability   ­  ParEEon  Tolerance   •  When  Par==on  happens   ­  Choose  either  Consistency   (only  respond  to  subset)   ­  or  Availability   (accept  stale  data  and  conflict  writes)   Conflict  ResoluEon!   C A P
  14. •  Big  Data   ­  Large  scale  datastore  (“>=  100TB

     or  Petabytes”)   ­  OpEmized  for  Batch  Processing   ­  Data  Warehouse   •  Big  Users   ­  very  high  get/set  rate  (thousands  of  ops/s)   ­  working  set  in  RAM   ­  latency  and  throughput  maPers  most   ­  (near)  Real-­‐Time  use  cases   Clarifica=on  
  15. The  Key-­‐Value  Store  /  “Cache”  –  the   founda=on  of

     NoSQL   Key   101100101000100010011101   101100101000100010011101   101100101000100010011101   101100101000100010011101   101100101000100010011101   101100101000100010011101   101100101000100010011101   101100101000100010011101   101100101000100010011101   101100101000100010011101   101100101000100010011101   101100101000100010011101   101100101000100010011101   101100101000100010011101   101100101000100010011101   Opaque   Binary   Value  
  16. Memcached  –  the  NoSQL  precursor   Key   101100101000100010011101  

    101100101000100010011101   101100101000100010011101   101100101000100010011101   101100101000100010011101   101100101000100010011101   101100101000100010011101   101100101000100010011101   101100101000100010011101   101100101000100010011101   101100101000100010011101   101100101000100010011101   101100101000100010011101   101100101000100010011101   101100101000100010011101   Opaque   Binary   Value   Memcached   In-­‐memory  only   Limited  set  of  operaEons   Blob  Storage:  Set,  Add,  Replace,  CAS   Retrieval:  Get   Structured  Data:  Append,  Increment     “Simple  and  fast.”     Challenges:     -­‐        cold  cache   -­‐  disrupEve  elasEcity   -­‐  missing  persistence  
  17. Redis  –  More  “Structured  Data”   commands   Key  

    101100101000100010011101   101100101000100010011101   101100101000100010011101   101100101000100010011101   101100101000100010011101   101100101000100010011101   101100101000100010011101   101100101000100010011101   101100101000100010011101   101100101000100010011101   101100101000100010011101   101100101000100010011101   101100101000100010011101   101100101000100010011101   101100101000100010011101   “Data  Structures”   Blob   List   Set   Hash   …   Redis   Disk  Persistence  (eventual  consistency  on   the  disk)   Vast  set  of  operaEons   Blob  Storage:  Set,  Add,  Replace,  CAS   Retrieval:  Get,  Pub-­‐Sub   Structured  Data:  Strings,  Hashes,  Lists,  Sets,   Sorted  lists     Challenges:    -­‐  clustering  (to  come)    -­‐  RAM  limit  (no  evicEon)        
  18. NoSQL  catalog   Key-­‐Value   Memcached   Cache   (memory

     only)   Database   (memory/disk)   Redis   Data  Structure  
  19. Membase  –  From  key-­‐value  cache  to   database   Disk-­‐based

     with  built-­‐in  memcached  cache   Cache  refill  on  restart   Memcached  compaEble  (drop  in  replacement)   Highly-­‐available  (data  replicaEon)   Add  or  remove  capacity  to  live  cluster     “Simple,  fast,  elasEc.”     Membase   Key   101100101000100010011101   101100101000100010011101   101100101000100010011101   101100101000100010011101   101100101000100010011101   101100101000100010011101   101100101000100010011101   101100101000100010011101   101100101000100010011101   101100101000100010011101   101100101000100010011101   101100101000100010011101   101100101000100010011101   101100101000100010011101   101100101000100010011101   Opaque   Binary   Value  
  20. NoSQL  catalog   Key-­‐Value   Memcached   Cache   (memory

     only)   Database   (memory/disk)   Redis   Data  Structure   Membase  
  21. Couchbase  –  Document-­‐oriented   database   Key   {  

           “string”  :  “string”,          “string”  :  value,          “string”  :                          {    “string”  :  “string”,                                “string”  :  value  },          “string”  :  [  array  ]   }     Auto-­‐sharding   Disk-­‐based  with  built-­‐in  memcached  cache   Cache  refill  on  restart   Memcached  compaEble  (drop  in  replace)   Highly-­‐available  (data  replicaEon)   Add  or  remove  capacity  to  live  cluster     When  values  are  JSON  objects  (“documents”):   Create  indices,  views  and  query  against  the   views     Chooses  Consistency  over  Availability   JSON  &   Opaque   OBJECT   (“DOCUMENT”)   Couchbase  
  22. NoSQL  catalog   Key-­‐Value   Memcached   Cache   (memory

     only)   Database   (memory/disk)   Redis   Data  Structure   Membase   Couchbase   Document  
  23. MongoDB  –  Document-­‐oriented   database   Key   {  

           “string”  :  “string”,          “string”  :  value,          “string”  :                          {    “string”  :  “string”,                                “string”  :  value  },          “string”  :  [  array  ]   }     Disk-­‐based  with  in-­‐memory  “caching”   BSON  (“binary  JSON”)  format  and  wire  protocol   Master-­‐slave  replicaEon   Auto-­‐sharding   Values  are  BSON  objects   Supports  ad  hoc  queries  –  best  when  indexed     more  similar  to  RDBMS  modeling  than  Caches     Scaling  over  sharding  requires  special  nodes   BSON   OBJECT   (“DOCUMENT”)   MongoDB  
  24. NoSQL  catalog   Key-­‐Value   Memcached   Cache   (memory

     only)   Database   (memory/disk)   Redis   Data  Structure   Membase   Couchbase   MongoDB   Document  
  25. Cassandra  –  Column  overlays   Disk-­‐based  system   Clustered  

      External  caching  required  for  low-­‐latency  reads   “Columns”  are  overlaid  on  the  data   Not  all  rows  must  have  all  columns   Supports  efficient  queries  on  columns   Restart  required  when  adding  columns     MulE-­‐Data-­‐Center  replicaEon  supported   Column-­‐Model  may  be  complex  to  start  with     Chooses  Availability  over  Consistency       Cassandra   Key 101100101000100010011101 101100101000100010011101 101100101000100010011101 101100101000100010011101 101100101000100010011101 101100101000100010011101 101100101000100010011101 101100101000100010011101 101100101000100010011101 101100101000100010011101 101100101000100010011101 101100101000100010011101 101100101000100010011101 101100101000100010011101 101100101000100010011101 Opaque Binary Value Column  1   Column  2   Column  3     (not  present)    
  26. NoSQL  catalog   Key-­‐Value   Memcached   Cache   (memory

     only)   Database   (memory/disk)   Redis   Data  Structure   Membase   Couchbase   MongoDB   Document   Column   Cassandra  
  27. Neo4j  –  Graph  database   Disk-­‐based  system   External  caching

     required  for  low-­‐latency  reads   Nodes,  relaEonships  and  paths   ProperEes  on  nodes   Delete,  Insert,  Traverse,  etc.       Neo4j   Key 101100101000100010011101 101100101000100010011101 101100101000100010011101 101100101000100010011101 101100101000100010011101 101100101000100010011101 101100101000100010011101 101100101000100010011101 101100101000100010011101 101100101000100010011101 101100101000100010011101 101100101000100010011101 101100101000100010011101 101100101000100010011101 101100101000100010011101 Opaque Binary Value Key 101100101000100010011101 101100101000100010011101 101100101000100010011101 101100101000100010011101 101100101000100010011101 101100101000100010011101 101100101000100010011101 101100101000100010011101 101100101000100010011101 101100101000100010011101 101100101000100010011101 101100101000100010011101 101100101000100010011101 101100101000100010011101 101100101000100010011101 Opaque Binary Value Key 101100101000100010011101 101100101000100010011101 101100101000100010011101 101100101000100010011101 101100101000100010011101 101100101000100010011101 101100101000100010011101 101100101000100010011101 101100101000100010011101 101100101000100010011101 101100101000100010011101 101100101000100010011101 101100101000100010011101 101100101000100010011101 101100101000100010011101 Opaque Binary Value Key 101100101000100010011101 101100101000100010011101 101100101000100010011101 101100101000100010011101 101100101000100010011101 101100101000100010011101 101100101000100010011101 101100101000100010011101 101100101000100010011101 101100101000100010011101 101100101000100010011101 101100101000100010011101 101100101000100010011101 101100101000100010011101 101100101000100010011101 Opaque Binary Value Key 101100101000100010011101 101100101000100010011101 101100101000100010011101 101100101000100010011101 101100101000100010011101 101100101000100010011101 101100101000100010011101 101100101000100010011101 101100101000100010011101 101100101000100010011101 101100101000100010011101 101100101000100010011101 101100101000100010011101 101100101000100010011101 101100101000100010011101 Opaque Binary Value
  28. NoSQL  catalog   Key-­‐Value   Memcached   Cache   (memory

     only)   Database   (memory/disk)   Redis   Data  Structure   Membase   Couchbase   MongoDB   Document   Column   Cassandra   Graph   Neo4j  
  29. NoSQL  catalog   Key-­‐Value   Memcached   Cache   (memory

     only)   Database   (memory/disk)   Redis   Data  Structure   Riak   Couchbase   MongoDB   Document   Column   Cassandra   Graph   Neo4j   HBase   InfiniteGraph   Coherence   Membase  
  30. Hadoop:  Big  Data  Swiss  Army  Knife   •  Oozie:  Workflow,

     coordinaEon   •  Sqoop  :  Data  connector  to  import/export  data   •  Hive  :  SQL-­‐Like  interface   •  Pig  :  High  level  programming  language   •  Mahout  :  Machine  learning  library   •  Whirr  :  Hadoop  management  tools  for  cloud  services   •  Flume  :  Aggregator   •  Map  Reduce  :  Framework  to  process  large  volume  of  data   •  HBase  :  Key  Value  data  store   •  Zookeeper  :  Centralized  configuraEon  management   •  HDFS  :  Distributed  file  system  
  31. So  what?  Connec=ng  Hadoop   click  stream   events  

    profiles,  campaigns   profiles,  real  Eme  campaign     staEsEcs   40  milliseconds  to  respond   with  the  decision.   2   3   1  
  32. Survey:  Schema  inflexibility  #1   adop=on  driver   11%  

    12%   16%   29%   35%   49%   Other   All  of  these   Costs   High  latency/low  performance   Inability  to  scale  out  data   Lack  of  flexibility/rigid  schemas   Source: Couchbase NoSQL Survey, December 2011, n=1351 What  is  the  biggest  data  management  problem     driving  your  use  of  NoSQL  in  the  coming  year?  
  33. Lack  of  Flexibility  /  Rigid  Schema   •  Aggregate  Data

     Models  (Mar0n  Fowler)   ­  Flexible  Data  Structure   ­  OpEmized  Access   ­  Easy  to  distribute  data   o::1001   { uid: ji22jd, customer: Ann, line_items: [ { sku: 0321293533, quan: 3, unit_price: 48.0 }, { sku: 0321601912, quan: 1, unit_price: 39.0 }, { sku: 0131495054, quan: 1, unit_price: 51.0 } ], payment: { type: Amex, expiry: 04/2001, last5: 12345 } } hPp://marEnfowler.com/bliki/AggregateOrientedDatabase.html  
  34. Use  Cases   Key  Value   •   Session  Management  

    •   User  Profile/Preferences   •   Shopping  Cart   Document   •   Event  Logging   •   Content  Management     •   Web  Analy=cs   •   E-­‐Commerce  Applica=on   Columns   •   Event  Logging   •   Content  Management   •   Counters   Graph   •   Connected  Data  /    Social  Networks   •   Rou=ng,  Dispatch   •   Recommenda=ons  based  on  Social  Graph  
  35. How  do  I  want  to  scale  out?   •  Modify

     cluster  topology  should  be  simple   ­  Add,  Remove,  Configure  Nodes  on  a  running  system   •  What  is  the  impact  of  topology  changes?   ­  Sharding,  Caching  of  the  data   ­  Availability  of  the  service  during  cluster  changes   •  More  hardware  =  More  failures   ­  Availability,  reliability  of  the  system:  failover  support  
  36. Add  Nodes  to  Cluster   •  Two  servers  added  

    One-­‐click  opera=on   •  Docs  automa=cally   rebalanced  across   cluster   Even  distribuEon  of  docs   Minimum  doc  movement   •  Cluster  map  updated   •  App  database     calls  now  distributed     over  larger  number  of   servers         REPLICA   ACTIVE   Doc  5   Doc  2   Doc   Doc   Doc  4   Doc  1   Doc   Doc   SERVER  1       REPLICA   ACTIVE   Doc  4   Doc  7   Doc   Doc   Doc  6   Doc  3   Doc   Doc   SERVER  2       REPLICA   ACTIVE   Doc  1   Doc  2   Doc   Doc   Doc  7   Doc  9   Doc   Doc   SERVER  3       SERVER  4       SERVER  5   REPLICA   ACTIVE   REPLICA   ACTIVE   Doc   Doc  8   Doc   Doc  9   Doc   Doc  2   Doc   Doc  8   Doc   Doc  5   Doc   Doc  6   READ/WRITE/UPDATE   READ/WRITE/UPDATE   APP  SERVER  1   COUCHBASE  Client  Library       CLUSTER  MAP   COUCHBASE  Client  Library       CLUSTER  MAP   APP  SERVER  2   COUCHBASE  SERVER  CLUSTER   User  Configured  Replica  Count  =  1  
  37. Fail  Over  Node       REPLICA   ACTIVE  

    Doc  5   Doc  2   Doc   Doc   Doc  4   Doc  1   Doc   Doc   SERVER  1       REPLICA   ACTIVE   Doc  4   Doc  7   Doc   Doc   Doc  6   Doc  3   Doc   Doc   SERVER  2       REPLICA   ACTIVE   Doc  1   Doc  2   Doc   Doc   Doc  7   Doc  9   Doc   Doc   SERVER  3       SERVER  4       SERVER  5   REPLICA   ACTIVE   REPLICA   ACTIVE   Doc  9   Doc  8   Doc   Doc  6   Doc   Doc   Doc  5   Doc   Doc  2   Doc  8   Doc   Doc   •  App  servers  accessing  docs   •  Requests  to  Server  3  fail   •  Cluster  detects  server  failed   Promotes  replicas  of  docs  to   acEve   Updates  cluster  map   •  Requests  for  docs  now  go  to   appropriate  server   •  Typically  rebalance     would  follow   Doc   Doc  1   Doc  3   APP  SERVER  1   COUCHBASE  Client  Library       CLUSTER  MAP   COUCHBASE  Client  Library       CLUSTER  MAP   APP  SERVER  2   User  Configured  Replica  Count  =  1   COUCHBASE  SERVER  CLUSTER  
  38. Performance   •  What  is  my  working  set?   ­ 

    Different  PaPerns  based  on  the  ApplicaEon   ­  Social  Games  vs.  AnalyEcs   •  What  do  I  need  to  cache  /  how  oren?   ­  Put  your  data  in  RAM   ­  Read/Write  rates   •  How  to  design  my  data  model?   ­  Trim  towards  your  “hot  code  path”   ­  Aggregate  Model   ­  Easy  to  change  
  39. Management  and  Monitoring   •  Do  not  forget  about  Opera=ons!

      ­  Service  Reliability  Engineering  Team  will  thank  you!   •  Manage  your  cluster  easily:   ­  Command  Line,  AdministraEon  Console  to  change  cluster  toplogy   •  Monitor  “your  NoSQL”   ­  Analyze  the  overall  status  of  your  cluster   ­  View  and  fix  boPlenecks  
  40. Conclusion   •  One  Size  Does  Not  Fit  All  

    •  Overview  of  the  the  NoSQL  types   •  Choose  the  right  solu=on  for  your  applica=on   •  Don’t  mix  Big  Data  with  Big  Users!  
  41. Thank  you!   [email protected]   @daschl     Get  Couchbase

     Server  at     hPp://www.couchbase.com/download