Slide 1

Slide 1 text

Monday, June 10, 13

Slide 2

Slide 2 text

Goto  Night  CPH,  June  6th  2013 How  to  integrate  Hadoop   with  your  NoSQL  database? Tugdual  “Tug”  Grall Technical  Evangelist Monday, June 10, 13

Slide 3

Slide 3 text

Goto  Night  CPH,  June  6th  2013 About  Me   • Tugdual  “Tug”  Grall ­ Couchbase • Technical  Evangelist ­ eXo • CTO ­ Oracle • Developer/Product  Manager • Mainly  Java/SOA ­ Developer  in  consul@ng  firms • Web • @tgrall • hAp://blog.grallandco.com • tgrall • NantesJUG  co-­‐founder • Pet  Project  : • hAp://www.resultri.com Monday, June 10, 13

Slide 4

Slide 4 text

Goto  Night  CPH,  June  6th  2013 4 0 0.50 1.00 1.50 2.00 2000 2006 2011 Source:  IDC  2011  Digital  Universe  Study  (hKp://www.emc.com/collateral/demos/microsites/emc-­‐digital-­‐universe-­‐2011/index.htm) Trillions  of  Gigabytes  (ZeKabytes) Big  Data High  Data  Variety  and  Velocity Unstructured  and  Semi-­‐ Structured  Data Structured  Data Text,  Log  Files,  Click   Streams,  Blogs,   Tweets,  Audio,   Video,  etc. More  Flexible  Data  Model  Required Monday, June 10, 13

Slide 5

Slide 5 text

Goto  Night  CPH,  June  6th  2013 <50%? 2027 95% RelaOonal  Technology $30B  Database  Market  Being  Disrupted 2013 All  new  database  growth  will  be  NoSQL RelaOonal   Technology RelaOonal   Technology RelaOonal  Technology NoSQL Technology Other Monday, June 10, 13

Slide 6

Slide 6 text

Goto  Night  CPH,  June  6th  2013 Cloudera Hortonworks Opera@onal  vs.  Analy@c  Databases Couchbase Mongo AnalyOc Databases Get  insights  from   data Real-­‐Ome,   InteracOve  Databases Fast  access   to  data NoSQL Monday, June 10, 13

Slide 7

Slide 7 text

Goto  Night  CPH,  June  6th  2013 Lack  of  flexibility/ rigid  schemas Inability  to  scale   out  data Performance  challenges Cost All  of  these Other 49% 35% 29% 16% 12% 11% Source:  Couchbase  Survey,  December  2011,  n  =  1351. Monday, June 10, 13

Slide 8

Slide 8 text

Goto  Night  CPH,  June  6th  2013 Hadoop Monday, June 10, 13

Slide 9

Slide 9 text

Goto  Night  CPH,  June  6th  2013 What  is  Hadoop? • Highly  scalable • Unstructured  data • Open  source • Big  Data  OperaOng  System • Changing  the  World  One  Petabyte  at  a  Time Monday, June 10, 13

Slide 10

Slide 10 text

Goto  Night  CPH,  June  6th  2013 What  is  Hadoop? • Simplest  unit  of  compute  and  storage CPU Disks Application Data Monday, June 10, 13

Slide 11

Slide 11 text

Goto  Night  CPH,  June  6th  2013 What  is  Hadoop? • And  when  it  grows? Application Data Monday, June 10, 13

Slide 12

Slide 12 text

Goto  Night  CPH,  June  6th  2013 What  is  Hadoop? • And  when  it  grows  more? Monday, June 10, 13

Slide 13

Slide 13 text

Goto  Night  CPH,  June  6th  2013 What  is  Hadoop? • NoSQL  to  the  rescue Application Data Monday, June 10, 13

Slide 14

Slide 14 text

Goto  Night  CPH,  June  6th  2013 What  is  Hadoop? • Hadoop  is  a  different  paradigm Application Data Monday, June 10, 13

Slide 15

Slide 15 text

Goto  Night  CPH,  June  6th  2013 Monday, June 10, 13

Slide 16

Slide 16 text

Goto  Night  CPH,  June  6th  2013 Hadoop  and  NoSQL Monday, June 10, 13

Slide 17

Slide 17 text

Goto  Night  CPH,  June  6th  2013 events profiles,  campaigns profiles,  real  @me  campaign   sta@s@cs 40  milliseconds  to  respond  with   the  decision. 2 3 1 Ad  and  offer  targeOng Monday, June 10, 13

Slide 18

Slide 18 text

Goto  Night  CPH,  June  6th  2013 Logs Couchbase Server Cluster Hadoop Cluster sqoop import Logs Logs Logs Logs Ad Targeting Platform sqoop export flume flow Moving  Parts Monday, June 10, 13

Slide 19

Slide 19 text

Goto  Night  CPH,  June  6th  2013 events& user&profiles& make&& recommenda2ons& 2& 3& 1& Content Oriented Site Legacy Relational Database Content  &  RecommendaOon  TargeOng Monday, June 10, 13

Slide 20

Slide 20 text

Goto  Night  CPH,  June  6th  2013 Logs Couchbase Server Cluster Hadoop Cluster sqoop import Logs Logs Logs Logs Content Driven Web Site sqoop export Original RDBMS In order to keep up with changing needs on richer, more targeted content that is delivered to larger and larger audiences very quickly, data behind content driven sites is shifting to Couchbase. Hadoop excels at complex analytics which may involve multiple steps of processing which incorporate a number of different data sources. sqoop import flume flow Moving  Parts Monday, June 10, 13

Slide 21

Slide 21 text

Goto  Night  CPH,  June  6th  2013 Sqoop is a tool designed to transfer data between Hadoop and relational databases. You can use Sqoop to import data from a relational database management system (RDBMS) such as MySQL or Oracle into the Hadoop Distributed File System (HDFS), transform the data in Hadoop MapReduce, and then export the data back into an RDBMS. sqoop.apache.org What  is  Sqoop? Monday, June 10, 13

Slide 22

Slide 22 text

Goto  Night  CPH,  June  6th  2013 • Traditional ETL Application Data Data T What  is  Sqoop? Monday, June 10, 13

Slide 23

Slide 23 text

Goto  Night  CPH,  June  6th  2013 • A different paradigm Data Applicatio n Data What  is  Sqoop? Monday, June 10, 13

Slide 24

Slide 24 text

Goto  Night  CPH,  June  6th  2013 • A very scalable different paradigm Data Application Data Application Data Application Data What  is  Sqoop? Monday, June 10, 13

Slide 25

Slide 25 text

Goto  Night  CPH,  June  6th  2013 • Where did the Transform go? Application Data T T T T T T T T T T T T What  is  Sqoop? Monday, June 10, 13

Slide 26

Slide 26 text

Goto  Night  CPH,  June  6th  2013 What  is  Sqoop? • Sqoop  “SQL-­‐Hadoop” ­ Default  connec@on  is  via  JDBC • Lots  of  custom  connectors ­ Couchbase,  VoltDB,  Ver@ca ­ Teradata,  Netezza ­ Oracle,  MySQL,  Postgres Monday, June 10, 13

Slide 27

Slide 27 text

Goto  Night  CPH,  June  6th  2013 Sqoop  :  Import sqoop import --connect jdbc:mysql://rdbms1.demo.com/CRM --table customers Monday, June 10, 13

Slide 28

Slide 28 text

Goto  Night  CPH,  June  6th  2013 Sqoop  :  Export sqoop export --connect jdbc:mysql://rdbms1.demo.com/ANALYTICS --table sales --export-dir /user/hive/warehouse/zip_profits --input-fields-terminated-by '\0001' Monday, June 10, 13

Slide 29

Slide 29 text

Goto  Night  CPH,  June  6th  2013 Sqoop  :  Import sqoop import –-connect http://localhost:8091/pools --table DUMP Monday, June 10, 13

Slide 30

Slide 30 text

MapReduceJob Goto  Night  CPH,  June  6th  2013 Sqoop  :  Import HDFS Map HDFS Map HDFS Map Sqoop   Client Metadata Launches Monday, June 10, 13

Slide 31

Slide 31 text

Goto  Night  CPH,  June  6th  2013 Sqoop  :  Export sqoop export --connect http://localhost:8091/pools --table DUMP --export-dir /user/hive/profiles/recommendation --username social Monday, June 10, 13

Slide 32

Slide 32 text

Goto  Night  CPH,  June  6th  2013 Sqoop  :  Export MapReduceJob HDFS Map HDFS Map HDFS Map Sqoop   Client Metadata Launches Monday, June 10, 13

Slide 33

Slide 33 text

Goto  Night  CPH,  June  6th  2013 DemonstraOon Monday, June 10, 13

Slide 34

Slide 34 text

Goto  Night  CPH,  June  6th  2013 Couchbase Monday, June 10, 13

Slide 35

Slide 35 text

Goto  Night  CPH,  June  6th  2013 Easy   Scalability Consistent  High   Performance Always  On   24x365 Grow  cluster  without  applica@on   changes,  without  down@me  with   a  single  click Consistent  sub-­‐millisecond   read  and  write  response  @mes   with  consistent  high  throughput No  down@me  for  so`ware   upgrades,  hardware  maintenance,   etc. Flexible  Data   Model JSON  document  model  with  no   fixed  schema. JSON JSON JSON JSON JSON PERFORMANCE Couchbase  Server  Core  Principles Monday, June 10, 13

Slide 36

Slide 36 text

Goto  Night  CPH,  June  6th  2013 Couchbase  Handles  Real  World  Scale Monday, June 10, 13

Slide 37

Slide 37 text

Goto  Night  CPH,  June  6th  2013 Couchbase  Server  2.0 Heartbeat Process  monitor Global  singleton  supervisor ConfiguraQon  manager on  each  node Rebalance  orchestrator Node  health  monitor one  per  cluster vBucket  state  and  replicaQon  manager hdp REST  management  API/Web  UI HTTP 8091 Erlang  port  mapper 4369 Distributed  Erlang 21100  -­‐  21199 Erlang/OTP storage  interface Couchbase  EP  Engine 11210 Memcapable    2.0 Moxi 11211 Memcapable    1.0 Memcached New  Persistence  Layer 8092 Query  API Query  Engine Data  Manager Cluster  Manager Monday, June 10, 13

Slide 38

Slide 38 text

Goto  Night  CPH,  June  6th  2013 Couchbase  Server  2.0 Heartbeat Process  monitor Global  singleton  supervisor ConfiguraQon  manager on  each  node Rebalance  orchestrator Node  health  monitor one  per  cluster vBucket  state  and  replicaQon  manager hdp REST  management  API/Web  UI HTTP 8091 Erlang  port  mapper 4369 Distributed  Erlang 21100  -­‐  21199 Erlang/OTP storage  interface Couchbase  EP  Engine 11210 Memcapable    2.0 Moxi 11211 Memcapable    1.0 Memcached New  Persistence  Layer 8092 Query  API Query  Engine Monday, June 10, 13

Slide 39

Slide 39 text

The  Classic  Order  Entry  Structure Goto  Night  CPH,  June  6th  2013 39 hKp://[email protected]/bliki/AggregateOrientedDatabase.html Rela%onal  databases  were  not  designed  with  clusters  in  mind,  which  is  why  people   have  cast  around  for  an  alterna%ve.  Storing  aggregates  as  fundamental  units  makes   a  lot  of  sense  for  running  on  a  cluster.   Monday, June 10, 13

Slide 40

Slide 40 text

Goto  Night  CPH,  June  6th  2013 40 o::1001 { uid:  “ji22jd”, customer:  “Ann”, line_items:  [   {  sku:  0321293533,  quan:  3,    unit_price:  48.0  }, {  sku:  0321601912,  quan:  1,  unit_price:  39.0  }, {  sku:  0131495054,  quan:  1,  unit_price:  51.0  }   ], payment:  {                      type:  “Amex”,                    expiry:  “04/2001”,   last5:  12345 } • Easy  to  distribute  data • Makes  sense  to  applicaQon  programmers Aggregate  by  Comparison Monday, June 10, 13

Slide 41

Slide 41 text

Goto  Night  CPH,  June  6th  2013 COUCHBASE  SERVER    CLUSTER • Docs  distributed  evenly  across   servers   • Each  server  stores  both  acOve  and   replica  docs Only  one  server  acQve  at  a  Qme • Client  library  provides  app  with   simple  interface  to  database • Cluster  map  provides  map   to  which  server  doc  is  on App  never  needs  to  know • App  reads,  writes,  updates  docs • MulOple  app  servers  can  access  same   document  at  same  Ome User  Configured  Replica  Count  =  1 READ/WRITE/UPDATE ACTIVE Doc  5 Doc  2 Doc Doc Doc SERVER  1 ACTIVE Doc  4 Doc  7 Doc Doc Doc SERVER  2 Doc  8 ACTIVE Doc  1 Doc  2 Doc Doc Doc REPLICA Doc  4 Doc  1 Doc  8 Doc Doc Doc REPLICA Doc  6 Doc  3 Doc  2 Doc Doc Doc REPLICA Doc  7 Doc  9 Doc  5 Doc Doc Doc SERVER  3 Doc  6 APP  SERVER  1 COUCHBASE  Client  Library CLUSTER  MAP COUCHBASE  Client  Library CLUSTER  MAP APP  SERVER  2 Doc  9 Basic  OperaOons Monday, June 10, 13

Slide 42

Slide 42 text

Goto  Night  CPH,  June  6th  2013 COUCHBASE  SERVER    CLUSTER ACTIVE Doc  5 Doc  2 Doc Doc Doc SERVER  1 REPLICA Doc  4 Doc  1 Doc  8 Doc Doc Doc APP  SERVER  1 COUCHBASE  Client  Library CLUSTER  MAP COUCHBASE  Client  Library CLUSTER  MAP APP  SERVER  2 Doc  9 • Indexing  work  is  distributed  amongst   nodes • Large  data  set  possible • Parallelize  the  effort • Each  node  has  index  for  data  stored  on  it • Queries  combine  the  results  from  required   nodes ACTIVE Doc  5 Doc  2 Doc Doc Doc SERVER  2 REPLICA Doc  4 Doc  1 Doc  8 Doc Doc Doc Doc  9 ACTIVE Doc  5 Doc  2 Doc Doc Doc SERVER  3 REPLICA Doc  4 Doc  1 Doc  8 Doc Doc Doc Doc  9 Query Indexing Monday, June 10, 13

Slide 43

Slide 43 text

Goto  Night  CPH,  June  6th  2013 DemonstraOon Monday, June 10, 13

Slide 44

Slide 44 text

≠ Goto  Night  CPH,  June  6th  2013 Map  Reduce  ... • Deal  with  “Big  Data” • “More”  is  beder  than  “Faster” • Batch  Oriented • Usually  used  to  “extract/transform”  data • Fully  distributed ­ Map,  Shuffle,  Reduce • Distributed   • Executed  where  the  document  is • Deal  with  “indexing”  data   • As  fast  as  possible • Use  to  query  the  data  in  the  Database Monday, June 10, 13

Slide 45

Slide 45 text

Goto  Night  CPH,  June  6th  2013 Conclusion • Big  Data  and  Big  Users  working  together • Use  Hadoop  to  store  “everything” ­ Batch  oriented ­ Complex  data  processing • MapReduce • Expose  a  subset  of  the  dataset  to  your  applicaOon ­ Real  @me  analy@cs ­ Low  latency ­ Simple  data  interac@ons  and  queries Monday, June 10, 13

Slide 46

Slide 46 text

Goto  Night  CPH,  June  6th  2013 Q&A We’re  Hiring!  couchbase.com/careers @tgrall [email protected] Monday, June 10, 13

Slide 47

Slide 47 text

Goto  Night  CPH,  June  6th  2013 Q&A Monday, June 10, 13