Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Big Data for the Rest of us

Big Data for the Rest of us

OBC 2014 Slides from the session

Marcus Ross

May 06, 2014
Tweet

More Decks by Marcus Ross

Other Decks in Technology

Transcript

  1.  Big  Data  for  the  Rest  of  Us:   Understanding  the

     Emerging  Hadoop  Ecosystem   Marcus  Ross  +  Peter  Dickten   PD
  2. The  crazy  Germans  from  OSBC2012  are  back  ☺ –  Peter

     Dickten  (@pe_d)   •  CEO  of  a  development  company  (DCS)   •  Lot's  of  development  for  market  research     –  Marcus  Ross  (@zahlenhelfer)   •  Trainer  +  consultant  for  database  systems  /  BI   Who  are  we?  
  3. Todays  journey   •  What  is  Big  Data?   • 

    What  is  Hadoop  and  how  does  it  work?   •  Ecosystem   – Hadoop  AddOns  /  Tools   – Services  and  Infrastructure   •  Summary  
  4. "Big  Data"?   •  2012-­‐?  are  the  years  of  

    Big  Data   •  Thanks  to  markeXng   almost  everything  is   now  a  big  data  soluXon  
  5. What's  the  Data  in  Big  Data?   •  All  kinds

     of  structured  informaXon,  e.g.   –  Web  server  log  files   –  Flight  data   –  Purchase  data   •  Most  of  the  Xme:  simply  structured   •  But:  insane  amount  of  records  ("big")  
  6. What  do  you  mean  with  "big"?   Much  more  informaXon

     than  a  single  system  can   efficiently  store  /  process   •  "efficiently"  (Xme)  depends  on  the  use  case  (e.g.   real  Xme  analysis  for  fraud  detecXon)   •  "efficiently"  (cost):  Database  system  for    Petabytes   of  data  could  get  extremely  expensive  
  7. Basic  idea:   Use  thousands  of  cheap  computers  with  cheap

      storage  instead  of  one/few  expensive   supercomputers   …and  yes…  that  sounds  like  Google:     •  Map/Reduce  (USP  7,650,331)   •  GFS  (closed  source)  
  8. downsides  of  1,000  cheap  computers   •  MTBF  :  from

     50  years  to  21  days  =>  store  data   redundantly  (e.g.  in  3  different  places)     •  Splihng  a  computaXon  to  1,000  computaXons  can   be  difficult   •  Only  works  if  every  machine  only  needs  a  small   subset  of  the  data  ("data  locality")  to  do  its  job.   MTBF  (mean  Xme  between  failure)  =  average  life  Xme,  e.g.  500K  hours  for  a  hard  disk    
  9. Worst  problem  of  all:  transport  Xmes   Activity Time in

    ns Reading from L1 cache 0,5 Reading from memory 100 Reading 4K randomly from SSD (*) 150,000 Round trip within same datacenter 500,000 Send packet CA->Netherlands->CA 150,000,000 Source:  hjp://norvig.com/21-­‐days.html#answers   (*)  assuming  1GB/s  reads   =>  Data  retrieval  from  other  machines  will  KILL  performance  
  10. The  "magic"  of  Hadoop  (simplified)   •  Hadoop  will  store

     data  mulXple   Xmes  (HDFS)  and  keep  track  of   failing  machines   •  Hadoop  will  split  up  work  to  many   machines  (map/reduce)   •  Hadoop  delegates  work  to   machines  which  already  have  the   data  needed  for  the  job  
  11. What  Hadoop  is  not     •  Drop-­‐in  replacement  for

     SQL   •  Chart/Report  Generator   •  An  Excel  Add-­‐In   •  Plug´n´Play  BI-­‐Suite   •  SAP  Data  Warehouse   MR
  12. Example:  Splihng  up  Work   •  QuesXon:  What  is  the

     most  used  word  in  "Hamlet"   by  Shakespeare?  (Hint:  3.73  %)   •  Hadoop  distributes  a  different  porXon  (e.g.  page/ sentence)  of  the  text  to  each  machine   •  Every  machine  counts  the  words  in  its  porXon   •  Hadoop  combines  the  results  of  the  machines  and   picks  the  highest  result.   *Map *Reduce *Distribute *“the“
  13. Real  world  use  case:  energy  discovery   Chevron  uses  Hadoop

     to  sort  and  process  data   from  ships  that  troll  the  ocean  collecXng  seismic   data  that  might  signify  the  presence  of  oil   reserves.  
  14. Real  world  use  case:  IT  Security   ipTrust  uses  Hadoop

     to  assign  reputaXon  scores   to  IP  addresses,  which  lets  other  security   products  decide  whether  to  accept  traffic  from   those  IP  addresses  or  not.  
  15. This  looks  tedious  ..  is  there  an  app  for  that?

      Lots  of  them  …  prepare  for  the  ride:   •  Pig   •  Hive   •  Flume   •  Mahout   •  Zookeeper   •  Scoop   •  Hbase   •  Ambari  
  16. Apache  Pig  -­‐  the  Data  Omnivore   •  Developed  by

     Yahoo,  now  Apache  project   •  "Pig  LaXn"  language  is  translated  to  map/reduce   •  Basic  idea:  "data  flow"  –  the  data  is  transformed  step-­‐by-­‐ step  using  built-­‐in/self-­‐wrijen  steps  (e.g.  filter,  group-­‐by,   join,  foreach...).  The  output  of  each  step  can  be  used  as   input  for  the  next  step  (using  variables)   •  Similar  to  interacXve  shells/read–eval–print  loop  tools  
  17. Apache  Pig  -­‐  example   log  =  LOAD  'server.log'  AS

     (user,time,query);   grpd  =  GROUP  log  BY  user;   cntd  =  FOREACH  grpd  GENERATE  group,  count(log);   STORE  cntd  INTO  'output.txt';       Computation  will  start  as  soon  as  the  result  is   written  by  STORE  (or  other  commands  like  DUMP)  
  18. Hive  –  select  *  from  Hadoop   Developed  by  Facebook,

     now  Apache.   An  SQL-­‐like  query  language  (HiveQL)  for  Hadoop.   Can  be  extended  with  map/reduce  code  wrijen  in   Java   Similar  to  SQL  (but  not  intended  for  ad-­‐hoc  queries)  
  19. Hive  -­‐  Example   •  LOAD  DATA  LOCAL  INPATH  "ciXes.txt"

     OVERWRITE   INTO  TABLE  CITE;   •  SELECT  *  FROM  cite  LIMIT  10;   •  INSERT  OVERWRITE  cite_count  SELECT  cited,   count(ciXng)  FROM  cite  GROUP  BY  cited;   •  SELECT  *  FROM  cite_count  WHERE  count>10;  
  20. A  collecXon  of  machine  learning   algorithms  focused  on  

      •  CollaboraXve  filtering   (recommendaXon)   •  Clustering  (grouping  of  enXXes   based  on  similar  characterisXcs)   •  ClassificaXon  (clustering  in  pre-­‐ exisXng  groups)   (mostly)  based  on  Hadoop.   Apache  Mahout   hjps://twijer.com/JulianHi/status/457668218753392642/photo/1  
  21. Apache  Flume   •  Is  a  distributed,  reliable,  and  high

     available  service  for   efficiently  collecXng,  aggregaXng,  and  moving  large   amounts  of  log  data.  (some  kind  of  ETL)   •  It  uses  three  parts:   –  Agent  (receive  data  from  an  applicaXon/log)   –  Processor  (intermediate  processing)   –  Collector  (write  data  to  permanent  storage)   •  Use  it  to  have  a  framework  for  import  instead  of   develop  an  importer  for  each  source  
  22. Apache  Zookeeper   •  Providing  an  open  source  distributed  

      –  ConfiguraXon  service   –  SynchronizaXon  service   –  Naming  registry  for  large  distributed  systems   •  Its  architecture  supports  high-­‐availability  through   redundant  services   •  ZooKeeper  is  used  by  companies  like  Rackspace,   Yahoo!  and  eBay  
  23. Apache  Sqoop     •  Efficiently  transferring  bulk  data  between

     Hadoop   and  structured  datastores  (relaXonal  databases)   •  For  example,  to  import  data  and  store  a  CSV  file  in   a  directory  in  HDFS:   sqoop import --connect <JDBC connection string> --table <tablename> --username <username> --password <password>
  24.    HBase   •  HBase  is  for  random,  realXme  read/write

     access  for  Big  Data   •  This  project's  goal  is  to  host  very  large  tables   •  Use  it  for  Billions  of  rows  X  millions  of  columns   •  Hosted  on  clusters  of  commodity  hardware   •  It´s  a  distributed,  versioned,  non-­‐relaXonal  database  modeled   a~er  Google's  Bigtable   •  It  provides  Bigtable-­‐like  capabiliXes  on  top  of  Hadoop/HDFS.  
  25.  Ambari   •  Aimed  at  making  Hadoop  management  simpler  

    •  Use  it  for  Hadoop  clusters  to   –  Provision     –  Manage   –  Monitor   •  Easy-­‐to-­‐use  management  web  UI     •  Plus  RESTful  APIs  
  26. Oozie   •  Oozie  is  a  workflow  scheduler  system  to

     manage  Apache   Hadoop  jobs.   •  Oozie  Workflow  jobs  are  Directed  Acyclical  Graphs   (DAGs)  of  acXons.   •  Oozie  Coordinator  jobs  are  recurrent  Oozie  Workflow   jobs  triggered  by  Xme  (frequency)  and  data  availabilty.  
  27. DistribuXons   •  You  get  an  out  of  the  box

     system     •  Most  Hadoop  tools  are  already  installed   •  A  complete  linux  system  +  enhancements  
  28. Suites   •  No  more  coding  in  Java  needed  

    •  But  mostly  only  "one  purpose"  apps   •  So~ware  packages  with  BigData  inside   –  Splunk   –  Talend   –  Penthao   –  Terradata   –  Microso~  HDInsight  
  29. Not  only  on  premise   •  Hadoop  can  run  in

     the  cloud   – Amazon   – Microso~   •  No  infrastructure  needed   •  Scale  your  processXme  by  your  needs   hjp://www.chg-­‐computer.de/uploads/mediapool/cloud.jpg  
  30. Management  Summary   •  Hadoop  can  help  you  with  your

     data!   •  Hadoop  and  RDBMS  will  coexist   •  Ecosystem  reduces  development  costs/Xme   •  MulXple  flavors  of  Hadoop:   – Cloud  /  Hosted  /  On  Premise   – Ready  to  use-­‐DistribuXons