Upgrade to Pro — share decks privately, control downloads, hide ads and more …

What MinuteSort-record-crushing and enterprise-grade Hadoop have in common

What MinuteSort-record-crushing and enterprise-grade Hadoop have in common

Talk at Google Cloud Platform Developer Tour, Berlin

http://cloud-platform-tour.appspot.com/locations-berlin.html

Michael Hausenblas

October 01, 2013
Tweet

More Decks by Michael Hausenblas

Other Decks in Technology

Transcript

  1. 1   What  MinuteSort-­‐record-­‐crushing  and   enterprise-­‐grade  Hadoop  have  in

     common     2013-­‐10-­‐01,  Google  Cloud  Pla2orm  Developer  Tour,  Berlin   Michael  Hausenblas,  Chief  Data  Engineer  EMEA,  MapR  Technologies  
  2. 2   MapR  HQ   San  Jose,  US   MapR

     UK   MapR  SE  &  Benelux   MapR  DACH   MapR  Nordics   MapR  Japan   MapR  Hyderbad   Company  Profile   §  Founded  in  2009   §  Came  out  of  stealth  in  2011   §  Deep  management  bench  with  extensive  analyLc,  storage,   virtualizaLon  and  open  source  experience   –  Google,  EMC,  MicrosoN,  InformaLca,  Cisco,  VMWare,  NetApp,  IBM,   MicrosoN,  Apache  FoundaLon,  Aster  Data,  Brio   §  Worldwide  presence   –  Engineering  and  support  in  California  and  Hyderabad   –  Sales  and  field  engineering  in  US,  UK,  France,  Germany,  Sweden,   Singapore,  Japan,  Korea,  Australia   §  1000s  of  deployments  including:   – 10+  of  Fortune  100  companies  in  producLon  
  3. 3   One  PlaKorm  for  Big  Data   … 99.999%

     HA   Data   ProtecLon   Disaster   Recovery   Scalability  &   Performance   Enterprise   IntegraLon   MulL-­‐ tenancy   Map   Reduce   File-­‐Based   ApplicaLons   SQL   Database   Search   Stream   Processing   Batch   InteracLve   Real-­‐Lme   Batch   Log  file  Analysis   Data  Warehouse  Offload   Fraud  DetecLon   Clickstream  AnalyLcs   Real-­‐Time   Sensor  Analysis   “Twicerscraping”   TelemaLcs   Process  OpLmizaLon   InteracPve   Forensic  Analysis   AnalyLc  Modeling   BI  User  Focus  
  4. 4   MapR  DistribuPon  for  Apache  Hadoop   §  Complete

     Hadoop   distribuLon   –  12+  Apache  projects   §  Open  source  Hadoop  +   addiLonal  innovaLon   –  Enterprise-­‐grade   –  Industry-­‐standard   interfaces   –  Comprehensive   management  suite   –  Higher  performance  
  5. 5   Dependable:  Lights  Out  Data  Center  Ready   § 

    Automated  stateful  failover   §  Automated  re-­‐replicaLon   §  Self-­‐healing  from  HW     and  SW  failures   §  Load  balancing   §  Rolling  upgrades   §  No  lost  jobs  or  data   §  99999’s  of  upLme   Reliable  Compute   Dependable  Storage   §  Business  conLnuity  with     snapshots    and  mirrors   §  Recover  to  a  point  in  Lme   §  End-­‐to-­‐end  check  summing     §  Strong  consistency   §  Data  safe   §  Mirror  across  sites  to  meet   Recovery  Time  ObjecLves  
  6. 6   MapR  is  POSIX  Compliant   §  MapR  is

     POSIX  compliant   –  Random  reads/writes   –  Simultaneous  reading  and  wriLng   to  a  file   –  Compression  is  automaLc  and   transparent   §  Industry-­‐standard  NFS  interface   (in  addiLon  to  HDFS  API)   –  Stream  data  into  the  cluster   –  Leverage  thousands  of  tools  and   applicaLons   –  Easier  to  use  non-­‐Java   programming  languages   –  No  need  for  most  proprietary   Hadoop  connectors  
  7. 7   Direct  Access  NFS™   File  Browsers   Access

     Directly     “Drag  &  Drop”   Random  Read   Random  Write   Log  directly   grep! sed! sort! tar! Standard  Linux   Commands  &  Tools   ApplicaLons  
  8. 8   MapR  Control  System   §  Most  comprehensive  

    management  suite  for   Hadoop   –  Health  monitoring   –  Cluster  administraLon   –  ApplicaLon  resource   provisioning   –  Job  monitoring  and   management   –  Job  and  data  placement   control   –  Security   §  MulLple  interfaces:   –  GUI   –  REST  API   –  CLI  
  9. 9   High  Availability  Everywhere   •  Distributed  metadata  can

     self-­‐heal   •  No  pracLcal  limit  on  #  of  files   No-­‐NameNode  architecture   •  Jobs  are  not  impacted  by  failures   •  Meet  your  data  processing  SLAs   JobTracker  HA   •  High  throughput  and  resilience  for  NFS-­‐based  data   ingesLon,  import/export  and  mulL-­‐client  access   NFS  HA   •  Files  and  tables  are  accessible  within  seconds  of  a   node  failure  or  cluster  restart   Instant  recovery   •  Upgrade  the  soNware  with  no  downLme   Rolling  upgrades   •  No  special  configuraLon  to  enable  HA   •  All  MapR  customers  operate  with  HA   HA  is  built-­‐in  
  10. 10   Name   Node   DataNode   DataNode  

    DataNode   DataNode   DataNode   DataNode   DataNode   DataNode   DataNode   No  NameNode  Architecture   Other  DistribuLons  (HDFS  FederaLon)   MapR   §  Single  point  of  failure   §  Limited  to  50M  files  per  NameNode   §  Performance  bocleneck   §  Metadata  must  fit  in  memory   §  HA  w/  automaLc  failover  and  re-­‐replicaLon   §  Up  to  1T  files  (>  5000x  advantage)   §  Higher  performance   §  Metadata  is  persisted  to  disk   A F   C D   E   D B C E   B C F   B F   A B A D   E   DataNode   DataNode   DataNode  
  11. 12   Data  ProtecPon:  ReplicaPon  and  Snapshots   ReplicaLon  

    • Protect  from  hardware  failures   • File  chunks,  table  regions  and  metadata  are   automaLcally  replicated  (3x  by  default)   • At  least  one  replica  on  a  different  rack   Snapshots   • Protect  from  user  and  applicaLon  errors   • Point-­‐in-­‐Lme  recovery   • No  data  duplicaLon   • No  performance  or  scale  impact   • Read  files  and  tables  directly  from  snapshot   C1 C2 C3 C1 C2 C4 C1 C4 C4 C2 C5 C5 C6 C3 C5 C6 C3 C6 C7 C7 C7 Ac#ve&Volume Snapshot 13505505.09500 A B C D D₁
  12. 13   Disaster  Recovery:  Mirroring   §  Efficient   – 

    Block-­‐level  (8KB)  deltas   –  AutomaLc  compression   –  No  performance  impact   §  Safe   –  Point-­‐in-­‐Lme  consistency   –  End-­‐to-­‐end  checksums   §  Easy   –  Graceful  handling  of  network  issues   –  Access  mirror  volume  directly  (not  a  cold   standby)   –  Schedules  at  the  volume  level                   ProducLon                                   Region   Region   WAN                   GCE  
  13. 14   Fast:  OpPmized  ROI  -­‐  BeXer  Performance   Why

    is MapR more efficient? –  No  redundant  layers   –  C/C++  (higher  performance,  no  garbage  collecLon  freezes)   –  Distributed  metadata   –  NaLve  compression   –  OpLmized  shuffle   –  Advanced  cache  manager   –  Port  scaling  (mulL-­‐NIC  support)  and  high-­‐speed  RPC  
  14. 15   storage processing nodes file-based applications batch processing OLTP

    interactive query (SQL) stream processing search Big Data platform for Hadoop workloads use cases supply chain management logistics 360 social media log file analysis fraud detection ETL off-load customer insights forensics drug discovery MapR Distributed File System (structured, semi-structured and unstructured data—POSIX compliant) configuration, monitoring Direct Access NFS™ MapReduce Apache Hive Apache Pig Cascading Apache HBase Apache Drill Storm Solr ElasticSearch For example: 64GB RAM, 12 cores 10GbE 12x3TB SATA HDD Machine Learning Apache Mahout Skytree on-premise and/or cloud MCS HA, DR, multi-tenancy security (PAM/Kerberos)
  15. 16   Benefits  of  MapR  on     Google  Compute

     Engine   §  ElasLc  Resource  AllocaLon   §  Launch  your  first  cluster  in  minutes   §  Only  pay  for  what  you  use   §  No  upfront  expenses  or  long-­‐term   commitments   §  Launch  parallel  clusters  for   simultaneous  access  by  different  users   §  If  your  needs  change  ...  no  problem!  It’s   easy  to  change  cluster  size,  node  types,   etc.   §  No  need  to  worry  about  launching  and   managing  Hadoop  clusters  
  16. 17   TeraSort  &  MinuteSort  World  Record   hcp://www.mapr.com/blog/record-­‐sevng-­‐hadoop-­‐in-­‐the-­‐cloud  

    hcp://www.mapr.com/blog/hadoop-­‐minutesort-­‐record   2013:  MinuteSort  Record   SorLng  15  billion  100-­‐byte  records  totaling  1.5  TB  in  59  seconds