What MinuteSort-record-crushing and enterprise-grade Hadoop have in common

1 What MinuteSort-‐record-‐crushing and enterprise-‐grade Hadoop have in
common 2013-‐10-‐01, Google Cloud Pla2orm Developer Tour, Berlin Michael Hausenblas, Chief Data Engineer EMEA, MapR Technologies

2 MapR HQ San Jose, US MapR
UK MapR SE & Benelux MapR DACH MapR Nordics MapR Japan MapR Hyderbad Company Proﬁle §  Founded in 2009 §  Came out of stealth in 2011 §  Deep management bench with extensive analyLc, storage, virtualizaLon and open source experience –  Google, EMC, MicrosoN, InformaLca, Cisco, VMWare, NetApp, IBM, MicrosoN, Apache FoundaLon, Aster Data, Brio §  Worldwide presence –  Engineering and support in California and Hyderabad –  Sales and ﬁeld engineering in US, UK, France, Germany, Sweden, Singapore, Japan, Korea, Australia §  1000s of deployments including: – 10+ of Fortune 100 companies in producLon

3 One PlaKorm for Big Data … 99.999%
HA Data ProtecLon Disaster Recovery Scalability & Performance Enterprise IntegraLon MulL-‐ tenancy Map Reduce File-‐Based ApplicaLons SQL Database Search Stream Processing Batch InteracLve Real-‐Lme Batch Log ﬁle Analysis Data Warehouse Oﬄoad Fraud DetecLon Clickstream AnalyLcs Real-‐Time Sensor Analysis “Twicerscraping” TelemaLcs Process OpLmizaLon InteracPve Forensic Analysis AnalyLc Modeling BI User Focus

4 MapR DistribuPon for Apache Hadoop §  Complete
Hadoop distribuLon –  12+ Apache projects §  Open source Hadoop + addiLonal innovaLon –  Enterprise-‐grade –  Industry-‐standard interfaces –  Comprehensive management suite –  Higher performance

5 Dependable: Lights Out Data Center Ready § 
Automated stateful failover §  Automated re-‐replicaLon §  Self-‐healing from HW and SW failures §  Load balancing §  Rolling upgrades §  No lost jobs or data §  99999’s of upLme Reliable Compute Dependable Storage §  Business conLnuity with snapshots and mirrors §  Recover to a point in Lme §  End-‐to-‐end check summing §  Strong consistency §  Data safe §  Mirror across sites to meet Recovery Time ObjecLves

6 MapR is POSIX Compliant §  MapR is
POSIX compliant –  Random reads/writes –  Simultaneous reading and wriLng to a ﬁle –  Compression is automaLc and transparent §  Industry-‐standard NFS interface (in addiLon to HDFS API) –  Stream data into the cluster –  Leverage thousands of tools and applicaLons –  Easier to use non-‐Java programming languages –  No need for most proprietary Hadoop connectors

7 Direct Access NFS™ File Browsers Access
Directly “Drag & Drop” Random Read Random Write Log directly grep! sed! sort! tar! Standard Linux Commands & Tools ApplicaLons

8 MapR Control System §  Most comprehensive
management suite for Hadoop –  Health monitoring –  Cluster administraLon –  ApplicaLon resource provisioning –  Job monitoring and management –  Job and data placement control –  Security §  MulLple interfaces: –  GUI –  REST API –  CLI

9 High Availability Everywhere •  Distributed metadata can
self-‐heal •  No pracLcal limit on # of ﬁles No-‐NameNode architecture •  Jobs are not impacted by failures •  Meet your data processing SLAs JobTracker HA •  High throughput and resilience for NFS-‐based data ingesLon, import/export and mulL-‐client access NFS HA •  Files and tables are accessible within seconds of a node failure or cluster restart Instant recovery •  Upgrade the soNware with no downLme Rolling upgrades •  No special conﬁguraLon to enable HA •  All MapR customers operate with HA HA is built-‐in

10 Name Node DataNode DataNode
DataNode DataNode DataNode DataNode DataNode DataNode DataNode No NameNode Architecture Other DistribuLons (HDFS FederaLon) MapR §  Single point of failure §  Limited to 50M files per NameNode §  Performance bocleneck §  Metadata must fit in memory §  HA w/ automaLc failover and re-‐replicaLon §  Up to 1T files (> 5000x advantage) §  Higher performance §  Metadata is persisted to disk A F C D E D B C E B C F B F A B A D E DataNode DataNode DataNode

11 Node Topologies

12 Data ProtecPon: ReplicaPon and Snapshots ReplicaLon
• Protect from hardware failures • File chunks, table regions and metadata are automaLcally replicated (3x by default) • At least one replica on a diﬀerent rack Snapshots • Protect from user and applicaLon errors • Point-‐in-‐Lme recovery • No data duplicaLon • No performance or scale impact • Read ﬁles and tables directly from snapshot C1 C2 C3 C1 C2 C4 C1 C4 C4 C2 C5 C5 C6 C3 C5 C6 C3 C6 C7 C7 C7 Ac#ve&Volume Snapshot 13505505.09500 A B C D D₁

13 Disaster Recovery: Mirroring §  Eﬃcient – 
Block-‐level (8KB) deltas –  AutomaLc compression –  No performance impact §  Safe –  Point-‐in-‐Lme consistency –  End-‐to-‐end checksums §  Easy –  Graceful handling of network issues –  Access mirror volume directly (not a cold standby) –  Schedules at the volume level ProducLon Region Region WAN GCE

14 Fast: OpPmized ROI -‐ BeXer Performance Why
is MapR more efficient? –  No redundant layers –  C/C++ (higher performance, no garbage collecLon freezes) –  Distributed metadata –  NaLve compression –  OpLmized shuﬄe –  Advanced cache manager –  Port scaling (mulL-‐NIC support) and high-‐speed RPC

15 storage processing nodes file-based applications batch processing OLTP
interactive query (SQL) stream processing search Big Data platform for Hadoop workloads use cases supply chain management logistics 360 social media log file analysis fraud detection ETL off-load customer insights forensics drug discovery MapR Distributed File System (structured, semi-structured and unstructured data—POSIX compliant) configuration, monitoring Direct Access NFS™ MapReduce Apache Hive Apache Pig Cascading Apache HBase Apache Drill Storm Solr ElasticSearch For example: 64GB RAM, 12 cores 10GbE 12x3TB SATA HDD Machine Learning Apache Mahout Skytree on-premise and/or cloud MCS HA, DR, multi-tenancy security (PAM/Kerberos)

16 Benefits of MapR on Google Compute
Engine §  ElasLc Resource AllocaLon §  Launch your first cluster in minutes §  Only pay for what you use §  No upfront expenses or long-‐term commitments §  Launch parallel clusters for simultaneous access by different users §  If your needs change ... no problem! It’s easy to change cluster size, node types, etc. §  No need to worry about launching and managing Hadoop clusters

17 TeraSort & MinuteSort World Record hcp://www.mapr.com/blog/record-‐sevng-‐hadoop-‐in-‐the-‐cloud
hcp://www.mapr.com/blog/hadoop-‐minutesort-‐record 2013: MinuteSort Record SorLng 15 billion 100-‐byte records totaling 1.5 TB in 59 seconds

18 hcps://github.com/mhausenblas/MapR-‐on-‐GCE/wiki/Gevng-‐Started-‐Guide

19 hcp://www.mapr.com/google hcps://cloud.google.com/soluLons/hadoop Try it out!

What MinuteSort-record-crushing and enterprise-...

What MinuteSort-record-crushing and enterprise-grade Hadoop have in common

Michael Hausenblas

More Decks by Michael Hausenblas

Other Decks in Technology

Featured

Transcript

1 What MinuteSort-‐record-‐crushing and enterprise-‐grade Hadoop have in

2 MapR HQ San Jose, US MapR

3 One PlaKorm for Big Data … 99.999%

4 MapR DistribuPon for Apache Hadoop §  Complete

5 Dependable: Lights Out Data Center Ready §

6 MapR is POSIX Compliant §  MapR is

7 Direct Access NFS™ File Browsers Access

8 MapR Control System §  Most comprehensive

9 High Availability Everywhere •  Distributed metadata can

10 Name Node DataNode DataNode

11 Node Topologies

12 Data ProtecPon: ReplicaPon and Snapshots ReplicaLon

13 Disaster Recovery: Mirroring §  Eﬃcient –

14 Fast: OpPmized ROI -‐ BeXer Performance Why

15 storage processing nodes ﬁle-based applications batch processing OLTP

16 Beneﬁts of MapR on Google Compute

17 TeraSort & MinuteSort World Record hcp://www.mapr.com/blog/record-‐sevng-‐hadoop-‐in-‐the-‐cloud

18 hcps://github.com/mhausenblas/MapR-‐on-‐GCE/wiki/Gevng-‐Started-‐Guide

19 hcp://www.mapr.com/google hcps://cloud.google.com/soluLons/hadoop Try it out!