Big Data, Open Source and Google Cloud Platform

Big Data, Open Source and Google Cloud Platform Dennis Huo,
Senior Software Engineer

Ubiquitous Data Wide Assortment of Tools Machinery to put it
all together Big Data, Open Source, and Google Cloud Platform Innovation

HBase Making the most of high-level Google services Making the
most of Google infrastructure options Cloud Storage Compute Engine Hive Hadoop Hadoop Ecosystem on Google Cloud Platform • Scale Compute/Storage Independently • Pay for granular time increments • Buy only the resources that you need • Seamless interoperability • Low barriers to entry • Share data amongst diverse toolset BigQuery Pig Hadoop Applications Hadoop, Pig, HBase, and Hive are trademarks of the Apache Software Foundation.

• Fault tolerance • Distributed file system/storage • Resource management
• Task coordination and execution Hadoop and the Big Data Ecosystem A Framework • De-facto standards around key primitives • Big Data beyond Hadoop/MapReduce Open Standardization

Google Compute Engine HDFS (optional) Work Nodes Work Nodes Google
Cloud Storage Connector Google Cloud Storage connector for Hadoop Hadoop on Google Cloud Platform Master Node Work Nodes HDFS (optional) Name Node (optional) Google Cloud Storage

Google Cloud Storage Connector hadoop fs (Hadoop built-in CLI) gsutil
(from gcloud SDK)

Cloud interoperability Hadoop interoperability Performance Ease of Operations Google Cloud
Storage Connector Overview • “Cloud native” semantics • Consistent access from everywhere • Interoperability with other Google services • Hadoop Distributed FileSystem (HDFS) as de-facto standard • Interface carried into next-generation technologies • Pig, Hive, Spark... • Data is highly durable and highly available • Data is not tied to single processing stack • GCE VMs serve application logic, not data blocks • Scalability trumps locality • “Price performance” vs “raw performance” GCS HDFS GCS GUI CLI API

Accenture Study: Price Performance of Cloud Credit: Accenture http://www.accenture.com/SiteCollectionDocuments/PDF/Accenture-Cloud-Based-Hadoop-Deployments-Benefits-and-Considerations.pdf •
Total Cost of Ownership analysis • Document Clustering on 300M web pages • ~31,000 files, 3TB of input data • Google Cloud Storage for initial input, final output • HDFS for intermediate data • More case studies available at www.accenture.com Document clustering execution times

Google Cloud Storage Mix and match storage and computation from
OSS and Google Cloud Platform BigQuery and Datastore Connectors Hadoop BigQuery Connector Datastore Connector Cloud Storage Connector BigQuery Datastore

Deployment, Configuration, and Toolkits bdutil - Thin wrapper around gcloud
command-line tools • Deploy Hadoop cluster on-demand • Installs and configures GCS connector • Extensible - add scripts to run during deployment Hadoop Master bdutil Hadoop Workers Compute Engine API

Demo: Shark on Google Cloud Platform Culmination of several interoperable
pieces of Open Source software • Shark: Hive on Spark • Hive: SQL on Hadoop • Spark: Next-Generation Data Processing on Hadoop data sources Pluggable into Google Cloud Platform • Inherits GCS Connector support • “External Tables” for seamless data portability • Hive metadata in Google CloudSQL for lifetime beyond cluster • Multi-tenancy via multiple personal Hive clusters

Ashish Thusoo, CEO, Qubole

Google Cloud: Superior Platform For customers and for Qubole Cheaper
• The equivalent of our most popular node type is 11% cheaper than Cloud X. • Qubole’s auto termination + pay by the minute More performant and stable • Object store showed significantly lower variance • TPCH was 20% faster out of the box More Hadoop friendly storage • fewer eventual consistency issues • significantly faster move operations • less complexity to handle Much quicker VM launch times • 2 - 4x faster in machine becoming usable. Makes auto-scaling easier

Deeply integrated in the platform Qubole on GCE - GA
on April 7th. Qubole integrates with GCS Qubole integrates with BigQuery Replace HDFS - Use as long term store and ephemeral Hadoop clusters Export to BQ - Allow processed data to be served up from a low latency query engine Extract from BQ - For joins during ETL processing

cloud.google.com

Big Data, Open Source and Google Cloud Platform

Big Data, Open Source and Google Cloud Platform

Kazunori Sato

More Decks by Kazunori Sato

Other Decks in Technology

Featured

Transcript

Big Data, Open Source and Google Cloud Platform Dennis Huo,

Ubiquitous Data Wide Assortment of Tools Machinery to put it

HBase Making the most of high-level Google services Making the

• Fault tolerance • Distributed file system/storage • Resource management

Google Compute Engine HDFS (optional) Work Nodes Work Nodes Google

Google Cloud Storage Connector hadoop fs (Hadoop built-in CLI) gsutil

Cloud interoperability Hadoop interoperability Performance Ease of Operations Google Cloud

Accenture Study: Price Performance of Cloud Credit: Accenture http://www.accenture.com/SiteCollectionDocuments/PDF/Accenture-Cloud-Based-Hadoop-Deployments-Benefits-and-Considerations.pdf •

Google Cloud Storage Mix and match storage and computation from

Deployment, Configuration, and Toolkits bdutil - Thin wrapper around gcloud

Demo: Shark on Google Cloud Platform Culmination of several interoperable

Ashish Thusoo, CEO, Qubole

Google Cloud: Superior Platform For customers and for Qubole Cheaper

Deeply integrated in the platform Qubole on GCE - GA

cloud.google.com