Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Big Data, Open Source and Google Cloud Platform

Big Data, Open Source and Google Cloud Platform

from Google Cloud Platform Live 2014
YouTube Video: https://www.youtube.com/watch?v=WARB0e2Wj_E

Kazunori Sato

April 24, 2014
Tweet

More Decks by Kazunori Sato

Other Decks in Technology

Transcript

  1. Ubiquitous Data Wide Assortment of Tools Machinery to put it

    all together Big Data, Open Source, and Google Cloud Platform Innovation
  2. HBase Making the most of high-level Google services Making the

    most of Google infrastructure options Cloud Storage Compute Engine Hive Hadoop Hadoop Ecosystem on Google Cloud Platform • Scale Compute/Storage Independently • Pay for granular time increments • Buy only the resources that you need • Seamless interoperability • Low barriers to entry • Share data amongst diverse toolset BigQuery Pig Hadoop Applications Hadoop, Pig, HBase, and Hive are trademarks of the Apache Software Foundation.
  3. • Fault tolerance • Distributed file system/storage • Resource management

    • Task coordination and execution Hadoop and the Big Data Ecosystem A Framework • De-facto standards around key primitives • Big Data beyond Hadoop/MapReduce Open Standardization
  4. Google Compute Engine HDFS (optional) Work Nodes Work Nodes Google

    Cloud Storage Connector Google Cloud Storage connector for Hadoop Hadoop on Google Cloud Platform Master Node Work Nodes HDFS (optional) Name Node (optional) Google Cloud Storage
  5. Cloud interoperability Hadoop interoperability Performance Ease of Operations Google Cloud

    Storage Connector Overview • “Cloud native” semantics • Consistent access from everywhere • Interoperability with other Google services • Hadoop Distributed FileSystem (HDFS) as de-facto standard • Interface carried into next-generation technologies • Pig, Hive, Spark... • Data is highly durable and highly available • Data is not tied to single processing stack • GCE VMs serve application logic, not data blocks • Scalability trumps locality • “Price performance” vs “raw performance” GCS HDFS GCS GUI CLI API
  6. Accenture Study: Price Performance of Cloud Credit: Accenture http://www.accenture.com/SiteCollectionDocuments/PDF/Accenture-Cloud-Based-Hadoop-Deployments-Benefits-and-Considerations.pdf •

    Total Cost of Ownership analysis • Document Clustering on 300M web pages • ~31,000 files, 3TB of input data • Google Cloud Storage for initial input, final output • HDFS for intermediate data • More case studies available at www.accenture.com Document clustering execution times
  7. Google Cloud Storage Mix and match storage and computation from

    OSS and Google Cloud Platform BigQuery and Datastore Connectors Hadoop BigQuery Connector Datastore Connector Cloud Storage Connector BigQuery Datastore
  8. Deployment, Configuration, and Toolkits bdutil - Thin wrapper around gcloud

    command-line tools • Deploy Hadoop cluster on-demand • Installs and configures GCS connector • Extensible - add scripts to run during deployment Hadoop Master bdutil Hadoop Workers Compute Engine API
  9. Demo: Shark on Google Cloud Platform Culmination of several interoperable

    pieces of Open Source software • Shark: Hive on Spark • Hive: SQL on Hadoop • Spark: Next-Generation Data Processing on Hadoop data sources Pluggable into Google Cloud Platform • Inherits GCS Connector support • “External Tables” for seamless data portability • Hive metadata in Google CloudSQL for lifetime beyond cluster • Multi-tenancy via multiple personal Hive clusters
  10. Google Cloud: Superior Platform For customers and for Qubole Cheaper

    • The equivalent of our most popular node type is 11% cheaper than Cloud X. • Qubole’s auto termination + pay by the minute More performant and stable • Object store showed significantly lower variance • TPCH was 20% faster out of the box More Hadoop friendly storage • fewer eventual consistency issues • significantly faster move operations • less complexity to handle Much quicker VM launch times • 2 - 4x faster in machine becoming usable. Makes auto-scaling easier
  11. Deeply integrated in the platform Qubole on GCE - GA

    on April 7th. Qubole integrates with GCS Qubole integrates with BigQuery Replace HDFS - Use as long term store and ephemeral Hadoop clusters Export to BQ - Allow processed data to be served up from a low latency query engine Extract from BQ - For joins during ETL processing