Google Cloud Platform & Big Data - Austin Data Meetup

Making Your Data Work Sandeep Parikh Solutions Architect @crcsmnky

Stuff I Want To Talk About Me Google and Big
Data Google Cloud Platform and Big Data Open Source Questions

3 Google Cloud Platform Hey That’s Me! I spend my
time working on repeatable architectural patterns and guidance for people interested in using Google Cloud Platform in the form of papers, code, architectures. Typically spend time talking about Big Data and Containers. Before Google, I was at MongoDB, Ravel, 21CT, Affinegy, Apple. I’ve been in Austin for ~12 years so I get to complain about everything. Find me on Twitter @crcsmnky

Google and Data

5 Google Cloud Platform Organize the world’s information and make
it universally accessible and useful.

It’s not just about picking up your workload and running
it on Google Cloud

Google wants to change how you work with data

For the past 15 years, Google has been building out
the world’s fastest, most powerful, highest quality cloud infrastructure on the planet.

Google’s network is huge

13 Google Cloud Platform Google Technologies

14 Google Cloud Platform Open Source Technologies

Google Cloud Platform and Data

16 Google Cloud Platform Management Mobile Services Compute Data Networking
Storage Developer Tools

17 Google Cloud Platform Capabilities Collaboration and exploration Data processing
Intelligence Cloud Databases Analytics Storage Services Messaging Services Data exploration in the Cloud Cloud Datalab Fast & economical data warehouse for large-scale data analytics Google BigQuery Mainstream Cloud based artificial intelligence and machine learning Cloud Machine Learning, Translate API Flexible, scalable and reliable data processing Streaming/batch processing, Hadoop/Spark - Cloud Dataflow, Cloud Dataproc Cloud Databases for all kinds of applications Relational, key-value, NoSQL - Cloud Bigtable, Cloud SQL, Cloud Datastore Proven storage platform GCS Standard, GCS DRA, GCS Nearline Reliable, large scale messaging Cloud Pub/Sub

18 Google Cloud Platform Data Lifecycle Capture Process Store Analyze
Mobile Phones In-Game Analytics Clickstream Object Store HDFS Hadoop Spark Storm SQL Pig Hive

19 Google Cloud Platform Data Lifecycle Capture Process Store Analyze
App Engine Cloud Pub/Sub Cloud Storage Cloud SQL Cloud Bigtable Cloud Datastore Cloud Dataflow Cloud Dataproc BigQuery Cloud Datalab

20 Google Cloud Platform Globally redundant Low latency (sub sec.)
Batched read/write Custom labels Push & Pull Auto expiration Cloud Pub/Sub Pub A Pub B Pub C Topic C Sub A Sub B Sub C1 Sub C2 Cloud Pub/Sub Subscriber X Subscriber Y Subscriber Z Message 1 Message 2 Topic A Topic B Message 3 Message 1 Message 2 Message 3 Message 3

21 Google Cloud Platform Cloud Dataflow Autoscaling mid-job Fully managed
- No-Ops Intuitive Data Processing Framework Batch and Stream Processing in one Dynamic rebalancing mid-job 1 2 3 4 5

22 Google Cloud Platform Cloud Dataflow Pipeline p = Pipeline.create();
p.begin() .apply(TextIO.Read.from(“gs://…”)) .apply(ParDo.of(new ExtractTags()) .apply(Count.create()) .apply(ParDo.of(new ExpandPrefixes()) .apply(Top.largestPerKey(3)) .apply(TextIO.Write.to(“gs://…”)); p.run(); Pipeline p = Pipeline.create(); p.begin() .apply(TextIO.Read.from(“gs://…”)) .apply(ParDo.of(new ExtractTags()) .apply(Count.create()) .apply(ParDo.of(new ExpandPrefixes()) .apply(Top.largestPerKey(3)) .apply(TextIO.Write.to(“gs://…”)); p.run(); .apply(PubsubIO.Read.from(“input_topic”)) .apply(Window.<Integer>by(FixedWindows.of(5, MINUTES)) .apply(PubsubIO.Write.to(“output_topic”)); Batch to Streaming

23 Google Cloud Platform Cloud Dataproc (managed Hadoop & Spark)
Preemptible VMs are 70% cheaper Spin up clusters of any size in 90 seconds Separation of storage and compute Run clusters segregated by job or function Per-minute billing 1 2 3 4 5

24 Google Cloud Platform Hive for IT Business Reporting Spark
MLlib Hive for Analysts MapReduce ETL Google Cloud Storage Cloud Dataproc

25 Google Cloud Platform Petabyte scale data warehouse Supports SQL
Fast and scales automatically No setup or administration Stream up to 100,000 rows/sec Integrates with third-party software like Tableau Google BigQuery

26 Google Cloud Platform Cloud Datalab Launch notebooks to explore,
transform and process data on Google Cloud Platform or locally. Built on IPython/Jupyter which already has a thriving ecosystem of modules and a huge knowledge base. Write code in multiple languages: Python, SQL and JavaScript. Fully Integrated Built on Jupyter Choose your language Notebooks It leverages the power of Cloud Storage, BigQuery, Cloud DataStore and Cloud SQL for analyses.

27 Google Cloud Platform Cloud Datalab Increased productivity through interactive
tools and availability of third party libraries. Explore and analyze data with ad hoc queries and visualizations. Explore, transform and process data collaboratively or publish data as reports, dashboards or APIs. Collaboration Reach Increase Productivity Simplicity Makes Google’s Big Data capabilities easier to use and therefore more accessible across the company.

Open Source

29 Google Cloud Platform Open Source Cloud Dataflow SDK released
Cloud Dataflow Runner for Spark and Flink thanks Cloudera and community! Cloud Datalab built on Jupyter Cloud Bigtable supports HBase 1.0 API Cloud Dataproc open source Hadoop and Spark Kubernetes completely open source

Solutions on Google Cloud Platform

31 Google Cloud Platform A Few Solutions • Build backend
services for mobile apps • Cloud Bigtable Schema Design for Time Series Data • Analyzing Financial Time Series using BigQuery • Processing Logs at Scale using Cloud Dataflow • Real-time data analysis with Kubernetes, Redis, and BigQuery • Reliable Task Scheduling on Google Compute Engine • Distributed Load Testing Using Kubernetes • Deploying Microservices on Google App Engine • Automated Image Builds with Jenkins, Packer, and Kubernetes • Internal Load Balancing using HAProxy on Google Compute Engine

32 Google Cloud Platform Processing Logs using Cloud Dataflow

Thanks!

34 Google Cloud Platform Questions & Comments Sandeep Parikh [email protected]
@crcsmnky

Google Cloud Platform & Big Data - Austin Data ...

Google Cloud Platform & Big Data - Austin Data Meetup

Sandeep Parikh

More Decks by Sandeep Parikh

Other Decks in Technology

Featured

Transcript

Making Your Data Work Sandeep Parikh Solutions Architect @crcsmnky

Stuff I Want To Talk About Me Google and Big

3 Google Cloud Platform Hey That’s Me! I spend my

Google and Data

5 Google Cloud Platform Organize the world’s information and make

It’s not just about picking up your workload and running

Google wants to change how you work with data

For the past 15 years, Google has been building out

For the past 15 years, Google has been building out

Google’s network is huge

13 Google Cloud Platform Google Technologies

14 Google Cloud Platform Open Source Technologies

Google Cloud Platform and Data

16 Google Cloud Platform Management Mobile Services Compute Data Networking

17 Google Cloud Platform Capabilities Collaboration and exploration Data processing

18 Google Cloud Platform Data Lifecycle Capture Process Store Analyze

19 Google Cloud Platform Data Lifecycle Capture Process Store Analyze

20 Google Cloud Platform Globally redundant Low latency (sub sec.)

21 Google Cloud Platform Cloud Dataflow Autoscaling mid-job Fully managed

22 Google Cloud Platform Cloud Dataflow Pipeline p = Pipeline.create();

23 Google Cloud Platform Cloud Dataproc (managed Hadoop & Spark)

24 Google Cloud Platform Hive for IT Business Reporting Spark

25 Google Cloud Platform Petabyte scale data warehouse Supports SQL

26 Google Cloud Platform Cloud Datalab Launch notebooks to explore,

27 Google Cloud Platform Cloud Datalab Increased productivity through interactive

Open Source

29 Google Cloud Platform Open Source Cloud Dataflow SDK released

Solutions on Google Cloud Platform

31 Google Cloud Platform A Few Solutions • Build backend

32 Google Cloud Platform Processing Logs using Cloud Dataflow

Thanks!

34 Google Cloud Platform Questions & Comments Sandeep Parikh [email protected]