Slide 1

Slide 1 text

Google Cloud Data Infrastructure Sandeep Parikh Solutions Architect @crcsmnky

Slide 2

Slide 2 text

Stuff I Want To Talk About Me Google and Big Data Google Cloud Platform and Big Data Open Source Questions

Slide 3

Slide 3 text

3 Google Cloud Platform Hey That’s Me! I spend my time working on repeatable architectural patterns and guidance for people interested in using Google Cloud Platform in the form of papers, code, architectures. Typically spend time talking about Big Data and Containers. Before Google, I was at MongoDB, Ravel, 21CT, Affinegy, Apple. I’ve been in Austin for ~12 years so I get to complain about everything. Find me on Twitter @crcsmnky

Slide 4

Slide 4 text

Google and Data

Slide 5

Slide 5 text

5 Google Cloud Platform Organize the world’s information and make it universally accessible and useful.

Slide 6

Slide 6 text

Google Cloud Platform 6 2012 2013 2002 2004 2006 2008 2010 GFS MapReduce BigTable Colossus Dremel Flume Megastore Spanner Millwheel PubSub F1 Google Research in Data Technologies Google Research Publications referenced are available here: http://research.google.com/pubs/papers.html The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines, 2009 http://research.google.com/pubs/pub35290.html

Slide 7

Slide 7 text

Google Cloud Platform 7 2012 2013 2002 2004 2006 2008 2010 GFS MapReduce BigTable Colossus Dremel Flume Megastore Spanner Millwheel PubSub F1 Open Source Ecosystem Google Research Publications referenced are available here: http://research.google.com/pubs/papers.html The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines, 2009 http://research.google.com/pubs/pub35290.html

Slide 8

Slide 8 text

Google Cloud Platform 8 Google Research Publications referenced are available here: http://research.google.com/pubs/papers.html The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines, 2009 http://research.google.com/pubs/pub35290.html 2012 2013 2002 2004 2006 2008 2010 Cloud Storage Dataproc Bigtable Cloud Storage BigQuery Dataflow Datastore Spanner Dataflow PubSub F1 Cloud Platform Data Infrastructure

Slide 9

Slide 9 text

Google Cloud Platform and Data

Slide 10

Slide 10 text

10 Google Cloud Platform Management Mobile Services Compute Data Networking Storage Developer Tools

Slide 11

Slide 11 text

11 Google Cloud Platform Capabilities Collaboration and exploration Data processing Intelligence Cloud Databases Analytics Storage Services Messaging Services Data exploration in the Cloud Cloud Datalab Fast & economical data warehouse for large-scale data analytics Google BigQuery Mainstream Cloud based artificial intelligence and machine learning Cloud ML, Vision API, Prediction API, Translate API Flexible, scalable and reliable data processing Streaming/batch processing, Hadoop/Spark - Cloud Dataflow, Cloud Dataproc Cloud Databases for all kinds of applications Relational, key-value, NoSQL - Cloud Bigtable, Cloud SQL, Cloud Datastore Proven storage platform GCS Standard, GCS DRA, GCS Nearline Reliable, large scale messaging Cloud Pub/Sub

Slide 12

Slide 12 text

12 Google Cloud Platform Data Lifecycle Capture Process Store Analyze Mobile Phones In-Game Analytics Clickstream Object Store HDFS Hadoop Spark Storm SQL Pig Hive

Slide 13

Slide 13 text

13 Google Cloud Platform Data Lifecycle Capture Process Store Analyze App Engine Cloud Pub/Sub Cloud Storage Cloud SQL Cloud Bigtable Cloud Datastore Cloud Dataflow Cloud Dataproc BigQuery Cloud Datalab

Slide 14

Slide 14 text

14 Google Cloud Platform Globally redundant Low latency (sub sec.) Batched read/write Custom labels Push & Pull Auto expiration Cloud Pub/Sub Pub A Pub B Pub C Topic C Sub A Sub B Sub C1 Sub C2 Cloud Pub/Sub Subscriber X Subscriber Y Subscriber Z Message 1 Message 2 Topic A Topic B Message 3 Message 1 Message 2 Message 3 Message 3

Slide 15

Slide 15 text

15 Google Cloud Platform Cloud Dataflow Autoscaling mid-job Fully managed - No-Ops Intuitive Data Processing Framework Batch and Stream Processing in one Dynamic rebalancing mid-job 1 2 3 4 5

Slide 16

Slide 16 text

16 Google Cloud Platform Cloud Dataflow Pipeline p = Pipeline.create(); p.begin() .apply(TextIO.Read.from(“gs://…”)) .apply(ParDo.of(new ExtractTags()) .apply(Count.create()) .apply(ParDo.of(new ExpandPrefixes()) .apply(Top.largestPerKey(3)) .apply(TextIO.Write.to(“gs://…”)); p.run(); Pipeline p = Pipeline.create(); p.begin() .apply(TextIO.Read.from(“gs://…”)) .apply(ParDo.of(new ExtractTags()) .apply(Count.create()) .apply(ParDo.of(new ExpandPrefixes()) .apply(Top.largestPerKey(3)) .apply(TextIO.Write.to(“gs://…”)); p.run(); .apply(PubsubIO.Read.from(“input_topic”)) .apply(Window.by(FixedWindows.of(5, MINUTES)) .apply(PubsubIO.Write.to(“output_topic”)); Batch to Streaming

Slide 17

Slide 17 text

17 Google Cloud Platform Cloud Dataproc (managed Hadoop & Spark) Preemptible VMs are 70% cheaper Spin up clusters of any size in 90 seconds Separation of storage and compute Run clusters segregated by job or function Per-minute billing 1 2 3 4 5

Slide 18

Slide 18 text

18 Google Cloud Platform Hive for IT Business Reporting Spark MLlib Hive for Analysts MapReduce ETL Google Cloud Storage Cloud Dataproc

Slide 19

Slide 19 text

19 Google Cloud Platform Cloud Bigtable Easily scale to hundreds of PBs Low-latency and high-throughput Massively scalable NoSQL database SSDs or HDDs depending on need Easy to integrate with Dataflow, Dataproc 1 2 3 4 5

Slide 20

Slide 20 text

20 Google Cloud Platform Cloud Bigtable Put, Increment, Append Bigtable Replication Full Scan, Map Reduce Gets, Short Scan Bulk Import Low Latency High Throughput Bigtable Replication

Slide 21

Slide 21 text

21 Google Cloud Platform Cloud Bigtable in Context NoSQL Key Value Blob SQL Cloud Storage Cloud Bigtable Cloud Datastore Cloud SQL Good for: Structured and unstructured binary or object data Such as: Images, large media files, backups Good for: Getting started, App Engine, serve use cases Such as: User profiles, product catalog Good for: Web frameworks, existing applications Such as: User credentials, customer orders Good for: Heavy read + write, events, and analytical data Such as: AdTech, Financial and IoT data Memcache Good for: Web/mobile applications, gaming Such as: Game state, user sessions

Slide 22

Slide 22 text

22 Google Cloud Platform Petabyte scale data warehouse Supports SQL Fast and scales automatically No setup or administration Stream up to 100,000 rows/sec Integrates with third-party software like Tableau Google BigQuery

Slide 23

Slide 23 text

23 Google Cloud Platform Cloud Datalab Launch notebooks to explore, transform and process data on Google Cloud Platform or locally. Built on IPython/Jupyter which already has a thriving ecosystem of modules and a huge knowledge base. Write code in multiple languages: Python, SQL and JavaScript. Fully Integrated Built on Jupyter Choose your language Notebooks It leverages the power of Cloud Storage, BigQuery, Cloud DataStore and Cloud SQL for analyses.

Slide 24

Slide 24 text

24 Google Cloud Platform Cloud Datalab Increased productivity through interactive tools and availability of third party libraries. Explore and analyze data with ad hoc queries and visualizations. Explore, transform and process data collaboratively or publish data as reports, dashboards or APIs. Collaboration Reach Increase Productivity Simplicity Makes Google’s Big Data capabilities easier to use and therefore more accessible across the company.

Slide 25

Slide 25 text

25 Google Cloud Platform Cloud Datalab

Slide 26

Slide 26 text

Open Source

Slide 27

Slide 27 text

27 Google Cloud Platform Open Source Cloud Dataflow → Apache Beam Incubating at Apache Open source SDKs Open source runners for Spark and Flink Cloud Datalab built on Jupyter Cloud Bigtable supports HBase 1.0 API Cloud Dataproc open source Hadoop and Spark Kubernetes completely open source

Slide 28

Slide 28 text

Solutions on Google Cloud Platform

Slide 29

Slide 29 text

29 Google Cloud Platform ● Reverse Geocoding using Cloud and Maps ● Scalable Geolocation Telemetry using Cloud ● Machine Learning with Financial Time Series ● Cloud Bigtable Schema Design for Time Series Data ● Analyzing Financial Time Series using BigQuery ● Processing Logs at Scale using Cloud Dataflow A Few Solutions ● Reliable Task Scheduling on Google Compute Engine ● Real-Time Inventory using Google Cloud ● Distributed Load Testing Using Kubernetes ● Deploying Microservices on Google App Engine ● Automated Image Builds with Jenkins, Packer, and Kubernetes ● Real-time data analysis with Kubernetes, Redis, and BigQuery

Slide 30

Slide 30 text

30 Google Cloud Platform Processing Logs using Cloud Dataflow

Slide 31

Slide 31 text

31 Google Cloud Platform Complex Event Processing

Slide 32

Slide 32 text

32 Google Cloud Platform Real-Time Inventory

Slide 33

Slide 33 text

Thanks!

Slide 34

Slide 34 text

34 Google Cloud Platform Questions & Comments Sandeep Parikh parikhs@google.com @crcsmnky