The Power of Big Data on Google Cloud Platform

Slide 1

Slide 1 text

The Power of Big Data on Google Cloud Platform William Vambenepe, Senior Product Manager Jim Caputo, Engineering Manager

Slide 2

Slide 2 text

Key drivers in the growth of Big Data • Applications at the heart of business interactions • Devices and sensors • Lower cost of storage & ingestion • New programming models • New scale and capabilities for SQL • Easily available software (Open Source) • Easy on-ramp, cost effective experimentation • Unlimited scale, low TCO • Combine Open Source software and platform services Ability to process Cloud consumption model Data availability

Slide 3

Slide 3 text

Open Source Big Data on GCE BigQuery Cloud Platform storage services An integrated data processing platform

Slide 4

Slide 4 text

An integrated data processing platform Open Source Big Data on GCE BigQuery Cloud Platform storage services See Open Source Big Data presentation at 3:30

Slide 5

Slide 5 text

BigQuery: Big Data Analytics in the Cloud Unrivaled Performance and Scale ● Scan multiple TB’s in seconds ● Interactive query performance ● No limits on amount of data Ease of Use and Adoption ● No administration / provisioning ● Convenience of SQL ● Open interfaces (REST, WebUI, ODBC) Advanced “Big Data” Storage ● Familiar database structure ● Easy data management and ACL’s ● Fast, atomic imports

Slide 6

Slide 6 text

Q3, 2012 Q4,2012 Q1, 2013 Q2, 2013 Today Q3, 2013 Q4, 2013 Q2, 2012 Launch 1000x Streaming rate Table Views Table Wildcards JSON functions SQL Improvements BigQuery Innovation Momentum Google Analytics Integration Streaming API Table Decorators Large Query Results Query Caching Analytic functions Big JOIN Big Aggregates Timestamp JSON Import Nested / Repeated Fields Datastore Import Batch Processing Excel Connector

Slide 7

Slide 7 text

BigQuery Ecosystem Chartio

Slide 8

Slide 8 text

Ease of use • Simplified infrastructure for realtime use cases • Stream events row-by-row via simple API Use cases • Server Logs, Mobile apps, Gaming, In-App real time analytics BigQuery Streaming Low cost: $0.01 per 100,000 rows Real time availability of data 100,000 rows per second Customer example:

Slide 9

Slide 9 text

Google Analytics Premium Platform Google BigQuery Data Pipeline Native Data Pipeline to Load Data into BigQuery Project Google Analytics + BigQuery

Slide 10

Slide 10 text

Unsampled Detail-Level Data Enables Many Possibilities Analyze user behaviour over longer period of time “Between 2010-2014, how has conversion rate changed based on the average page load latency?” Data Mashups for deeper, broader insights “Which customers that spent at least $1,000 within the last year, visited site this past month but did not purchase?” Complex real life questions can be answered over Big Data “Which coupon codes are used most frequently by our referrer sites, and do those customers generate repeat business?” Google Analytics Use Cases

Slide 11

Slide 11 text

Google Analytics + BigQuery Customers

Slide 12

Slide 12 text

BigQuery in Action " The interactive performance of Google BigQuery, combined with Tableau’s intuitive visualization tools, enabled our analysts to interactively explore huge quantities of data – hundreds of millions of rows – with incredible efficiency. Previously, analyses would require hours or days to complete, if they would even complete at all. With Google BigQuery it takes minutes, if that, to process. This time-to-insight was previously impossible" “ It is incredibly fast and easy to use. Our data was already quite big (at least we like to think so) but we can’t help feeling that BQ has a lot more to offer and would be able to work with 100 times that amount without breaking a sweat. It’s got a short learning curve that allows for quick iterations and rapid product development. The SQL like query language is an easy transition to make for any engineer and much quicker and easier than using a MapReduce model.” – Graham Polley Shine Technologies – Giovanni DeMeo Vice President Global Marketing and Analytics

Slide 13

Slide 13 text

BigQuery Data Warehousing Phones BigQuery Storage BigQuery Workflows Big Query Hadoop MapReduce Workflows Compute Engine App Engine Cloud Storage Big Query • Business Analysts • Applications • Visualizations

Slide 14

Slide 14 text

Cost Effective, Durable, Flexible and Fast BigQuery Storage: More than just file storage • Highly available and durable • Optimized columnar format and file management • Fast table reads (without querying) Durability & Fast Reads • Familiar database structure • TTL’s, project and dataset ACL’s • Table Decorators - “time travel” Rich Metadata • High-throughput, low latency streaming • Fast and high frequency bulk imports • Atomic create / append / replace operations Data Imports

Slide 15

Slide 15 text

JSON_EXTRACT("{ 'book': { 'category': 'fiction', 'title': 'Harry Potter' } }", "$.book.title"); SQL with differentiated functions for quick data analysis Optimized for SQL • HOST, DOMAIN, REGEXP_MATCH, Analytic functions, etc Structured Data and Flexibility • JSON, JOINs, Nested / Repeated fields BigQuery Storage: More than just file storage

Slide 16

Slide 16 text

visitId visitStartTime ... hits : REPEATED hits.time hits.page.pagePath ... hits.customVariables : REPEATED hits.customVariables. index hits.customVariables. customVarName hits.customVariables. customVarValue Analytics Table Google Analytics Data

Slide 17

Slide 17 text

visitId startTime hits_time hits_page_pagePath 1391301917 2014-02-02 00:45:16 0 /table/publicdata:samples.github_nested 6234 /table/publicdata:samples.github_timeline 27892 /table/publicdata:samples.wikipedia 53204 /table/publicdata:samples.natality 104234 /table/publicdata:samples.shakespeare Google Analytics Data

Slide 18

Slide 18 text

Find the average duration spent transitioning from page A to page B? SQL> SELECT COUNT(visitId) AS count_visitors, AVG(hitTime_lead - hitTime) / 1000 AS average_duration_seconds FROM ( SELECT visitId, hitTime, path, LEAD(path, 1) OVER (PARTITION BY visitId ORDER BY hitTime) AS path_lead, LEAD(hitTime, 1) OVER (PARTITION BY visitId ORDER BY hitTime) AS hitTime_lead FROM ( SELECT visitId, hits.time AS hitTime, hits.page.pagePath AS path FROM [bigquerytestdefault:demo.analytics] OMIT RECORD IF EVERY(hits.page.pagePath CONTAINS 'publicdata:samples.github_nested') OR EVERY(hits.page.pagePath CONTAINS 'publicdata:samples.github_timeline') ) ) WHERE path CONTAINS 'publicdata:samples.github_nested' AND path_lead CONTAINS 'publicdata:samples.github_timeline' SQL Limits

Slide 19

Slide 19 text

Find the average duration spent transitioning from page A to page B? SELECT COUNT(visitId) AS count_visitors, AVG(hitTime_lead - hitTime) / 1000 AS average_duration_seconds FROM ( SELECT visitId, hitTime, path, LEAD(path, 1) OVER (PARTITION BY visitId ORDER BY hitTime) AS path_lead, LEAD(hitTime, 1) OVER (PARTITION BY visitId ORDER BY hitTime) AS hitTime_lead FROM ( SELECT visitId, hits.time AS hitTime, hits.page.pagePath AS path FROM [bigquerytestdefault:demo.analytics] OMIT RECORD IF EVERY(hits.page.pagePath CONTAINS 'publicdata:samples.github_nested') OR EVERY(hits.page.pagePath CONTAINS 'publicdata:samples.github_timeline') ) ) WHERE path CONTAINS 'publicdata:samples.github_nested' AND path_lead CONTAINS 'publicdata:samples.github_timeline' SQL> SQL Limits

Slide 20

Slide 20 text

Slide 21

Slide 21 text

Slide 22

Slide 22 text

Slide 23

Slide 23 text

SQL challenges… • Some business logic can be very difficult to express in SQL • Other scenarios are nearly impossible SQL Limits Procedural language • Unblocks difficult scenarios • Often easier to express with less code

Slide 24

Slide 24 text

Find the average duration spent transitioning from page A to page B? for (var i = 0; i < row.hits.length - 1; i++) { if (row.hits[i].page.pagepath.indexOf('github_nested') != -1 && row.hits[i + 1].page.pagepath.indexOf('github_timeline') != -1) { result.push({ visitid: row.visitid, duration: row.hits[i + 1].time - row.hits[i].time }); } } Script> Escape hatch?

Slide 25

Slide 25 text

Reading data directly from BigQuery Storage Pseudo code prepare table for parallel read wait for job to start while (more data exists): read rows for row in rows: print findDuration(row) Script> BigQuery Table Reads

Slide 26

Slide 26 text

Demo

Slide 27

Slide 27 text

BigQuery User Defined Functions SQL BigQuery Simplicity, Scale, Performance Coming Soon! Contact your Sales Rep for early access. Javascript Javascript is a trademark or registered trademark of Oracle in the U.S. and other countries.

Slide 28

Slide 28 text

BigQuery Storage Fast Table Reads • Parallel MapReduce or equivalent operating over BigQuery tables Optimized for SQL • Database like structure, with advanced SQL SQL + Javascript User Defined Functions • Powerful, simple, fast!

Slide 29

Slide 29 text

Pricing Announcements

Slide 30

Slide 30 text

Customer and Partner Feedback • Reduced price for lower barrier to entry • Predictability in costs Key Drivers For Pricing Changes New Usage Models • Data warehousing • Streaming ingestion • UDF’s coming soon

Slide 31

Slide 31 text

Lower Query Lower Storage Lower Streaming $5/TB (from $35/TB) 85% price reduction $0.01/100K Rows (from $0.01/10K Rows) 90% price reduction $26/TB/Mth (from $80/TB/Mth) 65% price reduction Lower...

Slide 32

Slide 32 text

Existing usage model • Ideal for interactive, ad-hoc analysis of Big Data • Leverages shared pool of resources • Concurrent query quota applies Pricing • Flat $5/TB processed (reduced from $35/TB) • No contracts. Pay as you go! On-Demand Query Pricing

Slide 33

Slide 33 text

Capacity Reservations • Increments of 5 GB per second of query throughput • Larger, consistent workloads • Monthly commitment Reservation Sizing • 5 GB per second - $20,000 per month • ~13,000 TB of processing per month • 95% reduction off current $35 price Reserved Capacity Pricing Consistent performance No concurrent query quotas Predictable costs

Slide 34

Slide 34 text

Example Customer Scenario • Customer data: 50 TB stored • Average query size: 2.5 TB (5%) • Columnar data - process only the columns referenced by the query • Table partitioning combined with Table Wildcards • Table Decorators for accessing only recent data • Number of analysts: 2 • Queries per analyst per day: 75 • Estimated total processing: 2.5 TB * 150 queries = 375 TB per day Reservation Capability How much reserved capacity is needed? • Reservation: 5 GB per second • Total capability: 5 GB per second * 86,400 second per day = 432 TB per day

Slide 35

Slide 35 text

Bursting beyond your reservation • BigQuery’s multi-tenant architecture enables optional ability to leverage On-Demand resources Scenarios • Volatility in workload throughout the day • End of quarter financials • Product launch increases load Consistency, with Flexibility • Predictable costs • Guaranteed capacity • Additional resources when needed Reserved Capacity Model

Slide 36

Slide 36 text

Total query processing = 375 TB Time (1 day) Query Throughput Daily Query Volume

Slide 37

Slide 37 text

Total query processing = 375 TB Time (1 day) Reserved Capacity 5 GB per second Reserved Capacity – “No Burst” Query Throughput Maximum reached Queries are slowed

Slide 38

Slide 38 text

Time (1 day) Reserved Capacity 5 GB per second Total query processing = 375 TB On-Demand Burst Query Throughput Reserved Capacity – “Burst with Cap”

Slide 39

Slide 39 text

Time (1 day) Reserved Capacity 5 GB per second Query Throughput Reserved Capacity – “Burst with Cap” Total query processing = 430 TB Maximum 430 TB reached On-Demand Burst

Slide 40

Slide 40 text

Total query processing = 510 TB Time (1 day) Reserved Capacity 5 GB per second Query Throughput Reserved Capacity – “Burst with No Cap” On-Demand Burst

Slide 41

Slide 41 text

BigQuery Reserved + On-Demand • Users experienced slower queries during peak utilization • Usage was throttled. Helpful in that it avoids the risk of using entire budget in a short time • TB processed: 375 TB • Queries above Reserved Capacity leveraged On-Demand, so no slowdown for users • TB processed: 375 TB • On-Demand cost: $0 • Queries above Reserved Capacity leveraged On- Demand, but were halted after daily cap was hit • TB processed: 430 TB (the maximum amount possible for a 5 GB per sec reservation) • On-Demand cost: $0 Scenario #2 “Burst with Cap” Scenario #3 “Burst with Cap” Scenario #1 “No Burst” • Queries above Reserved Capacity leveraged On- Demand, and total usage continued to grow • TB processed: 510 TB (80 TB more than the 430 TB reservation maximum) • On-Demand cost: $400 (80 TB * $5 per TB On- Demand price) Scenario #4 “Burst with No Cap”

Slide 42

Slide 42 text

Conclusion Google BigQuery is your easy-to-use Cloud analytics platform Fully managed, lower TCO High performance & scalability Rich query capability Predictable costs meets flexibility Standards support for easy adoption Interactive and batch workloads