The Power of Big Data on Google Cloud Platform

The Power of Big Data on Google Cloud Platform William
Vambenepe, Senior Product Manager Jim Caputo, Engineering Manager

Key drivers in the growth of Big Data • Applications
at the heart of business interactions • Devices and sensors • Lower cost of storage & ingestion • New programming models • New scale and capabilities for SQL • Easily available software (Open Source) • Easy on-ramp, cost effective experimentation • Unlimited scale, low TCO • Combine Open Source software and platform services Ability to process Cloud consumption model Data availability

Open Source Big Data on GCE BigQuery Cloud Platform storage
services An integrated data processing platform

An integrated data processing platform Open Source Big Data on
GCE BigQuery Cloud Platform storage services See Open Source Big Data presentation at 3:30

BigQuery: Big Data Analytics in the Cloud Unrivaled Performance and
Scale • Scan multiple TB’s in seconds • Interactive query performance • No limits on amount of data Ease of Use and Adoption • No administration / provisioning • Convenience of SQL • Open interfaces (REST, WebUI, ODBC) Advanced “Big Data” Storage • Familiar database structure • Easy data management and ACL’s • Fast, atomic imports

Q3, 2012 Q4,2012 Q1, 2013 Q2, 2013 Today Q3, 2013
Q4, 2013 Q2, 2012 Launch 1000x Streaming rate Table Views Table Wildcards JSON functions SQL Improvements BigQuery Innovation Momentum Google Analytics Integration Streaming API Table Decorators Large Query Results Query Caching Analytic functions Big JOIN Big Aggregates Timestamp JSON Import Nested / Repeated Fields Datastore Import Batch Processing Excel Connector

BigQuery Ecosystem Chartio

Ease of use • Simplified infrastructure for realtime use cases
• Stream events row-by-row via simple API Use cases • Server Logs, Mobile apps, Gaming, In-App real time analytics BigQuery Streaming Low cost: $0.01 per 100,000 rows Real time availability of data 100,000 rows per second Customer example:

Google Analytics Premium Platform Google BigQuery Data Pipeline Native Data
Pipeline to Load Data into BigQuery Project Google Analytics + BigQuery

Unsampled Detail-Level Data Enables Many Possibilities Analyze user behaviour over
longer period of time “Between 2010-2014, how has conversion rate changed based on the average page load latency?” Data Mashups for deeper, broader insights “Which customers that spent at least $1,000 within the last year, visited site this past month but did not purchase?” Complex real life questions can be answered over Big Data “Which coupon codes are used most frequently by our referrer sites, and do those customers generate repeat business?” Google Analytics Use Cases

Google Analytics + BigQuery Customers

BigQuery in Action " The interactive performance of Google BigQuery,
combined with Tableau’s intuitive visualization tools, enabled our analysts to interactively explore huge quantities of data – hundreds of millions of rows – with incredible efficiency. Previously, analyses would require hours or days to complete, if they would even complete at all. With Google BigQuery it takes minutes, if that, to process. This time-to-insight was previously impossible" “ It is incredibly fast and easy to use. Our data was already quite big (at least we like to think so) but we can’t help feeling that BQ has a lot more to offer and would be able to work with 100 times that amount without breaking a sweat. It’s got a short learning curve that allows for quick iterations and rapid product development. The SQL like query language is an easy transition to make for any engineer and much quicker and easier than using a MapReduce model.” – Graham Polley Shine Technologies – Giovanni DeMeo Vice President Global Marketing and Analytics

BigQuery Data Warehousing Phones BigQuery Storage BigQuery Workflows Big Query
Hadoop MapReduce Workflows Compute Engine App Engine Cloud Storage Big Query • Business Analysts • Applications • Visualizations

Cost Effective, Durable, Flexible and Fast BigQuery Storage: More than
just file storage • Highly available and durable • Optimized columnar format and file management • Fast table reads (without querying) Durability & Fast Reads • Familiar database structure • TTL’s, project and dataset ACL’s • Table Decorators - “time travel” Rich Metadata • High-throughput, low latency streaming • Fast and high frequency bulk imports • Atomic create / append / replace operations Data Imports

JSON_EXTRACT("{ 'book': { 'category': 'fiction', 'title': 'Harry Potter' } }",
"$.book.title"); SQL with differentiated functions for quick data analysis Optimized for SQL • HOST, DOMAIN, REGEXP_MATCH, Analytic functions, etc Structured Data and Flexibility • JSON, JOINs, Nested / Repeated fields BigQuery Storage: More than just file storage

visitId visitStartTime ... hits : REPEATED hits.time hits.page.pagePath ... hits.customVariables
: REPEATED hits.customVariables. index hits.customVariables. customVarName hits.customVariables. customVarValue Analytics Table Google Analytics Data

visitId startTime hits_time hits_page_pagePath 1391301917 2014-02-02 00:45:16 0 /table/publicdata:samples.github_nested 6234
/table/publicdata:samples.github_timeline 27892 /table/publicdata:samples.wikipedia 53204 /table/publicdata:samples.natality 104234 /table/publicdata:samples.shakespeare Google Analytics Data

Find the average duration spent transitioning from page A to
page B? SQL> SELECT COUNT(visitId) AS count_visitors, AVG(hitTime_lead - hitTime) / 1000 AS average_duration_seconds FROM ( SELECT visitId, hitTime, path, LEAD(path, 1) OVER (PARTITION BY visitId ORDER BY hitTime) AS path_lead, LEAD(hitTime, 1) OVER (PARTITION BY visitId ORDER BY hitTime) AS hitTime_lead FROM ( SELECT visitId, hits.time AS hitTime, hits.page.pagePath AS path FROM [bigquerytestdefault:demo.analytics] OMIT RECORD IF EVERY(hits.page.pagePath CONTAINS 'publicdata:samples.github_nested') OR EVERY(hits.page.pagePath CONTAINS 'publicdata:samples.github_timeline') ) ) WHERE path CONTAINS 'publicdata:samples.github_nested' AND path_lead CONTAINS 'publicdata:samples.github_timeline' SQL Limits

page B? SELECT COUNT(visitId) AS count_visitors, AVG(hitTime_lead - hitTime) / 1000 AS average_duration_seconds FROM ( SELECT visitId, hitTime, path, LEAD(path, 1) OVER (PARTITION BY visitId ORDER BY hitTime) AS path_lead, LEAD(hitTime, 1) OVER (PARTITION BY visitId ORDER BY hitTime) AS hitTime_lead FROM ( SELECT visitId, hits.time AS hitTime, hits.page.pagePath AS path FROM [bigquerytestdefault:demo.analytics] OMIT RECORD IF EVERY(hits.page.pagePath CONTAINS 'publicdata:samples.github_nested') OR EVERY(hits.page.pagePath CONTAINS 'publicdata:samples.github_timeline') ) ) WHERE path CONTAINS 'publicdata:samples.github_nested' AND path_lead CONTAINS 'publicdata:samples.github_timeline' SQL> SQL Limits

SQL challenges… • Some business logic can be very difficult
to express in SQL • Other scenarios are nearly impossible SQL Limits Procedural language • Unblocks difficult scenarios • Often easier to express with less code

page B? for (var i = 0; i < row.hits.length - 1; i++) { if (row.hits[i].page.pagepath.indexOf('github_nested') != -1 && row.hits[i + 1].page.pagepath.indexOf('github_timeline') != -1) { result.push({ visitid: row.visitid, duration: row.hits[i + 1].time - row.hits[i].time }); } } Script> Escape hatch?

Reading data directly from BigQuery Storage Pseudo code prepare table
for parallel read wait for job to start while (more data exists): read rows for row in rows: print findDuration(row) Script> BigQuery Table Reads

BigQuery User Defined Functions SQL BigQuery Simplicity, Scale, Performance Coming
Soon! Contact your Sales Rep for early access. Javascript Javascript is a trademark or registered trademark of Oracle in the U.S. and other countries.

BigQuery Storage Fast Table Reads • Parallel MapReduce or equivalent
operating over BigQuery tables Optimized for SQL • Database like structure, with advanced SQL SQL + Javascript User Defined Functions • Powerful, simple, fast!

Pricing Announcements

Customer and Partner Feedback • Reduced price for lower barrier
to entry • Predictability in costs Key Drivers For Pricing Changes New Usage Models • Data warehousing • Streaming ingestion • UDF’s coming soon

Lower Query Lower Storage Lower Streaming $5/TB (from $35/TB) 85%
price reduction $0.01/100K Rows (from $0.01/10K Rows) 90% price reduction $26/TB/Mth (from $80/TB/Mth) 65% price reduction Lower...

Existing usage model • Ideal for interactive, ad-hoc analysis of
Big Data • Leverages shared pool of resources • Concurrent query quota applies Pricing • Flat $5/TB processed (reduced from $35/TB) • No contracts. Pay as you go! On-Demand Query Pricing

Capacity Reservations • Increments of 5 GB per second of
query throughput • Larger, consistent workloads • Monthly commitment Reservation Sizing • 5 GB per second - $20,000 per month • ~13,000 TB of processing per month • 95% reduction off current $35 price Reserved Capacity Pricing Consistent performance No concurrent query quotas Predictable costs

Example Customer Scenario • Customer data: 50 TB stored •
Average query size: 2.5 TB (5%) • Columnar data - process only the columns referenced by the query • Table partitioning combined with Table Wildcards • Table Decorators for accessing only recent data • Number of analysts: 2 • Queries per analyst per day: 75 • Estimated total processing: 2.5 TB * 150 queries = 375 TB per day Reservation Capability How much reserved capacity is needed? • Reservation: 5 GB per second • Total capability: 5 GB per second * 86,400 second per day = 432 TB per day

Bursting beyond your reservation • BigQuery’s multi-tenant architecture enables optional
ability to leverage On-Demand resources Scenarios • Volatility in workload throughout the day • End of quarter financials • Product launch increases load Consistency, with Flexibility • Predictable costs • Guaranteed capacity • Additional resources when needed Reserved Capacity Model

Total query processing = 375 TB Time (1 day) Query
Throughput Daily Query Volume

Total query processing = 375 TB Time (1 day) Reserved
Capacity 5 GB per second Reserved Capacity – “No Burst” Query Throughput Maximum reached Queries are slowed

Time (1 day) Reserved Capacity 5 GB per second Total
query processing = 375 TB On-Demand Burst Query Throughput Reserved Capacity – “Burst with Cap”

Time (1 day) Reserved Capacity 5 GB per second Query
Throughput Reserved Capacity – “Burst with Cap” Total query processing = 430 TB Maximum 430 TB reached On-Demand Burst

Total query processing = 510 TB Time (1 day) Reserved
Capacity 5 GB per second Query Throughput Reserved Capacity – “Burst with No Cap” On-Demand Burst

BigQuery Reserved + On-Demand • Users experienced slower queries during
peak utilization • Usage was throttled. Helpful in that it avoids the risk of using entire budget in a short time • TB processed: 375 TB • Queries above Reserved Capacity leveraged On-Demand, so no slowdown for users • TB processed: 375 TB • On-Demand cost: $0 • Queries above Reserved Capacity leveraged On- Demand, but were halted after daily cap was hit • TB processed: 430 TB (the maximum amount possible for a 5 GB per sec reservation) • On-Demand cost: $0 Scenario #2 “Burst with Cap” Scenario #3 “Burst with Cap” Scenario #1 “No Burst” • Queries above Reserved Capacity leveraged On- Demand, and total usage continued to grow • TB processed: 510 TB (80 TB more than the 430 TB reservation maximum) • On-Demand cost: $400 (80 TB * $5 per TB On- Demand price) Scenario #4 “Burst with No Cap”

Conclusion Google BigQuery is your easy-to-use Cloud analytics platform Fully
managed, lower TCO High performance & scalability Rich query capability Predictable costs meets flexibility Standards support for easy adoption Interactive and batch workloads

cloud.google.com

END the rest are no longer used slides, but saving
for reference...

The Power of Big Data on Google Cloud Platform

The Power of Big Data on Google Cloud Platform

More Decks by Kazunori Sato

Other Decks in Technology

Featured

Transcript