Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Power of Big Data on Google Cloud Platform

The Power of Big Data on Google Cloud Platform

from Google Cloud Platform Live 2014
YouTube Video: https://www.youtube.com/watch?v=GrD7ymUPt3M

Kazunori Sato

April 24, 2014
Tweet

More Decks by Kazunori Sato

Other Decks in Technology

Transcript

  1. The Power of Big Data on Google Cloud Platform William

    Vambenepe, Senior Product Manager Jim Caputo, Engineering Manager
  2. Key drivers in the growth of Big Data • Applications

    at the heart of business interactions • Devices and sensors • Lower cost of storage & ingestion • New programming models • New scale and capabilities for SQL • Easily available software (Open Source) • Easy on-ramp, cost effective experimentation • Unlimited scale, low TCO • Combine Open Source software and platform services Ability to process Cloud consumption model Data availability
  3. Open Source Big Data on GCE BigQuery Cloud Platform storage

    services An integrated data processing platform
  4. An integrated data processing platform Open Source Big Data on

    GCE BigQuery Cloud Platform storage services See Open Source Big Data presentation at 3:30
  5. BigQuery: Big Data Analytics in the Cloud Unrivaled Performance and

    Scale • Scan multiple TB’s in seconds • Interactive query performance • No limits on amount of data Ease of Use and Adoption • No administration / provisioning • Convenience of SQL • Open interfaces (REST, WebUI, ODBC) Advanced “Big Data” Storage • Familiar database structure • Easy data management and ACL’s • Fast, atomic imports
  6. Q3, 2012 Q4,2012 Q1, 2013 Q2, 2013 Today Q3, 2013

    Q4, 2013 Q2, 2012 Launch 1000x Streaming rate Table Views Table Wildcards JSON functions SQL Improvements BigQuery Innovation Momentum Google Analytics Integration Streaming API Table Decorators Large Query Results Query Caching Analytic functions Big JOIN Big Aggregates Timestamp JSON Import Nested / Repeated Fields Datastore Import Batch Processing Excel Connector
  7. Ease of use • Simplified infrastructure for realtime use cases

    • Stream events row-by-row via simple API Use cases • Server Logs, Mobile apps, Gaming, In-App real time analytics BigQuery Streaming Low cost: $0.01 per 100,000 rows Real time availability of data 100,000 rows per second Customer example:
  8. Google Analytics Premium Platform Google BigQuery Data Pipeline Native Data

    Pipeline to Load Data into BigQuery Project Google Analytics + BigQuery
  9. Unsampled Detail-Level Data Enables Many Possibilities Analyze user behaviour over

    longer period of time “Between 2010-2014, how has conversion rate changed based on the average page load latency?” Data Mashups for deeper, broader insights “Which customers that spent at least $1,000 within the last year, visited site this past month but did not purchase?” Complex real life questions can be answered over Big Data “Which coupon codes are used most frequently by our referrer sites, and do those customers generate repeat business?” Google Analytics Use Cases
  10. BigQuery in Action " The interactive performance of Google BigQuery,

    combined with Tableau’s intuitive visualization tools, enabled our analysts to interactively explore huge quantities of data – hundreds of millions of rows – with incredible efficiency. Previously, analyses would require hours or days to complete, if they would even complete at all. With Google BigQuery it takes minutes, if that, to process. This time-to-insight was previously impossible" “ It is incredibly fast and easy to use. Our data was already quite big (at least we like to think so) but we can’t help feeling that BQ has a lot more to offer and would be able to work with 100 times that amount without breaking a sweat. It’s got a short learning curve that allows for quick iterations and rapid product development. The SQL like query language is an easy transition to make for any engineer and much quicker and easier than using a MapReduce model.” – Graham Polley Shine Technologies – Giovanni DeMeo Vice President Global Marketing and Analytics
  11. BigQuery Data Warehousing Phones BigQuery Storage BigQuery Workflows Big Query

    Hadoop MapReduce Workflows Compute Engine App Engine Cloud Storage Big Query • Business Analysts • Applications • Visualizations
  12. Cost Effective, Durable, Flexible and Fast BigQuery Storage: More than

    just file storage • Highly available and durable • Optimized columnar format and file management • Fast table reads (without querying) Durability & Fast Reads • Familiar database structure • TTL’s, project and dataset ACL’s • Table Decorators - “time travel” Rich Metadata • High-throughput, low latency streaming • Fast and high frequency bulk imports • Atomic create / append / replace operations Data Imports
  13. JSON_EXTRACT("{ 'book': { 'category': 'fiction', 'title': 'Harry Potter' } }",

    "$.book.title"); SQL with differentiated functions for quick data analysis Optimized for SQL • HOST, DOMAIN, REGEXP_MATCH, Analytic functions, etc Structured Data and Flexibility • JSON, JOINs, Nested / Repeated fields BigQuery Storage: More than just file storage
  14. visitId visitStartTime ... hits : REPEATED hits.time hits.page.pagePath ... hits.customVariables

    : REPEATED hits.customVariables. index hits.customVariables. customVarName hits.customVariables. customVarValue Analytics Table Google Analytics Data
  15. visitId startTime hits_time hits_page_pagePath 1391301917 2014-02-02 00:45:16 0 /table/publicdata:samples.github_nested 6234

    /table/publicdata:samples.github_timeline 27892 /table/publicdata:samples.wikipedia 53204 /table/publicdata:samples.natality 104234 /table/publicdata:samples.shakespeare Google Analytics Data
  16. Find the average duration spent transitioning from page A to

    page B? SQL> SELECT COUNT(visitId) AS count_visitors, AVG(hitTime_lead - hitTime) / 1000 AS average_duration_seconds FROM ( SELECT visitId, hitTime, path, LEAD(path, 1) OVER (PARTITION BY visitId ORDER BY hitTime) AS path_lead, LEAD(hitTime, 1) OVER (PARTITION BY visitId ORDER BY hitTime) AS hitTime_lead FROM ( SELECT visitId, hits.time AS hitTime, hits.page.pagePath AS path FROM [bigquerytestdefault:demo.analytics] OMIT RECORD IF EVERY(hits.page.pagePath CONTAINS 'publicdata:samples.github_nested') OR EVERY(hits.page.pagePath CONTAINS 'publicdata:samples.github_timeline') ) ) WHERE path CONTAINS 'publicdata:samples.github_nested' AND path_lead CONTAINS 'publicdata:samples.github_timeline' SQL Limits
  17. Find the average duration spent transitioning from page A to

    page B? SELECT COUNT(visitId) AS count_visitors, AVG(hitTime_lead - hitTime) / 1000 AS average_duration_seconds FROM ( SELECT visitId, hitTime, path, LEAD(path, 1) OVER (PARTITION BY visitId ORDER BY hitTime) AS path_lead, LEAD(hitTime, 1) OVER (PARTITION BY visitId ORDER BY hitTime) AS hitTime_lead FROM ( SELECT visitId, hits.time AS hitTime, hits.page.pagePath AS path FROM [bigquerytestdefault:demo.analytics] OMIT RECORD IF EVERY(hits.page.pagePath CONTAINS 'publicdata:samples.github_nested') OR EVERY(hits.page.pagePath CONTAINS 'publicdata:samples.github_timeline') ) ) WHERE path CONTAINS 'publicdata:samples.github_nested' AND path_lead CONTAINS 'publicdata:samples.github_timeline' SQL> SQL Limits
  18. Find the average duration spent transitioning from page A to

    page B? SELECT COUNT(visitId) AS count_visitors, AVG(hitTime_lead - hitTime) / 1000 AS average_duration_seconds FROM ( SELECT visitId, hitTime, path, LEAD(path, 1) OVER (PARTITION BY visitId ORDER BY hitTime) AS path_lead, LEAD(hitTime, 1) OVER (PARTITION BY visitId ORDER BY hitTime) AS hitTime_lead FROM ( SELECT visitId, hits.time AS hitTime, hits.page.pagePath AS path FROM [bigquerytestdefault:demo.analytics] OMIT RECORD IF EVERY(hits.page.pagePath CONTAINS 'publicdata:samples.github_nested') OR EVERY(hits.page.pagePath CONTAINS 'publicdata:samples.github_timeline') ) ) WHERE path CONTAINS 'publicdata:samples.github_nested' AND path_lead CONTAINS 'publicdata:samples.github_timeline' SQL> SQL Limits
  19. Find the average duration spent transitioning from page A to

    page B? SELECT COUNT(visitId) AS count_visitors, AVG(hitTime_lead - hitTime) / 1000 AS average_duration_seconds FROM ( SELECT visitId, hitTime, path, LEAD(path, 1) OVER (PARTITION BY visitId ORDER BY hitTime) AS path_lead, LEAD(hitTime, 1) OVER (PARTITION BY visitId ORDER BY hitTime) AS hitTime_lead FROM ( SELECT visitId, hits.time AS hitTime, hits.page.pagePath AS path FROM [bigquerytestdefault:demo.analytics] OMIT RECORD IF EVERY(hits.page.pagePath CONTAINS 'publicdata:samples.github_nested') OR EVERY(hits.page.pagePath CONTAINS 'publicdata:samples.github_timeline') ) ) WHERE path CONTAINS 'publicdata:samples.github_nested' AND path_lead CONTAINS 'publicdata:samples.github_timeline' SQL> SQL Limits
  20. Find the average duration spent transitioning from page A to

    page B? SELECT COUNT(visitId) AS count_visitors, AVG(hitTime_lead - hitTime) / 1000 AS average_duration_seconds FROM ( SELECT visitId, hitTime, path, LEAD(path, 1) OVER (PARTITION BY visitId ORDER BY hitTime) AS path_lead, LEAD(hitTime, 1) OVER (PARTITION BY visitId ORDER BY hitTime) AS hitTime_lead FROM ( SELECT visitId, hits.time AS hitTime, hits.page.pagePath AS path FROM [bigquerytestdefault:demo.analytics] OMIT RECORD IF EVERY(hits.page.pagePath CONTAINS 'publicdata:samples.github_nested') OR EVERY(hits.page.pagePath CONTAINS 'publicdata:samples.github_timeline') ) ) WHERE path CONTAINS 'publicdata:samples.github_nested' AND path_lead CONTAINS 'publicdata:samples.github_timeline' SQL> SQL Limits
  21. SQL challenges… • Some business logic can be very difficult

    to express in SQL • Other scenarios are nearly impossible SQL Limits Procedural language • Unblocks difficult scenarios • Often easier to express with less code
  22. Find the average duration spent transitioning from page A to

    page B? for (var i = 0; i < row.hits.length - 1; i++) { if (row.hits[i].page.pagepath.indexOf('github_nested') != -1 && row.hits[i + 1].page.pagepath.indexOf('github_timeline') != -1) { result.push({ visitid: row.visitid, duration: row.hits[i + 1].time - row.hits[i].time }); } } Script> Escape hatch?
  23. Reading data directly from BigQuery Storage Pseudo code prepare table

    for parallel read wait for job to start while (more data exists): read rows for row in rows: print findDuration(row) Script> BigQuery Table Reads
  24. BigQuery User Defined Functions SQL BigQuery Simplicity, Scale, Performance Coming

    Soon! Contact your Sales Rep for early access. Javascript Javascript is a trademark or registered trademark of Oracle in the U.S. and other countries.
  25. BigQuery Storage Fast Table Reads • Parallel MapReduce or equivalent

    operating over BigQuery tables Optimized for SQL • Database like structure, with advanced SQL SQL + Javascript User Defined Functions • Powerful, simple, fast!
  26. Customer and Partner Feedback • Reduced price for lower barrier

    to entry • Predictability in costs Key Drivers For Pricing Changes New Usage Models • Data warehousing • Streaming ingestion • UDF’s coming soon
  27. Lower Query Lower Storage Lower Streaming $5/TB (from $35/TB) 85%

    price reduction $0.01/100K Rows (from $0.01/10K Rows) 90% price reduction $26/TB/Mth (from $80/TB/Mth) 65% price reduction Lower...
  28. Existing usage model • Ideal for interactive, ad-hoc analysis of

    Big Data • Leverages shared pool of resources • Concurrent query quota applies Pricing • Flat $5/TB processed (reduced from $35/TB) • No contracts. Pay as you go! On-Demand Query Pricing
  29. Capacity Reservations • Increments of 5 GB per second of

    query throughput • Larger, consistent workloads • Monthly commitment Reservation Sizing • 5 GB per second - $20,000 per month • ~13,000 TB of processing per month • 95% reduction off current $35 price Reserved Capacity Pricing Consistent performance No concurrent query quotas Predictable costs
  30. Example Customer Scenario • Customer data: 50 TB stored •

    Average query size: 2.5 TB (5%) • Columnar data - process only the columns referenced by the query • Table partitioning combined with Table Wildcards • Table Decorators for accessing only recent data • Number of analysts: 2 • Queries per analyst per day: 75 • Estimated total processing: 2.5 TB * 150 queries = 375 TB per day Reservation Capability How much reserved capacity is needed? • Reservation: 5 GB per second • Total capability: 5 GB per second * 86,400 second per day = 432 TB per day
  31. Bursting beyond your reservation • BigQuery’s multi-tenant architecture enables optional

    ability to leverage On-Demand resources Scenarios • Volatility in workload throughout the day • End of quarter financials • Product launch increases load Consistency, with Flexibility • Predictable costs • Guaranteed capacity • Additional resources when needed Reserved Capacity Model
  32. Total query processing = 375 TB Time (1 day) Reserved

    Capacity 5 GB per second Reserved Capacity – “No Burst” Query Throughput Maximum reached Queries are slowed
  33. Time (1 day) Reserved Capacity 5 GB per second Total

    query processing = 375 TB On-Demand Burst Query Throughput Reserved Capacity – “Burst with Cap”
  34. Time (1 day) Reserved Capacity 5 GB per second Query

    Throughput Reserved Capacity – “Burst with Cap” Total query processing = 430 TB Maximum 430 TB reached On-Demand Burst
  35. Total query processing = 510 TB Time (1 day) Reserved

    Capacity 5 GB per second Query Throughput Reserved Capacity – “Burst with No Cap” On-Demand Burst
  36. BigQuery Reserved + On-Demand • Users experienced slower queries during

    peak utilization • Usage was throttled. Helpful in that it avoids the risk of using entire budget in a short time • TB processed: 375 TB • Queries above Reserved Capacity leveraged On-Demand, so no slowdown for users • TB processed: 375 TB • On-Demand cost: $0 • Queries above Reserved Capacity leveraged On- Demand, but were halted after daily cap was hit • TB processed: 430 TB (the maximum amount possible for a 5 GB per sec reservation) • On-Demand cost: $0 Scenario #2 “Burst with Cap” Scenario #3 “Burst with Cap” Scenario #1 “No Burst” • Queries above Reserved Capacity leveraged On- Demand, and total usage continued to grow • TB processed: 510 TB (80 TB more than the 430 TB reservation maximum) • On-Demand cost: $400 (80 TB * $5 per TB On- Demand price) Scenario #4 “Burst with No Cap”
  37. Conclusion Google BigQuery is your easy-to-use Cloud analytics platform Fully

    managed, lower TCO High performance & scalability Rich query capability Predictable costs meets flexibility Standards support for easy adoption Interactive and batch workloads