simple data analytics. We use Dremel = Google BigQuery SELECT top(appId, 20) AS app, count(*) AS count FROM installlog.2012 ORDER BY count DESC It scans 100B rows in ~20 sec, No index used
rows SELECT language, SUM(views) as views FROM ( SELECT title, language, MAX(views) as views FROM [helixdata2:benchmark.Wiki100B] WHERE REGEXP_MATCH(title, "G.*o.*") GROUP EACH BY title, language ) GROUP EACH BY language ORDER BY views desc
of data read • 100 Billion regular expressions run • 278 GB shuffled With single server (estimated numbers): • 11.6 Hours to read 4 TB from disk • 27 hours to run 100 Billion regexps • 37 minutes to shuffle 278 GB Example: RegEx + GROUP BY on 100B rows *These figures are an example of Google BigQuery performance in a specific case, but do not represent a performance guarantee
Shard Shard Shard ColumnIO on Colossus SELECT state, year COUNT(*) GROUP BY state WHERE year >= 1980 and year < 1990 ORDER BY count_babies DESC LIMIT 10 COUNT(*) GROUP BY state Fast Aggregation by Tree Structure
Analyze Export Import How to use BigQuery? Google Analytics ETL tools Connectors Google Cloud BI tools and Visualization Google Cloud Spreadsheets, R, Hadoop
in ~20 sec Low Cost Storage: $0.020 per GB per month Queries: $5 per TB Fully Managed Use thousands of servers with zero-ops SQL Simple and Intuitive SQL Benefits of BigQuery
of data flow, just like: MapReduce/Hadoop jobs Apache Spark jobs Multiple Inputs and Outputs Cloud Storage BigQuery tables Cloud Datastore Cloud Pub/Sub Cloud Bigtable … and any external systems Pipeline
Scalability, fault-tolerance and optimization Batch + Streaming Integrates the two paradigm with single logic Easy Pipeline Design Use Java/Python to define an abstract pipeline Benefits of Cloud Dataflow