Google Bigquery, processing big data without the infrastructure

Google BigQuery Processing big data without the infrastructure Omer Dawelbeit
| @omerio | +OmerDawelbeit

What is Big Data? • Big data is data that
is so large and complex that it becomes difficult to process using traditional tools. • Widely characterised in terms of three dimensions (3Vs): ◦ Volume: massive datasets (terabytes, petabytes,...) ◦ Variety: different kinds of data (structured, semi- structured and unstructured). ◦ Velocity: the rate at which the data is generated from batch to real time.

What is BigQuery? • Cloud based service for the interactive
query/analysis of massive datasets. • No need for infrastructure setup, indexes, partitioning, db tuning, clustering, replication, sharding, etc… • Pay per data processed, 1st Terabyte free per month is free then $5 per Terabyte. Plus month storage cost. • Queries using SQL like syntax. • You just need your data and queries. + SQL

What is BigQuery? • Provide an append-only table storage. •
Supports storing nested datasets. • Supports a range of datatypes: String, Integer, Float, Boolean, Timestamp, Record. • Data can be streamed in at a rate of 100,000 rows per second. • Can be accessed using a web UI, command-line tool or through the BigQuery REST API.

Dremel • BigQuery is based on Dremel, a technology used
inside Google (analysis of crawled web documents, spam analysis, OCR results from Google Books, etc…). • Dremel is based on Columnar storage and is able to process complex nested data. • Can process Trillion-record, multi-terabyte datasets at interactive speed. • Interoperates with Google’s data management tools (MapReduce, GFS, etc…) • Outperforms MapReduce by an order of magnitude on nested data.

Dremel Id Name Homepage Bio 10 John Smith http://blah.com blah
blah 30 John Doe http://somewhere blah blah Traditional row storage RDBMS Select Name from Customer Where Bio like ‘%blah%’; Id 10 30 Name John Smith John Doe Homepage http://blah.com http://somewhere Bio blah blah blah blah Columnar storage databases

Dremel Intermediate servers Root server Leaf servers (1000s) Select Name
from Customer Where Bio like ‘%blah%’ T={/gfs/1, gfs/2,...,/gfs/100000} Select Name from Customer Where Bio like ‘%blah%’ T={/gfs/1, gfs/2,...,/gfs/1000} Select Name from Customer Where Bio like ‘%blah%’ T={/gfs/1}

Importing Data Data can be imported from: • Objects in
Google Cloud Storage. • Data posted in insert jobs or streaming inserts. • Google Cloud Datastore backup. • Data needs to be in CSV or JSON format. JSON can be used for nested data. • Destination table and schema can be specified on the load request. Cloud Datastore Cloud Storage Direct import/stream

Running Queries • Pricing based on sizes of columns processed.
• BigQuery SQL supports regular expressions, data aggregation, sorting, etc… 1 billion rows • Supports table decorators, for example query table data @<time> or @<time1>-<time2> for table data added between time1 and time2. • Query results can are automatically saved into temp tables (cached for 24 hours). • Users can save query results into names tables.

Integration • Developer can integrate with BigQuery using the BigQuery
client libraries (.NET, Java, Objective-C, Python, etc…). • Easy access from Compute Engine VMs (service accounts). • Similar to integration with other Google APIs so can benefit from code re-use. • Authentication using OAuth2.

How Powerful is BigQuery? • Shine with BigQuery: The 30
Terabyte challenge (https://www.youtube. com/watch?v=LSLU8Gxt-rc). • A query with regular expressions, grouping, nested queries on 30 terabytes of data in 30 billion rows of information. • Query scans 6.24TB in 5.6 minutes!

Why use BigQuery? BigQuery vs. Hadoop vs. Amazon Redshift •
Your data and your queries, nothing else. • No upfront cost, infrastructure setup, complex configurations or development. • Ability to support real time data ingestion using streaming. • Rapid prototyping of informational dashboard, for example with App Engine applications. • Interactive analysis of massive datasets (compared to batch analysis).

Example Use Case • Tracking of millions of products in
warehouses and vehicles globally using RFID. Mobile data local RFID reader Aggregation server App Engine applications for Decisioning, alerting, rule- based planning, etc...

Example Use Case • Power BI dashboards as a result
of aggregation and analysis of massive enterprise data using BigQuery. Transactions data ERP/CRM data Analytics data

Resources • DevBytes - What is BigQuery? (https://www.youtube.com/watch?v=aupC-Wj7XDY) • Querying
Massive Datasets using Google BigQuery (https://www.youtube. com/watch?v=1vOAzXYo6Eg) • BigQuery Getting Started Documentation (https://cloud.google.com/bigquery/sign- up) • Query Reference (https://cloud.google.com/bigquery/query-reference) • BigQuery Sample Tables (https://cloud.google.com/bigquery/docs/sample-tables? hl=en) • BigQuery API Documentation (https://cloud.google.com/bigquery/client-libraries) • Shine with BigQuery: The 30 Terabyte challenge (https://www.youtube.com/watch? v=LSLU8Gxt-rc)

Questions You can also email [email protected] Stay in touch Google+:
+OmerDawelbeit Twitter: @omerio Slides: https://goo.gl/NvPxJQ

References • Melnik, S., Gubarev, A. and Long, J., 2010.
Dremel: interactive analysis of web-scale datasets. Proceedings of the VLDB Endowment, 3(1-2), pp.330–339. (http://research. google.com/pubs/pub36632.html) • K. Sato, ”An Inside Look at Google BigQuery,” White paper, https://cloud.google.com/files/BigQueryTechnicalWP.pdf, 2012.

Google Bigquery, processing big data without th...

Google Bigquery, processing big data without the infrastructure

Omer Dawelbeit

More Decks by Omer Dawelbeit

Other Decks in Technology

Featured

Transcript

Google BigQuery Processing big data without the infrastructure Omer Dawelbeit

What is Big Data? • Big data is data that

What is BigQuery? • Cloud based service for the interactive

What is BigQuery? • Provide an append-only table storage. •

Dremel • BigQuery is based on Dremel, a technology used

Dremel Id Name Homepage Bio 10 John Smith http://blah.com blah

Dremel Intermediate servers Root server Leaf servers (1000s) Select Name

Importing Data Data can be imported from: • Objects in

Running Queries • Pricing based on sizes of columns processed.

Integration • Developer can integrate with BigQuery using the BigQuery

How Powerful is BigQuery? • Shine with BigQuery: The 30

Why use BigQuery? BigQuery vs. Hadoop vs. Amazon Redshift •

Example Use Case • Tracking of millions of products in

Example Use Case • Power BI dashboards as a result

Resources • DevBytes - What is BigQuery? (https://www.youtube.com/watch?v=aupC-Wj7XDY) • Querying

Questions You can also email [email protected] Stay in touch Google+:

References • Melnik, S., Gubarev, A. and Long, J., 2010.