Google Bigquery, processing big data without the infrastructure

Slide 1

Slide 1 text

Google BigQuery Processing big data without the infrastructure Omer Dawelbeit | @omerio | +OmerDawelbeit

Slide 2

Slide 2 text

What is Big Data? ● Big data is data that is so large and complex that it becomes difficult to process using traditional tools. ● Widely characterised in terms of three dimensions (3Vs): ○ Volume: massive datasets (terabytes, petabytes,...) ○ Variety: different kinds of data (structured, semi- structured and unstructured). ○ Velocity: the rate at which the data is generated from batch to real time.

Slide 3

Slide 3 text

What is BigQuery? ● Cloud based service for the interactive query/analysis of massive datasets. ● No need for infrastructure setup, indexes, partitioning, db tuning, clustering, replication, sharding, etc… ● Pay per data processed, 1st Terabyte free per month is free then $5 per Terabyte. Plus month storage cost. ● Queries using SQL like syntax. ● You just need your data and queries. + SQL

Slide 4

Slide 4 text

What is BigQuery? ● Provide an append-only table storage. ● Supports storing nested datasets. ● Supports a range of datatypes: String, Integer, Float, Boolean, Timestamp, Record. ● Data can be streamed in at a rate of 100,000 rows per second. ● Can be accessed using a web UI, command-line tool or through the BigQuery REST API.

Slide 5

Slide 5 text

Dremel ● BigQuery is based on Dremel, a technology used inside Google (analysis of crawled web documents, spam analysis, OCR results from Google Books, etc…). ● Dremel is based on Columnar storage and is able to process complex nested data. ● Can process Trillion-record, multi-terabyte datasets at interactive speed. ● Interoperates with Google’s data management tools (MapReduce, GFS, etc…) ● Outperforms MapReduce by an order of magnitude on nested data.

Slide 6

Slide 6 text

Dremel Id Name Homepage Bio 10 John Smith http://blah.com blah blah 30 John Doe http://somewhere blah blah Traditional row storage RDBMS Select Name from Customer Where Bio like ‘%blah%’; Id 10 30 Name John Smith John Doe Homepage http://blah.com http://somewhere Bio blah blah blah blah Columnar storage databases

Slide 7

Slide 7 text

Dremel Intermediate servers Root server Leaf servers (1000s) Select Name from Customer Where Bio like ‘%blah%’ T={/gfs/1, gfs/2,...,/gfs/100000} Select Name from Customer Where Bio like ‘%blah%’ T={/gfs/1, gfs/2,...,/gfs/1000} Select Name from Customer Where Bio like ‘%blah%’ T={/gfs/1}

Slide 8

Slide 8 text

Importing Data Data can be imported from: ● Objects in Google Cloud Storage. ● Data posted in insert jobs or streaming inserts. ● Google Cloud Datastore backup. ● Data needs to be in CSV or JSON format. JSON can be used for nested data. ● Destination table and schema can be specified on the load request. Cloud Datastore Cloud Storage Direct import/stream

Slide 9

Slide 9 text

Running Queries ● Pricing based on sizes of columns processed. ● BigQuery SQL supports regular expressions, data aggregation, sorting, etc… 1 billion rows ● Supports table decorators, for example query table data @ or @- for table data added between time1 and time2. ● Query results can are automatically saved into temp tables (cached for 24 hours). ● Users can save query results into names tables.

Slide 10

Slide 10 text

Integration ● Developer can integrate with BigQuery using the BigQuery client libraries (.NET, Java, Objective-C, Python, etc…). ● Easy access from Compute Engine VMs (service accounts). ● Similar to integration with other Google APIs so can benefit from code re-use. ● Authentication using OAuth2.

Slide 11

Slide 11 text

How Powerful is BigQuery? ● Shine with BigQuery: The 30 Terabyte challenge (https://www.youtube. com/watch?v=LSLU8Gxt-rc). ● A query with regular expressions, grouping, nested queries on 30 terabytes of data in 30 billion rows of information. ● Query scans 6.24TB in 5.6 minutes!

Slide 12

Slide 12 text

Why use BigQuery? BigQuery vs. Hadoop vs. Amazon Redshift ● Your data and your queries, nothing else. ● No upfront cost, infrastructure setup, complex configurations or development. ● Ability to support real time data ingestion using streaming. ● Rapid prototyping of informational dashboard, for example with App Engine applications. ● Interactive analysis of massive datasets (compared to batch analysis).

Slide 13

Slide 13 text

Example Use Case ● Tracking of millions of products in warehouses and vehicles globally using RFID. Mobile data local RFID reader Aggregation server App Engine applications for Decisioning, alerting, rule- based planning, etc...

Slide 14

Slide 14 text

Example Use Case ● Power BI dashboards as a result of aggregation and analysis of massive enterprise data using BigQuery. Transactions data ERP/CRM data Analytics data

Slide 15

Slide 15 text

Resources ● DevBytes - What is BigQuery? (https://www.youtube.com/watch?v=aupC-Wj7XDY) ● Querying Massive Datasets using Google BigQuery (https://www.youtube. com/watch?v=1vOAzXYo6Eg) ● BigQuery Getting Started Documentation (https://cloud.google.com/bigquery/sign- up) ● Query Reference (https://cloud.google.com/bigquery/query-reference) ● BigQuery Sample Tables (https://cloud.google.com/bigquery/docs/sample-tables? hl=en) ● BigQuery API Documentation (https://cloud.google.com/bigquery/client-libraries) ● Shine with BigQuery: The 30 Terabyte challenge (https://www.youtube.com/watch? v=LSLU8Gxt-rc)

Slide 16

Slide 16 text

Questions You can also email [email protected] Stay in touch Google+: +OmerDawelbeit Twitter: @omerio Slides: https://goo.gl/NvPxJQ

Slide 17

Slide 17 text

References ● Melnik, S., Gubarev, A. and Long, J., 2010. Dremel: interactive analysis of web-scale datasets. Proceedings of the VLDB Endowment, 3(1-2), pp.330–339. (http://research. google.com/pubs/pub36632.html) ● K. Sato, ”An Inside Look at Google BigQuery,” White paper, https://cloud.google.com/files/BigQueryTechnicalWP.pdf, 2012.