Google Bigquery, processing big data without the infrastructure

Google Bigquery, processing big data without the infrastructure

This presentation covers the basics of Google Bigquery and how developers can leverage its capabilities in analysing BI (Business Intelligence) and IoT (Internet of Things) data.

81f81af7cc5d1e767303b83bd76a1e46?s=128

Omer Dawelbeit

August 05, 2015
Tweet

Transcript

  1. Google BigQuery Processing big data without the infrastructure Omer Dawelbeit

    | @omerio | +OmerDawelbeit
  2. What is Big Data? • Big data is data that

    is so large and complex that it becomes difficult to process using traditional tools. • Widely characterised in terms of three dimensions (3Vs): ◦ Volume: massive datasets (terabytes, petabytes,...) ◦ Variety: different kinds of data (structured, semi- structured and unstructured). ◦ Velocity: the rate at which the data is generated from batch to real time.
  3. What is BigQuery? • Cloud based service for the interactive

    query/analysis of massive datasets. • No need for infrastructure setup, indexes, partitioning, db tuning, clustering, replication, sharding, etc… • Pay per data processed, 1st Terabyte free per month is free then $5 per Terabyte. Plus month storage cost. • Queries using SQL like syntax. • You just need your data and queries. + SQL
  4. What is BigQuery? • Provide an append-only table storage. •

    Supports storing nested datasets. • Supports a range of datatypes: String, Integer, Float, Boolean, Timestamp, Record. • Data can be streamed in at a rate of 100,000 rows per second. • Can be accessed using a web UI, command-line tool or through the BigQuery REST API.
  5. Dremel • BigQuery is based on Dremel, a technology used

    inside Google (analysis of crawled web documents, spam analysis, OCR results from Google Books, etc…). • Dremel is based on Columnar storage and is able to process complex nested data. • Can process Trillion-record, multi-terabyte datasets at interactive speed. • Interoperates with Google’s data management tools (MapReduce, GFS, etc…) • Outperforms MapReduce by an order of magnitude on nested data.
  6. Dremel Id Name Homepage Bio 10 John Smith http://blah.com blah

    blah 30 John Doe http://somewhere blah blah Traditional row storage RDBMS Select Name from Customer Where Bio like ‘%blah%’; Id 10 30 Name John Smith John Doe Homepage http://blah.com http://somewhere Bio blah blah blah blah Columnar storage databases
  7. Dremel Intermediate servers Root server Leaf servers (1000s) Select Name

    from Customer Where Bio like ‘%blah%’ T={/gfs/1, gfs/2,...,/gfs/100000} Select Name from Customer Where Bio like ‘%blah%’ T={/gfs/1, gfs/2,...,/gfs/1000} Select Name from Customer Where Bio like ‘%blah%’ T={/gfs/1}
  8. Importing Data Data can be imported from: • Objects in

    Google Cloud Storage. • Data posted in insert jobs or streaming inserts. • Google Cloud Datastore backup. • Data needs to be in CSV or JSON format. JSON can be used for nested data. • Destination table and schema can be specified on the load request. Cloud Datastore Cloud Storage Direct import/stream
  9. Running Queries • Pricing based on sizes of columns processed.

    • BigQuery SQL supports regular expressions, data aggregation, sorting, etc… 1 billion rows • Supports table decorators, for example query table data @<time> or @<time1>-<time2> for table data added between time1 and time2. • Query results can are automatically saved into temp tables (cached for 24 hours). • Users can save query results into names tables.
  10. Integration • Developer can integrate with BigQuery using the BigQuery

    client libraries (.NET, Java, Objective-C, Python, etc…). • Easy access from Compute Engine VMs (service accounts). • Similar to integration with other Google APIs so can benefit from code re-use. • Authentication using OAuth2.
  11. How Powerful is BigQuery? • Shine with BigQuery: The 30

    Terabyte challenge (https://www.youtube. com/watch?v=LSLU8Gxt-rc). • A query with regular expressions, grouping, nested queries on 30 terabytes of data in 30 billion rows of information. • Query scans 6.24TB in 5.6 minutes!
  12. Why use BigQuery? BigQuery vs. Hadoop vs. Amazon Redshift •

    Your data and your queries, nothing else. • No upfront cost, infrastructure setup, complex configurations or development. • Ability to support real time data ingestion using streaming. • Rapid prototyping of informational dashboard, for example with App Engine applications. • Interactive analysis of massive datasets (compared to batch analysis).
  13. Example Use Case • Tracking of millions of products in

    warehouses and vehicles globally using RFID. Mobile data local RFID reader Aggregation server App Engine applications for Decisioning, alerting, rule- based planning, etc...
  14. Example Use Case • Power BI dashboards as a result

    of aggregation and analysis of massive enterprise data using BigQuery. Transactions data ERP/CRM data Analytics data
  15. Resources • DevBytes - What is BigQuery? (https://www.youtube.com/watch?v=aupC-Wj7XDY) • Querying

    Massive Datasets using Google BigQuery (https://www.youtube. com/watch?v=1vOAzXYo6Eg) • BigQuery Getting Started Documentation (https://cloud.google.com/bigquery/sign- up) • Query Reference (https://cloud.google.com/bigquery/query-reference) • BigQuery Sample Tables (https://cloud.google.com/bigquery/docs/sample-tables? hl=en) • BigQuery API Documentation (https://cloud.google.com/bigquery/client-libraries) • Shine with BigQuery: The 30 Terabyte challenge (https://www.youtube.com/watch? v=LSLU8Gxt-rc)
  16. Questions You can also email omer.dawelbeit@gmail.com Stay in touch Google+:

    +OmerDawelbeit Twitter: @omerio Slides: https://goo.gl/NvPxJQ
  17. References • Melnik, S., Gubarev, A. and Long, J., 2010.

    Dremel: interactive analysis of web-scale datasets. Proceedings of the VLDB Endowment, 3(1-2), pp.330–339. (http://research. google.com/pubs/pub36632.html) • K. Sato, ”An Inside Look at Google BigQuery,” White paper, https://cloud.google.com/files/BigQueryTechnicalWP.pdf, 2012.