Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Google Bigquery, processing big data without the infrastructure

Google Bigquery, processing big data without the infrastructure

This presentation covers the basics of Google Bigquery and how developers can leverage its capabilities in analysing BI (Business Intelligence) and IoT (Internet of Things) data.

Omer Dawelbeit

August 05, 2015
Tweet

More Decks by Omer Dawelbeit

Other Decks in Technology

Transcript

  1. Google BigQuery
    Processing big data without the
    infrastructure
    Omer Dawelbeit | @omerio | +OmerDawelbeit

    View full-size slide

  2. What is Big Data?
    ● Big data is data that is so large and
    complex that it becomes difficult to process
    using traditional tools.
    ● Widely characterised in terms of three dimensions (3Vs):
    ○ Volume: massive datasets (terabytes, petabytes,...)
    ○ Variety: different kinds of data (structured, semi-
    structured and unstructured).
    ○ Velocity: the rate at which the data is generated from
    batch to real time.

    View full-size slide

  3. What is BigQuery?
    ● Cloud based service for the interactive query/analysis of
    massive datasets.
    ● No need for infrastructure setup, indexes, partitioning, db
    tuning, clustering, replication, sharding, etc…
    ● Pay per data processed, 1st Terabyte free per month is free
    then $5 per Terabyte. Plus month storage cost.
    ● Queries using SQL like syntax.
    ● You just need your data and queries.
    + SQL

    View full-size slide

  4. What is BigQuery?
    ● Provide an append-only table
    storage.
    ● Supports storing nested datasets.
    ● Supports a range of datatypes:
    String, Integer, Float, Boolean,
    Timestamp, Record.
    ● Data can be streamed in at a rate of 100,000 rows per second.
    ● Can be accessed using a web UI, command-line tool or
    through the BigQuery REST API.

    View full-size slide

  5. Dremel
    ● BigQuery is based on Dremel, a technology
    used inside Google (analysis of crawled web
    documents, spam analysis, OCR results from
    Google Books, etc…).
    ● Dremel is based on Columnar storage and is able to process
    complex nested data.
    ● Can process Trillion-record, multi-terabyte datasets at
    interactive speed.
    ● Interoperates with Google’s data management tools
    (MapReduce, GFS, etc…)
    ● Outperforms MapReduce by an order of magnitude on nested
    data.

    View full-size slide

  6. Dremel
    Id Name Homepage Bio
    10 John Smith http://blah.com blah blah
    30 John Doe http://somewhere blah blah
    Traditional row storage RDBMS
    Select Name from Customer Where Bio like ‘%blah%’;
    Id 10 30
    Name John Smith John Doe
    Homepage http://blah.com http://somewhere
    Bio blah blah blah blah
    Columnar storage databases

    View full-size slide

  7. Dremel
    Intermediate servers
    Root server
    Leaf servers
    (1000s)
    Select Name from Customer
    Where Bio like ‘%blah%’
    T={/gfs/1, gfs/2,...,/gfs/100000}
    Select Name from Customer
    Where Bio like ‘%blah%’
    T={/gfs/1, gfs/2,...,/gfs/1000}
    Select Name from Customer
    Where Bio like ‘%blah%’
    T={/gfs/1}

    View full-size slide

  8. Importing Data
    Data can be imported from:
    ● Objects in Google Cloud Storage.
    ● Data posted in insert jobs or
    streaming inserts.
    ● Google Cloud Datastore backup.
    ● Data needs to be in CSV or JSON
    format. JSON can be used for
    nested data.
    ● Destination table and schema can
    be specified on the load request.
    Cloud Datastore Cloud Storage
    Direct
    import/stream

    View full-size slide

  9. Running Queries
    ● Pricing based on sizes of
    columns processed.
    ● BigQuery SQL supports
    regular expressions, data
    aggregation, sorting, etc…
    1 billion rows
    ● Supports table decorators, for example query table
    data @ or @- for table data
    added between time1 and time2.
    ● Query results can are automatically saved into temp
    tables (cached for 24 hours).
    ● Users can save query results into names tables.

    View full-size slide

  10. Integration
    ● Developer can integrate with BigQuery using the BigQuery
    client libraries (.NET, Java, Objective-C, Python, etc…).
    ● Easy access from Compute Engine VMs (service accounts).
    ● Similar to integration with other Google APIs so can benefit
    from code re-use.
    ● Authentication using OAuth2.

    View full-size slide

  11. How Powerful is BigQuery?
    ● Shine with BigQuery: The 30 Terabyte
    challenge (https://www.youtube.
    com/watch?v=LSLU8Gxt-rc).
    ● A query with regular expressions,
    grouping, nested queries on 30 terabytes
    of data in 30 billion rows of information.
    ● Query scans 6.24TB in 5.6 minutes!

    View full-size slide

  12. Why use BigQuery?
    BigQuery vs. Hadoop vs. Amazon Redshift
    ● Your data and your queries, nothing else.
    ● No upfront cost, infrastructure setup, complex configurations
    or development.
    ● Ability to support real time data ingestion using streaming.
    ● Rapid prototyping of informational dashboard, for example
    with App Engine applications.
    ● Interactive analysis of massive datasets (compared to batch
    analysis).

    View full-size slide

  13. Example Use Case
    ● Tracking of millions of products in warehouses
    and vehicles globally using RFID.
    Mobile data
    local RFID
    reader
    Aggregation
    server
    App Engine applications for
    Decisioning, alerting, rule-
    based planning, etc...

    View full-size slide

  14. Example Use Case
    ● Power BI dashboards as a result of aggregation and
    analysis of massive enterprise data using BigQuery.
    Transactions data
    ERP/CRM data
    Analytics data

    View full-size slide

  15. Resources
    ● DevBytes - What is BigQuery? (https://www.youtube.com/watch?v=aupC-Wj7XDY)
    ● Querying Massive Datasets using Google BigQuery (https://www.youtube.
    com/watch?v=1vOAzXYo6Eg)
    ● BigQuery Getting Started Documentation (https://cloud.google.com/bigquery/sign-
    up)
    ● Query Reference (https://cloud.google.com/bigquery/query-reference)
    ● BigQuery Sample Tables (https://cloud.google.com/bigquery/docs/sample-tables?
    hl=en)
    ● BigQuery API Documentation (https://cloud.google.com/bigquery/client-libraries)
    ● Shine with BigQuery: The 30 Terabyte challenge (https://www.youtube.com/watch?
    v=LSLU8Gxt-rc)

    View full-size slide

  16. Questions
    You can also email [email protected]
    Stay in touch
    Google+: +OmerDawelbeit
    Twitter: @omerio
    Slides: https://goo.gl/NvPxJQ

    View full-size slide

  17. References
    ● Melnik, S., Gubarev, A. and Long, J., 2010. Dremel:
    interactive analysis of web-scale datasets. Proceedings of
    the VLDB Endowment, 3(1-2), pp.330–339. (http://research.
    google.com/pubs/pub36632.html)
    ● K. Sato, ”An Inside Look at Google BigQuery,” White paper,
    https://cloud.google.com/files/BigQueryTechnicalWP.pdf,
    2012.

    View full-size slide