Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Getting started with Google BigQuery

Getting started with Google BigQuery

Python Pereira

March 25, 2020
Tweet

More Decks by Python Pereira

Other Decks in Programming

Transcript

  1. Who am I? William Gómez Engineer at Huge (Full stack

    wannabe web engineer at Huge) Python Medellin meetups co-organizer Google Cloud Certified Data Engineer Freelancer Machine learning lover Data science FEM - Mentor
  2. What is it? BigQuery is Google's fully managed, petabyte scale,

    low cost analytics data warehouse. Remarkable features: - NoOps - Pay-as-you-go model - Securely share insights. - Streaming ingestion captures and real time analyses. - Analyze up to 1 TB of data and store 10 GB of data for free each month.
  3. Features Serverless: Google does all resource provisioning. Real-time analytics: (Near

    real-time) Automatic high availability: Replicated storage in multiple locations Standard SQL: Reduces code rewrites (Columnar database) Federated query and logical data warehousing: Several external sources Storage and compute separation - Every column is stored in a separated file. Meant for immutable pretty large datasets. (Not a transactional DB)
  4. Features Automatic backup and easy restore: Seven-day history of changes.

    Geospatial data types and functions: SQL support for GeoJSON and WKT. Data transfer service Big data ecosystem integration: Dataproc and Dataflow connectors Petabyte scale Flexible pricing models Data governance and security: Security with fine-grained identity
  5. Internet of Things Google Cloud IoT is a complete set

    of tools to connect, process, store, and analyze data both at the edge and in the cloud. Use cases: 1. Predictive maintenance 2. Real-time asset tracking 3. Logistics & supply chain management 4. Smart Cities & Buildings
  6. Optimizing Large-Scale Ingestion of Analytics Events and Logs Benefits: 1.

    Log integrity. No logs are lost due to streaming quota limits or sampling. 2. Cost reduction. Logs inserted in Cloud Storage using batch jobs. 3. Reserved query resources. Moving lower-priority logs to batch loading. Cold path: Batch process. Hot path: Streaming input.
  7. Large-scale events and log analytics Hot path: Immediate analysis, e.g.,

    an event might indicate undesired client behavior or bad actors Cold path: Events that need to be tracked and analyzed on an hourly or daily basis
  8. Large-scale events and log analytics Hot path: Critical logs. Cold

    path: Don't require near real-time analysis
  9. Commercial Datasets Commercial data providers are accelerating your time to

    insight by hosting their data offerings directly in BigQuery, Cloud Storage and Cloud Pub/Sub. Examples: • AccuWeather: min/max temperatures, precipitation, and snowfall • Dow Jones: historical News Snapshots • HouseCanary: historical home price • Xignite: financial market data
  10. Public datasets A public dataset is any dataset that is

    stored in BigQuery and made available to the general public through the Google Cloud Public Dataset Program.
  11. Storage, inserts BigQuery charges for data storage, streaming inserts, and

    for querying data, but loading and exporting data are free of charge
  12. Tricks to increase the performance: Faster and cheap • Denormalizing

    -> Repeating fields in the data to gain processing performance. • Always check how many bytes are you reading. • How many bytes are you sending to the next stage. • How many bytes are you writing? • Dont use: SELECT * , What are the relevant columns? • Filter as early as possible. Use WHERE clauses as early as possible. • JOINs => Bigger JOIN first smaller JOIN later. • Check if there is a reasonable approximate functions. • Sorting => Outer most query.
  13. Specify an schema BigQuery allows you to specify a table's

    schema when you load data into a table, and when you create an empty table. When you specify a table schema, you must supply each column's name and data type. You may optionally supply a column's description and mode. A column name cannot use any of the following prefixes: • _TABLE_ • _FILE_ • _PARTITION
  14. Data types Integer: Numeric values without fractional components Floating point:

    Approximate numeric values with fractional components Numeric: Exact numeric values with fractional components Boolean: TRUE or FALSE (case insensitive) String: Variable-length character (Unicode) data Bytes: Variable-length binary data Stackoverflow: what-is-the-difference- between-numeric-and-float-in-bigquery
  15. Data types Date: A logical calendar date Datetime: A year,

    month, day, hour, minute, second, and subsecond Time: A time, independent of a specific date Timestamp: An absolute point in time, with microsecond precision Struct (Record): Container of ordered fields each with a type (required) and field name (optional) Geography: A pointset on the Earth's surface (a set of points, lines and polygons on the WGS84 reference spheroid, with geodesic edges)
  16. Process Data collection Brainstorming Which questions can be answered with

    the data collected? Examples: • How does climate affect the way people move? • Is there any relationship between weather and accidents? • Is there a relationship between speed and accidents? • Is traffic linked to pollution? Open datasets Accidents data (2014 - 2018) Weather data Pressure, humidity, wind, temperature and precipitation (2014-2018) Traffic and alerts data Waze data (2017-2018)