Slide 1

Slide 1 text

September 24, 2017 Hello Getting started with BigQuery

Slide 2

Slide 2 text

Who am I? William Gómez Engineer at Huge (Full stack wannabe web engineer at Huge) Python Medellin meetups co-organizer Google Cloud Certified Data Engineer Freelancer Machine learning lover Data science FEM - Mentor

Slide 3

Slide 3 text

Agenda. 1. BigQuery 2. Features 3. Architecture 4. Use cases 5. Costs 6. Performance 7. Examples

Slide 4

Slide 4 text

BigQuery 1

Slide 5

Slide 5 text

What is it? BigQuery is Google's fully managed, petabyte scale, low cost analytics data warehouse. Remarkable features: - NoOps - Pay-as-you-go model - Securely share insights. - Streaming ingestion captures and real time analyses. - Analyze up to 1 TB of data and store 10 GB of data for free each month.

Slide 6

Slide 6 text

Performance evolution Three years ago: Last year: Enabling clustering Google Next 19 Youtube video: watch?v=eOQ3YJKgvHE

Slide 7

Slide 7 text

Traditional data warehousing

Slide 8

Slide 8 text

Modern data warehousing

Slide 9

Slide 9 text

Required eyebrow.

Slide 10

Slide 10 text

Modern data warehousing

Slide 11

Slide 11 text

Modern data warehousing

Slide 12

Slide 12 text

BigQuery ML

Slide 13

Slide 13 text

AutoML Tables vs. BigQuery ML

Slide 14

Slide 14 text

Customers

Slide 15

Slide 15 text

Features 2

Slide 16

Slide 16 text

Features Serverless: Google does all resource provisioning. Real-time analytics: (Near real-time) Automatic high availability: Replicated storage in multiple locations Standard SQL: Reduces code rewrites (Columnar database) Federated query and logical data warehousing: Several external sources Storage and compute separation - Every column is stored in a separated file. Meant for immutable pretty large datasets. (Not a transactional DB)

Slide 17

Slide 17 text

Features Automatic backup and easy restore: Seven-day history of changes. Geospatial data types and functions: SQL support for GeoJSON and WKT. Data transfer service Big data ecosystem integration: Dataproc and Dataflow connectors Petabyte scale Flexible pricing models Data governance and security: Security with fine-grained identity

Slide 18

Slide 18 text

Additional Features BigQuery ML (beta) BigQuery GIS

Slide 19

Slide 19 text

Architecture 3

Slide 20

Slide 20 text

Data warehousing architecture

Slide 21

Slide 21 text

Use cases 4

Slide 22

Slide 22 text

Internet of Things Google Cloud IoT is a complete set of tools to connect, process, store, and analyze data both at the edge and in the cloud. Use cases: 1. Predictive maintenance 2. Real-time asset tracking 3. Logistics & supply chain management 4. Smart Cities & Buildings

Slide 23

Slide 23 text

Optimizing Large-Scale Ingestion of Analytics Events and Logs Benefits: 1. Log integrity. No logs are lost due to streaming quota limits or sampling. 2. Cost reduction. Logs inserted in Cloud Storage using batch jobs. 3. Reserved query resources. Moving lower-priority logs to batch loading. Cold path: Batch process. Hot path: Streaming input.

Slide 24

Slide 24 text

Large-scale events and log analytics Hot path: Immediate analysis, e.g., an event might indicate undesired client behavior or bad actors Cold path: Events that need to be tracked and analyzed on an hourly or daily basis

Slide 25

Slide 25 text

Large-scale events and log analytics Hot path: Critical logs. Cold path: Don't require near real-time analysis

Slide 26

Slide 26 text

Datasets 5

Slide 27

Slide 27 text

Commercial Datasets Commercial data providers are accelerating your time to insight by hosting their data offerings directly in BigQuery, Cloud Storage and Cloud Pub/Sub. Examples: • AccuWeather: min/max temperatures, precipitation, and snowfall • Dow Jones: historical News Snapshots • HouseCanary: historical home price • Xignite: financial market data

Slide 28

Slide 28 text

Public datasets A public dataset is any dataset that is stored in BigQuery and made available to the general public through the Google Cloud Public Dataset Program.

Slide 29

Slide 29 text

Costs 6

Slide 30

Slide 30 text

Storage, inserts BigQuery charges for data storage, streaming inserts, and for querying data, but loading and exporting data are free of charge

Slide 31

Slide 31 text

Querying data When querying data, you can choose from two different pricing options:

Slide 32

Slide 32 text

Performance 7

Slide 33

Slide 33 text

Tricks to increase the performance: Faster and cheap • Denormalizing -> Repeating fields in the data to gain processing performance. • Always check how many bytes are you reading. • How many bytes are you sending to the next stage. • How many bytes are you writing? • Dont use: SELECT * , What are the relevant columns? • Filter as early as possible. Use WHERE clauses as early as possible. • JOINs => Bigger JOIN first smaller JOIN later. • Check if there is a reasonable approximate functions. • Sorting => Outer most query.

Slide 34

Slide 34 text

Example 8

Slide 35

Slide 35 text

Specify an schema BigQuery allows you to specify a table's schema when you load data into a table, and when you create an empty table. When you specify a table schema, you must supply each column's name and data type. You may optionally supply a column's description and mode. A column name cannot use any of the following prefixes: • _TABLE_ • _FILE_ • _PARTITION

Slide 36

Slide 36 text

Data types Integer: Numeric values without fractional components Floating point: Approximate numeric values with fractional components Numeric: Exact numeric values with fractional components Boolean: TRUE or FALSE (case insensitive) String: Variable-length character (Unicode) data Bytes: Variable-length binary data Stackoverflow: what-is-the-difference- between-numeric-and-float-in-bigquery

Slide 37

Slide 37 text

Data types Date: A logical calendar date Datetime: A year, month, day, hour, minute, second, and subsecond Time: A time, independent of a specific date Timestamp: An absolute point in time, with microsecond precision Struct (Record): Container of ordered fields each with a type (required) and field name (optional) Geography: A pointset on the Earth's surface (a set of points, lines and polygons on the WGS84 reference spheroid, with geodesic edges)

Slide 38

Slide 38 text

Loading data in BigQuery Let’s load a CSV.

Slide 39

Slide 39 text

Loading data in BigQuery

Slide 40

Slide 40 text

Some queries Let’s get all records for 2014: Let’s sum the USD by month:

Slide 41

Slide 41 text

Some queries Let’s get all records for 2014:

Slide 42

Slide 42 text

Data Studio Let’s get all records for 2014:

Slide 43

Slide 43 text

Data Studio

Slide 44

Slide 44 text

Mobility data

Slide 45

Slide 45 text

Process Data collection Brainstorming Which questions can be answered with the data collected? Examples: • How does climate affect the way people move? • Is there any relationship between weather and accidents? • Is there a relationship between speed and accidents? • Is traffic linked to pollution? Open datasets Accidents data (2014 - 2018) Weather data Pressure, humidity, wind, temperature and precipitation (2014-2018) Traffic and alerts data Waze data (2017-2018)

Slide 46

Slide 46 text

Process Storage, cleaning, and transformation

Slide 47

Slide 47 text

Insights A correlation between accidents events and weather data for certain hours.

Slide 48

Slide 48 text

Insights The increase of accidents in hours where there are peaks of traffic.

Slide 49

Slide 49 text

Social Networks Username: williamegomezo

Slide 50

Slide 50 text

Done. September 24, 2017 Getting started with BigQuery