DDM's Journey into Big Data with BigQuery with KSL and Deseret News

Justin Carmony - Sr. Director of Engineering - GCP SLC
Meetup - Nov 2018 DDM’s Journey into Big Data with BigQuery for KSL and Deseret News

Justin Carmony Sr. Director of Engineering  KSL.com News & Analytics 
Deseret Digital Media

About Presentation • Feel free to ask questions on-topic •
Q&A at the End • Always feel free to email/tweet me   ([email protected], @JustinCarmony) • Will post slides on-line

Who are you?

Experience w/ “Big Data” • Lots of Experience? • Used
it a little bit? • Tinkered with it? • Totally new?

Data Size? • 1’s Gigabytes • 100’s Gigabytes • 1’s
Terabytes • 100’s Terabytes • 1+ Petabytes

Terabytes • 100’s Terabytes • 1+ Petabytes DDM Nov 2017

Terabytes • 100’s Terabytes • 1+ Petabytes DDM Nov 2017 Nov 2018

Let’s Start w/ a Story

Avg. 22 Million Ad Impressions Per Day

Avg. 29 GB Data Per Day

How did we view this data?

Problems w/ Just DFP Reporting • Ran into Limitations •
Unable to perform complex ﬁlters • Initial report -> csv -> Excel, slow for large datasets • Reporting was very labor intensive, manual • Diﬃcult to join in w/ other Datasets (i.e. Google Analytics data)

DFP Solution: Data Transfer Files!

Few ﬁrst failed attempts • Wrote some PHP Scripts to
read CSV, create reports • This was SLOW, and required developer for new reports • Import a single month into MySQL • Hahahahahahahaha… yeah, no, MySQL died trying.  (note: I’m sure I could have gotten this to work with an advanced, complex setup.)

Use Big Data! ( Duh… )

Running it Ourselves • Complex Setup • Large Upfront Investment
• Always running (means always $$$)

I just want my data …

What is BigQuery • Enterprise Data Warehouse • Fully Managed
Service • Incredibly Fast, Scalable, and Cost Eﬀective • Based oﬀ of Dremel, Google’s Internal Big Data tool • Uses SQL for querying data

What does it cost? Storage • $0.02 per GB per
Month • $0.01 per GB per Month for LTS, data that hasn’t changed in 90 days • $0.05 per GB per Streaming Inserts • First 10 GB is free each month Analysis • $5 per TB of data processed • First 1 TB of Analysis Free

What doesn’t cost money? • Running / Maintaining a Cluster
• Bulk Importing of data (this is huge!!!) • Copying Data • Exporting Data

What makes BigQuery Special • Extremely cost eﬀective (extremely cheap)
• Incredibly Fast • Full SQL 2011 Compliance • Advanced Data Types: Records, Repeated Fields

Our Journey, 3 Phases: Crawl - Walk - Run

Crawl https://www.ﬂickr.com/photos/ndanger/4425407800/

Getting Started • Create Google Cloud Platform Project • Enable
BigQuery API

BigQuery Basics • Create a Dataset - Collection of Tables
& Views • Create a Table w/ a Deﬁned Schema • Two options for loading in data: Imports & Streaming Inserts

Import vs Streaming Import Jobs • Free • 1,000 loads
per day for single Table • 50,000 loads per day for Project • CSV, New Line Delimited JSON, AVRO Streaming Inserts • $0.05 per GB • Real-time (my experience, within 3 seconds) • Stream through the API

First Attempt Loading • Used homegrown Node.js Script to ETL
files • Generated New-line Delimited JSON files • Initially used repeated records, switched to JSON strings • Now with Standard SQL, and we understand it better, I’d go back and use repeated records again. • Biggest Hangup: Data Types Conflicts

Lessons Learned Importing • Newline Delimited JSON files are Largest
• If you can, use AVRO: • Allows for compression, attaching the schema • Imports are extra fast • You can import a wildcard list of file: • Instead of gs://example-bucket/upload.avro • Use multiple files like gs://example-bucket/upload-*

Lessons Learned Importing • Use unique names of imports •
Bad: gs://example/daily-import-* • Better: gs://example/daily-import-2018-01-01-x8fd62/part-*

How Query Billing Works • You are billed for the
size of columns your query uses. Period. • What doesn’t impact billing: • Time to run • Number of rows returned • Complexity • Anything else….

id username status fav_icecream 1 justin_carmony active pralines and cream
2 brett_atkinson active birthday cake 3 greg_dolan active chocolate 4 mike_peterson active cookies and cream SELECT id FROM `ddm-example.users` WHERE status = 'active'

28.46 TB * $5.00 per TB = $142.46

0.61 TB * $5.00 per TB = $3.05

23,398,994,173

Table Partition Multiple Tables • `tablename_YYYYMMDD` • _TABLE_SUFFIX to select
range • `dataset.table_201810*` • Pros: • Deﬁned boundaries Single Table • Treat as a single table • On import deﬁne row partition date • Beta: cluster tables • Pros: • Treat as a single table, WHERE statement to limit query scope.

At This Point … • We had our Advertising Data
in BigQuery • Using BigQuery UI to execute ad-hoc queries • Started to get great insights into our data! • It was amazing!

But Needed to Get Better! • Only analysis could be
done through me • For every question answered, 10 more questions! • “How can we expose this to the rest of the organization?”

Walk https://www.ﬂickr.com/photos/thomasleuthard/12087829163/

Example Dashboard Dummy Data

Using More BigQuery Features • Scheduled Queries to create Summaries,
Subset Tables • Use Views to created ﬁelds for other BI Tools • Streaming Inserts + Google Cloud Functions for tracking browser timings on DeseretNews.com • Built Internal Analytics Tools

Run https://www.ﬂickr.com/photos/80517909@N04/30600455558/

Running with BigQuery • Teach analysts SQL • Started to
use RStudio & Python Notebooks to streamline analysis • Create Data Studio Dashboards for every single product • Migrate internal analytics tools to BigQuery • Using User Deﬁned Functions for advanced analysis

Where we are at today • 106 TB of Data
• 69.7 Billion Rows • 21k Tables • 600+ TB of Analysis

Demo Time

Fly https://www.ﬂickr.com/photos/80517909@N04/30600455558/

DDM's Journey into Big Data with BigQuery with ...

DDM's Journey into Big Data with BigQuery with KSL and Deseret News

More Decks by Justin Carmony

Other Decks in Technology

Featured

Transcript