Slide 1

Slide 1 text

Justin Carmony - Sr. Director of Engineering - GCP SLC Meetup - Nov 2018 DDM’s Journey into Big Data with BigQuery for KSL and Deseret News

Slide 2

Slide 2 text

Justin Carmony Sr. Director of Engineering
 KSL.com News & Analytics
 Deseret Digital Media

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

About Presentation • Feel free to ask questions on-topic • Q&A at the End • Always feel free to email/tweet me 
 ([email protected], @JustinCarmony) • Will post slides on-line

Slide 5

Slide 5 text

Who are you?

Slide 6

Slide 6 text

Experience w/ “Big Data” • Lots of Experience? • Used it a little bit? • Tinkered with it? • Totally new?

Slide 7

Slide 7 text

Data Size? • 1’s Gigabytes • 100’s Gigabytes • 1’s Terabytes • 100’s Terabytes • 1+ Petabytes

Slide 8

Slide 8 text

Data Size? • 1’s Gigabytes • 100’s Gigabytes • 1’s Terabytes • 100’s Terabytes • 1+ Petabytes DDM Nov 2017

Slide 9

Slide 9 text

Data Size? • 1’s Gigabytes • 100’s Gigabytes • 1’s Terabytes • 100’s Terabytes • 1+ Petabytes DDM Nov 2017 Nov 2018

Slide 10

Slide 10 text

Let’s Start w/ a Story

Slide 11

Slide 11 text

No content

Slide 12

Slide 12 text

No content

Slide 13

Slide 13 text

No content

Slide 14

Slide 14 text

No content

Slide 15

Slide 15 text

Avg. 22 Million Ad Impressions Per Day

Slide 16

Slide 16 text

Avg. 29 GB Data Per Day

Slide 17

Slide 17 text

How did we view this data?

Slide 18

Slide 18 text

No content

Slide 19

Slide 19 text

Problems w/ Just DFP Reporting • Ran into Limitations • Unable to perform complex filters • Initial report -> csv -> Excel, slow for large datasets • Reporting was very labor intensive, manual • Difficult to join in w/ other Datasets (i.e. Google Analytics data)

Slide 20

Slide 20 text

DFP Solution: Data Transfer Files!

Slide 21

Slide 21 text

No content

Slide 22

Slide 22 text

No content

Slide 23

Slide 23 text

No content

Slide 24

Slide 24 text

No content

Slide 25

Slide 25 text

No content

Slide 26

Slide 26 text

No content

Slide 27

Slide 27 text

No content

Slide 28

Slide 28 text

No content

Slide 29

Slide 29 text

Few first failed attempts • Wrote some PHP Scripts to read CSV, create reports • This was SLOW, and required developer for new reports • Import a single month into MySQL • Hahahahahahahaha… yeah, no, MySQL died trying.
 (note: I’m sure I could have gotten this to work with an advanced, complex setup.)

Slide 30

Slide 30 text

Use Big Data! ( Duh… )

Slide 31

Slide 31 text

No content

Slide 32

Slide 32 text

Running it Ourselves • Complex Setup • Large Upfront Investment • Always running (means always $$$)

Slide 33

Slide 33 text

I just want my data …

Slide 34

Slide 34 text

No content

Slide 35

Slide 35 text

What is BigQuery • Enterprise Data Warehouse • Fully Managed Service • Incredibly Fast, Scalable, and Cost Effective • Based off of Dremel, Google’s Internal Big Data tool • Uses SQL for querying data

Slide 36

Slide 36 text

What does it cost? Storage • $0.02 per GB per Month • $0.01 per GB per Month for LTS, data that hasn’t changed in 90 days • $0.05 per GB per Streaming Inserts • First 10 GB is free each month Analysis • $5 per TB of data processed • First 1 TB of Analysis Free

Slide 37

Slide 37 text

What doesn’t cost money? • Running / Maintaining a Cluster • Bulk Importing of data (this is huge!!!) • Copying Data • Exporting Data

Slide 38

Slide 38 text

What makes BigQuery Special • Extremely cost effective (extremely cheap) • Incredibly Fast • Full SQL 2011 Compliance • Advanced Data Types: Records, Repeated Fields

Slide 39

Slide 39 text

Our Journey, 3 Phases: Crawl - Walk - Run

Slide 40

Slide 40 text

Crawl https://www.flickr.com/photos/ndanger/4425407800/

Slide 41

Slide 41 text

Getting Started • Create Google Cloud Platform Project • Enable BigQuery API

Slide 42

Slide 42 text

BigQuery Basics • Create a Dataset - Collection of Tables & Views • Create a Table w/ a Defined Schema • Two options for loading in data: Imports & Streaming Inserts

Slide 43

Slide 43 text

No content

Slide 44

Slide 44 text

Import vs Streaming Import Jobs • Free • 1,000 loads per day for single Table • 50,000 loads per day for Project • CSV, New Line Delimited JSON, AVRO Streaming Inserts • $0.05 per GB • Real-time (my experience, within 3 seconds) • Stream through the API

Slide 45

Slide 45 text

First Attempt Loading • Used homegrown Node.js Script to ETL files • Generated New-line Delimited JSON files • Initially used repeated records, switched to JSON strings • Now with Standard SQL, and we understand it better, I’d go back and use repeated records again. • Biggest Hangup: Data Types Conflicts

Slide 46

Slide 46 text

Lessons Learned Importing • Newline Delimited JSON files are Largest • If you can, use AVRO: • Allows for compression, attaching the schema • Imports are extra fast • You can import a wildcard list of file: • Instead of gs://example-bucket/upload.avro • Use multiple files like gs://example-bucket/upload-*

Slide 47

Slide 47 text

Lessons Learned Importing • Use unique names of imports • Bad: gs://example/daily-import-* • Better: gs://example/daily-import-2018-01-01-x8fd62/part-*

Slide 48

Slide 48 text

No content

Slide 49

Slide 49 text

No content

Slide 50

Slide 50 text

How Query Billing Works • You are billed for the size of columns your query uses. Period. • What doesn’t impact billing: • Time to run • Number of rows returned • Complexity • Anything else….

Slide 51

Slide 51 text

id username status fav_icecream 1 justin_carmony active pralines and cream 2 brett_atkinson active birthday cake 3 greg_dolan active chocolate 4 mike_peterson active cookies and cream SELECT id FROM `ddm-example.users` WHERE status = 'active'

Slide 52

Slide 52 text

id username status fav_icecream 1 justin_carmony active pralines and cream 2 brett_atkinson active birthday cake 3 greg_dolan active chocolate 4 mike_peterson active cookies and cream SELECT id FROM `ddm-example.users` WHERE status = 'active'

Slide 53

Slide 53 text

28.46 TB * $5.00 per TB = $142.46

Slide 54

Slide 54 text

0.61 TB * $5.00 per TB = $3.05

Slide 55

Slide 55 text

No content

Slide 56

Slide 56 text

No content

Slide 57

Slide 57 text

No content

Slide 58

Slide 58 text

No content

Slide 59

Slide 59 text

23,398,994,173

Slide 60

Slide 60 text

Table Partition Multiple Tables • `tablename_YYYYMMDD` • _TABLE_SUFFIX to select range • `dataset.table_201810*` • Pros: • Defined boundaries Single Table • Treat as a single table • On import define row partition date • Beta: cluster tables • Pros: • Treat as a single table, WHERE statement to limit query scope.

Slide 61

Slide 61 text

At This Point … • We had our Advertising Data in BigQuery • Using BigQuery UI to execute ad-hoc queries • Started to get great insights into our data! • It was amazing!

Slide 62

Slide 62 text

But Needed to Get Better! • Only analysis could be done through me • For every question answered, 10 more questions! • “How can we expose this to the rest of the organization?”

Slide 63

Slide 63 text

Walk https://www.flickr.com/photos/thomasleuthard/12087829163/

Slide 64

Slide 64 text

No content

Slide 65

Slide 65 text

Example Dashboard Dummy Data

Slide 66

Slide 66 text

No content

Slide 67

Slide 67 text

No content

Slide 68

Slide 68 text

Using More BigQuery Features • Scheduled Queries to create Summaries, Subset Tables • Use Views to created fields for other BI Tools • Streaming Inserts + Google Cloud Functions for tracking browser timings on DeseretNews.com • Built Internal Analytics Tools

Slide 69

Slide 69 text

Run https://www.flickr.com/photos/80517909@N04/30600455558/

Slide 70

Slide 70 text

Running with BigQuery • Teach analysts SQL • Started to use RStudio & Python Notebooks to streamline analysis • Create Data Studio Dashboards for every single product • Migrate internal analytics tools to BigQuery • Using User Defined Functions for advanced analysis

Slide 71

Slide 71 text

Where we are at today • 106 TB of Data • 69.7 Billion Rows • 21k Tables • 600+ TB of Analysis

Slide 72

Slide 72 text

Demo Time

Slide 73

Slide 73 text

Fly https://www.flickr.com/photos/80517909@N04/30600455558/