Slide 1

Slide 1 text

Grounding Big Data Joe Hellerstein Vikram Sreekanti UC Berkeley

Slide 2

Slide 2 text

REMEMBERING THE PAST Data Warehouse Single Source of Truth Enterprise Information Architecture Golden Master … Truth

Slide 3

Slide 3 text

REMEMBERING THE PAST “There is no point in bringing data … into the data warehouse environment without integrating it.” — Bill Inmon, Building the Data Warehouse Truth

Slide 4

Slide 4 text

Big data took us to a new world

Slide 5

Slide 5 text

There were changes in volume, velocity and variety, 
 which were challenging. Big data took us to a new world

Slide 6

Slide 6 text

There were changes in volume, velocity and variety, 
 which were challenging. The real challenge now is the meaning and value of data,
 which depend critically on context. Big data took us to a new world

Slide 7

Slide 7 text

A broader context for big data ground

Slide 8

Slide 8 text

Motivation: What is Different? Ground: Data Context Services Examples Challenges Putting Big Data in Context OUTLINE

Slide 9

Slide 9 text

Metadata: 
 The last thing anybody 
 wants to work on Isn’t this just
 metadata?

Slide 10

Slide 10 text

Data context services:
 The final frontier CONTEXT IS SO MUCH MORE Metadata: 
 The last thing anybody 
 wants to work on

Slide 11

Slide 11 text

WHAT IS DIFFERENT? Shift in technology
 Data representations Shift in behavior
 Data-driven organizations

Slide 12

Slide 12 text

Shift in behavior
 Data-driven organizations

Slide 13

Slide 13 text

Data in products Started with the Internet. Now, the Internet of Things

Slide 14

Slide 14 text

By 2017: 
 marketing spends more on tech than IT does. Data in marketing GARTNER GROUP By 2020: 
 90% of IT budget controlled outside of IT.

Slide 15

Slide 15 text

MANY USE CASES MANY CONSTITUENCIES MANY INCENTIVES MANY CONTEXTS

Slide 16

Slide 16 text

WHAT IS DIFFERENT? Shift in technology
 Data representations Shift in behavior
 Data-driven organizations

Slide 17

Slide 17 text

Shift in technology
 Data representations

Slide 18

Slide 18 text

What does it
 mean? It depends on
 the context. Raw data in the data lake
 Simplifies capture Encourages exploration

Slide 19

Slide 19 text

A LITTLE SCENARIO HDFS

Slide 20

Slide 20 text

BITS A web log from a retail site

Slide 21

Slide 21 text

BITS All the web logs from last year

Slide 22

Slide 22 text

VIEWS, MODELS, CODE A script to extract orders. To be used for Market Basket analysis.

Slide 23

Slide 23 text

VIEWS, MODELS, CODE A Hive table of orders. To be used for Market Basket analysis.

Slide 24

Slide 24 text

BITS All the web logs from last year

Slide 25

Slide 25 text

VIEWS, MODELS, CODE Code to extract abandoned user sessions

Slide 26

Slide 26 text

VIEWS, MODELS, CODE A retargeting model

Slide 27

Slide 27 text

A hive table
 of orders A retargeting model VIEWS, MODELS, CODE

Slide 28

Slide 28 text

No content

Slide 29

Slide 29 text

MANY SCRIPTS MANY MODELS MANY APPLICATIONS MANY CONTEXTS

Slide 30

Slide 30 text

Putting Big Data In Context Ground: Data Context Services Examples Challenges Motivation: What is Different? OUTLINE

Slide 31

Slide 31 text

THE MEANING AND VALUE OF DATA DEPENDS ON CONTEXT Application Context Views, models, code Behavioral Context Data lineage & usage
 Historical Context In and over time

Slide 32

Slide 32 text

APPLICATION CONTEXT Metadata Models for interpreting
 the data for use • Data structures • Semantic structures • Statistical structures Theme: An unopinionated model of context

Slide 33

Slide 33 text

HISTORICAL CONTEXT Versions Web logs Code to extract user/ movie rentals Recommender for movie licensing Point in time
 A promising new
 movie is similar to older hot movies at time of release! Trends over time
 How does a movie
 with these features
 fare over time?

Slide 34

Slide 34 text

BEHAVIORAL CONTEXT Why Dora?! Lineage & Usage

Slide 35

Slide 35 text

2 4 8 7 9 BEHAVIORAL CONTEXT Lineage & Usage Data Science Recommenders “You should compare with book sales from last year.” Curation Tips “Logistics staff checks weather data the 1st Monday of every month.” Proactive
 Impact Analysis “The Twitter analysis script changed. You should check the boss’ dashboard!”

Slide 36

Slide 36 text

7 7 9 9 THE BIG CONTEXT A NEW WORLD NEEDS NEW SERVICES

Slide 37

Slide 37 text

Putting Big Data in Context Ground: Data Context Services Examples Challenges Motivation: What is Different? OUTLINE

Slide 38

Slide 38 text

WHAT ARE WE BUILDING? Grounding philosophy • Start useful, stay useful. • Stay general. • Design for scale.

Slide 39

Slide 39 text

ABOVEGROUND API TO APPLICATIONS UNDERGROUND API TO SERVICES CONTEXT MODEL COMMON GROUND Parsing &
 Featurization Catalog &
 Discovery Wrangling Analytics &
 Vis Reference
 Data Data
 Quality Reproducibility Model
 Serving Scavenging
 and Ingestion Search &
 Query Scheduling &
 Workflow Versioned
 Storage ID & Auth

Slide 40

Slide 40 text

Scavenging
 and Ingestion Search &
 Query Scheduling &
 Workflow Versioned
 Storage ID & Auth COMMON GROUND CONTEXT MODEL Pachyderm Chronos Parsing &
 Featurization Catalog &
 Discovery Wrangling Analytics &
 Vis Reference
 Data Data
 Quality Reproducibility Model
 Serving ABOVEGROUND API TO APPLICATIONS UNDERGROUND API TO SERVICES CONTEXT MODEL COMMON GROUND

Slide 41

Slide 41 text

COMMON GROUND An unopinionated context model Versions Models Usage

Slide 42

Slide 42 text

COMMON GROUND The metamodel Models Versions Usage Versions Usage Models Model Graphs

Slide 43

Slide 43 text

member k1 member k1: string member k2 Object 2 member k1 member k2:
 number member k11: string member k12 element 1 element 2 element 3 element 1 element 2 element 3 Root RELATIONAL SCHEMA JSON DOCUMENT Schema 1 Table 1 Column 1 Column c Table t Column 1 Column d foreign key Models Versions Usage Versions Usage Models

Slide 44

Slide 44 text

COMMON GROUND The versioning model Models Versions Usage Models Versions Usage Versions Usage Models Model Graphs Version Graphs

Slide 45

Slide 45 text

COMMON GROUND The model Models Versions Usage Models Versions Usage Versions Usage Models Model Graphs Version Graphs

Slide 46

Slide 46 text

a3eb4b765520b0d0ab90594dcf2373c1ce5dbb0b0 0e9233e8e99cccd6861d304968efa4c945a0b918 3e64220f08374629ad43ca652d4ce7cef0bdbbca 3e0bada008655fe32d7d136eac0a3f333d23ed80 fd75a4ba16f96d11f3f954854acc2d739054233 Directed Acyclic Graphs
 (partial orders) In this order In no particular order VERSION GRAPHS Models Versions Usage Models Versions Usage Versions Usage Models

Slide 47

Slide 47 text

COMMON GROUND The usage model Models Versions Usage Models Versions Usage Models Versions Usage Versions Usage Models Model Graphs Version Graphs Usage Graphs: Lineage

Slide 48

Slide 48 text

USAGE GRAPHS Everything can participate in usage Models Versions Usage Models Versions Usage Models Versions Usage Versions Usage Models

Slide 49

Slide 49 text

COMMON GROUND The model Versions Models Usage Model Graphs Version Graphs Usage Graphs: Lineage

Slide 50

Slide 50 text

Putting Big Data in Context Ground: Data Context Services Examples Challenges Motivation: What is Different? OUTLINE

Slide 51

Slide 51 text

GROUND vZERO Goals
 Exercise flexibility of Common Ground
 Proof of concept for starting useful 
 Examples Grit: Ground+Git Apiary: A Grounded Hive Metastore

Slide 52

Slide 52 text

GRIT SCENARIO CS 186 @ Berkeley 500 students • Now the 6th-largest upper-division course at Berkeley CS 186 homework submissions through Github Goal: Track students’ submission history. E.g: • Track tardiness • Prevent submission time spoofing • Long-term: Analyze homework turn-in patterns gr

Slide 53

Slide 53 text

TECHNICAL ISSUES Ideally, you track git history as it’s created • GitHub’s Webhooks API! • Reports back to Ground on every push.
 Unfortunately, some wrinkles here • Webhooks API doesn’t report the full version lineage • So can’t rely on GitHub. Track the git repo ourselves. • A topic for future collaboration perhaps. (FWIW, Google Docs is even messier!) gr

Slide 54

Slide 54 text

APIARY A Grounded Hive Metastore Schema 1 Table 1 Column 1 Column c Table t Column 1 Column d foreign key Apiary
 Ground as the backing store for Hive Metastore Relational catalog a design pattern above our basic context model Hive Metastore
 the de facto catalog for structured big data

Slide 55

Slide 55 text

PUTTING IT TOGETHER APIARY + HDFS + Fully-versioned context store! versioned metadata storage append-mostly data storage versioned code ground

Slide 56

Slide 56 text

APIARY + GRIT + HDFS Different versions of data (HDFS) Different versions of code (Grit) + + = Different versions of metadata (Apiary)

Slide 57

Slide 57 text

DEMO

Slide 58

Slide 58 text

OUTLINE Putting Big Data in Context Ground: Data Context Services Examples Challenges Motivation: What is Different?

Slide 59

Slide 59 text

INITIAL FOCUS AREAS

Slide 60

Slide 60 text

ABOVEGROUND API TO APPLICATIONS UNDERGROUND API TO SERVICES CONTEXT MODEL COMMON GROUND Parsing &
 Featurization Catalog &
 Discovery Wrangling Analytics &
 Vis Reference
 Data Data
 Quality Reproducibility Model
 Serving Scavenging
 and Ingestion Search &
 Query Scheduling &
 Workflow Versioned
 Storage ID & Auth INITIAL FOCUS AREAS

Slide 61

Slide 61 text

ABOVEGROUND API TO APPLICATIONS UNDERGROUND API TO SERVICES CONTEXT MODEL COMMON GROUND Catalog &
 Discovery Wrangling Analytics &
 Vis Reference
 Data Data
 Quality Scavenging
 and Ingestion Search &
 Query Scheduling &
 Workflow Versioned
 Storage ID & Auth INITIAL FOCUS AREAS Parsing &
 Featurization Model
 Serving Reproducibility

Slide 62

Slide 62 text

ABOVEGROUND API TO APPLICATIONS UNDERGROUND API TO SERVICES CONTEXT MODEL COMMON GROUND Parsing &
 Featurization Catalog &
 Discovery Wrangling Analytics &
 Vis Reference
 Data Data
 Quality Reproducibility Model
 Serving Scavenging
 and Ingestion Search &
 Query Scheduling &
 Workflow ID & Auth INITIAL FOCUS AREAS Versioned
 Storage

Slide 63

Slide 63 text

ABOVEGROUND API TO APPLICATIONS UNDERGROUND API TO SERVICES CONTEXT MODEL COMMON GROUND Parsing &
 Featurization Catalog &
 Discovery Wrangling Analytics &
 Vis Reference
 Data Data
 Quality Reproducibility Model
 Serving Scavenging
 and Ingestion Search &
 Query Scheduling &
 Workflow Versioned
 Storage ID & Auth ABOVEGROUND API TO APPLICATIONS UNDERGROUND API TO SERVICES BROAD ARENA FOR INNOVATION AT ALL LEVELS!

Slide 64

Slide 64 text

It’s time to establish a bigger context for big data. Historical context
 Because
 things change Behavioral context
 Because behavior determines meaning Application context Because truth
 is subjective GROUNDING BIG DATA WITH CONTEXT SERVICES

Slide 65

Slide 65 text

INPUT AND FURTHER CONTEXT

Slide 66

Slide 66 text

ground Learn more at: http://www.ground-metadata.org @joe_hellerstein @vsreekanti