Slide 1

Slide 1 text

People, data, computation: an evolving trifecta Joe Hellerstein Berkeley | Trifacta

Slide 2

Slide 2 text

No content

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

No content

Slide 5

Slide 5 text

No content

Slide 6

Slide 6 text

What is the biggest problem with databases today? (Hint: it’s not the web.) 1995

Slide 7

Slide 7 text

No content

Slide 8

Slide 8 text

No content

Slide 9

Slide 9 text

No content

Slide 10

Slide 10 text

What is the biggest problem with databases today? 1995

Slide 11

Slide 11 text

1999

Slide 12

Slide 12 text

CONTROL Continuous Output, Navigation and Transformation with Refinement On-Line “Of all men's miseries, the bitterest is this: to know so much and have control over nothing” –  Herodotus •  Requirements for Control systems –  Early answers –  Refinement over time –  Interaction and ad-hoc control •  “Crystal Ball” vs. Black Box

Slide 13

Slide 13 text

Goals for Online Processing •  New “greedy” performance regime –  Maximize 1st derivative of the “mirth index” –  Mirth defined on-the-fly –  Therefore need FEEDBACK and CONTROL Time J 100% Online Traditional

Slide 14

Slide 14 text

Example: Online Aggregation Online Aggregation

Slide 15

Slide 15 text

CLOUDS

Slide 16

Slide 16 text

Potter’s Wheel

Slide 17

Slide 17 text

CONTROL Lessons •  Dream about UIs, work on systems •  Systems and statistics intertwine •  All 3 go together naturally –  User desires and behavior: 2 more things to model, predict –  “Performance” metrics need to reflect key user needs “What unlike things must meet and mate…” -- Art, Herman Melville The Trifecta

Slide 18

Slide 18 text

The Trifecta c. 2000 DB Systems: Pipelined query processing + sampling, declarative DSLs Statistics: Approximations and confidence bounds for queries UI: Immediacy, visualization of confidence, visual DSLs

Slide 19

Slide 19 text

Fast Forward 16 Years The Trifecta

Slide 20

Slide 20 text

The DBO Database Sys Florin Rusu, Fei Xu, Luis Perez, University of Florida, Gainesville What is DBO? The DBO Database Sys - DBO (version 0.2) is a prototype database engine for analytic, statistical processing - Key innovations: • Within seconds after query is issued, DBO gives statistically valid guess + bounds •Accuracy increases as query is executed; 100% accuracy at query completion •Works for arbitrary SELECT-FROM-WHERE-GROUP BY aggregate queries •For some queries (almost all single-table scans) 99%+ accuracy after only seconds - Key idea: Happy with the current estimate? Then kill the query! - DBO extends “classic” online aggregation to full, disk-based query plans; see our SIGMOD ‘07 paper How Does DBO Work? - Data are clustered randomly on the disk, so tuples flow through engine in random order - During processing, DBO finds “lucky” output tuples whose parts happen to be in memory - DBO uses those “lucky” tuples that it finds to guess final answer to the query - Example: SUM l_price (lineitem JOIN orders ON l_okey = o_okey AND o_shipdate > ‘1-1-97’) • Happen to have ($12.82, 1234) from lineitem, (1234, ‘2-12-98’) from orders in memory • So if probability of finding ($12.82, 1234, 1234, ‘2-12-98’) is p, add (12.82 / p) to estimate - By statistically characterizing what “lucky” means, can provide confidence bounds on estimate Levelwise QP in DBO - To search for output tuples, operations com- municate their internal state with one another - Recognizing output tuples generally requires data from all input relations - Thus, all relational operations at each level of the query plan search for lucky tuples in a coordinated fashion - Called a Levelwise Step - Each levelwise step produces an estimate Ni R1 R4 R3 R2 R5 R6 R7 R8 R12 R34 R56 R78 R1234 R5678 (1) Original query plan R1 R4 R3 R2 R5 R6 R7 R8 (2) All bottom-level joins evaluated concurrently in levelwise step #1. This step (3) Remaining query plan R12 R34 R56 R78 (4) All bottom-level joins evaluated concurrently in R1234 R5678 (6) Final join evaluated in N1 N2 N3 produces an estimator N 1 - We have prepared five queries over TPC-H benchm - For comparison, have two identical machines; one SELECT l_returnflag, l_linestatus, sum(l_extendedprice*(1-l_discount) * (1+l_tax)) FROM lineitem WHERE l_shipdate < '1998-09-01' GROUP BY l_returnflag, l_linestatus " c b SELECT n_name, sum(l_extendedprice * (1-l_discount)) FROM customer, orders, lineitem, nation WHERE c_custkey = o_custkey AND l_orderkey = o_orderkey AND l_returnflag = 'R' AND (o_orderdate < '1994-01-01') AND (o_orderdate > '1993-09-30') AND c_nationkey = n_nationkey GROUP BY n_name " c h SELECT n_name, sum(l_extendedprice * (1 - l_discount)) FROM customer, orders, lineitem, supplier, nation, reg WHERE c_custkey=o_custkey AND l_orderkey = o_orderkey AND l_suppkey = s_suppkey AND c_nationkey = s_nationkey AND s_nationkey = n_nationkey AND n_regionkey = r_regionkey AND l_discount > 0.08 AND r_name = 'ASIA' AND (o_orderdate > '1993-12-31') AND (o_orderdate < '1995-01-01') GROUP BY n_name "Find the revenue fr heavily discounted p 1994 that were sold t tomer in the Asian re per-country basis" SELECT n1.nationname, n2.nationname, extract(year from l_shipdate) as l_year, SUM(l_extendedprice * (1 - l_d FROM supplier, lineitem, orders, customer, nation n1, n WHERE s_suppkey = l_suppkey AND o_orderkey = l_orderkey AND c_custkey = o_custkey AND s_nationkey = n1.n_natio AND c_nationkey = n2.n_nationkey AND ((n1.n_name = 'GERMANY' AND n2.n_name = 'FRANCE') OR (n1.n_name = 'FRANCE' AND n2.n_name = 'GERMANY')) AND (l_shipdate < '1997-01-01') AND (l_shipdate > '1995-01-01') AND l_discount > 0.08 GROUP BY n1.nationname, n2.nationname, l_year SELECT l_shipmode, extract(year from l_shipdate) as l_y FROM orders, lineitem WHERE o_orderkey = l_orderkey AND o_orderprior > '1-URG AND (l_receiptdate > '1993-12-31') AND (l_receiptdate < '1995-01-01') AND (l_commitdate < l_receiptdate) AND (l_shipdate < l_commitdate) GROUP BY l_shipmode, l_year "Find the numb shipment mode 1994 that did n Mingxi Wu, Ravi Jampani, Chris Jermaine, Alin Dobra The Demonstration - DBO has a simple, GUI front end that shows estima

Slide 21

Slide 21 text

No content

Slide 22

Slide 22 text

No content

Slide 23

Slide 23 text

From There to Here

Slide 24

Slide 24 text

datapeople Jeff Heer Stanford Tapan Parikh Berkeley Maneesh Agrawala Berkeley Joe Hellerstein Berkeley Sean Diana Ravi Kandel MacLean Parikh Kuang Nicholas Wesley Chen Kong WilleF 2009 http://deepresearch.org

Slide 25

Slide 25 text

Data in the First Mile

Slide 26

Slide 26 text

Database

Slide 27

Slide 27 text

Shreddr

Slide 28

Slide 28 text

No content

Slide 29

Slide 29 text

Select the values are not: Michael Shreddr: Data Entry ⇒ Prediction Confirmation

Slide 30

Slide 30 text

ID ID ID Village Village Village Age Age Age form 1 form 2 form 3 worker 1 worker 2 worker 3 Shreddr: Columnar Data Entry

Slide 31

Slide 31 text

A Data Entry Trifecta, c. 2012 People: Predictive, correlated, encoded data entry Data: Processing data with compute + crowd power Computation: Learning, modeling, prediction

Slide 32

Slide 32 text

Wrangler

Slide 33

Slide 33 text

No content

Slide 34

Slide 34 text

DSL Visual PREDICTIVE INTERACTION Data Vis Visual Results visualization compilation Data Results coding ambiguous interaction Response Visualization of probable Next Steps disambiguation Prediction guide decide

Slide 35

Slide 35 text

A Data Wrangling Trifacta c. 2015 People: Interactive visual profiling, predictive interaction Data: Data-centric DSLs + compilation for multiple scales & engines Computation: Sampling, approximation, learning, inference, prediction

Slide 36

Slide 36 text

People •  Perception & Comprehension: from Experts to Amateurs •  Interaction across scales: from Lake to Atom Data (and metadata!) •  Lineage across pipelines •  Harvesting usage •  Building and leveraging organizational context Computation •  Programmability of interactive data systems •  Prediction of data and code reuse/adaptation •  Platform variety and capability A Trifecta of Topics going forward

Slide 37

Slide 37 text

References •  Online Aggregation. [HHW SIGMOD 97] •  Interactive Data Analysis: The CONTROL Project. [HAC+ Computer 99] •  Potter’s Wheel: An Interactive Data Cleaning System. [RH VLDB 01] •  Data in the First Mile. [CHP CIDR 11] •  Shreddr: pipelined paper digitization for low-resource organizations. [CKY+ ACM Dev 12] •  Wrangler: Interactive visual specification of data transformation scripts. [KPHH CHI 11] •  Profiler: Integrated statistical analysis and visualization for data quality assessment. [KPP+ AVI 12] •  Predictive Interaction for Data Transformation. [HHK CIDR 15]

Slide 38

Slide 38 text

BACKUP

Slide 39

Slide 39 text

TinyDB & Model-Driven Data Acquisition •  Data as a scarce resource •  Querying the physical world •  Statistical modeling The Trifecta

Slide 40

Slide 40 text

The Trifecta c. 2000 •  Systems: Minimizing duty cycles and communication for queries •  Statistics: Modeling, approximation, interpolation and prediction •  UX: SQL as an interface to the physical world

Slide 41

Slide 41 text

Potter’s Wheel

Slide 42

Slide 42 text

Potter’s Wheel

Slide 43

Slide 43 text

Usher

Slide 44

Slide 44 text

Hard constraint SoG constraint fricHon Usher Satisficr

Slide 45

Slide 45 text

Profiler