People, data, computation: an evolving trifecta

People, data, computation: an evolving trifecta Joe Hellerstein Berkeley |
Trifacta

What is the biggest problem with databases today? (Hint: it’s
not the web.) 1995

What is the biggest problem with databases today? 1995

CONTROL Continuous Output, Navigation and Transformation with Refinement On-Line “Of
all men's miseries, the bitterest is this: to know so much and have control over nothing” –  Herodotus •  Requirements for Control systems –  Early answers –  Refinement over time –  Interaction and ad-hoc control •  “Crystal Ball” vs. Black Box

Goals for Online Processing •  New “greedy” performance regime – 
Maximize 1st derivative of the “mirth index” –  Mirth defined on-the-fly –  Therefore need FEEDBACK and CONTROL Time J 100% Online Traditional

Example: Online Aggregation Online Aggregation

CLOUDS

Potter’s Wheel

CONTROL Lessons •  Dream about UIs, work on systems • 
Systems and statistics intertwine •  All 3 go together naturally –  User desires and behavior: 2 more things to model, predict –  “Performance” metrics need to reflect key user needs “What unlike things must meet and mate…” -- Art, Herman Melville The Trifecta

The Trifecta c. 2000 DB Systems: Pipelined query processing +
sampling, declarative DSLs Statistics: Approximations and conﬁdence bounds for queries UI: Immediacy, visualization of conﬁdence, visual DSLs

Fast Forward 16 Years The Trifecta

The DBO Database Sys Florin Rusu, Fei Xu, Luis Perez,
University of Florida, Gainesville What is DBO? The DBO Database Sys - DBO (version 0.2) is a prototype database engine for analytic, statistical processing - Key innovations: • Within seconds after query is issued, DBO gives statistically valid guess + bounds •Accuracy increases as query is executed; 100% accuracy at query completion •Works for arbitrary SELECT-FROM-WHERE-GROUP BY aggregate queries •For some queries (almost all single-table scans) 99%+ accuracy after only seconds - Key idea: Happy with the current estimate? Then kill the query! - DBO extends “classic” online aggregation to full, disk-based query plans; see our SIGMOD ‘07 paper How Does DBO Work? - Data are clustered randomly on the disk, so tuples flow through engine in random order - During processing, DBO finds “lucky” output tuples whose parts happen to be in memory - DBO uses those “lucky” tuples that it finds to guess final answer to the query - Example: SUM l_price (lineitem JOIN orders ON l_okey = o_okey AND o_shipdate > ‘1-1-97’) • Happen to have ($12.82, 1234) from lineitem, (1234, ‘2-12-98’) from orders in memory • So if probability of finding ($12.82, 1234, 1234, ‘2-12-98’) is p, add (12.82 / p) to estimate - By statistically characterizing what “lucky” means, can provide confidence bounds on estimate Levelwise QP in DBO - To search for output tuples, operations com- municate their internal state with one another - Recognizing output tuples generally requires data from all input relations - Thus, all relational operations at each level of the query plan search for lucky tuples in a coordinated fashion - Called a Levelwise Step - Each levelwise step produces an estimate Ni R1 R4 R3 R2 R5 R6 R7 R8 R12 R34 R56 R78 R1234 R5678 (1) Original query plan R1 R4 R3 R2 R5 R6 R7 R8 (2) All bottom-level joins evaluated concurrently in levelwise step #1. This step (3) Remaining query plan R12 R34 R56 R78 (4) All bottom-level joins evaluated concurrently in R1234 R5678 (6) Final join evaluated in N1 N2 N3 produces an estimator N 1 - We have prepared five queries over TPC-H benchm - For comparison, have two identical machines; one SELECT l_returnflag, l_linestatus, sum(l_extendedprice*(1-l_discount) * (1+l_tax)) FROM lineitem WHERE l_shipdate < '1998-09-01' GROUP BY l_returnflag, l_linestatus " c b SELECT n_name, sum(l_extendedprice * (1-l_discount)) FROM customer, orders, lineitem, nation WHERE c_custkey = o_custkey AND l_orderkey = o_orderkey AND l_returnflag = 'R' AND (o_orderdate < '1994-01-01') AND (o_orderdate > '1993-09-30') AND c_nationkey = n_nationkey GROUP BY n_name " c h SELECT n_name, sum(l_extendedprice * (1 - l_discount)) FROM customer, orders, lineitem, supplier, nation, reg WHERE c_custkey=o_custkey AND l_orderkey = o_orderkey AND l_suppkey = s_suppkey AND c_nationkey = s_nationkey AND s_nationkey = n_nationkey AND n_regionkey = r_regionkey AND l_discount > 0.08 AND r_name = 'ASIA' AND (o_orderdate > '1993-12-31') AND (o_orderdate < '1995-01-01') GROUP BY n_name "Find the revenue fr heavily discounted p 1994 that were sold t tomer in the Asian re per-country basis" SELECT n1.nationname, n2.nationname, extract(year from l_shipdate) as l_year, SUM(l_extendedprice * (1 - l_d FROM supplier, lineitem, orders, customer, nation n1, n WHERE s_suppkey = l_suppkey AND o_orderkey = l_orderkey AND c_custkey = o_custkey AND s_nationkey = n1.n_natio AND c_nationkey = n2.n_nationkey AND ((n1.n_name = 'GERMANY' AND n2.n_name = 'FRANCE') OR (n1.n_name = 'FRANCE' AND n2.n_name = 'GERMANY')) AND (l_shipdate < '1997-01-01') AND (l_shipdate > '1995-01-01') AND l_discount > 0.08 GROUP BY n1.nationname, n2.nationname, l_year SELECT l_shipmode, extract(year from l_shipdate) as l_y FROM orders, lineitem WHERE o_orderkey = l_orderkey AND o_orderprior > '1-URG AND (l_receiptdate > '1993-12-31') AND (l_receiptdate < '1995-01-01') AND (l_commitdate < l_receiptdate) AND (l_shipdate < l_commitdate) GROUP BY l_shipmode, l_year "Find the numb shipment mode 1994 that did n Mingxi Wu, Ravi Jampani, Chris Jermaine, Alin Dobra The Demonstration - DBO has a simple, GUI front end that shows estima

From There to Here

datapeople Jeﬀ Heer Stanford Tapan Parikh Berkeley Maneesh Agrawala Berkeley
Joe Hellerstein Berkeley Sean Diana Ravi Kandel MacLean Parikh Kuang Nicholas Wesley Chen Kong WilleF 2009 http://deepresearch.org

Data in the First Mile

Database

Shreddr

Select the values are not: Michael Shreddr: Data Entry ⇒
Prediction Conﬁrmation

ID ID ID Village Village Village Age Age Age form
1 form 2 form 3 worker 1 worker 2 worker 3 Shreddr: Columnar Data Entry

A Data Entry Trifecta, c. 2012 People: Predictive, correlated, encoded
data entry Data: Processing data with compute + crowd power Computation: Learning, modeling, prediction

Wrangler

DSL Visual PREDICTIVE INTERACTION Data Vis Visual Results visualization compilation
Data Results coding ambiguous interaction Response Visualization of probable Next Steps disambiguation Prediction guide decide

A Data Wrangling Trifacta c. 2015 People: Interactive visual proﬁling,
predictive interaction Data: Data-centric DSLs + compilation for multiple scales & engines Computation: Sampling, approximation, learning, inference, prediction

People •  Perception & Comprehension: from Experts to Amateurs • 
Interaction across scales: from Lake to Atom Data (and metadata!) •  Lineage across pipelines •  Harvesting usage •  Building and leveraging organizational context Computation •  Programmability of interactive data systems •  Prediction of data and code reuse/adaptation •  Platform variety and capability A Trifecta of Topics going forward

References •  Online Aggregation. [HHW SIGMOD 97] •  Interactive Data
Analysis: The CONTROL Project. [HAC+ Computer 99] •  Potter’s Wheel: An Interactive Data Cleaning System. [RH VLDB 01] •  Data in the First Mile. [CHP CIDR 11] •  Shreddr: pipelined paper digitization for low-resource organizations. [CKY+ ACM Dev 12] •  Wrangler: Interactive visual speciﬁcation of data transformation scripts. [KPHH CHI 11] •  Proﬁler: Integrated statistical analysis and visualization for data quality assessment. [KPP+ AVI 12] •  Predictive Interaction for Data Transformation. [HHK CIDR 15]

BACKUP

TinyDB & Model-Driven Data Acquisition •  Data as a scarce
resource •  Querying the physical world •  Statistical modeling The Trifecta

The Trifecta c. 2000 •  Systems: Minimizing duty cycles and
communication for queries •  Statistics: Modeling, approximation, interpolation and prediction •  UX: SQL as an interface to the physical world

Potter’s Wheel

Hard constraint SoG constraint fricHon Usher Satisﬁcr

Proﬁler

People, data, computation: an evolving trifecta

People, data, computation: an evolving trifecta

More Decks by Joe Hellerstein

Other Decks in Technology

Featured

Transcript