$30 off During Our Annual Pro Sale. View Details »

People, data, computation: an evolving trifecta

People, data, computation: an evolving trifecta

Keynote talk for Data Systems for Interactive Analysis, workshop at IEEE Vis 2015 http://www.interactive-analysis.org

Joe Hellerstein

October 26, 2015
Tweet

More Decks by Joe Hellerstein

Other Decks in Technology

Transcript

  1. CONTROL Continuous Output, Navigation and Transformation with Refinement On-Line “Of

    all men's miseries, the bitterest is this: to know so much and have control over nothing” –  Herodotus •  Requirements for Control systems –  Early answers –  Refinement over time –  Interaction and ad-hoc control •  “Crystal Ball” vs. Black Box
  2. Goals for Online Processing •  New “greedy” performance regime – 

    Maximize 1st derivative of the “mirth index” –  Mirth defined on-the-fly –  Therefore need FEEDBACK and CONTROL Time J 100% Online Traditional
  3. CONTROL Lessons •  Dream about UIs, work on systems • 

    Systems and statistics intertwine •  All 3 go together naturally –  User desires and behavior: 2 more things to model, predict –  “Performance” metrics need to reflect key user needs “What unlike things must meet and mate…” -- Art, Herman Melville The Trifecta
  4. The Trifecta c. 2000 DB Systems: Pipelined query processing +

    sampling, declarative DSLs Statistics: Approximations and confidence bounds for queries UI: Immediacy, visualization of confidence, visual DSLs
  5. The DBO Database Sys Florin Rusu, Fei Xu, Luis Perez,

    University of Florida, Gainesville What is DBO? The DBO Database Sys - DBO (version 0.2) is a prototype database engine for analytic, statistical processing - Key innovations: • Within seconds after query is issued, DBO gives statistically valid guess + bounds •Accuracy increases as query is executed; 100% accuracy at query completion •Works for arbitrary SELECT-FROM-WHERE-GROUP BY aggregate queries •For some queries (almost all single-table scans) 99%+ accuracy after only seconds - Key idea: Happy with the current estimate? Then kill the query! - DBO extends “classic” online aggregation to full, disk-based query plans; see our SIGMOD ‘07 paper How Does DBO Work? - Data are clustered randomly on the disk, so tuples flow through engine in random order - During processing, DBO finds “lucky” output tuples whose parts happen to be in memory - DBO uses those “lucky” tuples that it finds to guess final answer to the query - Example: SUM l_price (lineitem JOIN orders ON l_okey = o_okey AND o_shipdate > ‘1-1-97’) • Happen to have ($12.82, 1234) from lineitem, (1234, ‘2-12-98’) from orders in memory • So if probability of finding ($12.82, 1234, 1234, ‘2-12-98’) is p, add (12.82 / p) to estimate - By statistically characterizing what “lucky” means, can provide confidence bounds on estimate Levelwise QP in DBO - To search for output tuples, operations com- municate their internal state with one another - Recognizing output tuples generally requires data from all input relations - Thus, all relational operations at each level of the query plan search for lucky tuples in a coordinated fashion - Called a Levelwise Step - Each levelwise step produces an estimate Ni R1 R4 R3 R2 R5 R6 R7 R8 R12 R34 R56 R78 R1234 R5678 (1) Original query plan R1 R4 R3 R2 R5 R6 R7 R8 (2) All bottom-level joins evaluated concurrently in levelwise step #1. This step (3) Remaining query plan R12 R34 R56 R78 (4) All bottom-level joins evaluated concurrently in R1234 R5678 (6) Final join evaluated in N1 N2 N3 produces an estimator N 1 - We have prepared five queries over TPC-H benchm - For comparison, have two identical machines; one SELECT l_returnflag, l_linestatus, sum(l_extendedprice*(1-l_discount) * (1+l_tax)) FROM lineitem WHERE l_shipdate < '1998-09-01' GROUP BY l_returnflag, l_linestatus " c b SELECT n_name, sum(l_extendedprice * (1-l_discount)) FROM customer, orders, lineitem, nation WHERE c_custkey = o_custkey AND l_orderkey = o_orderkey AND l_returnflag = 'R' AND (o_orderdate < '1994-01-01') AND (o_orderdate > '1993-09-30') AND c_nationkey = n_nationkey GROUP BY n_name " c h SELECT n_name, sum(l_extendedprice * (1 - l_discount)) FROM customer, orders, lineitem, supplier, nation, reg WHERE c_custkey=o_custkey AND l_orderkey = o_orderkey AND l_suppkey = s_suppkey AND c_nationkey = s_nationkey AND s_nationkey = n_nationkey AND n_regionkey = r_regionkey AND l_discount > 0.08 AND r_name = 'ASIA' AND (o_orderdate > '1993-12-31') AND (o_orderdate < '1995-01-01') GROUP BY n_name "Find the revenue fr heavily discounted p 1994 that were sold t tomer in the Asian re per-country basis" SELECT n1.nationname, n2.nationname, extract(year from l_shipdate) as l_year, SUM(l_extendedprice * (1 - l_d FROM supplier, lineitem, orders, customer, nation n1, n WHERE s_suppkey = l_suppkey AND o_orderkey = l_orderkey AND c_custkey = o_custkey AND s_nationkey = n1.n_natio AND c_nationkey = n2.n_nationkey AND ((n1.n_name = 'GERMANY' AND n2.n_name = 'FRANCE') OR (n1.n_name = 'FRANCE' AND n2.n_name = 'GERMANY')) AND (l_shipdate < '1997-01-01') AND (l_shipdate > '1995-01-01') AND l_discount > 0.08 GROUP BY n1.nationname, n2.nationname, l_year SELECT l_shipmode, extract(year from l_shipdate) as l_y FROM orders, lineitem WHERE o_orderkey = l_orderkey AND o_orderprior > '1-URG AND (l_receiptdate > '1993-12-31') AND (l_receiptdate < '1995-01-01') AND (l_commitdate < l_receiptdate) AND (l_shipdate < l_commitdate) GROUP BY l_shipmode, l_year "Find the numb shipment mode 1994 that did n Mingxi Wu, Ravi Jampani, Chris Jermaine, Alin Dobra The Demonstration - DBO has a simple, GUI front end that shows estima
  6. datapeople Jeff Heer Stanford Tapan Parikh Berkeley Maneesh Agrawala Berkeley

    Joe Hellerstein Berkeley Sean Diana Ravi Kandel MacLean Parikh Kuang Nicholas Wesley Chen Kong WilleF 2009 http://deepresearch.org
  7. ID ID ID Village Village Village Age Age Age form

    1 form 2 form 3 worker 1 worker 2 worker 3 Shreddr: Columnar Data Entry
  8. A Data Entry Trifecta, c. 2012 People: Predictive, correlated, encoded

    data entry Data: Processing data with compute + crowd power Computation: Learning, modeling, prediction
  9. DSL Visual PREDICTIVE INTERACTION Data Vis Visual Results visualization compilation

    Data Results coding ambiguous interaction Response Visualization of probable Next Steps disambiguation Prediction guide decide
  10. A Data Wrangling Trifacta c. 2015 People: Interactive visual profiling,

    predictive interaction Data: Data-centric DSLs + compilation for multiple scales & engines Computation: Sampling, approximation, learning, inference, prediction
  11. People •  Perception & Comprehension: from Experts to Amateurs • 

    Interaction across scales: from Lake to Atom Data (and metadata!) •  Lineage across pipelines •  Harvesting usage •  Building and leveraging organizational context Computation •  Programmability of interactive data systems •  Prediction of data and code reuse/adaptation •  Platform variety and capability A Trifecta of Topics going forward
  12. References •  Online Aggregation. [HHW SIGMOD 97] •  Interactive Data

    Analysis: The CONTROL Project. [HAC+ Computer 99] •  Potter’s Wheel: An Interactive Data Cleaning System. [RH VLDB 01] •  Data in the First Mile. [CHP CIDR 11] •  Shreddr: pipelined paper digitization for low-resource organizations. [CKY+ ACM Dev 12] •  Wrangler: Interactive visual specification of data transformation scripts. [KPHH CHI 11] •  Profiler: Integrated statistical analysis and visualization for data quality assessment. [KPP+ AVI 12] •  Predictive Interaction for Data Transformation. [HHK CIDR 15]
  13. TinyDB & Model-Driven Data Acquisition •  Data as a scarce

    resource •  Querying the physical world •  Statistical modeling The Trifecta
  14. The Trifecta c. 2000 •  Systems: Minimizing duty cycles and

    communication for queries •  Statistics: Modeling, approximation, interpolation and prediction •  UX: SQL as an interface to the physical world