Guided Interaction Over Large Datasets

Slide 1

Slide 1 text

Guided Interaction over Large Datasets Arnab Nandi Computer Science & Engineering The Ohio State University

Slide 2

Slide 2 text

“Big Data”

Slide 3

Slide 3 text

Interacting with Large Datasets • Users want to explore and interact with the data when analyzing it • Data is too “big” • Slow to interact with • Unfamiliar • Hard to manage

Slide 4

Slide 4 text

Revisiting Status Quo • Databases have become really fast / efficient in going from query to result • Then why are we still unhappy? • Does this efficiency solve the overall user need? Interact Optimize Execute Query Plan Result “frontend” tasks: O(minutes) typical database system: O(seconds) Query Intent

Slide 5

Slide 5 text

Outline • Motivating Example • Challenges • Principles of Guided Interaction • Large-scale Browsing: Skimmer

Slide 6

Slide 6 text

Motivating Example Naïve user Alex Database Expert Bob Manager • Alex and Bob meet a Senior Manager • Forget name, need to look up contact info. • All they remember: manager of small group of senior researchers

Slide 7

Slide 7 text

Motivating Example: Naïve Alex • Visits corporate social network website 1. Browses all the “advanced search” forms 2. Uses Faceted Search interface to naively query for everyone in the company 3. Realizes you can’t drill down by seniority - There isn’t a “seniority” field, but age… 4. Goes back to “Birthday Search” form - Figures out senior employees are ~50 5. Adds age range, drills further, finds person Naïve user Alex

Slide 8

Slide 8 text

Motivating Example: Expert Bob • Opens up SQL Console to employee DB 1. SHOW TABLES; // reads… 2. DESC TABLES; // reads more… 3. SELECT emp.project, COUNT(*) AS c, AVG(emp.age) AS a FROM emp JOIN dept ON (emp.deptID = dept.ID) GROUP BY emp.project ORDER BY c ASC, a DESC LIMIT 3 4. SELECT emp.name,emp.cubicleID FROM emp JOIN dept ON (emp.deptID = dept.ID) WHERE dept.name=‘Research’ AND emp.project=’DatabasePrj’ AND emp.designation=’Manager’ Database Expert Bob Average age & count per group Use “DatabasePrj” from prev query

Slide 9

Slide 9 text

Motivating Example • Both users spent more time constructing and issuing sub queries • Issued redundant / wrong queries • On standard server, most queries take < 1 min • Session takes several minutes – hour! • Most time was spent in constructing the right query

Slide 10

Slide 10 text

Outline • Motivating Example • Challenges • Principles of Guided Interaction • Large-scale Browsing: Skimmer

Slide 11

Slide 11 text

Challenges • User’s lack of Knowledge • Dependency of Information • Iterative and Incremental Querying • Imprecise User Query Intent

Slide 12

Slide 12 text

Challenges Lack of Knowledge • Both users didn’t know about the • Schema • Data • Naïve user Alex did not know about • Query Language either • All 3 are needed to effectively issue queries • Otherwise, most time is spent issuing trial-and-error queries to learn more about the DB

Slide 13

Slide 13 text

Challenges Dependency of Information 3. Realizes you can’t drill down by seniority - There isn’t a “seniority” field, but age… 4. Goes back to “Birthday Search” form - Figures out senior employees are ~50 Naïve user Alex SELECT emp.project, COUNT(*) AS c, AVG(emp.age) AS a FROM emp JOIN dept ON (emp.deptID = dept.ID) GROUP BY emp.project ORDER BY c ASC, a DESC LIMIT 3 Database Expert Bob Average age & count per group

Slide 14

Slide 14 text

Challenges Dependency of Information • Finding out what age “Senior” meant required a secondary query • Cannot really write as a subquery • Dependency exists between final query and intermediate query results

Slide 15

Slide 15 text

Challenges Iterative & Incremental Querying • Observation: Users construct queries by first executing smaller parts • Cognitive capacity of users is limited • Query may be declarative, but users prefer iterative / incremental construction • Leads to a lot of requerying

Slide 16

Slide 16 text

Challenges Imprecise Query Intent • • DB Expert Bob was looking for some notion of “group” of small people • Hard to translate imprecise intents unless we’re aware of data • Only solution is to execute and see if answer worked Average age & count per group SELECT emp.project, COUNT(*) AS c, AVG(emp.age) AS a FROM emp JOIN dept ON (emp.deptID = dept.ID) GROUP BY emp.project ORDER BY c ASC, a DESC LIMIT 3

Slide 17

Slide 17 text

Challenges • Our example was a simple one • Challenges become much harder with complex needs • n-way JOINs, Nested queries, complex aggregates… • Any database use-case with a human in the loop will face these problems

Slide 18

Slide 18 text

Solutions so far • Application-level • Slick UIs, customized to use case • No principled approach to solving overall user needs • Where are my standardized operators for overall data interaction? • Set of rules I can follow when building such a system? • Related work: • QBE, VizQL(Tableau), AQUA, CONTROL, Telegraph and more • Solve thin slices of the overall problem

Slide 19

Slide 19 text

Outline • Motivating Example • Challenges • Principles of Guided Interaction • Large-scale Browsing: Skimmer

Slide 20

Slide 20 text

Guided Interaction • Principled Approach to solving these problems • More holistic thinking • To be included inside database Interact Optimize Execute Query Plan Result Rapid Iteration Interact Query Intent Database

Slide 21

Slide 21 text

Guided Interaction • Set of 3 design principles • Enumeration • Insights • Responsiveness • Database systems that keep these in mind can avoid the challenges discussed Example system: Skimmer

Slide 22

Slide 22 text

Guided Interaction Enumeration • The database is responsible for effectively enumerating all possible valid interactions with the data. • Removes burden of schema / data / language knowledge off the user

Slide 23

Slide 23 text

Guided Interaction Enumeration: Example • What does an enumeration-enabled query system look like? • Important • One possible implementation • Focus on the concepts, not the idea! • Portray simple use case • Can have many, far more complex systems built using these principles

Slide 24

Slide 24 text

Guided Interaction Enumeration: Example • Consider SQL query interface • With Partial Query Completion • Typing in “em” has exposed projection, join, and selection options. WHERE emily hanson contacts.email employee preﬁx suggestions type + cardinalities 4 . emp.name 45K . COLUMN 100K . TABLE

Slide 25

Slide 25 text

Guided Interaction Insights • The database must attempt to surface as many insights from the data as possible. • Removes informational dependencies • Aids expression of query intent • Note: Should not overwhelm the user

Slide 26

Slide 26 text

Guided Interaction Insights: Example • Consider SQL interface with range / numeric value selection • Visual / interactive feedback saves dependent query • Does my DB let me build something like this? !"#$ %$ &%%$ WHERE emp.age > 60 distribution  of values   in column Distribution of values in column

Slide 27

Slide 27 text

Guided Interaction Responsiveness • All interactions must be instantaneous even if inaccurate. • Fluid data interaction is key to getting insights • Tradeoff accuracy for near-instantaneous responses (i.e. <100ms*) * R. Miller. “Response time in man-computer conversational transactions” FJCC, 1968.

Slide 28

Slide 28 text

Guided Interaction Responsiveness: Example • SQL query interface, Partial Query Completion • Need to deliver results in <100ms WHERE emily hanson contacts.email employee preﬁx suggestions type + cardinalities 4 . emp.name 45K . COLUMN 100K . TABLE

Slide 29

Slide 29 text

Guided Interaction solves shortcomings in the Query-Result Model • Enumeration • Insights • Responsiveness

Slide 30

Slide 30 text

Outline • Motivating Example • Challenges • Principles of Guided Interaction • Large-scale Browsing: Skimmer

Slide 31

Slide 31 text

Outline • Motivating Example • Challenges • Principles of Guided Interaction • Large-scale Browsing: Skimmer • Overview • Interface • Algorithms • Evaluation

Slide 32

Slide 32 text

Outline • Motivating Example • Challenges • Principles of Guided Interaction • Large-scale Browsing: Skimmer • Overview • Interface • Algorithms • Evaluation

Slide 33

Slide 33 text

Skimmer: Large-scale Browsing Naïve user Alex Database Expert Bob Manager • Alex and Bob look for a Senior Manager • Solution: Let’s skim the entire employee directory!

Slide 34

Slide 34 text

Skimmer: Large-scale Browsing • Often more efficient than formulating articulate query • Results presented can overwhelm both the system and the user • SELECT * FROM emp JOIN dept ON (emp.deptID = dept.ID) ORDER BY emp.age

Slide 35

Slide 35 text

A scrolling interface • Scrolling is a widely used interface • Constraints in fast scrolling: • System constraints: Data distortion • User constraints: Visual perception, memory retention etc.

Slide 36

Slide 36 text

Solution: Skimmer • Guided Interaction principles • Enumeration • Intuitive actions: page up, page down, change speed • Insights • Maximize the amount of information (not tuples) • Be sure not to overwhelm the user • Responsiveness • Efficiently surface insights • Reduce interface—data overhead (network, display)

Slide 37

Slide 37 text

Outline • Motivating Example • Challenges • Principles of Guided Interaction • Large-scale Browsing: Skimmer • Overview • Interface • Algorithms • Evaluation

Slide 38

Slide 38 text

User Interface

Slide 39

Slide 39 text

User Interface • Input: Sorted query result R • Output: R requires S pages {P1 , P2 ,…, PS } for display • Display representatives: {D1 , D2 ,…, DS } • Di Í Pi and it is computed based on: • User’s current scrolling speed • Contents of page Pi • User’s current browsing history • Benefit: Reduces information overload by showing summarized, non-redundant and diverse information

Slide 40

Slide 40 text

Goodness Metric: Information Loss • Tuplewise information loss of a non-displayed tuple, tnd from Pi where td is most similar tuple from Di U H(sid) • Pagewise information loss score of page Pi : • Cumulative information loss for result set R and scroll log SL ) , ( ) ( , d nd nd t t V sid t TIL = å Î = ) ( ) ( , , i p P t p i sid t TIL sid P PIL å = = | | 1 , ) ( ) , ( SL sid i sid P PIL R SL CIL

Slide 41

Slide 41 text

Outline • Motivating Example • Challenges • Principles of Guided Interaction • Large-scale Browsing: Skimmer • Overview • Interface • Algorithms • Evaluation

Slide 42

Slide 42 text

Naïve Sampling • Compute set Di = Ki tuples from page Pi • Ki is determined based on user’s current scrolling speed • Random sampling • Pick Ki random tuples from Pi • Uniform sampling • Pick Ki evenly spaced tuples from Pi

Slide 43

Slide 43 text

K-Medoid • Clustering algorithm that partitions a dataset D, containing N elements, into K partitions • Each partition is represented by an actual sample point • It minimizes the following absolute error criterion: • Best known heuristic solutions: PAM, CLARA and CLARANS å å = Î = K j C p KMedoids j j o p V P E 1 ) , ( ) (

Slide 44

Slide 44 text

Local K-Medoid (LKMed) • Di = PAM (Pi , Ki ) • PAM Algorithm: • Initialize clusters centers • Repeat until convergence • Assignment: Assign each point to nearest cluster • Update: Swap based greedy update of cluster centers • CLARA and CLARANS not suitable for small datasets A B Current Representative

Slide 45

Slide 45 text

Importance of History • Our goal: Show non-redundant, diverse information to the user page 1 page 2

Slide 46

Slide 46 text

Historical K-Medoid (HKMed) • Di = HKMed (Pi , Ki ) • Minimizes the exact PIL score • HKMed Algorithm • Initialize the cluster centers • Repeat until convergence • Assignment: Assign each point to nearest cluster. • Update: Update unfixed cluster centers based on greedy swap D C A B Historical Representative Current Representative

Slide 47

Slide 47 text

Performance Issues: Responsiveness • Computational constraints: Satisfy user’s non-linear scrolling behavior • Next page representative is selected based on: • Past displayed content • User’s current scroll rate • Desired computation time: Less than 100 ms • PAM : O(K*(N-K)2) dist computations per iteration

Slide 48

Slide 48 text

Approximate K-Medoid • K-Means is an efficient partition based clustering algorithm. It divides a dataset into ‘K’ partitions. • It is O(K*N) as compared to O(K(N-K)2) in K-Medoid • Each partition is represented by partition centroid. • It minimizes the following square-error criterion: • It can only be used for numerical attributes and Euclidean distance function. å å = Î - = K j C p j i m p P EKMeans 1 2 | | ) (

Slide 49

Slide 49 text

Local K-Means (LKMeans) • Algorithm • KCenters = KMeans (Pi , Ki ) • Di = NN (KCenters, Pi ) • KMeans Algorithm • Initialize cluster centers • Repeat until convergence • Assignment: Assign each point to nearest cluster. • Update: New cluster centers by computing mean of all assigned points.

Slide 50

Slide 50 text

Historical K-Means (HKMeans) • Similar motivation as that of historical K-Medoid. • Algorithm • KCenters = HKMeans (Pi , Ki ) • Di = NN (KCenters, Pi ) • HKMeans Algorithm • Initialize cluster centers • Repeat until convergence • Assignment: Assign each point to nearest cluster. • Update: New unfixed cluster centers by computing mean of all assigned points. Historical Representative

Slide 51

Slide 51 text

Effect of Initialization • HKMeans worse than LKMeans in terms of CIL Score • Unlike HKMed, HKMeans can get caught in local minimum • Bad initial cluster centers • Representatives being determined based on the outliers Historical Representative

Slide 52

Slide 52 text

Two-Phase K- Means (TPKMeans) • Phase 1 • Choose good initial cluster centers using LKMeans • Phase 2 • Select non-redundant representatives using HKMeans • Benefits • Information quality quite close to HKMed • Runs almost N times faster as compared to K- Medoids based algorithms

Slide 53

Slide 53 text

Two Phase K-Means (TPKMeans) Local K-Means Historical K-Means

Slide 54

Slide 54 text

Outline • Motivating Example • Challenges • Principles of Guided Interaction • Large-scale Browsing: Skimmer • Overview • Interface • Algorithms • Evaluation: • Performance • Information Quality • User Study

Slide 55

Slide 55 text

Experimental Goals • Computational Performance • Page size • Number of dimensions • Sampling rate • Information Quality • User Study

Slide 56

Slide 56 text

Performance } HKMed and LKMed need more time } Not suitable for large page size or high sampling rate } HKMed is faster than LKMed } All algorithms satisfy interactive response constraint

Slide 57

Slide 57 text

Experimental Goals • Computational Performance • Information Quality • Information Gain: We use Random Sampling as baseline B • Page size • Number of dimensions • Sampling rate • User Study ) , ( ) , ( ) , ( R SL CIL R SL CIL B A IG A B =

Slide 58

Slide 58 text

Information Quality } HKMed is best followed by TPKMeans and LKMed } HKMeans is almost close to random sampling } Information gain decreases with increasing # dimensions

Slide 59

Slide 59 text

Summary Recommendations Sampling Rates Page Size Two-Phase K-Means Two Phase K-Means Two Phase K-Means Historical K-Medoids

Slide 60

Slide 60 text

Experimental Goals • Computational Performance • Information Quality • User Study • Users’ efficiency and quality of response to three tasks

Slide 61

Slide 61 text

User Study Interesting Patterns Regression Task Discriminating Features } Almost similar or better quality of response for all three tasks } Users are able to do the tasks 1.5 - 2 times faster } Less stress due to reduced information

Slide 62

Slide 62 text

Skimmer: Recap • Scrolling-aware browsing: Introduced the idea of selecting representative tuples to enable variable-speed scrolling through relational data • Information loss metric: Quantified loss of information incurred due to browsing representative tuples • Algorithms: Developed and compared five new scrolling based sampling algorithms that minimize information loss • Interaction constraints: Proposed efficiently computable algorithms that satisfy fast scrolling requirement.

Slide 63

Slide 63 text

Outline • Motivating Example • Challenges • Principles of Guided Interaction • Large-scale Browsing: Skimmer

Slide 64

Slide 64 text

Conclusion • Interacting with Large Datasets is hard! • Challenges • Principles of Guided Interaction • Enumeration • Insights • Responsiveness • Large-scale Browsing: Skimmer • Scrolling & history-aware, information-based clustering of tabular data

Slide 65

Slide 65 text

Co-authors • Skimmer: Rapid Scrolling of Relational Query Results – SIGMOD 2012 • Manish Singh, Arnab Nandi, H. V. Jagadish • Guided Interaction: Rethinking the Query- Result Paradigm – VLDB 2011 • Arnab Nandi, H. V. Jagadish • Assisted querying using instant-response interfaces – SIGMOD 2007 (demo) • Arnab Nandi, H. V. Jagadish

Slide 66

Slide 66 text

http://arnab.org