Guided Interaction Over Large Datasets

Guided Interaction over Large Datasets Arnab Nandi Computer Science &
Engineering The Ohio State University

“Big Data”

Interacting with Large Datasets • Users want to explore and
interact with the data when analyzing it • Data is too “big” • Slow to interact with • Unfamiliar • Hard to manage

Revisiting Status Quo • Databases have become really fast /
efficient in going from query to result • Then why are we still unhappy? • Does this efficiency solve the overall user need? Interact Optimize Execute Query Plan Result “frontend” tasks: O(minutes) typical database system: O(seconds) Query Intent

Outline • Motivating Example • Challenges • Principles of Guided
Interaction • Large-scale Browsing: Skimmer

Motivating Example Naïve user Alex Database Expert Bob Manager •
Alex and Bob meet a Senior Manager • Forget name, need to look up contact info. • All they remember: manager of small group of senior researchers

Motivating Example: Naïve Alex • Visits corporate social network website
1. Browses all the “advanced search” forms 2. Uses Faceted Search interface to naively query for everyone in the company 3. Realizes you can’t drill down by seniority - There isn’t a “seniority” field, but age… 4. Goes back to “Birthday Search” form - Figures out senior employees are ~50 5. Adds age range, drills further, finds person Naïve user Alex

Motivating Example: Expert Bob • Opens up SQL Console to
employee DB 1. SHOW TABLES; // reads… 2. DESC TABLES; // reads more… 3. SELECT emp.project, COUNT(*) AS c, AVG(emp.age) AS a FROM emp JOIN dept ON (emp.deptID = dept.ID) GROUP BY emp.project ORDER BY c ASC, a DESC LIMIT 3 4. SELECT emp.name,emp.cubicleID FROM emp JOIN dept ON (emp.deptID = dept.ID) WHERE dept.name=‘Research’ AND emp.project=’DatabasePrj’ AND emp.designation=’Manager’ Database Expert Bob Average age & count per group Use “DatabasePrj” from prev query

Motivating Example • Both users spent more time constructing and
issuing sub queries • Issued redundant / wrong queries • On standard server, most queries take < 1 min • Session takes several minutes – hour! • Most time was spent in constructing the right query

Challenges • User’s lack of Knowledge • Dependency of Information
• Iterative and Incremental Querying • Imprecise User Query Intent

Challenges Lack of Knowledge • Both users didn’t know about
the • Schema • Data • Naïve user Alex did not know about • Query Language either • All 3 are needed to effectively issue queries • Otherwise, most time is spent issuing trial-and-error queries to learn more about the DB

Challenges Dependency of Information 3. Realizes you can’t drill down
by seniority - There isn’t a “seniority” field, but age… 4. Goes back to “Birthday Search” form - Figures out senior employees are ~50 Naïve user Alex SELECT emp.project, COUNT(*) AS c, AVG(emp.age) AS a FROM emp JOIN dept ON (emp.deptID = dept.ID) GROUP BY emp.project ORDER BY c ASC, a DESC LIMIT 3 Database Expert Bob Average age & count per group

Challenges Dependency of Information • Finding out what age “Senior”
meant required a secondary query • Cannot really write as a subquery • Dependency exists between final query and intermediate query results

Challenges Iterative & Incremental Querying • Observation: Users construct queries
by first executing smaller parts • Cognitive capacity of users is limited • Query may be declarative, but users prefer iterative / incremental construction • Leads to a lot of requerying

Challenges Imprecise Query Intent • • DB Expert Bob was
looking for some notion of “group” of small people • Hard to translate imprecise intents unless we’re aware of data • Only solution is to execute and see if answer worked Average age & count per group SELECT emp.project, COUNT(*) AS c, AVG(emp.age) AS a FROM emp JOIN dept ON (emp.deptID = dept.ID) GROUP BY emp.project ORDER BY c ASC, a DESC LIMIT 3

Challenges • Our example was a simple one • Challenges
become much harder with complex needs • n-way JOINs, Nested queries, complex aggregates… • Any database use-case with a human in the loop will face these problems

Solutions so far • Application-level • Slick UIs, customized to
use case • No principled approach to solving overall user needs • Where are my standardized operators for overall data interaction? • Set of rules I can follow when building such a system? • Related work: • QBE, VizQL(Tableau), AQUA, CONTROL, Telegraph and more • Solve thin slices of the overall problem

Guided Interaction • Principled Approach to solving these problems •
More holistic thinking • To be included inside database Interact Optimize Execute Query Plan Result Rapid Iteration Interact Query Intent Database

Guided Interaction • Set of 3 design principles • Enumeration
• Insights • Responsiveness • Database systems that keep these in mind can avoid the challenges discussed Example system: Skimmer

Guided Interaction Enumeration • The database is responsible for effectively
enumerating all possible valid interactions with the data. • Removes burden of schema / data / language knowledge off the user

Guided Interaction Enumeration: Example • What does an enumeration-enabled query
system look like? • Important • One possible implementation • Focus on the concepts, not the idea! • Portray simple use case • Can have many, far more complex systems built using these principles

Guided Interaction Enumeration: Example • Consider SQL query interface •
With Partial Query Completion • Typing in “em” has exposed projection, join, and selection options. WHERE emily hanson contacts.email employee preﬁx suggestions type + cardinalities 4 . emp.name 45K . COLUMN 100K . TABLE

Guided Interaction Insights • The database must attempt to surface
as many insights from the data as possible. • Removes informational dependencies • Aids expression of query intent • Note: Should not overwhelm the user

Guided Interaction Insights: Example • Consider SQL interface with range
/ numeric value selection • Visual / interactive feedback saves dependent query • Does my DB let me build something like this? !"#$ %$ &%%$ WHERE emp.age > 60 distribution  of values   in column Distribution of values in column

Guided Interaction Responsiveness • All interactions must be instantaneous even
if inaccurate. • Fluid data interaction is key to getting insights • Tradeoff accuracy for near-instantaneous responses (i.e. <100ms*) * R. Miller. “Response time in man-computer conversational transactions” FJCC, 1968.

Guided Interaction Responsiveness: Example • SQL query interface, Partial Query
Completion • Need to deliver results in <100ms WHERE emily hanson contacts.email employee preﬁx suggestions type + cardinalities 4 . emp.name 45K . COLUMN 100K . TABLE

Guided Interaction solves shortcomings in the Query-Result Model • Enumeration
• Insights • Responsiveness

Interaction • Large-scale Browsing: Skimmer • Overview • Interface • Algorithms • Evaluation

Skimmer: Large-scale Browsing Naïve user Alex Database Expert Bob Manager
• Alex and Bob look for a Senior Manager • Solution: Let’s skim the entire employee directory!

Skimmer: Large-scale Browsing • Often more efficient than formulating articulate
query • Results presented can overwhelm both the system and the user • SELECT * FROM emp JOIN dept ON (emp.deptID = dept.ID) ORDER BY emp.age

A scrolling interface • Scrolling is a widely used interface
• Constraints in fast scrolling: • System constraints: Data distortion • User constraints: Visual perception, memory retention etc.

Solution: Skimmer • Guided Interaction principles • Enumeration • Intuitive
actions: page up, page down, change speed • Insights • Maximize the amount of information (not tuples) • Be sure not to overwhelm the user • Responsiveness • Efficiently surface insights • Reduce interface—data overhead (network, display)

User Interface

User Interface • Input: Sorted query result R • Output:
R requires S pages {P1 , P2 ,…, PS } for display • Display representatives: {D1 , D2 ,…, DS } • Di Í Pi and it is computed based on: • User’s current scrolling speed • Contents of page Pi • User’s current browsing history • Benefit: Reduces information overload by showing summarized, non-redundant and diverse information

Goodness Metric: Information Loss • Tuplewise information loss of a
non-displayed tuple, tnd from Pi where td is most similar tuple from Di U H(sid) • Pagewise information loss score of page Pi : • Cumulative information loss for result set R and scroll log SL ) , ( ) ( , d nd nd t t V sid t TIL = å Î = ) ( ) ( , , i p P t p i sid t TIL sid P PIL å = = | | 1 , ) ( ) , ( SL sid i sid P PIL R SL CIL

Naïve Sampling • Compute set Di = Ki tuples from
page Pi • Ki is determined based on user’s current scrolling speed • Random sampling • Pick Ki random tuples from Pi • Uniform sampling • Pick Ki evenly spaced tuples from Pi

K-Medoid • Clustering algorithm that partitions a dataset D, containing
N elements, into K partitions • Each partition is represented by an actual sample point • It minimizes the following absolute error criterion: • Best known heuristic solutions: PAM, CLARA and CLARANS å å = Î = K j C p KMedoids j j o p V P E 1 ) , ( ) (

Local K-Medoid (LKMed) • Di = PAM (Pi , Ki
) • PAM Algorithm: • Initialize clusters centers • Repeat until convergence • Assignment: Assign each point to nearest cluster • Update: Swap based greedy update of cluster centers • CLARA and CLARANS not suitable for small datasets A B Current Representative

Importance of History • Our goal: Show non-redundant, diverse information
to the user page 1 page 2

Historical K-Medoid (HKMed) • Di = HKMed (Pi , Ki
) • Minimizes the exact PIL score • HKMed Algorithm • Initialize the cluster centers • Repeat until convergence • Assignment: Assign each point to nearest cluster. • Update: Update unfixed cluster centers based on greedy swap D C A B Historical Representative Current Representative

Performance Issues: Responsiveness • Computational constraints: Satisfy user’s non-linear scrolling
behavior • Next page representative is selected based on: • Past displayed content • User’s current scroll rate • Desired computation time: Less than 100 ms • PAM : O(K*(N-K)2) dist computations per iteration

Approximate K-Medoid • K-Means is an efficient partition based clustering
algorithm. It divides a dataset into ‘K’ partitions. • It is O(K*N) as compared to O(K(N-K)2) in K-Medoid • Each partition is represented by partition centroid. • It minimizes the following square-error criterion: • It can only be used for numerical attributes and Euclidean distance function. å å = Î - = K j C p j i m p P EKMeans 1 2 | | ) (

Local K-Means (LKMeans) • Algorithm • KCenters = KMeans (Pi
, Ki ) • Di = NN (KCenters, Pi ) • KMeans Algorithm • Initialize cluster centers • Repeat until convergence • Assignment: Assign each point to nearest cluster. • Update: New cluster centers by computing mean of all assigned points.

Historical K-Means (HKMeans) • Similar motivation as that of historical
K-Medoid. • Algorithm • KCenters = HKMeans (Pi , Ki ) • Di = NN (KCenters, Pi ) • HKMeans Algorithm • Initialize cluster centers • Repeat until convergence • Assignment: Assign each point to nearest cluster. • Update: New unfixed cluster centers by computing mean of all assigned points. Historical Representative

Effect of Initialization • HKMeans worse than LKMeans in terms
of CIL Score • Unlike HKMed, HKMeans can get caught in local minimum • Bad initial cluster centers • Representatives being determined based on the outliers Historical Representative

Two-Phase K- Means (TPKMeans) • Phase 1 • Choose good
initial cluster centers using LKMeans • Phase 2 • Select non-redundant representatives using HKMeans • Benefits • Information quality quite close to HKMed • Runs almost N times faster as compared to K- Medoids based algorithms

Two Phase K-Means (TPKMeans) Local K-Means Historical K-Means

Interaction • Large-scale Browsing: Skimmer • Overview • Interface • Algorithms • Evaluation: • Performance • Information Quality • User Study

Experimental Goals • Computational Performance • Page size • Number
of dimensions • Sampling rate • Information Quality • User Study

Performance } HKMed and LKMed need more time } Not
suitable for large page size or high sampling rate } HKMed is faster than LKMed } All algorithms satisfy interactive response constraint

Experimental Goals • Computational Performance • Information Quality • Information
Gain: We use Random Sampling as baseline B • Page size • Number of dimensions • Sampling rate • User Study ) , ( ) , ( ) , ( R SL CIL R SL CIL B A IG A B =

Information Quality } HKMed is best followed by TPKMeans and
LKMed } HKMeans is almost close to random sampling } Information gain decreases with increasing # dimensions

Summary Recommendations Sampling Rates Page Size Two-Phase K-Means Two Phase
K-Means Two Phase K-Means Historical K-Medoids

Experimental Goals • Computational Performance • Information Quality • User
Study • Users’ efficiency and quality of response to three tasks

User Study Interesting Patterns Regression Task Discriminating Features } Almost
similar or better quality of response for all three tasks } Users are able to do the tasks 1.5 - 2 times faster } Less stress due to reduced information

Skimmer: Recap • Scrolling-aware browsing: Introduced the idea of selecting
representative tuples to enable variable-speed scrolling through relational data • Information loss metric: Quantified loss of information incurred due to browsing representative tuples • Algorithms: Developed and compared five new scrolling based sampling algorithms that minimize information loss • Interaction constraints: Proposed efficiently computable algorithms that satisfy fast scrolling requirement.

Conclusion • Interacting with Large Datasets is hard! • Challenges
• Principles of Guided Interaction • Enumeration • Insights • Responsiveness • Large-scale Browsing: Skimmer • Scrolling & history-aware, information-based clustering of tabular data

Co-authors • Skimmer: Rapid Scrolling of Relational Query Results –
SIGMOD 2012 • Manish Singh, Arnab Nandi, H. V. Jagadish • Guided Interaction: Rethinking the Query- Result Paradigm – VLDB 2011 • Arnab Nandi, H. V. Jagadish • Assisted querying using instant-response interfaces – SIGMOD 2007 (demo) • Arnab Nandi, H. V. Jagadish

http://arnab.org

Guided Interaction Over Large Datasets

Guided Interaction Over Large Datasets

More Decks by Arnab Nandi

Featured

Transcript