Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Guided Interaction Over Large Datasets

Arnab Nandi
April 05, 2013
29

Guided Interaction Over Large Datasets

Many decades of research, coupled with continuous increases in computing power, have enabled highly efficient execution of queries on large databases. In consequence, for many databases, far more time is spent by users formulating queries than by the system evaluating them. It stands to reason that, looking at the overall query experience we provide users, we should pay attention to how we can assist users in the holistic process of obtaining the information they desire from the database, and not just the constituent activity of efficiently generating a result given a complete precise query.

We examine the conventional query-result paradigm employed by databases and demonstrate challenges encountered when following this paradigm for an informa- tion seeking task. We recognize that the process of query specification itself is a major stumbling block. With current computational abilities, we are at a point where we can make use of the data in the database to aid in this process.

To this end, we propose a new paradigm, guided interaction, to solve the noted challenges, by using interaction to guide the user through the query formulation, query execution and result examination processes.

Arnab Nandi

April 05, 2013
Tweet

More Decks by Arnab Nandi

Transcript

  1. Interacting with Large Datasets • Users want to explore and

    interact with the data when analyzing it • Data is too “big” • Slow to interact with • Unfamiliar • Hard to manage
  2. Revisiting Status Quo • Databases have become really fast /

    efficient in going from query to result • Then why are we still unhappy? • Does this efficiency solve the overall user need? Interact Optimize Execute Query Plan Result “frontend” tasks: O(minutes) typical database system: O(seconds) Query Intent
  3. Outline • Motivating Example • Challenges • Principles of Guided

    Interaction • Large-scale Browsing: Skimmer
  4. Motivating Example Naïve user Alex Database Expert Bob Manager •

    Alex and Bob meet a Senior Manager • Forget name, need to look up contact info. • All they remember: manager of small group of senior researchers
  5. Motivating Example: Naïve Alex • Visits corporate social network website

    1. Browses all the “advanced search” forms 2. Uses Faceted Search interface to naively query for everyone in the company 3. Realizes you can’t drill down by seniority - There isn’t a “seniority” field, but age… 4. Goes back to “Birthday Search” form - Figures out senior employees are ~50 5. Adds age range, drills further, finds person Naïve user Alex
  6. Motivating Example: Expert Bob • Opens up SQL Console to

    employee DB 1. SHOW TABLES; // reads… 2. DESC TABLES; // reads more… 3. SELECT emp.project, COUNT(*) AS c, AVG(emp.age) AS a FROM emp JOIN dept ON (emp.deptID = dept.ID) GROUP BY emp.project ORDER BY c ASC, a DESC LIMIT 3 4. SELECT emp.name,emp.cubicleID FROM emp JOIN dept ON (emp.deptID = dept.ID) WHERE dept.name=‘Research’ AND emp.project=’DatabasePrj’ AND emp.designation=’Manager’ Database Expert Bob Average age & count per group Use “DatabasePrj” from prev query
  7. Motivating Example • Both users spent more time constructing and

    issuing sub queries • Issued redundant / wrong queries • On standard server, most queries take < 1 min • Session takes several minutes – hour! • Most time was spent in constructing the right query
  8. Outline • Motivating Example • Challenges • Principles of Guided

    Interaction • Large-scale Browsing: Skimmer
  9. Challenges • User’s lack of Knowledge • Dependency of Information

    • Iterative and Incremental Querying • Imprecise User Query Intent
  10. Challenges Lack of Knowledge • Both users didn’t know about

    the • Schema • Data • Naïve user Alex did not know about • Query Language either • All 3 are needed to effectively issue queries • Otherwise, most time is spent issuing trial-and-error queries to learn more about the DB
  11. Challenges Dependency of Information 3. Realizes you can’t drill down

    by seniority - There isn’t a “seniority” field, but age… 4. Goes back to “Birthday Search” form - Figures out senior employees are ~50 Naïve user Alex SELECT emp.project, COUNT(*) AS c, AVG(emp.age) AS a FROM emp JOIN dept ON (emp.deptID = dept.ID) GROUP BY emp.project ORDER BY c ASC, a DESC LIMIT 3 Database Expert Bob Average age & count per group
  12. Challenges Dependency of Information • Finding out what age “Senior”

    meant required a secondary query • Cannot really write as a subquery • Dependency exists between final query and intermediate query results
  13. Challenges Iterative & Incremental Querying • Observation: Users construct queries

    by first executing smaller parts • Cognitive capacity of users is limited • Query may be declarative, but users prefer iterative / incremental construction • Leads to a lot of requerying
  14. Challenges Imprecise Query Intent • • DB Expert Bob was

    looking for some notion of “group” of small people • Hard to translate imprecise intents unless we’re aware of data • Only solution is to execute and see if answer worked Average age & count per group SELECT emp.project, COUNT(*) AS c, AVG(emp.age) AS a FROM emp JOIN dept ON (emp.deptID = dept.ID) GROUP BY emp.project ORDER BY c ASC, a DESC LIMIT 3
  15. Challenges • Our example was a simple one • Challenges

    become much harder with complex needs • n-way JOINs, Nested queries, complex aggregates… • Any database use-case with a human in the loop will face these problems
  16. Solutions so far • Application-level • Slick UIs, customized to

    use case • No principled approach to solving overall user needs • Where are my standardized operators for overall data interaction? • Set of rules I can follow when building such a system? • Related work: • QBE, VizQL(Tableau), AQUA, CONTROL, Telegraph and more • Solve thin slices of the overall problem
  17. Outline • Motivating Example • Challenges • Principles of Guided

    Interaction • Large-scale Browsing: Skimmer
  18. Guided Interaction • Principled Approach to solving these problems •

    More holistic thinking • To be included inside database Interact Optimize Execute Query Plan Result Rapid Iteration Interact Query Intent Database
  19. Guided Interaction • Set of 3 design principles • Enumeration

    • Insights • Responsiveness • Database systems that keep these in mind can avoid the challenges discussed Example system: Skimmer
  20. Guided Interaction Enumeration • The database is responsible for effectively

    enumerating all possible valid interactions with the data. • Removes burden of schema / data / language knowledge off the user
  21. Guided Interaction Enumeration: Example • What does an enumeration-enabled query

    system look like? • Important • One possible implementation • Focus on the concepts, not the idea! • Portray simple use case • Can have many, far more complex systems built using these principles
  22. Guided Interaction Enumeration: Example • Consider SQL query interface •

    With Partial Query Completion • Typing in “em” has exposed projection, join, and selection options. WHERE emily hanson contacts.email employee prefix suggestions type + cardinalities 4 .  emp.name 45K .  COLUMN 100K .  TABLE
  23. Guided Interaction Insights • The database must attempt to surface

    as many insights from the data as possible. • Removes informational dependencies • Aids expression of query intent • Note: Should not overwhelm the user
  24. Guided Interaction Insights: Example • Consider SQL interface with range

    / numeric value selection • Visual / interactive feedback saves dependent query • Does my DB let me build something like this? !"#$ %$ &%%$ WHERE emp.age > 60 distribution
 of values 
 in column Distribution of values in column
  25. Guided Interaction Responsiveness • All interactions must be instantaneous even

    if inaccurate. • Fluid data interaction is key to getting insights • Tradeoff accuracy for near-instantaneous responses (i.e. <100ms*) * R. Miller. “Response time in man-computer conversational transactions” FJCC, 1968.
  26. Guided Interaction Responsiveness: Example • SQL query interface, Partial Query

    Completion • Need to deliver results in <100ms WHERE emily hanson contacts.email employee prefix suggestions type + cardinalities 4 .  emp.name 45K .  COLUMN 100K .  TABLE
  27. Outline • Motivating Example • Challenges • Principles of Guided

    Interaction • Large-scale Browsing: Skimmer
  28. Outline • Motivating Example • Challenges • Principles of Guided

    Interaction • Large-scale Browsing: Skimmer • Overview • Interface • Algorithms • Evaluation
  29. Outline • Motivating Example • Challenges • Principles of Guided

    Interaction • Large-scale Browsing: Skimmer • Overview • Interface • Algorithms • Evaluation
  30. Skimmer: Large-scale Browsing Naïve user Alex Database Expert Bob Manager

    • Alex and Bob look for a Senior Manager • Solution: Let’s skim the entire employee directory!
  31. Skimmer: Large-scale Browsing • Often more efficient than formulating articulate

    query • Results presented can overwhelm both the system and the user • SELECT * FROM emp JOIN dept ON (emp.deptID = dept.ID) ORDER BY emp.age
  32. A scrolling interface • Scrolling is a widely used interface

    • Constraints in fast scrolling: • System constraints: Data distortion • User constraints: Visual perception, memory retention etc.
  33. Solution: Skimmer • Guided Interaction principles • Enumeration • Intuitive

    actions: page up, page down, change speed • Insights • Maximize the amount of information (not tuples) • Be sure not to overwhelm the user • Responsiveness • Efficiently surface insights • Reduce interface—data overhead (network, display)
  34. Outline • Motivating Example • Challenges • Principles of Guided

    Interaction • Large-scale Browsing: Skimmer • Overview • Interface • Algorithms • Evaluation
  35. User Interface • Input: Sorted query result R • Output:

    R requires S pages {P1 , P2 ,…, PS } for display • Display representatives: {D1 , D2 ,…, DS } • Di Í Pi and it is computed based on: • User’s current scrolling speed • Contents of page Pi • User’s current browsing history • Benefit: Reduces information overload by showing summarized, non-redundant and diverse information
  36. Goodness Metric: Information Loss • Tuplewise information loss of a

    non-displayed tuple, tnd from Pi where td is most similar tuple from Di U H(sid) • Pagewise information loss score of page Pi : • Cumulative information loss for result set R and scroll log SL ) , ( ) ( , d nd nd t t V sid t TIL = å Î = ) ( ) ( , , i p P t p i sid t TIL sid P PIL å = = | | 1 , ) ( ) , ( SL sid i sid P PIL R SL CIL
  37. Outline • Motivating Example • Challenges • Principles of Guided

    Interaction • Large-scale Browsing: Skimmer • Overview • Interface • Algorithms • Evaluation
  38. Naïve Sampling • Compute set Di = Ki tuples from

    page Pi • Ki is determined based on user’s current scrolling speed • Random sampling • Pick Ki random tuples from Pi • Uniform sampling • Pick Ki evenly spaced tuples from Pi
  39. K-Medoid • Clustering algorithm that partitions a dataset D, containing

    N elements, into K partitions • Each partition is represented by an actual sample point • It minimizes the following absolute error criterion: • Best known heuristic solutions: PAM, CLARA and CLARANS å å = Î = K j C p KMedoids j j o p V P E 1 ) , ( ) (
  40. Local K-Medoid (LKMed) • Di = PAM (Pi , Ki

    ) • PAM Algorithm: • Initialize clusters centers • Repeat until convergence • Assignment: Assign each point to nearest cluster • Update: Swap based greedy update of cluster centers • CLARA and CLARANS not suitable for small datasets A B Current Representative
  41. Historical K-Medoid (HKMed) • Di = HKMed (Pi , Ki

    ) • Minimizes the exact PIL score • HKMed Algorithm • Initialize the cluster centers • Repeat until convergence • Assignment: Assign each point to nearest cluster. • Update: Update unfixed cluster centers based on greedy swap D C A B Historical Representative Current Representative
  42. Performance Issues: Responsiveness • Computational constraints: Satisfy user’s non-linear scrolling

    behavior • Next page representative is selected based on: • Past displayed content • User’s current scroll rate • Desired computation time: Less than 100 ms • PAM : O(K*(N-K)2) dist computations per iteration
  43. Approximate K-Medoid • K-Means is an efficient partition based clustering

    algorithm. It divides a dataset into ‘K’ partitions. • It is O(K*N) as compared to O(K(N-K)2) in K-Medoid • Each partition is represented by partition centroid. • It minimizes the following square-error criterion: • It can only be used for numerical attributes and Euclidean distance function. å å = Î - = K j C p j i m p P EKMeans 1 2 | | ) (
  44. Local K-Means (LKMeans) • Algorithm • KCenters = KMeans (Pi

    , Ki ) • Di = NN (KCenters, Pi ) • KMeans Algorithm • Initialize cluster centers • Repeat until convergence • Assignment: Assign each point to nearest cluster. • Update: New cluster centers by computing mean of all assigned points.
  45. Historical K-Means (HKMeans) • Similar motivation as that of historical

    K-Medoid. • Algorithm • KCenters = HKMeans (Pi , Ki ) • Di = NN (KCenters, Pi ) • HKMeans Algorithm • Initialize cluster centers • Repeat until convergence • Assignment: Assign each point to nearest cluster. • Update: New unfixed cluster centers by computing mean of all assigned points. Historical Representative
  46. Effect of Initialization • HKMeans worse than LKMeans in terms

    of CIL Score • Unlike HKMed, HKMeans can get caught in local minimum • Bad initial cluster centers • Representatives being determined based on the outliers Historical Representative
  47. Two-Phase K- Means (TPKMeans) • Phase 1 • Choose good

    initial cluster centers using LKMeans • Phase 2 • Select non-redundant representatives using HKMeans • Benefits • Information quality quite close to HKMed • Runs almost N times faster as compared to K- Medoids based algorithms
  48. Outline • Motivating Example • Challenges • Principles of Guided

    Interaction • Large-scale Browsing: Skimmer • Overview • Interface • Algorithms • Evaluation: • Performance • Information Quality • User Study
  49. Experimental Goals • Computational Performance • Page size • Number

    of dimensions • Sampling rate • Information Quality • User Study
  50. Performance } HKMed and LKMed need more time } Not

    suitable for large page size or high sampling rate } HKMed is faster than LKMed } All algorithms satisfy interactive response constraint
  51. Experimental Goals • Computational Performance • Information Quality • Information

    Gain: We use Random Sampling as baseline B • Page size • Number of dimensions • Sampling rate • User Study ) , ( ) , ( ) , ( R SL CIL R SL CIL B A IG A B =
  52. Information Quality } HKMed is best followed by TPKMeans and

    LKMed } HKMeans is almost close to random sampling } Information gain decreases with increasing # dimensions
  53. Experimental Goals • Computational Performance • Information Quality • User

    Study • Users’ efficiency and quality of response to three tasks
  54. User Study Interesting Patterns Regression Task Discriminating Features } Almost

    similar or better quality of response for all three tasks } Users are able to do the tasks 1.5 - 2 times faster } Less stress due to reduced information
  55. Skimmer: Recap • Scrolling-aware browsing: Introduced the idea of selecting

    representative tuples to enable variable-speed scrolling through relational data • Information loss metric: Quantified loss of information incurred due to browsing representative tuples • Algorithms: Developed and compared five new scrolling based sampling algorithms that minimize information loss • Interaction constraints: Proposed efficiently computable algorithms that satisfy fast scrolling requirement.
  56. Outline • Motivating Example • Challenges • Principles of Guided

    Interaction • Large-scale Browsing: Skimmer
  57. Conclusion • Interacting with Large Datasets is hard! • Challenges

    • Principles of Guided Interaction • Enumeration • Insights • Responsiveness • Large-scale Browsing: Skimmer • Scrolling & history-aware, information-based clustering of tabular data
  58. Co-authors • Skimmer: Rapid Scrolling of Relational Query Results –

    SIGMOD 2012 • Manish Singh, Arnab Nandi, H. V. Jagadish • Guided Interaction: Rethinking the Query- Result Paradigm – VLDB 2011 • Arnab Nandi, H. V. Jagadish • Assisted querying using instant-response interfaces – SIGMOD 2007 (demo) • Arnab Nandi, H. V. Jagadish