Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Guided Interaction Over Large Datasets

Arnab Nandi
April 05, 2013
4

Guided Interaction Over Large Datasets

Many decades of research, coupled with continuous increases in computing power, have enabled highly efficient execution of queries on large databases. In consequence, for many databases, far more time is spent by users formulating queries than by the system evaluating them. It stands to reason that, looking at the overall query experience we provide users, we should pay attention to how we can assist users in the holistic process of obtaining the information they desire from the database, and not just the constituent activity of efficiently generating a result given a complete precise query.

We examine the conventional query-result paradigm employed by databases and demonstrate challenges encountered when following this paradigm for an informa- tion seeking task. We recognize that the process of query specification itself is a major stumbling block. With current computational abilities, we are at a point where we can make use of the data in the database to aid in this process.

To this end, we propose a new paradigm, guided interaction, to solve the noted challenges, by using interaction to guide the user through the query formulation, query execution and result examination processes.

Arnab Nandi

April 05, 2013
Tweet

Transcript

  1. Guided Interaction
    over Large Datasets
    Arnab Nandi
    Computer Science & Engineering
    The Ohio State University

    View Slide

  2. “Big Data”

    View Slide

  3. Interacting with Large Datasets
    • Users want to
    explore and interact
    with the data when
    analyzing it
    • Data is too “big”
    • Slow to interact with
    • Unfamiliar
    • Hard to manage

    View Slide

  4. Revisiting Status Quo
    • Databases have become really fast / efficient
    in going from query to result
    • Then why are we still unhappy?
    • Does this efficiency solve the overall user need?
    Interact Optimize Execute
    Query Plan Result
    “frontend” tasks: O(minutes) typical database system: O(seconds)
    Query
    Intent

    View Slide

  5. Outline
    • Motivating Example
    • Challenges
    • Principles of Guided Interaction
    • Large-scale Browsing: Skimmer

    View Slide

  6. Motivating Example
    Naïve user
    Alex
    Database Expert
    Bob Manager
    • Alex and Bob meet a Senior Manager
    • Forget name, need to look up contact info.
    • All they remember: manager of small group of
    senior researchers

    View Slide

  7. Motivating Example: Naïve Alex
    • Visits corporate social network website
    1. Browses all the “advanced search” forms
    2. Uses Faceted Search interface to naively query
    for everyone in the company
    3. Realizes you can’t drill down by seniority
    - There isn’t a “seniority” field, but age…
    4. Goes back to “Birthday Search” form
    - Figures out senior employees are ~50
    5. Adds age range, drills further, finds person
    Naïve user
    Alex

    View Slide

  8. Motivating Example: Expert Bob
    • Opens up SQL Console to employee DB
    1. SHOW TABLES; // reads…
    2. DESC TABLES; // reads more…
    3. SELECT emp.project, COUNT(*) AS c, AVG(emp.age) AS a
    FROM emp JOIN dept ON (emp.deptID = dept.ID)
    GROUP BY emp.project ORDER BY c ASC, a DESC LIMIT 3
    4. SELECT emp.name,emp.cubicleID
    FROM emp JOIN dept ON (emp.deptID = dept.ID)
    WHERE dept.name=‘Research’ AND
    emp.project=’DatabasePrj’ AND
    emp.designation=’Manager’
    Database Expert
    Bob
    Average age & count per group
    Use “DatabasePrj” from
    prev query

    View Slide

  9. Motivating Example
    • Both users spent more time
    constructing and issuing sub queries
    • Issued redundant / wrong queries
    • On standard server, most queries take < 1 min
    • Session takes several minutes – hour!
    • Most time was spent in constructing the
    right query

    View Slide

  10. Outline
    • Motivating Example
    • Challenges
    • Principles of Guided Interaction
    • Large-scale Browsing: Skimmer

    View Slide

  11. Challenges
    • User’s lack of Knowledge
    • Dependency of Information
    • Iterative and Incremental Querying
    • Imprecise User Query Intent

    View Slide

  12. Challenges
    Lack of Knowledge
    • Both users didn’t know about the
    • Schema
    • Data
    • Naïve user Alex did not know about
    • Query Language
    either
    • All 3 are needed to effectively issue queries
    • Otherwise, most time is spent issuing trial-and-error
    queries to learn more about the DB

    View Slide

  13. Challenges
    Dependency of Information
    3. Realizes you can’t drill down by seniority
    - There isn’t a “seniority” field, but age…
    4. Goes back to “Birthday Search” form
    - Figures out senior employees are ~50
    Naïve user
    Alex
    SELECT emp.project, COUNT(*) AS c, AVG(emp.age) AS a
    FROM emp JOIN dept ON (emp.deptID = dept.ID)
    GROUP BY emp.project ORDER BY c ASC, a DESC LIMIT 3
    Database Expert Bob
    Average age & count per group

    View Slide

  14. Challenges
    Dependency of Information
    • Finding out what age “Senior” meant
    required a secondary query
    • Cannot really write as a subquery
    • Dependency exists between final query
    and intermediate query results

    View Slide

  15. Challenges
    Iterative & Incremental Querying
    • Observation: Users construct queries by first
    executing smaller parts
    • Cognitive capacity of users is limited
    • Query may be declarative, but users prefer
    iterative / incremental construction
    • Leads to a lot of requerying

    View Slide

  16. Challenges
    Imprecise Query Intent

    • DB Expert Bob was looking for some notion of
    “group” of small people
    • Hard to translate imprecise intents unless we’re
    aware of data
    • Only solution is to execute and see if answer
    worked
    Average age & count per group
    SELECT emp.project, COUNT(*) AS c, AVG(emp.age) AS a
    FROM emp JOIN dept ON (emp.deptID = dept.ID)
    GROUP BY emp.project ORDER BY c ASC, a DESC LIMIT 3

    View Slide

  17. Challenges
    • Our example was a simple one
    • Challenges become much harder with
    complex needs
    • n-way JOINs, Nested queries, complex aggregates…
    • Any database use-case with a human in the loop
    will face these problems

    View Slide

  18. Solutions so far
    • Application-level
    • Slick UIs, customized to use case
    • No principled approach to solving overall user needs
    • Where are my standardized operators for overall data
    interaction?
    • Set of rules I can follow when building such a system?
    • Related work:
    • QBE, VizQL(Tableau), AQUA, CONTROL, Telegraph and more
    • Solve thin slices of the overall problem

    View Slide

  19. Outline
    • Motivating Example
    • Challenges
    • Principles of Guided Interaction
    • Large-scale Browsing: Skimmer

    View Slide

  20. Guided Interaction
    • Principled Approach to solving these problems
    • More holistic thinking
    • To be included inside database
    Interact Optimize Execute
    Query Plan Result
    Rapid Iteration
    Interact
    Query
    Intent
    Database

    View Slide

  21. Guided Interaction
    • Set of 3 design principles
    • Enumeration
    • Insights
    • Responsiveness
    • Database systems that
    keep these in mind
    can avoid the challenges discussed
    Example system: Skimmer

    View Slide

  22. Guided Interaction
    Enumeration
    • The database is responsible for
    effectively enumerating
    all possible valid interactions
    with the data.
    • Removes burden of
    schema / data / language knowledge
    off the user

    View Slide

  23. Guided Interaction
    Enumeration: Example
    • What does an enumeration-enabled
    query system look like?
    • Important
    • One possible implementation
    • Focus on the concepts, not the idea!
    • Portray simple use case
    • Can have many, far more complex systems built using these principles

    View Slide

  24. Guided Interaction
    Enumeration: Example
    • Consider SQL query interface
    • With Partial Query Completion
    • Typing in “em” has exposed projection, join, and
    selection options.
    WHERE emily hanson
    contacts.email
    employee
    prefix
    suggestions
    type +
    cardinalities
    4
    .

    emp.name
    45K
    .

    COLUMN
    100K
    .

    TABLE

    View Slide

  25. Guided Interaction
    Insights
    • The database must attempt to
    surface as many insights from the
    data as possible.
    • Removes informational dependencies
    • Aids expression of query intent
    • Note: Should not overwhelm the user

    View Slide

  26. Guided Interaction
    Insights: Example
    • Consider SQL interface with range / numeric value selection
    • Visual / interactive feedback saves dependent query
    • Does my DB let me build something like this?
    !"#$
    %$
    &%%$
    WHERE emp.age > 60
    distribution

    of values 

    in column
    Distribution
    of values in
    column

    View Slide

  27. Guided Interaction
    Responsiveness
    • All interactions must be
    instantaneous
    even if inaccurate.
    • Fluid data interaction is key to getting insights
    • Tradeoff accuracy for near-instantaneous
    responses (i.e. <100ms*)
    * R. Miller. “Response time in man-computer conversational transactions” FJCC, 1968.

    View Slide

  28. Guided Interaction
    Responsiveness: Example
    • SQL query interface, Partial Query Completion
    • Need to deliver results in <100ms
    WHERE emily hanson
    contacts.email
    employee
    prefix
    suggestions
    type +
    cardinalities
    4
    .

    emp.name
    45K
    .

    COLUMN
    100K
    .

    TABLE

    View Slide

  29. Guided Interaction
    solves shortcomings in the Query-Result
    Model
    • Enumeration
    • Insights
    • Responsiveness

    View Slide

  30. Outline
    • Motivating Example
    • Challenges
    • Principles of Guided Interaction
    • Large-scale Browsing: Skimmer

    View Slide

  31. Outline
    • Motivating Example
    • Challenges
    • Principles of Guided Interaction
    • Large-scale Browsing: Skimmer
    • Overview
    • Interface
    • Algorithms
    • Evaluation

    View Slide

  32. Outline
    • Motivating Example
    • Challenges
    • Principles of Guided Interaction
    • Large-scale Browsing: Skimmer
    • Overview
    • Interface
    • Algorithms
    • Evaluation

    View Slide

  33. Skimmer: Large-scale Browsing
    Naïve user
    Alex
    Database Expert
    Bob Manager
    • Alex and Bob look for a Senior Manager
    • Solution: Let’s skim the entire employee
    directory!

    View Slide

  34. Skimmer: Large-scale Browsing
    • Often more efficient than formulating
    articulate query
    • Results presented can overwhelm both
    the system and the user
    • SELECT *
    FROM emp JOIN dept ON
    (emp.deptID = dept.ID)
    ORDER BY emp.age

    View Slide

  35. A scrolling interface
    • Scrolling is a widely
    used interface
    • Constraints in fast
    scrolling:
    • System constraints:
    Data distortion
    • User constraints: Visual
    perception, memory
    retention etc.

    View Slide

  36. Solution: Skimmer
    • Guided Interaction principles
    • Enumeration
    • Intuitive actions: page up, page down, change speed
    • Insights
    • Maximize the amount of information (not tuples)
    • Be sure not to overwhelm the user
    • Responsiveness
    • Efficiently surface insights
    • Reduce interface—data overhead (network, display)

    View Slide

  37. Outline
    • Motivating Example
    • Challenges
    • Principles of Guided Interaction
    • Large-scale Browsing: Skimmer
    • Overview
    • Interface
    • Algorithms
    • Evaluation

    View Slide

  38. User Interface

    View Slide

  39. User Interface
    • Input: Sorted query result R
    • Output: R requires S pages {P1
    , P2
    ,…, PS
    } for display
    • Display representatives: {D1
    , D2
    ,…, DS
    }
    • Di
    Í Pi
    and it is computed based on:
    • User’s current scrolling speed
    • Contents of page Pi
    • User’s current browsing history
    • Benefit: Reduces information overload by
    showing summarized, non-redundant and diverse
    information

    View Slide

  40. Goodness Metric: Information Loss
    • Tuplewise information loss of a non-displayed tuple, tnd
    from Pi
    where td
    is most similar tuple from Di
    U H(sid)
    • Pagewise information loss score of page Pi
    :
    • Cumulative information loss for result set R and scroll
    log SL
    )
    ,
    (
    )
    ( , d
    nd
    nd
    t
    t
    V
    sid
    t
    TIL =
    å
    Î
    = )
    (
    )
    ( ,
    ,
    i
    p P
    t
    p
    i sid
    t
    TIL
    sid
    P
    PIL
    å
    =
    =
    |
    |
    1
    , )
    (
    )
    ,
    (
    SL
    sid
    i sid
    P
    PIL
    R
    SL
    CIL

    View Slide

  41. Outline
    • Motivating Example
    • Challenges
    • Principles of Guided Interaction
    • Large-scale Browsing: Skimmer
    • Overview
    • Interface
    • Algorithms
    • Evaluation

    View Slide

  42. Naïve Sampling
    • Compute set Di
    = Ki
    tuples from page Pi
    • Ki
    is determined based on user’s current scrolling
    speed
    • Random sampling
    • Pick Ki
    random tuples from Pi
    • Uniform sampling
    • Pick Ki
    evenly spaced tuples from Pi

    View Slide

  43. K-Medoid
    • Clustering algorithm that partitions a dataset D,
    containing N elements, into K partitions
    • Each partition is represented by an actual sample
    point
    • It minimizes the following absolute error criterion:
    • Best known heuristic solutions: PAM, CLARA and
    CLARANS
    å å
    = Î
    =
    K
    j C
    p
    KMedoids
    j
    j
    o
    p
    V
    P
    E
    1
    )
    ,
    (
    )
    (

    View Slide

  44. Local K-Medoid (LKMed)
    • Di
    = PAM (Pi
    , Ki
    )
    • PAM Algorithm:
    • Initialize clusters centers
    • Repeat until convergence
    • Assignment: Assign each
    point to nearest cluster
    • Update: Swap based
    greedy update of cluster
    centers
    • CLARA and CLARANS
    not suitable for small
    datasets
    A
    B
    Current Representative

    View Slide

  45. Importance of History
    • Our goal: Show non-redundant, diverse
    information to the user
    page 1
    page 2

    View Slide

  46. Historical K-Medoid (HKMed)
    • Di
    = HKMed (Pi
    , Ki
    )
    • Minimizes the exact PIL
    score
    • HKMed Algorithm
    • Initialize the cluster
    centers
    • Repeat until
    convergence
    • Assignment: Assign
    each point to nearest
    cluster.
    • Update: Update unfixed
    cluster centers based on
    greedy swap
    D
    C
    A
    B
    Historical Representative
    Current Representative

    View Slide

  47. Performance Issues: Responsiveness
    • Computational constraints: Satisfy user’s
    non-linear scrolling behavior
    • Next page representative is selected based
    on:
    • Past displayed content
    • User’s current scroll rate
    • Desired computation time: Less than 100 ms
    • PAM : O(K*(N-K)2) dist computations per
    iteration

    View Slide

  48. Approximate K-Medoid
    • K-Means is an efficient partition based clustering
    algorithm. It divides a dataset into ‘K’ partitions.
    • It is O(K*N) as compared to O(K(N-K)2) in K-Medoid
    • Each partition is represented by partition centroid.
    • It minimizes the following square-error criterion:
    • It can only be used for numerical attributes and
    Euclidean distance function.
    å å
    = Î
    -
    =
    K
    j C
    p j
    i
    m
    p
    P
    EKMeans
    1
    2
    |
    |
    )
    (

    View Slide

  49. Local K-Means (LKMeans)
    • Algorithm
    • KCenters = KMeans (Pi
    , Ki
    )
    • Di
    = NN (KCenters, Pi
    )
    • KMeans Algorithm
    • Initialize cluster centers
    • Repeat until
    convergence
    • Assignment: Assign each
    point to nearest cluster.
    • Update: New cluster
    centers by computing mean
    of all assigned points.

    View Slide

  50. Historical K-Means (HKMeans)
    • Similar motivation as that of
    historical K-Medoid.
    • Algorithm
    • KCenters = HKMeans (Pi
    , Ki
    )
    • Di
    = NN (KCenters, Pi
    )
    • HKMeans Algorithm
    • Initialize cluster centers
    • Repeat until convergence
    • Assignment: Assign each
    point to nearest cluster.
    • Update: New unfixed
    cluster centers by
    computing mean of all
    assigned points.
    Historical Representative

    View Slide

  51. Effect of Initialization
    • HKMeans worse than
    LKMeans in terms of
    CIL Score
    • Unlike HKMed,
    HKMeans can get
    caught in local
    minimum
    • Bad initial cluster
    centers
    • Representatives being
    determined based on
    the outliers
    Historical Representative

    View Slide

  52. Two-Phase K- Means (TPKMeans)
    • Phase 1
    • Choose good initial cluster centers using LKMeans
    • Phase 2
    • Select non-redundant representatives using
    HKMeans
    • Benefits
    • Information quality quite close to HKMed
    • Runs almost N times faster as compared to K-
    Medoids based algorithms

    View Slide

  53. Two Phase K-Means (TPKMeans)
    Local K-Means Historical K-Means

    View Slide

  54. Outline
    • Motivating Example
    • Challenges
    • Principles of Guided Interaction
    • Large-scale Browsing: Skimmer
    • Overview
    • Interface
    • Algorithms
    • Evaluation:
    • Performance
    • Information Quality
    • User Study

    View Slide

  55. Experimental Goals
    • Computational Performance
    • Page size
    • Number of dimensions
    • Sampling rate
    • Information Quality
    • User Study

    View Slide

  56. Performance
    } HKMed and LKMed need more time
    } Not suitable for large page size or high
    sampling rate
    } HKMed is faster than LKMed
    } All algorithms satisfy interactive
    response constraint

    View Slide

  57. Experimental Goals
    • Computational Performance
    • Information Quality
    • Information Gain: We use Random Sampling as
    baseline B
    • Page size
    • Number of dimensions
    • Sampling rate
    • User Study
    )
    ,
    (
    )
    ,
    (
    )
    ,
    (
    R
    SL
    CIL
    R
    SL
    CIL
    B
    A
    IG
    A
    B
    =

    View Slide

  58. Information Quality
    } HKMed is best followed by
    TPKMeans and LKMed
    } HKMeans is almost close to random
    sampling
    } Information gain decreases with
    increasing # dimensions

    View Slide

  59. Summary Recommendations
    Sampling Rates
    Page Size
    Two-Phase K-Means Two Phase K-Means
    Two Phase K-Means
    Historical K-Medoids

    View Slide

  60. Experimental Goals
    • Computational Performance
    • Information Quality
    • User Study
    • Users’ efficiency and quality of response to
    three tasks

    View Slide

  61. User Study
    Interesting Patterns Regression Task
    Discriminating Features
    } Almost similar or better
    quality of response for all
    three tasks
    } Users are able to do the tasks
    1.5 - 2 times faster
    } Less stress due to reduced
    information

    View Slide

  62. Skimmer: Recap
    • Scrolling-aware browsing: Introduced the idea of
    selecting representative tuples to enable variable-speed
    scrolling through relational data
    • Information loss metric: Quantified loss of information
    incurred due to browsing representative tuples
    • Algorithms: Developed and compared five new scrolling
    based sampling algorithms that minimize information
    loss
    • Interaction constraints: Proposed efficiently computable
    algorithms that satisfy fast scrolling requirement.

    View Slide

  63. Outline
    • Motivating Example
    • Challenges
    • Principles of Guided Interaction
    • Large-scale Browsing: Skimmer

    View Slide

  64. Conclusion
    • Interacting with Large Datasets is hard!
    • Challenges
    • Principles of Guided Interaction
    • Enumeration
    • Insights
    • Responsiveness
    • Large-scale Browsing: Skimmer
    • Scrolling & history-aware, information-based
    clustering of tabular data

    View Slide

  65. Co-authors
    • Skimmer: Rapid Scrolling of Relational Query
    Results – SIGMOD 2012
    • Manish Singh, Arnab Nandi, H. V. Jagadish
    • Guided Interaction: Rethinking the Query-
    Result Paradigm – VLDB 2011
    • Arnab Nandi, H. V. Jagadish
    • Assisted querying using instant-response
    interfaces – SIGMOD 2007 (demo)
    • Arnab Nandi, H. V. Jagadish

    View Slide

  66. http://arnab.org

    View Slide