Evaluating Interactive Data Systems: Workloads, Metrics, and Guidelines

Slide 1

Slide 1 text

Evaluating Interactive Data Systems: Workloads, Metrics, and Guidelines Lilong Jiang Protiva Rahman Arnab Nandi interactive data systems research group at ohio state

Slide 2

Slide 2 text

interactive data systems 1 Fast, Low-latency, Fluid, Rapid query-response Iterative, Session- oriented, ad-hoc Human-in-the-loop

Slide 3

Slide 3 text

interactivity can accelerate the discovery of insights

Slide 4

Slide 4 text

but is today’s data infrastructure sufficient for interactivity?

Slide 5

Slide 5 text

Not just a UI problem: Impacts multiple layers of the stack COGNITION & PERCEPTION CACHING & REUSE FEEDBACK & GUIDANCE OPTIMIZATION & EXECUTION QUERY INTERFACES

Slide 6

Slide 6 text

Some of our attempts Data Tweening [VLDB 17] Perceptually-aware Visualizations [DSIA 15] SnapToQuery: Query Feedback [VLDB 15] Skimmer: Rapid Browsing [SIGMOD 12] Guided Interaction [VLDB 11] FluxQuery: Main memory execution engine, cyclic shared scans [SIGMOD 16] DICE: Interactive / Approx. Cubing [ICDE 14] Result Reuse for NLP [DNIS 15] Structured Autocompletion [SIGMOD 07] Querying Beyond Keyboards [VLDB14, CHI13] Growing community of researchers doing some amazing work on these fronts

Slide 7

Slide 7 text

A growing community • Interactive data systems are becoming popular • Mention of “interactive” in SIGMOD / VLDB papers over last 20 years (normalized by mention of “database”) • Related themes • Database usability • Query Interfaces • Human-in-the-loop 6 0 0.05 0.1 0.15 0.2 0.25 1985 1995 2005 2015

Slide 8

Slide 8 text

Not just databases 7 Co-located with VIS Co-located with KDD Co-located with SIGMOD

Slide 9

Slide 9 text

Building interactive data systems: evaluation is a critical component • Important to measure systems correctly • and measure the right things • Good way to build a community • Catalog best practices • Improve upon each others work • Lower barrier to entry to work in this area • Hard / expensive to perform user-driven evaluation 8

Slide 10

Slide 10 text

Some recent related work • IDEBench: Eichmann, Binnig, Kraska, Zraggen • Focus: ad-hoc OLAP, approximation • VizPerf: Battle, Chang, Heer, Stonebraker • Focus: VIS + DB, standardized benchmark • Vis + DB Dagstuhl Evaluation Cohort • Discussed at this HILDA 9

Slide 11

Slide 11 text

Purpose • https://github.com/ixlab/eval • 130+ paper .bib file • Metrics (in context) • Case studies • TracesSOON • Reference for anyone trying to design and evaluate their interactive system • Resource to encourage more evaluation work 10

Slide 12

Slide 12 text

What this is not • Not a standardized benchmark • Not the only way to evaluate interactive data systems • Reason: heterogeneity • Not a canonical list of metrics • Your system may need to measure something more as well • Not evaluating the user • Instead, evaluating the system • Not redoing HCI / CHI • Let’s learn from human factors, cognitive science, other fields and enrich systems evaluation 11

Slide 13

Slide 13 text

Focus of this work / Outline • Salient Features / Challenges in Evaluation • Guidelines (large-scale survey of related work) • Metrics based on user behavior • Biases • Case Studies • Inertial Scroll • Filter and Map • Crossfiltering • Open Problems 12

Slide 14

Slide 14 text

Outline • Salient Features / Challenges in Evaluation • Guidelines (large-scale survey of related work) • Metrics • Biases • Case Studies • Inertial Scroll • Filter and Map • Crossfiltering • Open Problems 13

Slide 15

Slide 15 text

Characteristics of Interactive Workloads for Devices and Interfaces • Device and interfaces • sensing rate • map, slider • Continuous gestures • slider, linking & brushing • interactivity 14 • Ambiguous Query Intent • Exploratory Analysis • Session behavior • adjacent queries: related, identical, similar results

Slide 16

Slide 16 text

Devices and Interfaces • Devices have different sensing rates -> different query issuing frequencies • Device-Interface combination generates different workloads • Sliders are more intensive than text • Zooming on map has two predicate changes • Examples: • TouchViz • DBTouch 15

Slide 17

Slide 17 text

TouchViz (Steven M. Drucker , Danyel Fisher , Ramik Sadana, Jessica Herron , M.C. Schraefel, CHI 2013) • Need for comparison on mouse based devices • “Comparisons using mouse based interaction. The two conditions that we explored are both on a touch device, and we do not compare these interfaces with operations on a mouse based/desktop display. ..It would be interesting to see if the results in this study hold for mouse based interaction in addition to touch.” • Expanding to multi-device interfaces require additional studies. • “Longitudinal studies moving back and forth between desktop and touch device. Users often move between different applications and different devices with different affordances….This study showed that there is also clearly some benefit of tuning the interaction to the affordance of a particular device. Future work should examine this problem in a more holistic fashion, perhaps as a longitudinal study where users need to move back and forth between the desktop and a touch oriented device.” 16

Slide 18

Slide 18 text

DBTouch (Stratos Idreos, Erietta Liarou, CIDR 2013) • Varying gesture speeds affects amount of data processed • “Varying Gesture Speed. .. observes what happens with a varying speed of applying the basic slide gesture for an interactive summaries query... As we slow down the speed of the gesture, we are able to observe/process more data. dbTouch captures more touch input and it can map this input to object identifiers of the underlying data.” • Screen size affects amount of data processed • “Varying Object Size. ..we test what happens as the size of a data object changes…. By adjusting the object size we allow for more detail; as the size increases, the same gesture speed allows the inspection of more data....via adjustments of the object size a user can interactively get a more fine grained or a more high level view of the data on demand.” 17

Slide 19

Slide 19 text

Continuous Action • Human-in-the-loop • Continuous manipulation -> continuous query generation • Immediate feedback • Focus on low latency • Examples: • Retrospective Adaptive Prefetching • Prefetching for Visual Data Exploration 18 0 0 10 10,000 20,000 20 30 40 50 0 100 150 200 300 350 15,000 30,000 250 0 12,500 25,000 50 55 60 65 70 75 Range: 0 - 50 Range: 50 - 70 Range: 150 - 300

Slide 20

Slide 20 text

• “In traditional systems, once a query is posed, the database controls the data flow, i.e., it is in full control regarding which data it processes and in what order, such as to compute the result to the user query.” • “In dbTouch, a query is a session of one or more continuous gestures and the system needs to react to every touch, while the user is now in control of the data flow.” 19 DBTouch (Stratos Idreos, Erietta Liarou, CIDR 2013)

Slide 21

Slide 21 text

Retrospective Adaptive Prefetching (RAP) (Serdar Yeşilmurat, Veysi İşler, Geoinformatica 2013) • Low latency through prefetching • “ As expected, the map refresh time increases when more tiles are requested from the map server…Since some tiles are already in the prefetching cache, they are loaded from the cache, which reduces the refresh time ” • Simulating user behavior as a sequence of navigations • “The tests are completely automated by simulating user actions... a one-second delay is inserted between navigations to reflect average user behavior. For each test, 21 randomly generated navigations are simulated … 100 tests are executed, and in each test, a unique list of 21 navigations is implemented. ..They all contain 21 navigations to ensure consistency across comparisons." • Related: ForeCache (Leilani Battle, Remco Chang, Michael Stonebraker, SIGMOD 2016) 20

Slide 22

Slide 22 text

Prefetching for Visual Data Exploration (Punit R. Doshi, Elke A. Rundensteiner and Matthew O. War, SIGMOD 2002) • Simulation studies with varying navigation patterns • “We study the effect of differences in navigation patterns by (1) varying the number of hot regions, (2) erratic versus directional navigation patterns and (3) delay between user requests.” • Traces from real user studies used to validate simulations • “We have performed a user-study with real users during which time our logging tool collected the traces of user explorations…. These traces consisted of 30 minutes each for 20 different users. These traces when given as input to our tool under various system settings gave results similar to our synthetic user traces, confirming our conclusions outlined above.” 21

Slide 23

Slide 23 text

Ambiguous Query Intent “I can’t tell you what I want… …but I’ll know it when I see it!” • assumption: • User can express what they want • There exists exactly one well-formed query • solution: • Interactive UIs: Let user play / poke at data • Guide them to their intended result using feedback • Exploratory and Interactive Querying • Another concern: Sensitivity and Jitter • No haptic feedback • Unintended, noisy, repeated queries • Examples: • SeeDB • GestureDB 22 Query Intent Model (Fluxquery, Ebenstein et al., SIGMOD, 2016)

Slide 24

Slide 24 text

GestureDB (Arnab Nandi, Lilong Jiang, Michael Mandel, VLDB 2014) • Anticipating the intended query • “we compare the ability of three different classifiers to anticipate a user’s intent. The first uses only proximity of the UI elements to predict the desired query. • The second uses proximity and schema compatibility of attributes and the third uses proximity, schema compatibility, and data compatibility.” 24

Slide 25

Slide 25 text

Session Behavior • Users are interested in answering a question during session • Most queries in a session are related to each other • Opportunities for optimizations • Examples: • Forecache • Sesame 25

Slide 26

Slide 26 text

ForeCache (Leilani Battle, Remco Chang, Michael Stonebraker, SIGMOD 2016) • Actions are session/task dependent • “The average number of requests per task are as follows: 35 tiles for Task 1, 25 tiles for Task2, and 17 tiles for Task 3. The mountain ranges in Tasks 2 and 3 (Europe and South America) were closer together and had less snow than those in task 1 (US and Southern Canada). Thus, users spent less time on these tasks, shown by the decrease in total requests.” • Users have similar behavior • “We also found that large groups of users shared similar browsing patterns. These groupings further reinforce the reasoning behind our analysis phases, showing that most users can be categorized by a small number of specific patterns within each task, and even across tasks.” 26

Slide 27

Slide 27 text

Sesame (Niranjan Kamat, Arnab Nandi, TKDD 2016) • Benefit in session aware-speculation: • “Response Time across Varying Data Sizes. The benefit of our system is evident: ALGOSESAME is typically at least an order of magnitude faster than traditional database querying. As an anecdotal example, with 192 shards, ALGOSESAME is 18 faster than traditional execution for WorkloadTPCDS and 25 faster for WorkloadReal.” • “Overall Speculation Benefit. The Overall Speculation Benefit is similar to the query speedup initially, which increases as a greater fraction of time is spent actually running the query compared to the time spent in setting up the query execution, whose increase is relatively slower. Notably, the Overall Speculation Benefit is consistently greater than 1.” 27

Slide 28

Slide 28 text

Salient Features of Interactive Systems • Devices and Interfaces • Continuous Actions • Ambiguous Query Intent • Session Behavior 28

Slide 29

Slide 29 text

Slide 30

Slide 30 text

Traditional Benchmarking -TPC • Ideal for transactional systems • Need to consider humans-in-the-loop • Performance • TPC-C, TPC-E: OLTP • TPC-DI: ETL • TPC-DS, TPC-H: OLAP • Human-in-the-loop systems require richer latency constraints • Simply measuring max / median / mean does not cut it • Need metrics for human factors 30

Slide 31

Slide 31 text

Metrics • Performance • Throughput • Latency • Scalability • Cache Hit Rate • Query Issuing Frequency • Latency Constraint Violation 31 • Human Factors • Quantitative • Usability • Learnability • Discoverability • Accuracy • Qualitative • User Feedback • Design Study

Slide 32

Slide 32 text

Metrics in literature 32 type metric explanation & citations query interface usability how easy it is to use a system [145]: it can be evaluated by the query specification or task completion time [33, 43, 46, 88, 141, 147, 157, 185, 207, 227, 237, 269, 270, 281], number of iterations or navigation cost [47, 159, 160, 181], miss times [152], ease to get more or unique insights [101,225,237,264], accuracy or success in finishing tasks [46,65,127,166,185,237, 269, 270], etc. learnability easy to learn the user actions given prior instruction: [43, 46, 207, 225], etc. discoverability ability to discover the user actions without prior instruction: [152, 192, 207], etc. survey / questionnaire survey questions: [65, 127, 141, 155, 157, 159, 185, 207, 225, 264], etc. case study do real tasks, demonstrate feasibility and in-context usefulness [53, 106, 108, 170, 193, 216, 230, 261, 266, 271, 272, 279], etc. subjective feedback / suggestions [46, 57, 65, 147, 227, 237, 261], etc. behavior analysis sequences of mouse press-drag-release [147], event state sequence [174], etc. performance throughput transactions / requests / tasks per second: TPC-C, TPC-E [30], [70], etc. latency the execution time of the query or frame rate per second: [70, 88, 104, 152, 155, 159, 184,188,189], etc. scalability performance with the increasing data size, number of dimensions, number of machines, etc.: [81, 88, 155, 189], etc. cache hit rate [48, 155, 245], etc. https://github.com/ixlab/eval

Slide 33

Slide 33 text

Throughput • Traditional metric • TPC-C, TPC-E • Can be measured as: • Transactions per second • Requests per second • Tasks per second • Example: Atlas System 33

Slide 34

Slide 34 text

Atlas (Sye-Min Chan, Ling Xiao, John Gerth, Pat Hanrahan, VAST 2008) - Throughput • System for interactive visualizing large scale temporal data • load balancing among distributed server • Speedup – increase in query throughput over baseline (1 server) with increase in number of servers • Number of queries processed by the database 34

Slide 35

Slide 35 text

Latency • Query response = a lot more than query execution time • start = submit • Send query – network time • Query scheduling • Query execution • Summarization • Receive response – network time • Rendering 35

Slide 36

Slide 36 text

Motivation for Low Latency in Literature 36 • Assessing Simulator Sickness in a See-Through HMD: Effects of Time Delay, Time on Task, and Task Complexity (Nelson et al., IMAGE Conference 2000) – latencies of 50-100ms affect ability of participants to visually follow an object with a head mounted device • Assessing Target Acquisition and Tracking Performance for Moving Targets in the Presence of Latency and Jitter (Pavlovych et al. Graphics Interface 2012) – Error rates significantly increase with latency above 110ms in mouse target acquisition tasks • How Fast is Fast Enough? A Study of the Effects of Latency in Direct-Touch Pointing Tasks, (Ricardo Jota, Albert Ng, Paul Dietz and Daniel Wigdor, CHI 2013) – “performance decreases with latency; ability to perceive latency in feedback to the land-on event range from 20 to 100ms”

Slide 37

Slide 37 text

The Effects of Interactive Latency on Exploratory Visual Analysis (Zhicheng Liu and Jeffrey Heer, TVCG 2014) • Controlled user study comparing latency with 500ms difference per operation • Key findings: • “ (1) the additional delay results in reduced interaction and reduced dataset coverage during analysis;” • “(2) the rate at which users make observations, draw generalizations and generate hypotheses also declines due to the delay;” • “(3) initial exposure to delays can negatively impact overall performance even when the delay is removed in a later session.” 37

Slide 38

Slide 38 text

Facetor (Abhijith Kashyap, Vagelis Hristidis, Michalis Petropoulos, CIKM 2010) - Latency • Reduces the expected navigation cost during faceted exploration • Latency interpreted as query execution time • “This experiment aims to show that UniformSuggestions is fast enough to be used in real-time. 38

Slide 39

Slide 39 text

Scalability • Performance • Scale up: put everything in faster disk / main memory, increase CPU speed • Scale out: distributed systems, shard / split • Bottlenecks • Overhead cost • post-aggregation (highlighting, ranking) • Cognitive ability of users • Size and complexity of the data • summarization 39

Slide 40

Slide 40 text

DICE (Niranjan Kamat, Prasanth Jayachandran, Karthik Tunga, Arnab Nandi, ICDE 2014) - Scalability 40 • Distributed interactive cube exploration • Example of scaling out • Performance improvement bottoms out after a point

Slide 41

Slide 41 text

Cache Hit Rate • Number of times an item was in the cache • Accuracy of Speculation • Cache Location • Frontend cache – reduces database load but hard to maintain – cache invalidation • Backend Cache – Predictable latency • Caching Strategy • Eviction-based – LRU, FIFO - ineffective • Predictive caching • Example: Scout 41

Slide 42

Slide 42 text

Scout (Farhan Tauheed, Thomas Heinis, Felix Schurmann, Henry Markram , Anastasia Ailamaki, VLDB 2012) – Cache Hit Rate • System for exploring spatial data – content-aware prefetching • Baselines • Hilbert Prefetch – prefetches nearest neighbors • straight line – extrapolates from previous moves • EWMA – weights recent moves as higher • Sensitivity analysis 42

Slide 43

Slide 43 text

Query Issuing Frequency (QIF) • Sensing rate for devices has increased • Ipad: 30Hz, 120Hz with pencil • Number of queries issued per time interval • Tradeoff 43

Slide 44

Slide 44 text

Latency Constraint Violation (LCV) • State of art = mean / median / max latency • Challenge = UIs are composed of a sequence of queries (in a workload, Qi+1 is dependent on result of Qi ) • Not captured in latency metrics • Dependent queries • Can cause cascading failures • Incorrect conclusions • Measured as binary or number of violations • Crossfiltering case study 44

Slide 45

Slide 45 text

Metrics • Performance • Throughput • Latency • Scalability • Cache Hit Rate • Query Issuing Frequency • Latency Constraint Violation 45 • Human Factors • Quantitative • Usability • Learnability • Discoverability • Accuracy • Qualitative • User Feedback • Design Study

Slide 46

Slide 46 text

Usability • Captures ease of use of the interface • Proxied by • task completion time • number of iterations • navigation cost • number of insights • uniqueness of insights • Example: Dataplay 46

Slide 47

Slide 47 text

Dataplay (Azza Abouzied, Joseph M. Hellerstein, Avi Silberschatz, UIST 2012) - Usability • User study comparing 2 features (autocomplete query correction vs. direct manipulation of query tree) of the system - 13 participants • Task Completion time - 3 tasks • “Task 1: We gave users a query tree that finds (a) students who got A’s in some courses. We asked users to fix the query to find (a) students with all A’s. The complexity of this task is ‘1-tweak’.” • “Task 2: We gave users a query tree that finds (a) students who took courses in any of three areas. We asked users to fix the query to find (a) students who took courses in all and only the three areas. The complexity of this task is ‘2-tweaks’." • “Task 3: We gave users a query tree that finds (a) students who took any of three specific courses and got an A in any. We asked users to fix the query to find (a) students who took all three courses with A’s in them ignoring grades in other courses. The complexity of this task is ‘3-tweaks’.” 47

Slide 48

Slide 48 text

Learnability • Ability to retain user actions with instructions • Usable vs. Learnable • Cockpit • Audience Dependent • DBA vs. consumer • Example: Kinetica 48

Slide 49

Slide 49 text

Kinetica (Jeffrey M. Rzeszotarski, Aniket Kittur, CHI 2014) - Learnability • System that leverages touch interactions and physics based affordances for data visualization. • “In the case of Kinetica, participants followed a built-in tutorial that goes over each tool with a use case example” • “For the Excel participants, the study observer presented two well-viewed YouTube video tutorials on Excel and pivot tables/charts.” 49

Slide 50

Slide 50 text

Discoverability • Ability to find user actions without instructions • Airport Kiosk is discoverable • Affordances – usage clues • Example: GestureDB 50

Slide 51

Slide 51 text

GestureDB (Arnab Nandi, Lilong Jiang, Michael Mandel, VLDB 2014) - Discoverability • “Thus, our second study compares VISUAL QUERY BUILDER to GESTUREQUERY in terms of the discoverability of the JOIN action, i.e., whether an untrained user is able to intuit how to successfully perform gestural interaction from the interface and its usability affordances.” • “Each subject was provided a task described in natural language, and asked to figure out and complete the query task on each system within 15 minutes.” • The task involved a PREVIEW, FILTER, and JOIN, described to the subjects as answering the question, “What are the titles of the albums created by the artist ‘Black Sabbath’?” 51

Slide 52

Slide 52 text

Accuracy • Database contract: • Old: Correct answer, unbounded latency • New: Approximate answer, strict latency • Mainly for systems • Sampling • Approximate Query Processing • Online aggregation • Measures error between sample result and true result • Example: Incvisage 52

Slide 53

Slide 53 text

Incvisage (Sajjadur Rahman, et al. VLDB 2017) – Online Aggregation (Hellerstein et al. SIGMOD 1997) Accuracy • System that shows the user incremental visualizations that preserve its salient features. • Performance experiment: Compared avg. mean squared error across iterations against baselines • User Study: Quiz style evaluation – compared against online aggregation (OLA) • “The extrema-based questions asked a participant to find the highest or lowest values in a visualization. The range-based questions asked a participant to estimate the average value over a time range (e.g., months, days of the week).” • Accuracy measured as normalized difference from correct value. 53

Slide 54

Slide 54 text

Metrics • Performance • Throughput • Latency • Scalability • Cache Hit Rate • Query Issuing Frequency • Latency Constraint Violation 54 • Human Factors • Quantitative • Usability • Learnability • Discoverability • Accuracy • Qualitative • User Feedback • Design Study

Slide 55

Slide 55 text

User Feedback • Comments, suggestions, questionnaires/surveys • Pilot study to inform quantitative metrics and UI design • Can be quantitative – Likert scale • Insight based • Think aloud protocols • Anecdotal comments • Example: Scented Widget 55

Slide 56

Slide 56 text

Scented Widgets (Wesley Willett, Jeffrey Heer, and Maneesh Agrawala, TVCG 2007) • UIs embedded with visualizations • “After completing the tasks, subjects filled out a survey that asked them to rate the scenting conditions on perceived utility and user experience.” 56

Slide 57

Slide 57 text

Design Study • Extended interviews with practitioners for task selection • Not a metric • Task Definition – articulate problem space • Example: Zenvisage 57

Slide 58

Slide 58 text

Zenvisage (Tarique Siddiqui, Albert Kim, John Lee, Karrie Karahalios, Aditya Parameswaran, VLDB 2017) – Design Study • Taxonomy of tasks in literature: “The exploration tasks in Amar et al. include: filtering (f), sorting (s), determining range (r), characterizing distribution (d), finding anomalies (a), clustering (c), correlating attributes (co), retrieving value (v), computing derived value (dv), and finding extrema (e).” • Meet with experts: “We hired seven data analysts via Upwork, a freelancing platform—we found these analysts by searching for freelancers who had the keywords analyst or tableau in their profile.” • Workflow analysis: “We conducted one hour interviews with them to understand how they perform data exploration tasks. The interviewees had 3—10 years of prior experience, and told about every step of their workflow; from receiving the dataset to presenting the analysis to clients.” • Validate by experts: “When we asked the data analysts which tasks they use in their workflow, the responses were consistent in that all of them use all of these tasks, except for three exceptions—c, reported by four participants, and e, d, reported by six participants.” 58

Slide 59

Slide 59 text

Outline • Salient Features / Challenges in Evaluation • Guidelines (large-scale survey of related work) • Metrics • Confounding factors and Biases • Case Studies • Inertial Scroll • Filter and Map • Crossfiltering • Open Problems 59

Slide 60

Slide 60 text

Confounding Factors • Learning • Interference • Fatigue 60

Slide 61

Slide 61 text

Learning – Practice Effects • Within-subject study • Same task – different system • Different task - same system • Multiple datasets, multiple systems • Improved performance on second system 61

Slide 62

Slide 62 text

Accounting for Learning • Randomize or Counterbalance • Counterbalancing - Example: Voyager: Exploratory Analysis via Faceted Browsing of Visualization Recommendations (Wongsuphasawat et al., TVCG 2015) • Each participant conducted two exploratory analysis sessions, each with a different visualization tool and dataset. We counterbalanced the presentation order of tools and datasets across subjects. • Different Users for each task - Example: SnapToQuery (Lilong Jiang, Arnab Nandi, VLDB 2015) • “It should be noted that in order to avoid bias and memory effects, these users are recruited separately from the previous experiment.” 62

Slide 63

Slide 63 text

Interference • Within-subject studies • Diminished performance in second task • Confusing functionalities • Randomization or Counterbalancing • Between-subjects user study • Example: Related Worksheets 63

Slide 64

Slide 64 text

Related Worksheets (Eirik Bakke, David R. Karger, and Robert C. Miller, CHI 2011) • Random Split into two groups • “The subjects were divided randomly into two groups of 26 subjects each, and invited to participate in the main part of the study. Of these, 18 recruits from the Excel group and 18 recruits from the Related Worksheets group completed the main task.” • Same task • “The tasks and instructions in the two main forms were identical except in the description of the tool used to solve the tasks.” 64

Slide 65

Slide 65 text

Fatigue • Long tasks can lead to users performing poorly towards the end • Tasks need to be broken into small chunks • Breaks in between • Example: SIEUFERD (Eirik Bakke, David R. Karger, SIGMOD 2016) • Usability survey in between 20 min tasks 65

Slide 66

Slide 66 text

Biases 66 • 100+ cognitive biases • Participants: • Social desirability bias: Users tell you what you want to hear • Anchoring bias: Preference for first item seen • Researcher: • Framing effect: Do not ask leading questions • Selection bias: Choosing non-random study population

Slide 67

Slide 67 text

A Framework for Studying Biases in Visualization Research (Andre Calero Valdez , Martina Ziefle, Michael Sedlmair, DECISIVe 2017) • Need to understand biases • “carefully studying low-level perceptual and action biases will make up for a good underpinning, not only for better understanding highlevel phenomena eventually, but also as a way to better understand decision making with visualization in general.” • Transparency • “Good practice such as reproducibility through publishing all data, codes and experimental setup, using confidence intervals to allow for meta-analysis, and reporting negative findings will be essential in this process.” • Counteracting biases: • “how far is it valid to correct for these biases? Challenging current views , should a visualization “lie” to counteract biases and improve decision making? While for perceptual biases the answer might be quite clear, what about higherlevel biases? Should the visualization decide what is in the best interest of the user? For example, may a visualization override the users preference to not know unpleasant information and counteract the ostrich effect?” 67

Slide 68

Slide 68 text

Slide 69

Slide 69 text

Slide 70

Slide 70 text

Case Study Design Philosophy • Capture common scenarios for interfaces and devices • Interfaces: map, slider, etc. • Devices: mouse, touch, gestural, etc. • Maximize coverage of query types and interaction techniques • Query types: select, join, etc. • Interaction techniques: zoom, linking & brushing, etc. • Enough variations for workloads • Data size, data shape, etc. 70

Slide 71

Slide 71 text

Workloads Overview 71 workloads inertial scrolling crossfiltering filter and map device touch(trackpad) mouse, touch(iPad), gesture(Leap Motion) mouse query interfaces scroll slider slider, map, textbox, checkbox interaction techniques browsing linking & brushing filtering & navigation trace {timestamp, scrollTop, scrollNum, delta} {timestamp, minVal, maxVal, sliderIdx} {timestamp, tabURL, requestId, resourceType, type, status} query select, join count aggregation select, join

Slide 72

Slide 72 text

Slide 73

Slide 73 text

Inertial Scrolling • acceleration: help the user scroll smoothly • widely adopted in mobile devices, touchpads, trackpads, etc. 73

Slide 74

Slide 74 text

Implementing Inertial Scroll • Lazy load • Expensive / impossible to load full dataset • As extent is about to come into view, fetch from server • SELECT title, year, … FROM imdb LIMIT 100 OFFSET 60 74

Slide 75

Slide 75 text

Overwhelmed Database! • New workload: Exponential queries Issued • Query Scheduling • Which one to serve first? • Too fast = user discards result anyway • Query Execution • Do we need 100% accurate results? 75

Slide 76

Slide 76 text

Inertial Scrolling Experiments • Goal: Understand impact of scrolling behavior on DB • Asked 15 users to skim IMDB movie records, select interesting movies • Record scroll / wheel events, track timestamp, and scrollTop (and pixel delta), number of tuples scrolled SELECT title, year, … FROM imdb LIMIT 100 OFFSET 60 76

Slide 77

Slide 77 text

Inertial Scrolling: Scrolling speeds 77 • User scrolls much larger extents with inertial • y-axis : 400 vs 4 • Lazy load is not practical: either too little or too wasteful • User often reaches end of page before items are loaded = UI is blocked 0 5000 10000 Timestamp (ms) 0 2 4 Wheel Delta Inertial Non-inertial

Slide 78

Slide 78 text

Inertial Scrolling: Scroll Speed 78 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 User 0 50 100 150 200 Count movies selected backscrolled selections • (left) Some users scroll more wildly than others – non- uniform audience • (right) Number of backscrolls > number of movies selected • Users forget / overshoot and then return to revisit

Slide 79

Slide 79 text

Inertial Scrolling: Performance • Lazy loading strategies • Naïve: fetch when tuple placeholder is in view • Blocking operation, user waits • Event: at each scroll event, check cache, prefetch • Computationally expensive • Timer: prefetch every n milliseconds • Need to tune parameter based on usage 79

Slide 80

Slide 80 text

Inertial Scrolling: Latency 80 • Average latency, vary # tuples fetched • Event fetch: insensitive to # tuples fetched • Timer fetch: ~60 seconds when # tuples is low, fast when # tuples is high

Slide 81

Slide 81 text

Outline • Salient Features / Challenges in Evaluation • Guidelines (large-scale survey of related work) • Metrics • Biases • Case Studies • Inertial Scroll • Brushing and Linking with Maps • Crossfiltering • Open Problems 81

Slide 82

Slide 82 text

Slide 83

Slide 83 text

Filter and Map • Interfaces • map • slider • Guidelines: • parameters • benchmark 83

Slide 84

Slide 84 text

Filter and Map: Behavior • Study users browsing on AirBnB • Find short-term housing • At least 20 mins • Look at • Query Actions • Map Zoom Behavior • Map Dragging Behavior 84

Slide 85

Slide 85 text

Filter and Map: Query Actions 85 0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 70.00% Map Slider, Checkbox Button Text Box • Unsurprisingly, map interactions are a very popular way for query refinement

Slide 86

Slide 86 text

Filter and Map: Map Zoom Behavior 86 • Most users converge at zoom levels 11—13 • Change in zoom levels is at most 3 • Insight can be used to prefetch / precompute

Slide 87

Slide 87 text

Filter and Map: Map Dragging (Panning) Behavior 87 • At deeper levels, users move smaller distances (confirming intuition) • Can be used to prefetch / create ideal tile resolutions

Slide 88

Slide 88 text

Filter and Map: Performance 88 • 70% queries = 4 filter conditions • Precompute at least ~ C(n,4) * 2^4 • Request time is much lower than exploration time • Good case for building a prefetching layer

Slide 89

Slide 89 text

Slide 90

Slide 90 text

Slide 91

Slide 91 text

Crossfiltering 91 • Each histogram corresponds one attribute / dimension • Histograms for other attributes are updated synchronously while the user is manipulating one slider • Multiple (n – 1) queries are issued at the same time

Slide 92

Slide 92 text

Crossfilter demo 92

Slide 93

Slide 93 text

Crossfilter experiments • How do different UI devices impact crossfilter workloads? 93 Leap Motion Mouse Touch

Slide 94

Slide 94 text

SnapToQuery In Action “SnapToQuery”: VLDB 2015

Slide 95

Slide 95 text

Crossfiltering: Experimental Setup • Dataset: 3D road network dataset (three attributes: longitude, latitude, height; 434,874 tuples) • Configuration: PostgreSQL vs MemSQL on Linux Machine • Task & Users: 30 Traces from SnapToQuery • Trace: timestamp, range values, slider idx • Query: Multiple histogram queries 95

Slide 96

Slide 96 text

Crossfiltering: Behavior • Sliding Behavior • Traces for three devices • Querying Behavior • Two behavior-driven optimizations • Performance metrics 96 0 0 10 10,000 20,000 20 30 40 50 0 100 150 200 300 350 15,000 30,000 250 0 12,500 25,000 50 55 60 65 70 75 Range: 0 - 50 Range: 50 - 70 Range: 150 - 300

Slide 97

Slide 97 text

Crossfiltering: Sliding Behavior 97 mouse touch Leap Motion Leap Motion presents more jitter than the mouse and touch.

Slide 98

Slide 98 text

Crossfiltering: Optimizations • Interface-driven (skip): skip queries already skipped by frontend • Result-driven (KL>0 or KL > 0.2): skip queries whose result is same or similar • Both ideas are areas of future inquiry • (aka please steal these ideas and write papers on them!) 98

Slide 99

Slide 99 text

Crossfiltering: Performance • Metrics • Query Issuing Frequency • Query Reduction • Latency • Latency Constraint Violation • Factors: databases (PostgreSQL, MemSQL), devices (mouse, touch, Leap Motion), and optimization methods (raw, KL>0, KL>0.2, skip) 99

Slide 100

Slide 100 text

Crossfiltering: Query Issuing Frequency 100 • number of queries issued by Leap Motion is much larger than mouse and touch (y-axis scale: 2500 vs 120) • drastically reduce the number of queries when issuing queries with result-driven (KL > 0 and KL > 0.2) • even when issuing queries selectively, the dominant query issuing interval varies little, especially for Leap Motion. Frequencies of Leap Motion concentrate at 20ms - 25ms, for the mouse and touch, the shape of the bell is broader

Slide 101

Slide 101 text

Crossfiltering: Latency 101 Ø MemSQL can maintain a latency 10–50ms. After some optimization (KL=0.2 or skip), PostgreSQL can maintain a latency 100–1000ms (sub-second). Ø Leap Motion has more dense workload than the mouse and touch.

Slide 102

Slide 102 text

Crossfiltering: Query Reduction 102 Ø For the skip strategy, the percentage depends more on the database type than device type Ø While for the result strategy (KL>0 and KL>0.2), the percentage depends more on the device than the database type

Slide 103

Slide 103 text

Latency Constraint Violations • Current Approach: Min / Max / Average Latency • Problem: Does not capture full picture, especially in complex UIs / session-based analysis • Solution: Measure Latency Constraint Violations 103

Slide 104

Slide 104 text

Latency Constraint Violations 104 Q1 Time Q2 Q3

Slide 105

Slide 105 text

Q1 Latency Constraint Violations 105 Q1 Time Q2 Q3

Slide 106

Slide 106 text

Crossfiltering: Querying Behavior 106 Execution Time Execution Delay Database Issuing User Issuing Q1 Q2 Q3 Q4 Latency Interval Get Result Timestamp Observation 1: Ø Q1, Q2, Q3, Q4 are issued one after another Ø Query issuing frequency

Slide 107

Slide 107 text

Crossfiltering: Querying Behavior 107 Execution Time Execution Delay Database Issuing User Issuing Q1 Q2 Q3 Q4 Latency Interval Get Result Timestamp Observation 2: Ø Execution delay will become larger and larger Ø Latency & latency constraint violation

Slide 108

Slide 108 text

Crossfiltering: Adjacent Queries 108 Execution Time Execution Delay Database Issuing User Issuing Q1 Q2 Q3 Q4 Latency Interval Get Result Timestamp Observation 3: Ø Adjacent queries: same, identical, similar Ø Result-driven: skip Q2, Q3, run Q4

Slide 109

Slide 109 text

Crossfiltering: Skipped Queries 109 Execution Time Execution Delay Database Issuing User Issuing Q1 Q2 Q3 Q4 Latency Interval Get Result Timestamp Observation 4: Ø Queries are already skipped in frontend Ø Interface-driven: skip Q2, Q3, run Q4

Slide 110

Slide 110 text

Crossfiltering: Latency Constraint Violation 110 Ø Fewer queries violate latency constraint for MemSQL than PostgreSQL. Ø For MemSQL, when we issue queries with KL>0, we can reduce about half of violated queries. Ø For PostgreSQL, when we issue queries with KL>0.2, the decrease for the mouse and touch is about 30% while for the leap motion is 17%.

Slide 111

Slide 111 text

Slide 112

Slide 112 text

Recap • Salient Features / Challenges in Evaluation • Guidelines (large-scale survey of related work) • Metrics • Biases • Case Studies • Inertial Scroll • Filter and Map • Crossfiltering • New Metrics • Latency Constraint Violations • QIF 112

Slide 113

Slide 113 text

Slide 114

Slide 114 text

Open Problem: Multimodal UIs • How to optimally combine speech + gestures + keyboard + ______ ? • Mixed-initiative interfaces • How do we measure performance across all interfaces? Response-time Simple Rich Slow Vocabulary Speech Touch Fast The perfect interface

Slide 115

Slide 115 text

Open Problems: Recognizing Human Limits Capability Time (Decades) Human Perception & Cognition Computer Science RIGHT NOW

Slide 116

Slide 116 text

Recognizing Human Limits • Context: • Interactive Visualizations • Intuition: • If you can’t tell the difference, why compute it? • Approach: • Measure human limitations in perception Mturk study (derive perceptual functions) • Push perceptual functions down to DB for optimizations • “Approximate User” for evaluations Interaction session JSON DBMS Visualization System InterVis Queries, Perceptual funcs Result set Perceptual Execution Network Interaction Pixels Frontend Backend DSIA@VIS 2015 w/ Eugene Wu Open Problems: VLDB 2017 w/Joe Hellerstein

Slide 117

Slide 117 text

Outline • Introduction • Guidelines • Salient Features • Metrics • Biases • Case Studies • Inertial Scroll • Filter and Map • Crossfiltering • Open Problems • Conclusion 117

Slide 118

Slide 118 text

Conclusion • Salient Features in Interactive Data Systems • Very important to model user interaction in both system design and evaluation! • Guidelines (large-scale survey of related work) • Metrics (new metrics) • Biases • Case Studies, Behavioral Analysis, Optimizations • Inertial Scroll • Brushing and Linking with Maps • Crossfiltering • Open Problems 118

Slide 119

Slide 119 text

Thank you! papers, videos and more at http://arnab.org