Slide 1

Slide 1 text

Assessing Subjective Quality of Web Interaction with Neural Network as Context of Use Model Maxim Bakaev ([email protected]), Vladimir Khvorostov, Tatiana Laricheva Novosibirsk State Technical University (Russia) Tools and Methods for Program Analysis (TMPA-2017), Moscow

Slide 2

Slide 2 text

TMPA 2017, Moscow 2 Introduction – Motivation  Error removal is the most time-consuming phase of software life cycle (non-linear increase with complexity/scale)  The level of QA technologies advancement impose the limit on software size (currently, order of 108 lines of code)  Analysis, testing and error removal tools used to get less attention, compared to e.g. CASE for requirements or design  Web-based software:  Ubiquitous (100 millions active websites)  Often built by small and/or inexperienced development teams  But the complexity is increasing (even legacy IS migrate to the web)  Pressure for rapid development, frequent updates and fixes  Automation/support of QA activities is very much desired

Slide 3

Slide 3 text

TMPA 2017, Moscow 3 Introduction – web QA automation  Unit testing and integration testing – good (even emerging auto-generation of tests)  Validation/verification – fair (GUI testing automation tools, e.g. Selenium to write scripts for browser)  Load testing – excellent (naturally automatable)  Visual appearance (of web pages) – poor (very subjective, while image analysis is complex)  Usability and interaction quality  Objective dimension: effectiveness (perform functions), accessibility (compliance to standards), speed – fair  Subjective dimension: satisfaction, trust, etc. – poor (relative to context of use, i.e. user- and task-dependent)

Slide 4

Slide 4 text

METHODS: How accurately can we predict users’ subjective impression of a website without an actual user? TMPA 2017, Moscow 4 RESEARCH QUESTION: • Artificial Neural Network (NN) to model user preferences • Data collection in experimental survey, to train the network

Slide 5

Slide 5 text

TMPA 2017, Moscow 5 Automated Usability Evaluation  Interaction-based UE: based on data from interactions  mouse cursor behavior, “optimality” of interaction from logs, etc.  real users must do real work with potentially poor designs  Metric-based UE: defined and quantified website metrics  e.g. concordance to guidelines, complexity, ratio of graphics/ text  effect of user tasks and contexts on the metrics significance?  Model-based UE: simulation, CBR, data mining  rely on models (mainly Domain, User, and Tasks) and general knowledge about human behavior and web technologies  high computational complexity, work effort to maintain the models  Hybrid approaches: AI, machine learning  real-time processing of interaction data for interface augmentation, trained usability evaluation models, etc.

Slide 6

Slide 6 text

TMPA 2017, Moscow 6 Artificial Neural Networks  NN is a sophisticated way of specifying a function  Input layer, one or more hidden layers, output layer  Learning: find a function with the smallest possible “cost”  Model quality: classification accuracy (SS relative error)

Slide 7

Slide 7 text

TMPA 2017, Moscow 7 NNs for Web Usability Assessment  NNs in software testing automation:  predicting defects, resulting quality and costs, acting as test oracle, etc.  mostly focus on functional requirements  exception: Kansei Engineering that can use NNs to link design parameters (input neurons) with subjective evaluations per emotional scales (output neurons)  For our research problem:  input: factors of use context: User, Platform, and Environment + analyze importance of factors  output: subjective interaction quality attributes

Slide 8

Slide 8 text

TMPA 2017, Moscow 8 The NN Model Structure  Domain: fixed (Education and Career) – university websites, 11 from Germany and 10 from Russia  User-related factors:  Age, gender, language/culture group  Platform-related factors:  website country group, number of website sections, Flesch- Kincaid Grade Level (https://readability-score.com), number of errors plus warnings (https://validator.w3.org)  Environment-related factors:  page load time, global rank, bounce rate (http://alexa.com)  Output: evaluation for “emotional” scales  Beautiful, Evident, Fun, Trustworthy, and Usable

Slide 9

Slide 9 text

TMPA 2017, Moscow 9 Experiment – Description  Participants (82):  Procedure: each user evaluate 10 websites randomly selected from the 21, by the five subjective scales, with values for each ranging from 1 (worst) to 7 (best). German users Russian users Total number of participants 40 42 Gender Male 90% 71.4% Female 10% 28.6% Program Bachelor 35% 54.8% Master 65% 45.2% Age Range 19~33 20~28 Mean 24.5 21.7 SD 3.19 0.89 Native language Common German: 75% Russian: 90.5% Others 25% 9.5%

Slide 10

Slide 10 text

TMPA 2017, Moscow 10 Results – Descriptive Statistics  In total, 820 evaluations (divided into NN training, testing and holdout datasets as 80% - 10% - 10%). Scale German users Russian users Difference Beautiful 3.83 (0.59) 4.10 (0.98) Trustworthy 4.10 (0.37) 4.53 (0.76) p=0.058 Fun 3.54 (0.50) 4.20 (0.97) p=0.025 Evident 3.79 (0.27) 4.60 (0.66) p<0.001 Usable 3.78 (0.36) 4.52 (0.79) p=0.003 Scales Beautiful Trustworthy Fun Evident Usable Beautiful 1 0.670 0.767 0.563 0.605 Trustworthy 1 0.673 0.648 0.675 Fun 1 0.638 0.661 Evident 1 0.813 Usable 1

Slide 11

Slide 11 text

TMPA 2017, Moscow 11 Results – the NN Model  Multilayer Perceptron, 1 hidden layer with 10 neurons, optimization algorithm – gradient descent, activation functions – sigmoid (hidden) and identity (output).  Overall relative error: 0.737 (moderate)  Residuals for Usable evaluations (converted from ordinal to scale) Scale Error Beautiful 0.656 Fun 0.707 Evident 0.762 Trustworthy 0.786 Usable 0.792 Model 0.737

Slide 12

Slide 12 text

TMPA 2017, Moscow 12 Results – Factors and Scales  Prediction error is better for “common” (Beautiful, Fun) rather than “specialized” (Trustworthy, Usable) scales.  Factors’ importance (in building NN models):  Demographics: age diversity is important, unlike gender.  Cross-cultural: diversity in websites and users are moderately important  Websites: diversity of content and technical quality is important, but in scale – not very much Factor Normalized importance Flesch-Kincaid Grade Level 100.0% User Age 90.9% Alexa Bounce Rate 75.9% Page Load Time 73.0% Website Group 61.4% Errors + Warnings 59.4% User Language Group 39.0% Alexa Global Rank 30.6% Number of Sections 24.6% User Gender 8.8%

Slide 13

Slide 13 text

TMPA 2017, Moscow 13 Conclusions  Can we automate web interaction quality assessment?  websites are abundant, but experts’ and users’ time is not  ANN to predict subjective evaluations for a fixed domain (university websites) and target user group (students):  input: factors of use context – User, Platform, Environment  output: subjective interaction quality scales  Experimental survey to collect data for the NN training:  82 users, 21 websites, 5 scales (820 full evaluations)  Moderate predictive potential (error 0.737), but better for more “simple” scales (e.g. Beautiful: 0.656).  Further research prospects:  varying the factors of domain and target users  analyzing the effect of richer training data on the NN quality