Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Measuring the Cognitive Load of Software Developers: A Systematic Mapping Study

Measuring the Cognitive Load of Software Developers: A Systematic Mapping Study

Bruno C. da Silva

May 25, 2019
Tweet

More Decks by Bruno C. da Silva

Other Decks in Research

Transcript

  1. Measuring the Cognitive Load of Software Developers: A Systematic Mapping

    Study Lucian Gonçales, Kleinner Farias <Unisinos, Brazil> Bruno da Silva, Jonathan Fessler <Cal Poly, USA> [email protected] Sat 25 - Sun 26 May 2019 Montreal, QC, Canada 27th IEEE/ACM International Conference on Program Comprehension
  2. We’ve come a long way since we started to measure

    software artifacts And recently, researchers in SE have started to measure the human body https://www.quora.com/How-can-you-measure-the-electrical-activity-in-the-brain
  3. Our Objectives Provide a classification and a thematic analysis of

    studies on the measurement of developers’ cognitive load
  4. Our Objectives Provide a classification and a thematic analysis of

    studies on the measurement of developers’ cognitive load Pinpoint gaps and possible research directions for future work
  5. Systematic Mapping Study (SMS) 1. What are the types of

    sensors for measuring the cognitive load of developers?
  6. Systematic Mapping Study (SMS) 1. What are the types of

    sensors for measuring the cognitive load of developers? 2. What metrics have been used to measure developers’ cognitive load?
  7. Systematic Mapping Study (SMS) 1. What are the types of

    sensors for measuring the cognitive load of developers? 2. What metrics have been used to measure developers’ cognitive load? 3. What algorithms have been used to classify developers’ cognitive load?
  8. Systematic Mapping Study (SMS) 1. What are the types of

    sensors for measuring the cognitive load of developers? 2. What metrics have been used to measure developers’ cognitive load? 3. What algorithms have been used to classify developers’ cognitive load? 4. For what purpose?
  9. Systematic Mapping Study (SMS) 1. What are the types of

    sensors for measuring the cognitive load of developers? 2. What metrics have been used to measure developers’ cognitive load? 3. What algorithms have been used to classify developers’ cognitive load? 4. For what purpose? 5. Which tasks have been used to measure developers’ cognitive load?
  10. Systematic Mapping Study (SMS) 1. What are the types of

    sensors for measuring the cognitive load of developers? 2. What metrics have been used to measure developers’ cognitive load? 3. What algorithms have been used to classify developers’ cognitive load? 4. For what purpose? 5. Which tasks have been used to measure developers’ cognitive load? 6. What were the artifacts used on cognitive tasks?
  11. Systematic Mapping Study (SMS) 1. What are the types of

    sensors for measuring the cognitive load of developers? 2. What metrics have been used to measure developers’ cognitive load? 3. What algorithms have been used to classify developers’ cognitive load? 4. For what purpose? 5. Which tasks have been used to measure developers’ cognitive load? 6. What were the artifacts used on cognitive tasks? 7. How many participants did the studies recruit to measure developers’ cognitive load?
  12. Systematic Mapping Study (SMS) 1. What are the types of

    sensors for measuring the cognitive load of developers? 2. What metrics have been used to measure developers’ cognitive load? 3. What algorithms have been used to classify developers’ cognitive load? 4. For what purpose? 5. Which tasks have been used to measure developers’ cognitive load? 6. What were the artifacts used on cognitive tasks? 7. How many participants did the studies recruit to measure developers’ cognitive load? 8. Which research methods have been used to investigate cognitive load in software development tasks?
  13. Systematic Mapping Study (SMS) 1. What are the types of

    sensors for measuring the cognitive load of developers? 2. What metrics have been used to measure developers’ cognitive load? 3. What algorithms have been used to classify developers’ cognitive load? 4. For what purpose? 5. Which tasks have been used to measure developers’ cognitive load? 6. What were the artifacts used on cognitive tasks? 7. How many participants did the studies recruit to measure developers’ cognitive load? 8. Which research methods have been used to investigate cognitive load in software development tasks? 9. Where have the studies been published?
  14. Search String (“brain computer interfaces” OR sensors OR devices) AND

    (“psychophysiological indicators” OR “brain synchronization” OR “cognitive load” OR emotions OR biometrics)
  15. Search String (“brain computer interfaces” OR sensors OR devices) AND

    (“psychophysiological indicators” OR “brain synchronization” OR “cognitive load” OR emotions OR biometrics) AND
  16. Search String (“brain computer interfaces” OR sensors OR devices) AND

    (“psychophysiological indicators” OR “brain synchronization” OR “cognitive load” OR emotions OR biometrics) AND (“software engineering” OR “software development” OR “software testing” OR “software maintenance” OR “computer programming” OR diagram OR code)
  17. Paper databases and initial search ACM Digital Library CiteSeerX Google

    Scholar IEEE Explore Inspec Microsoft Academic Pubmed Scopus Science Direct Springer Link Wiley Online Library (“brain computer interfaces” OR sensors OR devices) AND (“psychophysiological indicators” OR “brain synchronization” OR “cognitive load” OR emotions OR biometrics) AND (“software engineering” OR “software development” OR “software testing” OR “software maintenance” OR “computer programming” OR diagram OR code) 2,612 articles
  18. Results: RQ1 - Sensors EEG Combined sensors Eye-trackers fMRI #

    of studies 0 4.5 9 13.5 18 1 2 12 18 Software engineering researchers have preferred EEG sensors to collect data related to cognitive load
  19. Results: RQ1 - Sensors EEG Combined sensors Eye-trackers fMRI #

    of studies 0 4.5 9 13.5 18 1 2 12 18 Many studies have combined sensors. A trend to improve accuracy S. C. Müller and T. Fritz, "Stuck and Frustrated or in Flow and Happy: Sensing Developers' Emotions and Progress," 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering, Florence, 2015, pp. 688-699.
  20. Results: RQ1 - Sensors EEG Combined sensors Eye-trackers fMRI #

    of studies 0 4.5 9 13.5 18 1 2 12 18 Many studies have combined sensors. A trend to improve accuracy. Pupil size, fixation, blinks S. C. Müller and T. Fritz, "Stuck and Frustrated or in Flow and Happy: Sensing Developers' Emotions and Progress," 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering, Florence, 2015, pp. 688-699. EDA, skin temp, heart rate, BVP EEG waves, FBs, attention, meditation
  21. Results: RQ1 - Sensors EEG Combined sensors Eye-trackers fMRI #

    of studies 0 4.5 9 13.5 18 1 2 12 18 Gap: application of high resolution devices, such as EEGs with 128 and 256 channels https://www.cognionics.net/mobile-128
  22. Results: RQ2 - Metrics Combination of metrics Frequency bands Power

    spectrum ERD ERP FD VOI Eye fixation ERSP IAF ICA SSVEP # of studies 0 2 4 6 8 10 12 1 1 1 1 1 2 2 2 2 3 4 13
  23. Results: RQ2 - Metrics 40% of primary studies applied multiple

    metrics to measure developers’ cognitive load Combination of metrics Frequency bands Power spectrum ERD ERP FD VOI Eye fixation ERSP IAF ICA SSVEP # of studies 0 2 4 6 8 10 12 1 1 1 1 1 2 2 2 2 3 4 13
  24. Results: RQ2 - Metrics 42% of studies (14/33) focused on

    measures that are related to EEGs Combination of metrics Frequency bands Power spectrum ERD ERP FD VOI Eye fixation ERSP IAF ICA SSVEP # of studies 0 2 4 6 8 10 12 1 1 1 1 1 2 2 2 2 3 4 13
  25. Results: RQ2 - Metrics Gap: there’s no clear distinction between

    ‘cognitive load’ and ‘mental effort’ found in software engineering papers. Combination of metrics Frequency bands Power spectrum ERD ERP FD VOI Eye fixation ERSP IAF ICA SSVEP # of studies 0 2 4 6 8 10 12 1 1 1 1 1 2 2 2 2 3 4 13
  26. Results: RQ3 - Algorithms/ML Basic stats SVM Naive Bayes Multi

    Algos Decision Tree K-means Logistic Regression Neural Network Random Forest RF Learners RVM Linear Regression # of studies 0 2 4 6 8 10 12 1 1 1 1 1 1 1 1 3 5 5 13
  27. Basic stats SVM Naive Bayes Multi Algos Decision Tree K-means

    Logistic Regression Neural Network Random Forest RF Learners RVM Linear Regression # of studies 0 2 4 6 8 10 12 1 1 1 1 1 1 1 1 3 5 5 13 Many studies did not use any specific ML algorithm (40%) Results: RQ3 - Algorithms/ML
  28. More occurrences of classification techniques compared to regression Basic stats

    SVM Naive Bayes Multi Algos Decision Tree K-means Logistic Regression Neural Network Random Forest RF Learners RVM Linear Regression # of studies 0 2 4 6 8 10 12 1 1 1 1 1 1 1 1 3 5 5 13 Results: RQ3 - Algorithms/ML
  29. Gap: All applied from highly controlled settings. Would those models

    hold high accuracy in real scenarios with less controlled settings? How? To what extent? Basic stats SVM Naive Bayes Multi Algos Decision Tree K-means Logistic Regression Neural Network Random Forest RF Learners RVM Linear Regression # of studies 0 2 4 6 8 10 12 1 1 1 1 1 1 1 1 3 5 5 13 Results: RQ3 - Algorithms/ML
  30. Results: RQ4 - Purpose Code comprehension Emotion recognition Task difficulty

    Cognitive demand Productivity Stress level Authentication Code quality Interruptibility Pair-dynamic level Performance Satisfaction # of studies 0 2 4 6 8 10 12 1 1 1 1 1 1 2 2 3 6 6 8
  31. Results: RQ4 - Purpose Code comprehension Emotion recognition Task difficulty

    Cognitive demand Productivity Stress level Authentication Code quality Interruptibility Pair-dynamic level Performance Satisfaction # of studies 0 2 4 6 8 10 12 1 1 1 1 1 1 2 2 3 6 6 8 Researchers looking for a more reliable way of measuring program comprehension compared to task completion time or correctness
  32. Results: RQ4 - Purpose Code comprehension Emotion recognition Task difficulty

    Cognitive demand Productivity Stress level Authentication Code quality Interruptibility Pair-dynamic level Performance Satisfaction # of studies 0 2 4 6 8 10 12 1 1 1 1 1 1 2 2 3 6 6 8 Researchers looking for emotion awareness in programming/software engineering
  33. Results: RQ4 - Purpose Code comprehension Emotion recognition Task difficulty

    Cognitive demand Productivity Stress level Authentication Code quality Interruptibility Pair-dynamic level Performance Satisfaction # of studies 0 2 4 6 8 10 12 1 1 1 1 1 1 2 2 3 6 6 8 Researchers looking for patterns of task difficulty by measuring the human body
  34. Results: RQ4 - Purpose Code comprehension Emotion recognition Task difficulty

    Cognitive demand Productivity Stress level Authentication Code quality Interruptibility Pair-dynamic level Performance Satisfaction # of studies 0 2 4 6 8 10 12 1 1 1 1 1 1 2 2 3 6 6 8 Gap: What about looking at the human side as an end not just as a mean? (e.g. better understanding of developers’ burnout)
  35. Results: RQ5 - Tasks & RQ6 - Artifacts Programming Observing

    Doing math Multitasks Listening Reading Making choices # of studies 0 2 4 6 8 10 12 1 2 2 3 3 6 16 Code Images Math equations Text Localization Product Reports Sounds Sounds and Images Video # of studies 0 2 4 6 8 10 12 1 1 1 1 1 1 3 3 4 16
  36. Not surprisingly… Results: RQ5 - Tasks & RQ6 - Artifacts

    Programming Observing Doing math Multitasks Listening Reading Making choices # of studies 0 2 4 6 8 10 12 1 2 2 3 3 6 16 Code Images Math equations Text Localization Product Reports Sounds Sounds and Images Video # of studies 0 2 4 6 8 10 12 1 1 1 1 1 1 3 3 4 16
  37. Interesting to see other important skills typically exercised by engineers

    Programming Observing Doing math Multitasks Listening Reading Making choices # of studies 0 2 4 6 8 10 12 1 2 2 3 3 6 16 Results: RQ5 - Tasks & RQ6 - Artifacts
  38. Results: RQ5 - Tasks & RQ6 - Artifacts Programming Observing

    Doing math Multitasks Listening Reading Making choices # of studies 0 2 4 6 8 10 12 1 2 2 3 3 6 16 Code Images Math equations Text Localization Product Reports Sounds Sounds and Images Video # of studies 0 2 4 6 8 10 12 1 1 1 1 1 1 3 3 4 16 Gap: What about…Testing? Design process/artifacts? Code review? Resolving merge conflicts?
  39. Results: RQ7 - # Participants & RQ8 - Research methods

    # participants 11-20 21-30 0-10 31-40 41-50 # of studies 0 2 4 6 8 10 12 3 4 4 6 16 Controlled Experiment Proposal only Opinion paper # of studies 0 2 4 6 8 10 12 1 8 24
  40. Results: RQ7 - # Participants & RQ8 - Research methods

    # participants 11-20 21-30 0-10 31-40 41-50 # of studies 0 2 4 6 8 10 12 3 4 4 6 16 Controlled Experiment Proposal only Opinion paper # of studies 0 2 4 6 8 10 12 1 8 24 Most common range.
  41. Results: RQ7 - # Participants & RQ8 - Research methods

    # participants 11-20 21-30 0-10 31-40 41-50 # of studies 0 2 4 6 8 10 12 3 4 4 6 16 Controlled Experiment Proposal only Opinion paper # of studies 0 2 4 6 8 10 12 1 8 24 Most common range. Gap: How to improve the # of participants when the sensors are “invasive”?
  42. Results: RQ7 - # Participants & RQ8 - Research methods

    # participants 11-20 21-30 0-10 31-40 41-50 # of studies 0 2 4 6 8 10 12 3 4 4 6 16 Controlled Experiment Proposal only Opinion paper # of studies 0 2 4 6 8 10 12 1 8 24 Most common range. Controlled settings is the most common approach Gap: How to improve the # of participants when the sensors are “invasive”?
  43. Gap: Why not measure dev cognitive load in real scenarios

    (in situ)? Results: RQ7 - # Participants & RQ8 - Research methods # participants 11-20 21-30 0-10 31-40 41-50 # of studies 0 2 4 6 8 10 12 3 4 4 6 16 Controlled Experiment Proposal only Opinion paper # of studies 0 2 4 6 8 10 12 1 8 24 Most common range. Controlled settings is the most common approach Gap: How to improve the # of participants when the sensors are “invasive”?
  44. In conclusion… Facets Common trends found Gaps/Challenges & Opportunity for

    future work Sensors EEGs (low resolution); Combination of sensors High resolution EEGs; More combination of sensors; fMRI (?)
  45. In conclusion… Facets Common trends found Gaps/Challenges & Opportunity for

    future work Sensors EEGs (low resolution); Combination of sensors High resolution EEGs; More combination of sensors; fMRI (?) Metrics EEG-related metrics Combination multiple of metrics Distinguish mental effort from cognitive load measurement
  46. In conclusion… Facets Common trends found Gaps/Challenges & Opportunity for

    future work Sensors EEGs (low resolution); Combination of sensors High resolution EEGs; More combination of sensors; fMRI (?) Metrics EEG-related metrics Combination multiple of metrics Distinguish mental effort from cognitive load measurement Research method Controlled experiments Build new models on real world industry settings
  47. In conclusion… Facets Common trends found Gaps/Challenges & Opportunity for

    future work Sensors EEGs (low resolution); Combination of sensors High resolution EEGs; More combination of sensors; fMRI (?) Metrics EEG-related metrics Combination multiple of metrics Distinguish mental effort from cognitive load measurement Research method Controlled experiments Build new models on real world industry settings Algorithms No numerical/computational analysis; Supervised ML (classification) Test existing models on real world industry settings
  48. In conclusion… Facets Common trends found Gaps/Challenges & Opportunity for

    future work Sensors EEGs (low resolution); Combination of sensors High resolution EEGs; More combination of sensors; fMRI (?) Metrics EEG-related metrics Combination multiple of metrics Distinguish mental effort from cognitive load measurement Research method Controlled experiments Build new models on real world industry settings Algorithms No numerical/computational analysis; Supervised ML (classification) Test existing models on real world industry settings Purpose Code comprehension; Task difficulty; Emotion recognition Better understanding and improving the human side of software dev
  49. In conclusion… Facets Common trends found Gaps/Challenges & Opportunity for

    future work Sensors EEGs (low resolution); Combination of sensors High resolution EEGs; More combination of sensors; fMRI (?) Metrics EEG-related metrics Combination multiple of metrics Distinguish mental effort from cognitive load measurement Research method Controlled experiments Build new models on real world industry settings Algorithms No numerical/computational analysis; Supervised ML (classification) Test existing models on real world industry settings Purpose Code comprehension; Task difficulty; Emotion recognition Better understanding and improving the human side of software dev Tasks Programming Context/Task switch, Code review, Merge conflicts
  50. In conclusion… Facets Common trends found Gaps/Challenges & Opportunity for

    future work Sensors EEGs (low resolution); Combination of sensors High resolution EEGs; More combination of sensors; fMRI (?) Metrics EEG-related metrics Combination multiple of metrics Distinguish mental effort from cognitive load measurement Research method Controlled experiments Build new models on real world industry settings Algorithms No numerical/computational analysis; Supervised ML (classification) Test existing models on real world industry settings Purpose Code comprehension; Task difficulty; Emotion recognition Better understanding and improving the human side of software dev Tasks Programming Context/Task switch, Code review, Merge conflicts Artifacts Source code Source code + other code-related artifacts and dev tools
  51. In conclusion… Facets Common trends found Gaps/Challenges & Opportunity for

    future work Sensors EEGs (low resolution); Combination of sensors High resolution EEGs; More combination of sensors; fMRI (?) Metrics EEG-related metrics Combination multiple of metrics Distinguish mental effort from cognitive load measurement Research method Controlled experiments Build new models on real world industry settings Algorithms No numerical/computational analysis; Supervised ML (classification) Test existing models on real world industry settings Purpose Code comprehension; Task difficulty; Emotion recognition Better understanding and improving the human side of software dev Tasks Programming Context/Task switch, Code review, Merge conflicts Artifacts Source code Source code + other code-related artifacts and dev tools Participants 11-20 participants How to increase the number of participants using “invasive” sensors?
  52. In conclusion… Facets Common trends found Gaps/Challenges & Opportunity for

    future work Sensors EEGs (low resolution); Combination of sensors High resolution EEGs; More combination of sensors; fMRI (?) Metrics EEG-related metrics Combination multiple of metrics Distinguish mental effort from cognitive load measurement Research method Controlled experiments Build new models on real world industry settings Algorithms No numerical/computational analysis; Supervised ML (classification) Test existing models on real world industry settings Purpose Code comprehension; Task difficulty; Emotion recognition Better understanding and improving the human side of software dev Tasks Programming Context/Task switch, Code review, Merge conflicts Artifacts Source code Source code + other code-related artifacts and dev tools Participants 11-20 participants How to increase the number of participants using “invasive” sensors? Publications Upward trend; Many papers after the release of Emotiv and Neurosky Only one paper at ICPC (why?)
  53. Measuring the Cognitive Load of Software Developers: A Systematic Mapping

    Study (“brain computer interfaces” OR sensors OR devices) AND (“psychophysiological indicators” OR “brain synchronization” OR “cognitive load” OR emotions OR biometrics) AND (“software engineering” OR “software development” OR “software testing” OR “software maintenance” OR “computer programming” OR diagram OR code) 2,612 articles 33 articles Findings, common trends, gaps, challenges Lucian Gonçales, Kleinner Farias Bruno da Silva, Jonathan Fessler [email protected]