Measuring the Cognitive Load of Software Developers: A Systematic Mapping Study

Measuring the Cognitive Load of Software Developers: A Systematic Mapping
Study Lucian Gonçales, Kleinner Farias <Unisinos, Brazil> Bruno da Silva, Jonathan Fessler <Cal Poly, USA> [email protected] Sat 25 - Sun 26 May 2019 Montreal, QC, Canada 27th IEEE/ACM International Conference on Program Comprehension

Software developers are strongly involved in activities that aﬀect and
demand attention

We’ve come a long way since we started to measure
software artifacts

We’ve come a long way since we started to measure
software artifacts And recently, researchers in SE have started to measure the human body https://www.quora.com/How-can-you-measure-the-electrical-activity-in-the-brain

https://www.ncbi.nlm.nih.gov/pubmed/29574776 https://www.emotiv.com/

Eye tracking https://honestversion.com/global-eye-tracking-devices-market-survey-regional-supply-and-value-chain-analysis-2024/ https://www.thejambar.com/eye-catching-technology-eye-tracking-studies-help-future-programming-students/ https://eyegaze.com/

fMRI Functional Magnetic Resonance Imaging https://today.uconn.edu/2014/01/fmri-machine-will-expand-research-capabilities/ https://www.neurologyadvisor.com/topics/epilepsy/guidelines-for-using-fmri-for-presurgical-evaluation-of-epilepsy/

Our Objectives Provide a classiﬁcation and a thematic analysis of
studies on the measurement of developers’ cognitive load

Our Objectives Provide a classiﬁcation and a thematic analysis of
studies on the measurement of developers’ cognitive load Pinpoint gaps and possible research directions for future work

Systematic Mapping Study (SMS)

Systematic Mapping Study (SMS) 1. What are the types of
sensors for measuring the cognitive load of developers?

sensors for measuring the cognitive load of developers? 2. What metrics have been used to measure developers’ cognitive load?

sensors for measuring the cognitive load of developers? 2. What metrics have been used to measure developers’ cognitive load? 3. What algorithms have been used to classify developers’ cognitive load?

sensors for measuring the cognitive load of developers? 2. What metrics have been used to measure developers’ cognitive load? 3. What algorithms have been used to classify developers’ cognitive load? 4. For what purpose?

sensors for measuring the cognitive load of developers? 2. What metrics have been used to measure developers’ cognitive load? 3. What algorithms have been used to classify developers’ cognitive load? 4. For what purpose? 5. Which tasks have been used to measure developers’ cognitive load?

sensors for measuring the cognitive load of developers? 2. What metrics have been used to measure developers’ cognitive load? 3. What algorithms have been used to classify developers’ cognitive load? 4. For what purpose? 5. Which tasks have been used to measure developers’ cognitive load? 6. What were the artifacts used on cognitive tasks?

sensors for measuring the cognitive load of developers? 2. What metrics have been used to measure developers’ cognitive load? 3. What algorithms have been used to classify developers’ cognitive load? 4. For what purpose? 5. Which tasks have been used to measure developers’ cognitive load? 6. What were the artifacts used on cognitive tasks? 7. How many participants did the studies recruit to measure developers’ cognitive load?

sensors for measuring the cognitive load of developers? 2. What metrics have been used to measure developers’ cognitive load? 3. What algorithms have been used to classify developers’ cognitive load? 4. For what purpose? 5. Which tasks have been used to measure developers’ cognitive load? 6. What were the artifacts used on cognitive tasks? 7. How many participants did the studies recruit to measure developers’ cognitive load? 8. Which research methods have been used to investigate cognitive load in software development tasks?

sensors for measuring the cognitive load of developers? 2. What metrics have been used to measure developers’ cognitive load? 3. What algorithms have been used to classify developers’ cognitive load? 4. For what purpose? 5. Which tasks have been used to measure developers’ cognitive load? 6. What were the artifacts used on cognitive tasks? 7. How many participants did the studies recruit to measure developers’ cognitive load? 8. Which research methods have been used to investigate cognitive load in software development tasks? 9. Where have the studies been published?

Search String (“brain computer interfaces” OR sensors OR devices)

Search String (“brain computer interfaces” OR sensors OR devices) AND

(“psychophysiological indicators” OR “brain synchronization” OR “cognitive load” OR emotions OR biometrics)

(“psychophysiological indicators” OR “brain synchronization” OR “cognitive load” OR emotions OR biometrics) AND

(“psychophysiological indicators” OR “brain synchronization” OR “cognitive load” OR emotions OR biometrics) AND (“software engineering” OR “software development” OR “software testing” OR “software maintenance” OR “computer programming” OR diagram OR code)

Paper databases and initial search ACM Digital Library CiteSeerX Google
Scholar IEEE Explore Inspec Microsoft Academic Pubmed Scopus Science Direct Springer Link Wiley Online Library (“brain computer interfaces” OR sensors OR devices) AND (“psychophysiological indicators” OR “brain synchronization” OR “cognitive load” OR emotions OR biometrics) AND (“software engineering” OR “software development” OR “software testing” OR “software maintenance” OR “computer programming” OR diagram OR code) 2,612 articles

Process overview

RESULTS

Results: RQ1 - Sensors EEG Combined sensors Eye-trackers fMRI #
of studies 0 4.5 9 13.5 18 1 2 12 18

of studies 0 4.5 9 13.5 18 1 2 12 18 Software engineering researchers have preferred EEG sensors to collect data related to cognitive load

of studies 0 4.5 9 13.5 18 1 2 12 18 Many studies have combined sensors. A trend to improve accuracy S. C. Müller and T. Fritz, "Stuck and Frustrated or in Flow and Happy: Sensing Developers' Emotions and Progress," 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering, Florence, 2015, pp. 688-699.

of studies 0 4.5 9 13.5 18 1 2 12 18 Many studies have combined sensors. A trend to improve accuracy. Pupil size, ﬁxation, blinks S. C. Müller and T. Fritz, "Stuck and Frustrated or in Flow and Happy: Sensing Developers' Emotions and Progress," 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering, Florence, 2015, pp. 688-699. EDA, skin temp, heart rate, BVP EEG waves, FBs, attention, meditation

of studies 0 4.5 9 13.5 18 1 2 12 18 Gap: application of high resolution devices, such as EEGs with 128 and 256 channels https://www.cognionics.net/mobile-128

Results: RQ2 - Metrics Combination of metrics Frequency bands Power
spectrum ERD ERP FD VOI Eye ﬁxation ERSP IAF ICA SSVEP # of studies 0 2 4 6 8 10 12 1 1 1 1 1 2 2 2 2 3 4 13

Results: RQ2 - Metrics 40% of primary studies applied multiple
metrics to measure developers’ cognitive load Combination of metrics Frequency bands Power spectrum ERD ERP FD VOI Eye ﬁxation ERSP IAF ICA SSVEP # of studies 0 2 4 6 8 10 12 1 1 1 1 1 2 2 2 2 3 4 13

Results: RQ2 - Metrics 42% of studies (14/33) focused on
measures that are related to EEGs Combination of metrics Frequency bands Power spectrum ERD ERP FD VOI Eye ﬁxation ERSP IAF ICA SSVEP # of studies 0 2 4 6 8 10 12 1 1 1 1 1 2 2 2 2 3 4 13

Results: RQ2 - Metrics Gap: there’s no clear distinction between
‘cognitive load’ and ‘mental effort’ found in software engineering papers. Combination of metrics Frequency bands Power spectrum ERD ERP FD VOI Eye ﬁxation ERSP IAF ICA SSVEP # of studies 0 2 4 6 8 10 12 1 1 1 1 1 2 2 2 2 3 4 13

Results: RQ3 - Algorithms/ML Basic stats SVM Naive Bayes Multi
Algos Decision Tree K-means Logistic Regression Neural Network Random Forest RF Learners RVM Linear Regression # of studies 0 2 4 6 8 10 12 1 1 1 1 1 1 1 1 3 5 5 13

Basic stats SVM Naive Bayes Multi Algos Decision Tree K-means
Logistic Regression Neural Network Random Forest RF Learners RVM Linear Regression # of studies 0 2 4 6 8 10 12 1 1 1 1 1 1 1 1 3 5 5 13 Many studies did not use any speciﬁc ML algorithm (40%) Results: RQ3 - Algorithms/ML

More occurrences of classiﬁcation techniques compared to regression Basic stats
SVM Naive Bayes Multi Algos Decision Tree K-means Logistic Regression Neural Network Random Forest RF Learners RVM Linear Regression # of studies 0 2 4 6 8 10 12 1 1 1 1 1 1 1 1 3 5 5 13 Results: RQ3 - Algorithms/ML

Gap: All applied from highly controlled settings. Would those models
hold high accuracy in real scenarios with less controlled settings? How? To what extent? Basic stats SVM Naive Bayes Multi Algos Decision Tree K-means Logistic Regression Neural Network Random Forest RF Learners RVM Linear Regression # of studies 0 2 4 6 8 10 12 1 1 1 1 1 1 1 1 3 5 5 13 Results: RQ3 - Algorithms/ML

Results: RQ4 - Purpose Code comprehension Emotion recognition Task diﬃculty
Cognitive demand Productivity Stress level Authentication Code quality Interruptibility Pair-dynamic level Performance Satisfaction # of studies 0 2 4 6 8 10 12 1 1 1 1 1 1 2 2 3 6 6 8

Cognitive demand Productivity Stress level Authentication Code quality Interruptibility Pair-dynamic level Performance Satisfaction # of studies 0 2 4 6 8 10 12 1 1 1 1 1 1 2 2 3 6 6 8 Researchers looking for a more reliable way of measuring program comprehension compared to task completion time or correctness

Cognitive demand Productivity Stress level Authentication Code quality Interruptibility Pair-dynamic level Performance Satisfaction # of studies 0 2 4 6 8 10 12 1 1 1 1 1 1 2 2 3 6 6 8 Researchers looking for emotion awareness in programming/software engineering

Cognitive demand Productivity Stress level Authentication Code quality Interruptibility Pair-dynamic level Performance Satisfaction # of studies 0 2 4 6 8 10 12 1 1 1 1 1 1 2 2 3 6 6 8 Researchers looking for patterns of task difﬁculty by measuring the human body

Cognitive demand Productivity Stress level Authentication Code quality Interruptibility Pair-dynamic level Performance Satisfaction # of studies 0 2 4 6 8 10 12 1 1 1 1 1 1 2 2 3 6 6 8 Gap: What about looking at the human side as an end not just as a mean? (e.g. better understanding of developers’ burnout)

Results: RQ5 - Tasks & RQ6 - Artifacts Programming Observing
Doing math Multitasks Listening Reading Making choices # of studies 0 2 4 6 8 10 12 1 2 2 3 3 6 16 Code Images Math equations Text Localization Product Reports Sounds Sounds and Images Video # of studies 0 2 4 6 8 10 12 1 1 1 1 1 1 3 3 4 16

Not surprisingly… Results: RQ5 - Tasks & RQ6 - Artifacts
Programming Observing Doing math Multitasks Listening Reading Making choices # of studies 0 2 4 6 8 10 12 1 2 2 3 3 6 16 Code Images Math equations Text Localization Product Reports Sounds Sounds and Images Video # of studies 0 2 4 6 8 10 12 1 1 1 1 1 1 3 3 4 16

Interesting to see other important skills typically exercised by engineers
Programming Observing Doing math Multitasks Listening Reading Making choices # of studies 0 2 4 6 8 10 12 1 2 2 3 3 6 16 Results: RQ5 - Tasks & RQ6 - Artifacts

Results: RQ5 - Tasks & RQ6 - Artifacts Programming Observing
Doing math Multitasks Listening Reading Making choices # of studies 0 2 4 6 8 10 12 1 2 2 3 3 6 16 Code Images Math equations Text Localization Product Reports Sounds Sounds and Images Video # of studies 0 2 4 6 8 10 12 1 1 1 1 1 1 3 3 4 16 Gap: What about…Testing? Design process/artifacts? Code review? Resolving merge conﬂicts?

Results: RQ7 - # Participants & RQ8 - Research methods
# participants 11-20 21-30 0-10 31-40 41-50 # of studies 0 2 4 6 8 10 12 3 4 4 6 16 Controlled Experiment Proposal only Opinion paper # of studies 0 2 4 6 8 10 12 1 8 24

# participants 11-20 21-30 0-10 31-40 41-50 # of studies 0 2 4 6 8 10 12 3 4 4 6 16 Controlled Experiment Proposal only Opinion paper # of studies 0 2 4 6 8 10 12 1 8 24 Most common range.

# participants 11-20 21-30 0-10 31-40 41-50 # of studies 0 2 4 6 8 10 12 3 4 4 6 16 Controlled Experiment Proposal only Opinion paper # of studies 0 2 4 6 8 10 12 1 8 24 Most common range. Gap: How to improve the # of participants when the sensors are “invasive”?

# participants 11-20 21-30 0-10 31-40 41-50 # of studies 0 2 4 6 8 10 12 3 4 4 6 16 Controlled Experiment Proposal only Opinion paper # of studies 0 2 4 6 8 10 12 1 8 24 Most common range. Controlled settings is the most common approach Gap: How to improve the # of participants when the sensors are “invasive”?

Gap: Why not measure dev cognitive load in real scenarios
(in situ)? Results: RQ7 - # Participants & RQ8 - Research methods # participants 11-20 21-30 0-10 31-40 41-50 # of studies 0 2 4 6 8 10 12 3 4 4 6 16 Controlled Experiment Proposal only Opinion paper # of studies 0 2 4 6 8 10 12 1 8 24 Most common range. Controlled settings is the most common approach Gap: How to improve the # of participants when the sensors are “invasive”?

Results: RQ9 - Publications

Results: Taxonomy

In conclusion… Facets Common trends found Gaps/Challenges & Opportunity for
future work

future work Sensors EEGs (low resolution); Combination of sensors High resolution EEGs; More combination of sensors; fMRI (?)

future work Sensors EEGs (low resolution); Combination of sensors High resolution EEGs; More combination of sensors; fMRI (?) Metrics EEG-related metrics Combination multiple of metrics Distinguish mental eﬀort from cognitive load measurement

future work Sensors EEGs (low resolution); Combination of sensors High resolution EEGs; More combination of sensors; fMRI (?) Metrics EEG-related metrics Combination multiple of metrics Distinguish mental eﬀort from cognitive load measurement Research method Controlled experiments Build new models on real world industry settings

future work Sensors EEGs (low resolution); Combination of sensors High resolution EEGs; More combination of sensors; fMRI (?) Metrics EEG-related metrics Combination multiple of metrics Distinguish mental eﬀort from cognitive load measurement Research method Controlled experiments Build new models on real world industry settings Algorithms No numerical/computational analysis; Supervised ML (classiﬁcation) Test existing models on real world industry settings

future work Sensors EEGs (low resolution); Combination of sensors High resolution EEGs; More combination of sensors; fMRI (?) Metrics EEG-related metrics Combination multiple of metrics Distinguish mental effort from cognitive load measurement Research method Controlled experiments Build new models on real world industry settings Algorithms No numerical/computational analysis; Supervised ML (classification) Test existing models on real world industry settings Purpose Code comprehension; Task difficulty; Emotion recognition Better understanding and improving the human side of software dev

future work Sensors EEGs (low resolution); Combination of sensors High resolution EEGs; More combination of sensors; fMRI (?) Metrics EEG-related metrics Combination multiple of metrics Distinguish mental effort from cognitive load measurement Research method Controlled experiments Build new models on real world industry settings Algorithms No numerical/computational analysis; Supervised ML (classification) Test existing models on real world industry settings Purpose Code comprehension; Task difficulty; Emotion recognition Better understanding and improving the human side of software dev Tasks Programming Context/Task switch, Code review, Merge conflicts

future work Sensors EEGs (low resolution); Combination of sensors High resolution EEGs; More combination of sensors; fMRI (?) Metrics EEG-related metrics Combination multiple of metrics Distinguish mental effort from cognitive load measurement Research method Controlled experiments Build new models on real world industry settings Algorithms No numerical/computational analysis; Supervised ML (classification) Test existing models on real world industry settings Purpose Code comprehension; Task difficulty; Emotion recognition Better understanding and improving the human side of software dev Tasks Programming Context/Task switch, Code review, Merge conflicts Artifacts Source code Source code + other code-related artifacts and dev tools

future work Sensors EEGs (low resolution); Combination of sensors High resolution EEGs; More combination of sensors; fMRI (?) Metrics EEG-related metrics Combination multiple of metrics Distinguish mental effort from cognitive load measurement Research method Controlled experiments Build new models on real world industry settings Algorithms No numerical/computational analysis; Supervised ML (classification) Test existing models on real world industry settings Purpose Code comprehension; Task difficulty; Emotion recognition Better understanding and improving the human side of software dev Tasks Programming Context/Task switch, Code review, Merge conflicts Artifacts Source code Source code + other code-related artifacts and dev tools Participants 11-20 participants How to increase the number of participants using “invasive” sensors?

future work Sensors EEGs (low resolution); Combination of sensors High resolution EEGs; More combination of sensors; fMRI (?) Metrics EEG-related metrics Combination multiple of metrics Distinguish mental effort from cognitive load measurement Research method Controlled experiments Build new models on real world industry settings Algorithms No numerical/computational analysis; Supervised ML (classification) Test existing models on real world industry settings Purpose Code comprehension; Task difficulty; Emotion recognition Better understanding and improving the human side of software dev Tasks Programming Context/Task switch, Code review, Merge conflicts Artifacts Source code Source code + other code-related artifacts and dev tools Participants 11-20 participants How to increase the number of participants using “invasive” sensors? Publications Upward trend; Many papers after the release of Emotiv and Neurosky Only one paper at ICPC (why?)

Measuring the Cognitive Load of Software Developers: A Systematic Mapping
Study (“brain computer interfaces” OR sensors OR devices) AND (“psychophysiological indicators” OR “brain synchronization” OR “cognitive load” OR emotions OR biometrics) AND (“software engineering” OR “software development” OR “software testing” OR “software maintenance” OR “computer programming” OR diagram OR code) 2,612 articles 33 articles Findings, common trends, gaps, challenges Lucian Gonçales, Kleinner Farias Bruno da Silva, Jonathan Fessler [email protected]

Measuring the Cognitive Load of Software Develo...

Measuring the Cognitive Load of Software Developers: A Systematic Mapping Study

More Decks by Bruno C. da Silva

Other Decks in Research

Featured

Transcript