OpenTalks.AI - Ольга Перепелкина, Affective computing

Affective computing Olga Perepelkina, Chief research scientist Neurodata Lab LLC
©

Affective computing Medicine Usability Advertise ment Computer games & VR
Intellectu al Transport HR

Plan 1. Affective computing and Social signal processing 2. Emotions:
why we use multimodal approach? 3. Steps for automatic emotion recognition 4. Get data: what kind of data could we use and how to annotate data? 5. Feature extraction approaches 6. Multichannel data fusion 7. Next steps and trends in affective computing

AI in Human Behavior ~ Human-Computer Interaction Recognition Generation Affective
Computing Social Signal Processing

Affective Computing: Picard, 1995 Social Signal Processing: Viciarelli et al,
2009 Affective computing lab in MIT (MIT media lab, Cambridge) http://affect.media.mit.edu/ Intelligent behavior understanding group (ICL. London) https://ibug.doc.ic.ac.uk/

Affective Computing • Affective computing (artificial emotional intelligence, or emotion
AI) is the study and development of systems and devices that can recognize, interpret, process, and simulate human affects.

Social Signal Processing Interpersonal Distance Gesture Forward Posture Height Mutual
Gaze Vocal Behavior Forward Posture Nonverbal behavior cues Social Signal Vinciarelli et al. Social signal processing: Survey of an emerging domain, 2009

Emotions: how do people recognize them? • People can recognize
emotions by separate channels: voice, body language, touches, faces (the best accuracy). • Other modalities such as smell and taste also take into account during emotion recognition. • Visual and auditory modalities affect each other (e.g. facial movements around mouth region impact vocal acoustics). • fMRI, ERP studies: compared with unimodal presentations (e.g., face only), multimodal presentations (e.g., face and voice) yield faster and more accurate emotion judgments. Schirmer, Annett, and Ralph Adolphs. "Emotion perception from face, voice, and touch: comparisons and convergence." Trends in Cognitive Sciences (2017).

Multimodal affective computing • Most researches: faces, less – voices.
Even less: texts, body, physiology. • Accuracy at Multimodal data is higher than at Unimodal data: on 9,83% in average, for 85% systems. • We do not know which type and how much “channels” we need for the best classification. • The contribution of individual channels can be different: for example, models based only on audio recognized fear better, and models on visual signs – recognized disgust better. D'mello et al., 2015; Osman et al., 2017

Multimodal Affective Computing Speech Data Visual Data Vocal Affect Body
Gestures Facial Expressions Eye movements Feature Extraction Emotion Classification Physiology Data EDA / GSR Blood pressure

Multimodal Affective Computing Speech Data Visual Data Vocal Affect Body
Gestures Facial Expressions Eye movements Feature Extraction Emotion Classification Physiology Data EDA / GSR Blood pressure Computer Vision

Steps for automatic emotion recognition Choose Data Define Categories Annotate
Data Preprocess Data Extract Features Model Fusion

Data: format • Images • Audio • Video • Text
• …

Data: content Play acted emotions Elicited emotions “In the wild”

Annotation: choose categories • Categorical / Dimensional approaches Wagner et
al., 2011 Sacharin et al., 2012

Annotation: crowdworkers

Face & Eyes recognition Human face Face Face Detection (Faster
RCNN) Face Identification (ResNet 50) Feature Extraction Eyes Eyes and Nose Detection (OpenCV + CNN) Open/closed Eyes

Body pose estimation • Human 2D pose estimation – the
problem of localizing anatomical keypoints (body parts). • Challenges: • Each image may contain an unknown number of people • Interaction between people: contact, occlusion, limb articulation etc. • Runtime complexity tends to grow with number of people in the image • Input: a color image, output: the 2D locations of anatomical keypoints for each person in the image.

Body pose estimation approaches Top-down •Detect person => find body
parts •Single-person pose estimation •Runtime is proportional to the number of people •If the person detector fails – no resource to recovery Bottom-up •Find keypoints & connections => construct person •Multi-person pose estimation •Robust to early commitment and have the potential to decouple runtime complexity from the number of people in the image Cao Z. et al. Realtime multi-person 2d pose estimation using part affinity fields //CVPR. – 2017. – Т. 1. – №. 2. – С. 7.

Multimodal Fusion • Feature-based (early) • Decision-based (late) • Hybrid
fusion

Feature-based fusion • Integrates features immediately after the are extracted.
• Training of a single model. • Problems: • Features from different channels have different time scales • Large set of features from different channels => higher computing load • Vulnerable approach in the case of Missing data

Decision-based fusion • Performs integration after each of the modalities
has made a decision. • Averaging, voting schemes, weighting sum, etc. • Problems: • Decision-based fusion ignores the low level interaction between the modalities.

Hybrid fusion • Combines outputs from early fusion and individual
unimodal predictors. • E.g., two steps combination: 1) feature-level fusion from audio + video, 2) decision-level fusion from the first classifier + one more classifier for physiological data.

Multimodal fusion • Lingenfelser et al. conducted systematic research of
feature-level, decision-level and hybrid-fusion approaches and did not find evidence of the advantages of any of the techniques over others. • Lingenfelser, Florian, Johannes Wagner, and Elisabeth André. "A systematic discussion of fusion techniques for multi-modal affect recognition tasks." Proceedings of the 13th international conference on multimodal interfaces. ACM, 2011 • Nevertheless, the decision-level approach seems more reasonable, because it deals better with missing data. • Al Osman, Hussein, and Tiago H. Falk. "Multimodal Affect Recognition: Current Approaches and Challenges." Emotion and Attention Recognition Based on Biological Signals and Images. InTech, 2017

RAMAS dataset • Basic play acted emotions • 4 channels:
face, eyes, body, voice

Face Eyes Body Voice All channels

Natural Emotions Mixed Emotions Social Signals Multimodal Data Wearable gadgets
& Smartphones Affective computing: trends Face Eyes … Voice Body

Thank you!

OpenTalks.AI - Ольга Перепелкина, Affective com...

OpenTalks.AI - Ольга Перепелкина, Affective computing

More Decks by OpenTalks.AI

Other Decks in Business

Featured

Transcript