CLASSIFICATION OF SPEECH EMOTION CONSIDERING LONG CYCLE FLUCTUATION BY DYNAMIC MODE DECOMPOSITION

CLASSIFICATION OF SPEECH EMOTION CONSIDERING LONG CYCLE FLUCTUATION BY DYNAMIC
MODE DECOMPOSITION

MODE DECOMPOSITION AGENDA ▸ 1)Utilization of Virtual HumanAgent by Classification of Emotions (Why?) ▸ Whatis“Humanness” inthefirst place? ▸ 2)Classification of emotions of speech by convolution of Graph (How to?) ▸ PossibilityofGraphasUnifiedfeature quantity ▸ 3)Classification of emotions of speech by Dynamic Mode Decomposition (How to?) ▸ Classificationofspeechemotionsby Dynamic Mode Decomposition on Deep Learning

WHAT IS BROUGHT TO RACHEL IN SENTIMENT ANALYSIS?

MODE DECOMPOSITION WHAT IS BROUGHT TO RACHEL IN SENTIMENT ANALYSIS? ▸ Add emotional identification to Virtual Human Agent FOR EXAMPLE, THE VIRTUAL HUMAN AGENT RETURNS "IT WAS GOOD! (EMPATHY)" FOR “I GOT A TICKET TODAY (HAPPY!)"

WHAT IS THE “HUMANNESS”?

MODE DECOMPOSITION WHAT IS THE “HUMANNESS”? ▸ “Outer receptacle” and “Inner receptacle” ▸ “Inner receptacle” is a physical internal reaction including internal organs, such as motivation associated with tension. ▸ As an organ, it refers to an autonomous system that “can not be controlled by ▸ It is a remarkable system as an organ. It is “necessary to be conscious” of this oneself, and the focal point of view is an example. ourself” (heart rate, blood pressure, etc.) HUMANNESS, PERSONALITY Beautiful! ▸ “Outer reception” is closely related to psychological reactions represented by cerebral reaction, Circumplex Model. OUTER RECEPTACLE HUMANNESS, PERSONALITY EXPERIENCE + PREFERENCE INNER RECEPTACLE OUTER RECEPTACLE

MODE DECOMPOSITION WHAT IS THE “HUMANNESS”? ▸ “Active” system and “Passive” system ▸ From the viewpoint of external acceptance and internal acceptance, there are active systems “autonomously reacting” to cerebral function and passive systems waiting for stimulation. ▸ Tendency of cerebral site reaction ▸ Outer receptions tend to be internal to the cerebrum, its typical organ is the “insular cortex”. ▸ In vision, organs that capture moving bodies/Inner receptions tend to be inside the cerebrum outside, outside acceptance to make “higher-order cognition accompanied with sensibility more in detail” tends to be inside the cerebrum. Inner receptions SOMETHING LIKE GREEN Outer receptions ELEPHANT!

MODE DECOMPOSITION WHAT IS THE “HUMANNESS”? ▸ Difference in reaction rate of cerebral region ▸ “Inner reception” is twice as fast as “Outer acceptance” reaction. From the comparison of this speed, it is thought that recognition accompanied by sensibility of outer receptions is performing higher order behavior. ▸ Hypothesis of definition of sensitivity ▸ For these intrinsic and extrinsic receptors, we assume that a “large difference response is a sensory response”. For example, when internal reaction does not show reaction, when large response is observed by external acceptance, reaction “aware” > OUTER RECEPTION … … BEAUTIFUL! EXPERIENCE PREFERENCE INNER RECEPTION Fast Slow HOT! NOTHING TO THINK ABOUT INNER RECEPTION OUTER RECEPTION NOT HUMANNESS INNER RECEPTION OUTER RECEPTION HUMANNESS difference THIS IS!

MODE DECOMPOSITION WHAT IS THE “HUMANNESS”? ▸ “Sensibility” from the viewpoint of modeling ▸ High-level relationship of sensitivity ▸ From the low level, take the phase of “reaction”, “emotion”, “sensitivity” and phase, and sensitivity as high-order brain function. ▸ ▸ This shows that there is a range of reactions of individual differences that depend on “personality” as they reach higher levels at the same time. At the same time, it shows dependence on higher order low order, indicating the range of responsiveness of "sensibility” due to circumstances such as exercise and low temperature, environmental factors. Reaction Emotion Sensibility Humanness, Personality

MODE DECOMPOSITION WHAT IS THE “HUMANNESS”? ▸ “Sensibility” from the viewpoint of modeling ▸ Contrast with Circular model ▸ An annular model expresses emotion with large pleasure axis /active axis ▸ ▸ In this case, we will give a subjective axis (psychological axis) and expand the range of explanation of the emotional high sense of sensitivity With this axis, we explain the change of reaction "excitement" of outer acceptance from reaction of inner acceptance “beat” EMOTION SENSIBILITY Subjective axis(psychological axis) BEAT EXCITEMENT Individual difference EMOTION

MODE DECOMPOSITION WHAT IS THE “HUMANNESS”? ▸ Classification of emotional analysis of “individual's utterance” ▸ Good datasets including individual differences, the accuracy of their identification ▸ If you have a discrimination layer of individual differences and you answer correctly to multiple of “1) individual differences" and “2) classifications”, it is assumed that a reaction with human likeness has been made and its accuracy is evaluated in NLP. Subjective axis(psychological axis) EMOTION CLASSIFIER EMOTION INDIVIDUAL CLASSIFIER AND B’S UTTERANCE A B Active/Pleasure axis ACCURACY (F1-VALUE) ACCURACY (F1-VALUE) TOTAL ACCURACY (F1-VALUE) GOOD BAD OR INDIVIDUAL'S

WHICH DATA SET DO YOU USE?

MODE DECOMPOSITION WHICH DATA SET DO YOU USE? ▸ RAVDESS Dataset (Ref. https://smartlaboratory.org/ravdess/) Songs / Normal 2 Classes 8 Persons 24 Utterance 2 Strength 2 Repeat 2 (*)…Classes: neutral, calm, happy, sad, angry, fearful, disgust, surprised (*)…Persons: 12 men and 12 women (*)…Strength: Normal and Strong ▸ Spectrogram and changes in time series for each category SPEECH CONSISTING OF COMPLICATED WAVEFORMS

WHICH FEATURE DO YOU USE?

MODE DECOMPOSITION WHICH FEATURE DO YOU USE? ▸ Mel-Frequency Cepstrum Coefficients LOW FREQUENCY EXPANSION WITH VOICE FEATURES ▸ Tonality analysis (HPCP /Harmonic Pitch Class Profile) PITCH (MUSICAL SCALE, A / B / C, ETC.) AS BAND, AND TIME SERIES BASED FEATURE QUANTITY

CLASSIFICATION OF EMOTIONS OF SPEECH BY CONVOLUTION OF GRAPH (HOW
TO?)

MODE DECOMPOSITION CLASSIFICATION OF EMOTIONS OF SPEECH BY CONVOLUTION OF GRAPH (HOW TO?) ▸ Graph Convolution (Abstract) ▸ Convolution using graph structure ▸ The graph can be used for expressing the luminance of pixels (image), the value of each element of the spectrogram (sound), and the connection of words (NLP), so it is expected to acquire “unified feature values” by graph GRAPH CONVOLUTION HOWEVER, THE CONNECTION OF THE GRAPH IS UNSPECIFIED AND CAN NOT BE REPRESENTED AS IT IS

MODE DECOMPOSITION CLASSIFICATION OF EMOTIONS OF SPEECH BY CONVOLUTION OF GRAPH (HOW TO?) ▸ Graph Convolution (Details) e_1 e_2 e_n U_1 U_2 U_n ESSENTIAL ACQUISITION BY ORTHOGONAL BASIS USE OF ORTHOGONAL BASIS FUNCTIONS WAVEFORM BY ORTHOGONAL BASIS FUNCTION f_1(.) f_2(.) f_n(.) g_2(.) g_n(.) <F ,G>=0 <F ,G>=0 g_1(.) Δx_2 Δx_1 Δx_n e_2 U_1 e_1 U_2 Linear fitting Functions to be infinite a_1 cos_1 Fourier series expansion a_2 cos_2 a_3 cos_3 a_1 a_2 a_k + b_1 sin_1 b_2 sin_2 b_3 sin_3 b_1 b_2 b_k ∫ ∫ ak cos_n bk sin_n

MODE DECOMPOSITION CLASSIFICATION OF EMOTIONS OF SPEECH BY CONVOLUTION OF GRAPH (HOW TO?) ▸ Graph Convolution (Details) REPLACE BASIS FUNCTIONS WITH LAPLACIAN EIGENFUNCTIONS Graph Laplacian & Spectral Clustering Graph Fourier Transform GRAPH DECOMPOSITION IS POSSIBLE BY LAPLACE EIGENFUNCTION GRAPH FILTERING WITH GRAPH FOURIER TRANSFORM ξ^2 e^1 Laplace eigenfunction ξ^2 e^2 ξ^2 e^n ξ^2 e^3 ξ_1 ξ_2 ξ_k ∫ H_0 H_1 H_n Eigenvalue decomp

POSSIBILITY OF GRAPH AS UNIFIED FEATURE QUANTITY

MODE DECOMPOSITION POSSIBILITY OF GRAPH AS UNIFIED FEATURE QUANTITY ▸ Example of accuracy as classification of images by MNIST ▸ TruePositive (Training) /TruePositive (Validation) ▸ Filter state of graph convolution (H_n, n=36) Epochs Train Validation 1 0.907 0.315 0.136 0.963 25 0.993 0.025 0.063 0.986 50 0.995 0.017 0.077 0.986 75 0.995 0.016 0.078 0.986 100 0.996 0.014 0.089 0.986 Value 1 0.75 0.5 0.25 0 1 25 75 100 Train Loss Train Acc Validation Loss Validation Acc 50 Epochs

MODE DECOMPOSITION POSSIBILITY OF GRAPH AS UNIFIED FEATURE QUANTITY ▸ Accuracy as classification of speech based on mel-cepstrum coefficient ▸ TruePositive (Training) /TruePositive (Validation) Epochs Train Validation 1 0.188 2.056 1.721 0.347 25 1.000 0.059 0.921 0.715 50 1.000 0.010 1.104 0.715 75 1.000 0.005 1.225 0.681 100 1.000 0.003 1.274 0.701 Value 1 0.75 0.5 0.25 0 Train Loss Train Acc Validation Loss Validation Acc 1 25 50 75 100 Epochs ▸ Filter state of graph convolution (H_n, n=30)

MODE DECOMPOSITION POSSIBILITY OF GRAPH AS UNIFIED FEATURE QUANTITY ▸ Accuracy as classification of speech based on HPCP coefficient ▸ TruePositive (Training) /TruePositive (Validation) Epochs Train Validation 1 0.404 0.400 0.374 0.300 25 0.116 0.000 0.170 0.200 50 0.116 0.000 0.080 0.200 75 0.116 0.000 0.034 0.200 100 0.116 0.000 0.013 0.200 Value 1 0.75 0.5 0.25 0 Train Loss Train Acc Validation Loss Validation Acc 1 25 50 75 100 Epochs ▸ Filter state of graph convolution (H_n, n=30)

MODE DECOMPOSITION POSSIBILITY OF GRAPH AS UNIFIED FEATURE QUANTITY ▸ ▸ Consideration ▸ Since the filter convolution filter H represents the smoothness of the frequency, this corresponds to the resolution. Therefore, high resolution is obtained, that is, higher accuracy can be obtained by obtaining more edge direction. H_0 H_1 H_n HIGHER ACCURACY At the same time there is a limit to the decomposition of time series and band.

CLASSIFICATION OF EMOTIONS OF SPEECH BY DYNAMIC MODE DECOMPOSITION (HOW
TO?)

MODE DECOMPOSITION CLASSIFICATION OF EMOTIONS OF SPEECH BY DYNAMIC MODE DECOMPOSITION (HOW TO?) DMD ▸ Dynamic mode composition (Abstract) ▸ Mode decomposition focusing on time series variation ▸ The sound appears especially in time series fluctuations. Therefore, it is insufficient to convolve the phase and the band at the same time in the graph, and “pay attention to the time series”. PREDICTION AND ACQUISITION OF LONG CYCLE VARIATION

MODE DECOMPOSITION CLASSIFICATION OF EMOTIONS OF SPEECH BY DYNAMIC MODE DECOMPOSITION (HOW TO?) DEFINITION OF TIME EVOLUTION SINGULAR VALUE RESOLUTION EIGENVALUE RESOLVING ▸ Dynamic mode composition (Details) Time series variation Low rank approximation Eigen mode calculation ∫ ∫ DIFFERENCE ON FEATURE ENGINEERING MAPPING IN LINEAR SPACE

MODE DECOMPOSITION CLASSIFICATION OF EMOTIONS OF SPEECH BY DYNAMIC MODE DECOMPOSITION (HOW TO?) TIME EVOLUTION FROM DMD MODE TENSOR EXPANSION OF SVD AND EIGH FROM THE OBTAINED MODE TO CLASSIFICATION ▸ Dynamic mode composition (Details) Long cycle variation Tucker decomposition Emotional classification ON FEATURE ENGINEERING ON DEEP LEARNING MAPPING AND DECOMPOSING ALL DATA ON ONE AXIS

ACCURACY OBTAINED FROM FEATURE QUANTITIES BY DYNAMIC MODE DECOMPOSITION

MODE DECOMPOSITION CLASSIFICATION OF EMOTIONS OF SPEECH BY DYNAMIC MODE DECOMPOSITION (HOW TO?) ▸ Accuracy as classification of speech based on mel-cepstrum coefficient ▸ TruePositive (Training) /TruePositive (Validation) Epochs Trai n Validation 1 0.077 2.128 2.107 0.100 25 0.876 1.834 2.121 0.300 50 0.969 1.806 2.123 0.300 75 0.932 1.810 2.123 0.400 100 0.939 1.812 2.123 0.400 Value 1 0.75 0.5 0.25 0 Train Loss Train Acc Validation Loss Validation Acc 1 25 50 75 100 Epochs ▸ Result of dmd convolution (Mode ≦13) Mode=0 Mode=1 Mode=2

MODE DECOMPOSITION CLASSIFICATION OF EMOTIONS OF SPEECH BY DYNAMIC MODE DECOMPOSITION (HOW TO?) ▸ Accuracy as classification of speech based on HPCP coefficient ▸ TruePositive (Training) /TruePositive (Validation) Epochs Trai n Validation 1 0.875 0.379 0.373 0.875 25 0.875 0.372 0.367 0.875 50 0.875 0.371 0.366 0.875 75 0.875 0.370 0.366 0.875 100 0.875 0.370 0.366 0.875 Valu e 1 0.75 0.5 0.25 0 Train Loss Validation Loss Train Acc Validation Acc 1 25 50 75 100 Epochs ▸ Result of dmd convolution (Mode ≦12) Mode=1 Mode=0 Mode=5

MODE DECOMPOSITION CLASSIFICATION OF EMOTIONS OF SPEECH BY DYNAMIC MODE DECOMPOSITION (HOW TO?) ▸ Consideration ▸ Emotional sounds appear strongly in pitch and have large features in time series. For that reason, Validation Accuracy improves with DMD and convolution focused on them. FEATURES IN PITCH AND TIME SERIES ACQUISITION OF CLASSIFICATION OF HAPPY

SUMMARY

MODE DECOMPOSITION SUMMARY ▸ Graph Convolution ▸ High accuracy in images (98% over) ▸ Accuracy about sound (70%) ▸ However, when targeting emotions, convolution is less effective due to phase and bandwidth (10%-30%) ▸ Dynamic mode decomposition Convolution ▸ Acquiring emotional features from spectrograms is less accurate (40%) ▸ However, from the relationship between pitch and emotion, high generalization performance can be obtained by paying attention to their time series (around 90%)

MANY THANKS FOR YOUR ATTENTION!

CLASSIFICATION OF SPEECH EMOTION CONSIDERING LO...

CLASSIFICATION OF SPEECH EMOTION CONSIDERING LONG CYCLE FLUCTUATION BY DYNAMIC MODE DECOMPOSITION

More Decks by Couger

Other Decks in Technology

Featured

Transcript