Ishizuka and T. Hashimoto, "Large Language Model-Based Emotional Speech Annotation Using Context and Acoustic Feature for Speech Emotion Recognition," ICASSP 2024 • IEMOCAPデータセットによる評価の結果、コンテクストとテキストの音響特徴表現を与えることで、LLMは 人手とほぼ変わらない精度で感情をアノテーションできることが示された (J .Santoso, ICASSP2024) Acoustic feature extractor Conversion of acoustic feature to text LLM (single utterance prompt example) Answer with either one of [neutral, happy, sad, angry]. M speaks “Who did you marry?” with high pitch. How does M feel? M feels (conversation prompt example) Answer with either one of [neutral, happy, sad, angry]. Given the following conversation sequence: M (high pitch): “So what’s up? What’s new?” F (low pitch): “Well Vegas was awesome.” M (normal pitch): “Yeah. I heard.” F (high pitch): And, um, I got married.” M (high pitch): “Shut up. No-in Vegas?” F (high pitch):”Year. In the old town part.” M (high pitch):”Who did you marry?” How does M feel? M feels. (description of acoustic feature example) Speaking rate: (slow / normal / fast) speaking rate Articulation rate: (slow / normal / fast) articulation rate PItch: (low / normal / high) pitch Loudness: (quiet / normal loudness / loud) speaking rate Intensity: (low / normal / high) intensity acoustic feature set • loudness • pitch • speaking rate • etc Description of acoustic feature (text) Text content (transcription) Prompt Emotion Class Input Speech