Slide 1

Slide 1 text

MediaGnosis: The next-generation media processing artificial intelligence Ryo Masumura, NTT Corporation, Japan

Slide 2

Slide 2 text

1 Copyright NTT CORPORATION Self-introduction  Ryo Masumura, Ph.D.  Biography • 2011.04: Join into Nippon Telegraph and Telephone Corporation (NTT) • 2015.04-2016.09: Ph.D. Student, Tohoku University • Now: Distinguished research scientist at NTT Human Information Labs.  Research Topics • Speech processing (speech recognition, classification, etc.) • Natural language processing (classification, generation, etc.) • Computer vision (classification, detection, captioning, etc.) • Crossmodal processing (joint modeling, etc.) My goal: Establishing general-purpose media processing AI

Slide 3

Slide 3 text

2 Copyright NTT CORPORATION Overview of my presentation  Present next-generation media-processing AI “MediaGnosis,” being developed at NTT Corporation  What is MediaGnosis? • How does MediaGnosis differ from general AIs? • What technology is the key enabler of MediaGnosis?  What is possible by MediaGnosis? • App. 1: multi-modal conversation sensing application • App. 2: multi-modal personality factor measurement application

Slide 4

Slide 4 text

3 Copyright NTT CORPORATION What is MediaGnosis?

Slide 5

Slide 5 text

4 Copyright NTT CORPORATION Overview of MediaGnosis  MediaGnosis is a multi-modal foundation model that can handle various functions and modalities within a single brain in an integrated manner  “MediaGnosis” originates from the idea of “treating all sorts of media (records of information) as gnosis (knowledge) in an integrated manner like humans and making a diagnosis (judgment) based on that knowledge Speech and Audio Processing Image and Video Processing Natural Language Processing Cross-modal Processing Multi-modal foundation model

Slide 6

Slide 6 text

5 Copyright NTT CORPORATION Problems in general media processing AIs Speech recognition LLM First impression recognition Face recognition Speaker recognition Emotion recognition XXX Gender and age estimation They have an independent brain for each of speech recognition, face recognition, and emotion recognition  Knowledge acquired in each general AI function is not mutually utilized  General AIs work with an independent brain for each function

Slide 7

Slide 7 text

6 Copyright NTT CORPORATION Problems in general media processing AIs  Difficult to combine multiple functions in a complex manner Difficult to combine multiple technologies individually built on different concepts  While the market is demanding new services that utilize multi-modals and multiple AI functions, the difficulty is a major bottleneck in service development Speech recognition provided by corp. A Emotion recognition provided by corp. C REST API using cloud Web socket API using cloud Standalone Python module Face recognition provided by corp. B Service with multiple AI functions Attribute estimation provided by corp. D Web assembly

Slide 8

Slide 8 text

7 Copyright NTT CORPORATION Speech recognition LLM First impression recognition Face recognition Speaker recognition Emotion recognition XXX Gender and age estimation  Store cross-functional knowledge in a unified multi-modal foundation model and process various types of information with the shared knowledge Strength of MediaGnosis  Even if the training data for each function is limited, sharable knowledge allows for efficient learning and growth ※ This is example, this shows a part of functions

Slide 9

Slide 9 text

8 Copyright NTT CORPORATION  Can provide various functions in an all-in-one manner and realize complex AI inference combining multi-modal multi-tasks Speech recognition Gender and age estimation Translation LLM Emotion recognition First impression recognition  Easily offer new services that utilize multi-modals and multiple AI processing Inputs Outputs Strength of MediaGnosis ※ This is example, this shows a part of functions

Slide 10

Slide 10 text

9 Copyright NTT CORPORATION Key Technology in MediaGnosis  Multi-modal foundation modeling based on our home-made multi-modal or multi-task joint modeling techniques  Joint modeling enables knowledge sharing by representing various AI functions in a unified model structure and performing co-learning Text Generation Speech and Audio Understanding Image and Video Understanding Emotion Classification Attribute Classification Happy, Sad, neutral Male, Female Elder, Adult, Child “It is sunny today” Crossmodal Understanding Natural Language Understanding Can be shared between speech recognition, speaker recognition, speech-based emotion recognition, etc. Can be shared between speech-based attribute estimation, face-based attribute estimation, and audio-visual attribute estimation, etc. Can be shared between all AI functions ※ This is example, this shows a part of modules

Slide 11

Slide 11 text

10 Copyright NTT CORPORATION Training in MediaGnosis  MediaGnosis jointly utilizes various datasets for training architecture Dataset for speech recognition Dataset for face emotion recognition Dataset for machine translation Training datasets etc. Text Generation Speech and Audio Understanding Image and Video Understandin g Emotion Classification Attribute Classification Happy, Sad, neutral Male, Female Elder, Adult, Child “It is sunny today” Crossmodal Understanding Natural Language Understanding  Utilize both unpaired datasets (text-only, speech-only, image-only datasets) and input-output paired datasets Text-only dataset Speech-only dataset Image-only dataset ※ This is example, this shows a part of modules

Slide 12

Slide 12 text

11 Copyright NTT CORPORATION Inference in MediaGnosis  Possible to implement single-modal functions and multimodal functions by extracting modules for the target function without using all modules Case 1: Speech recognition Case 3: Audio-visual emotion recognition Case 2: Face-based gender and age estimation Speech and Audio Understanding Cross-modal Understanding Text Generation Image and Video Understanding Cross-modal Understanding Attribute Classification Speech and Audio Understanding Cross-modal Understanding Image and Video Understanding Emotion Classification Happy, Sad, neutral Male, Female Elder, Adult, Child “It is sunny today” ※ This is example, this shows a part of modules

Slide 13

Slide 13 text

12 Copyright NTT CORPORATION Our major technical point (1) Speech, Text, and speech-text cross-modal joint modeling for text generation [Masumura+ INTERSPEECH2022] Speech, Image, and speech-image cross-modal joint modeling for classification [Takashima+ INTERSPEECH2022]  Cross-modal Transformer-based joint modeling [Masumura+ INTERSPEECH2022] [Takashima+ INTERSPEECH2022]  Jointly model multiple single-modal tasks and multimodal tasks using a shared cross-modal transformer architecture Speech Text Speech Image

Slide 14

Slide 14 text

13 Copyright NTT CORPORATION Our major technical point (2)  Grounding other modals into Text-based LM for cross-modal modeling [Masumura+ SLT2019] [Masumura+ INTERSPEECH2020] [Masumura+ EUSIPCO2023]  Leverage knowledge gained through large amounts of text for improving speech or image processing with limited training data Text-based LM improves image captioning [Masumura+ EUSIPCO2023] Text-based LM improves speech recognition [Masumura+ INTERSPEECH2020] Text Speech Text Image

Slide 15

Slide 15 text

14 Copyright NTT CORPORATION Our major technical point (3)  Self-supervised representation learning for multi-domain joint modeling [Masumura+ SLT2021] [Ihori+ ICASSP2021] [Tanaka+ INTERSPEECH2022] [Tanaka+ ICASSP2023]  Utilize unpaired data collected from a variety of domains for self-supervised learning Domain adversarial self-supervised representation learning [Tanaka+ INTERSPEECH2022] Cross-lingual self-supervised representation learning [Tanaka+ ICASSP2023] Considering multi-domain datasets Considering cross-lingual datasets

Slide 16

Slide 16 text

15 Copyright NTT CORPORATION Our major technical point (4)  Special token based inference control in joint modeling [Ihori+ INTERSPEECH2021] [Tanaka+ INTERSPEECH2021] [Ihori+ COLING 2022] [Orihashi+ ICIP 2022]  Control output generation style in inference while jointly training multiple tasks Style token based inference control for speech recognition [Tanaka+ INTERSPEECH2021] Switching token based inference control for text-style conversion [Ihori+ INTERSPEECH2021] Switching tokens Style tokens Auxiliary token based inference control for scene-text recognition [Orihashi+ ICIP2022] Auxiliary tokens

Slide 17

Slide 17 text

16 Copyright NTT CORPORATION Our major technical point (5)  Joint inference of multiple information with auto-regressive joint modeling [Masumura+ INTERSPEECH2022] [Ihori+ INTERSPEECH2022] [Makishima+ INTERSPEECH2023]  Jointly generate multiple function’s outputs in one inference and consider relationship between the multiple functions Joint inference of multi-talker’s gender, age, and transcriptions [Masumura+ INTERSPEECH2022] Joint inference of dual-style outputs [Ihori+ INTERSPEECH2023] Jointly generate spoken text and written text Jointly generate multi-talker’s gender, age, and transcriptions

Slide 18

Slide 18 text

17 Copyright NTT CORPORATION Our major technical point (6)  Multi-modal end-to-end modeling for challenging domains [Yamazaki+ AAAI2022] [Hojo+ INTERSPEECH2023]  Enhance multi-modal scene-context awareness, conversation-context awareness Audio-visual scene aware dialog modeling with crossmodal transformer [Yamazaki+ AAAI2022] Audio-visual conversation understaniding modeling with crossmodal transformer [Hojo+ INTERSPEECH2023] Handle conversation-context between a seller and a buyer Handle conversation histories and conversation video scene

Slide 19

Slide 19 text

18 Copyright NTT CORPORATION Our major technical point (7)  Joint modeling of recognition and generation modules [Masumura+ APSIPA2019] [Masumura+ INTERSPEECH2019] [Makishima+ INTERSPEECH2022]  Consider speech generation or image generation to properly recognize speech or image information Consider generation criteria in itraining [Makishima+ INTERSPEECH2021] Consider generation criteria in inference [Masumura+ INTERSPEECH2019] Considering reconstruction criterion in inference Considering reconstruction criterion in training

Slide 20

Slide 20 text

19 Copyright NTT CORPORATION Our major technical point (8)  Training techniques to jointly use unpaired and paired data [Masumura+ ICASSP2020] [Takashima+ APSIPA2020] [Orihashi+ INTERSPEECH2020] [Suzuki+ ICIP2023]  Unpaired data can be utilized for semi-supervised learning in a training phase, and utilized for unsupervised adaptation in a run-time phase Semi-supervised learning for sequence-to-sequence modeling [Masumura+ ICASSP2020] Online domain adaptation for transformer based object detection [Suzuki+ ICIP2023] Unlabeled data Unlabeled data

Slide 21

Slide 21 text

20 Copyright NTT CORPORATION What is possible by MediaGnosis?

Slide 22

Slide 22 text

21 Copyright NTT CORPORATION Our vision in industry fields Smart Communication Smart Office Smart City Smart XXX Communication practice ❓ AI Anywhere powered by Auto customer service City Robots Monitoring AI collaboration AI-based RPA  Provide any solutions across multimodal and multiple domains using the MediaGnosis (our multimodal foundation model)

Slide 23

Slide 23 text

22 Copyright NTT CORPORATION Applications powered by MediaGnosis  Have developed various single or multi-modal applications  Pickup two multi-modal applications for smart communication fields

Slide 24

Slide 24 text

23 Copyright NTT CORPORATION App. 1: Remote communication support  Facilitate a remote meeting by sensing multimodal multi-party information  We often encounter stumbling blocks…  It is important to ensure that such meetings proceed smoothly • I can’t sense the attitude of the other participants… • Only a limited number of people speak… • The discussion is deadlocked…

Slide 25

Slide 25 text

24 Copyright NTT CORPORATION Demo

Slide 26

Slide 26 text

25 Copyright NTT CORPORATION Analysis sheet  Sense various aspects of a meeting and visualize its changing state Speaker’s audio- visual emotion Listeners’ visual emotion Audio-Visual Multiparty analyses  Help to improve overall quality of the meeting Transcriptions (can be trnaslated)

Slide 27

Slide 27 text

26 Copyright NTT CORPORATION App. 2: Personality measurement Select a situation that you like Perform role-playing against the selected situation for about 60s Have your analysis result  Measure your personality and help you discover your potential charm by sensing multi-modal information  Select a situation and perform a role-play (e.g., apologize to your lover)  Analyze your responses, express them numerically, and classifies them into categories

Slide 28

Slide 28 text

27 Copyright NTT CORPORATION Demo

Slide 29

Slide 29 text

28 Copyright NTT CORPORATION Analysis sheet  Explain your popularity factors, which represent the most charming points of a person, and also give advice about how to further enhance your charm  Represent how your popularity factors differ from the averages Energetic compared with averaged ones Compare the differences with others in various aspects

Slide 30

Slide 30 text

29 Copyright NTT CORPORATION Summary

Slide 31

Slide 31 text

30 Copyright NTT CORPORATION Summary  MediaGnosis is a multi-modal foundation model that can handle various functions and modalities within a single brain in an integrated manner Speech and Audio Processing Image and Video Processing Natural Language Processing Cross-modal Processing  Can provide various functions in an all-in-one manner and realize complex AI inference combining multi-modal multi-tasks  Aim to provide any solutions across multimodal and multiple domains Multi-modal foundation model

Slide 32

Slide 32 text

31 Copyright NTT CORPORATION References [Masumura+ INTERSPEECH2022] Ryo Masumura, Yoshihiro Yamazaki, Saki Mizuno, Naoki Makishima, Mana Ihori, Mihiro Uchida, Hiroshi Sato, Tomohiro Tanaka, Akihiko Takashima, Satoshi Suzuki, Shota Orihashi, Takafumi Moriya, Nobukatsu Hojo and Atsushi Ando, "End-to-End Joint Modeling of Conversation History-Dependent and Independent ASR Systems with Multi-History Training", In Proc. Annual Conference of the International Speech Communication Association (INTERSPEECH), pp.3218-3222, 2022. [Takashima+ INTERSPEECH2022] Akihiko Takashima, Ryo Masumura, Atsushi Ando, Yoshihiro Yamazaki, Mihiro Uchida and Shota Orihashi, "Interactive Co-Learning with Cross-Modal Transformer for Audio-Visual Emotion Recognition", In Proc. Annual Conference of the International Speech Communication Association (INTERSPEECH), pp.4740-4744, 2022. [Masumura+ SLT2019] Ryo Masumura, Mana Ihori, Tomohiro Tanaka, Atsushi Ando, Ryo Ishii, Takanobu Oba, Ryuichiro Higashinaka, "Improving Speech-Based End-of-Turn Detection via Cross-Modal Representation Learning with Punctuated Text Data", In Proc. IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp.1062-1069, 2019. [Masumura+ INTERSPEECH2020] Ryo Masumura, Naoki Makishima, Mana Ihori, Akihiko Takashima, Tomohiro Tanaka, Shota Orihashi, "Phoneme-to-Grapheme Conversion Based Large-Scale Pre-Training for End-to-End Automatic Speech Recognition", In Proc. Annual Conference of the International Speech Communication Association (INTERSPEECH), 2822-2826, 2020. [Masumura+ EUSIPCO2023] Ryo Masumura, Naoki Makishima, Mana Ihori, Akihiko Takashima, Tomohiro Tanaka, Shota Orihashi, "Text-to-Text Pre-Training with Paraphrasing for Improving Transformer-based Image Captioning", In Proc. European Signal Processing Conference (EUSIPCO), pp.516-520, 2023. [Masumura+ SLT2021] Ryo Masumura, Naoki Makishima, Mana Ihori, Tomohiro Tanaka, Akihiko Takashima and Shota Orihashi, "Large-Context Conversational Representation Learning: Self-Supervised Learning for Conversational Documents" In Proc. IEEE Spoken Language Technology Workshop (SLT), 1012-1019, 2021.

Slide 33

Slide 33 text

32 Copyright NTT CORPORATION References [Ihori+ ICASSP2021] Mana Ihori, Naoki Makishima, Tomohiro Tanaka, Akihiko Takashima, Shota Orihashi, Ryo Masumura, "MAPGN: MAsked Pointer-Generator Network for Sequence-to-Sequence Pre-training", In Proc. International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 7563-7567, 2021. [Tanaka+ INTERSPEECH2022] Tomohiro Tanaka, Ryo Masumura, Hiroshi Sato, Mana Ihori, Kohei Matsuura, Takanori Ashihara and Takafumi Moriya, "Domain Adversarial Self-Supervised Speech Representation Learning for Improving Unknown Domain Downstream Tasks", In Proc. Annual Conference of the International Speech Communication Association (INTERSPEECH), pp. 1066-1070, 2022. [Tanaka+ ICASSP2023] Tomohiro Tanaka, Ryo Masumura, Mana Ihori, Hiroshi Sato, Taiga Yamane, Takanori Ashihara, Kohei Matsuura, Takafumi Moriya, "Leveraging Language Embeddings for Cross-Lingual Self-Supervised Speech Representation Learning", In Proc. International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2023. [Ihori+ INTERSPEECH2021] Mana Ihori, Naoki Makishima, Tomohiro Tanaka, Akihiko Takashima, Shota Orihashi and Ryo Masumura, "Zero-Shot Joint Modeling of Multiple Spoken-Text-Style Conversion Tasks using Switching Tokens", In Proc. Annual Conference of the International Speech Communication Association (INTERSPEECH), 776-780, 2021. [Tanaka+ INTERSPEECH2021] Tomohiro Tanaka, Ryo Masumura, Mana Ihori, Akihiko Takashima, Shota Orihashi and Naoki Makishima, "End-to- End Rich Transcription-Style Automatic Speech Recognition with Semi-Supervised Learning", In Proc. Annual Conference of the International Speech Communication Association (INTERSPEECH), 4458-4462, 2021. [Ihori+ COLING 2022] Mana Ihori, Hiroshi Sato, Tomohiro Tanaka, Ryo Masumura, "Multi-Perspective Document Revision", In Proc. International Conference on Computational Linguistics (COLING), pp.6128-6138, 2022. [Orihashi+ ICIP2022] Shota Orihashi, Yoshihiro Yamazaki, Mihiro Uchida, Akihiko Takashima, Ryo Masumura, "Fully Sharable Scene Text Recognition Modeling for Horizontal and Vertical Writing", In Proc. International Conference on Image Processing (ICIP), pp.2636-2640, 2022.

Slide 34

Slide 34 text

33 Copyright NTT CORPORATION References [Masumura+ INTERSPEECH2022] Ryo Masumura, Yoshihiro Yamazaki, Saki Mizuno, Naoki Makishima, Mana Ihori, Mihiro Uchida, Hiroshi Sato, Tomohiro Tanaka, Akihiko Takashima, Satoshi Suzuki, Shota Orihashi, Takafumi Moriya, Nobukatsu Hojo and Atsushi Ando, "End-to-End Joint Modeling of Conversation History-Dependent and Independent ASR Systems with Multi-History Training", In Proc. Annual Conference of the International Speech Communication Association (INTERSPEECH), pp.3218-3222, 2022. [Ihori+ INTERSPEECH2022] Mana Ihori, Hiroshi Sato, Tomohiro Tanaka, Ryo Masumura, Saki Mizuno, Nabukatsu Hojo, "Transcribing Speech as Spoken and Written Dual Text Using an Autoregressive Model", In Proc. Annual Conference of the International Speech Communication Association (INTERSPEECH), pp.461-465, 2023. [Makishima+ INTERSPEECH2023] Naoki Makishima, Keita Suzuki, Satoshi Suzuki, Atsushi Ando, Ryo Masumura, "Joint Autoregressive Modeling of End-to-End Multi-Talker Overlapped Speech Recognition and Utterance-level Timestamp Prediction", In Proc. Annual Conference of the International Speech Communication Association (INTERSPEECH), pp.2913-2917, 2023. [Yamazaki+ AAAI2022] Yoshihiro Yamazaki, Shota Orihashi, Ryo Masumura, Mihiro Uchida, Akihiko Takashima, "Audio Visual Scene-Aware Dialog Generation with Transformer-based Video Representations", In Proc. DSTC Workshop at AAAI Conference on Artificial Intelligence(AAAI), No.35, 2022. [Hojo+ INTERSPEECH2023] Nobukatsu Hojo, Saki Mizuno, Satoshi Kobashikawa, Ryo Masumura, Mana Ihori, Hiroshi Sato, Tomohiro Tanaka, "Audio-Visual Praise Estimation for Conversational Video based on Synchronization-Guided Multimodal Transformer", In Proc. Annual Conference of the International Speech Communication Association (INTERSPEECH), pp.2663-2667, 2023. [Masumura+ APSIPA2019] Ryo Masumura, Yusuke Ijima, Satoshi Kobashikawa, Takanobu Oba, Yushi Aono, "Can We Simulate Generative Process of Acoustic Modeling Data? Towards Data Restoration for Acoustic Modeling", In Proc. Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp.655-661, 2019.

Slide 35

Slide 35 text

34 Copyright NTT CORPORATION References [Masumura+ INTERSPEECH2019] Ryo Masumura, Hiroshi Sato, Tomohiro Tanaka, Takafumi Moriya, Yusuke Ijima, Takanobu Oba, "End-to-End Automatic Speech Recognition with a Reconstruction Criterion Using Speech-to-Text and Text-to-Speech Encoder-Decoders", In Proc. Annual Conference of the International Speech Communication Association (INTERSPEECH), pp.1606-1610, 2019. [Makishima+ INTERSPEECH2022] Naoki Makishima, Satoshi Suzuki, Atsushi Ando and Ryo Masumura, "Speaker consistency loss and step-wise optimization for semi-supervised joint training of TTS and ASR using unpaired text data", In Proc. Annual Conference of the International Speech Communication Association (INTERSPEECH), pp.526-530, 2022. [Masumura+ ICASSP2020] Ryo Masumura, Mana Ihori, Akihiko Takashima, Takafumi Moriya, Atsushi Ando, Yusuke Shinohara, "Sequence-level Consistency Training for Semi-Supervised End-to-End Automatic Speech Recognition", In Proc. International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp.7049-7053, 2020. [Takashima+ APSIPA2020] Akihiko Takashima, Naoki Makishima, Mana Ihori, Tomohiro Tanaka, Shota Orihashi, Ryo Masumura, "Unsupervised Domain Adversarial Training in Angular Space for Facial Expression Recognition", In Proc. Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp.1054-1059, 2020. [Orihashi+ INTERSPEECH2020] Shota Orihashi, Mana Ihori, Tomohiro Tanaka, Ryo Masumura, "Unsupervised Domain Adaptation for Dialogue Sequence Labeling Based on Hierarchical Adversarial Training", In Proc. Annual Conference of the International Speech Communication Association (INTERSPEECH), pp.1575-1579, 2020. [Suzuki+ ICIP2023] Satoshi Suzuki, Taiga Yamane, Naoki Makishima, Keita Suzuki, Atsushi Ando, Ryo Masumura, "ONDA-DETR: Online Domain Adaptation for Detection Transformers with Self-Training Framework", In Proc. International Conference on Image Processing (ICIP), pp.1780- 1784, 2023.