Upgrade to Pro — share decks privately, control downloads, hide ads and more …

MediaGnosis IEEE ICIP2023 Industry Seminar

Ryo Masumura
October 11, 2023

MediaGnosis IEEE ICIP2023 Industry Seminar

Ryo Masumura

October 11, 2023
Tweet

More Decks by Ryo Masumura

Other Decks in Research

Transcript

  1. 1 Copyright NTT CORPORATION Self-introduction  Ryo Masumura, Ph.D. 

    Biography • 2011.04: Join into Nippon Telegraph and Telephone Corporation (NTT) • 2015.04-2016.09: Ph.D. Student, Tohoku University • Now: Distinguished research scientist at NTT Human Information Labs.  Research Topics • Speech processing (speech recognition, classification, etc.) • Natural language processing (classification, generation, etc.) • Computer vision (classification, detection, captioning, etc.) • Crossmodal processing (joint modeling, etc.) My goal: Establishing general-purpose media processing AI
  2. 2 Copyright NTT CORPORATION Overview of my presentation  Present

    next-generation media-processing AI “MediaGnosis,” being developed at NTT Corporation  What is MediaGnosis? • How does MediaGnosis differ from general AIs? • What technology is the key enabler of MediaGnosis?  What is possible by MediaGnosis? • App. 1: multi-modal conversation sensing application • App. 2: multi-modal personality factor measurement application
  3. 4 Copyright NTT CORPORATION Overview of MediaGnosis  MediaGnosis is

    a multi-modal foundation model that can handle various functions and modalities within a single brain in an integrated manner  “MediaGnosis” originates from the idea of “treating all sorts of media (records of information) as gnosis (knowledge) in an integrated manner like humans and making a diagnosis (judgment) based on that knowledge Speech and Audio Processing Image and Video Processing Natural Language Processing Cross-modal Processing Multi-modal foundation model
  4. 5 Copyright NTT CORPORATION Problems in general media processing AIs

    Speech recognition LLM First impression recognition Face recognition Speaker recognition Emotion recognition XXX Gender and age estimation They have an independent brain for each of speech recognition, face recognition, and emotion recognition  Knowledge acquired in each general AI function is not mutually utilized  General AIs work with an independent brain for each function
  5. 6 Copyright NTT CORPORATION Problems in general media processing AIs

     Difficult to combine multiple functions in a complex manner Difficult to combine multiple technologies individually built on different concepts  While the market is demanding new services that utilize multi-modals and multiple AI functions, the difficulty is a major bottleneck in service development Speech recognition provided by corp. A Emotion recognition provided by corp. C REST API using cloud Web socket API using cloud Standalone Python module Face recognition provided by corp. B Service with multiple AI functions Attribute estimation provided by corp. D Web assembly
  6. 7 Copyright NTT CORPORATION Speech recognition LLM First impression recognition

    Face recognition Speaker recognition Emotion recognition XXX Gender and age estimation  Store cross-functional knowledge in a unified multi-modal foundation model and process various types of information with the shared knowledge Strength of MediaGnosis  Even if the training data for each function is limited, sharable knowledge allows for efficient learning and growth ※ This is example, this shows a part of functions
  7. 8 Copyright NTT CORPORATION  Can provide various functions in

    an all-in-one manner and realize complex AI inference combining multi-modal multi-tasks Speech recognition Gender and age estimation Translation LLM Emotion recognition First impression recognition  Easily offer new services that utilize multi-modals and multiple AI processing Inputs Outputs Strength of MediaGnosis ※ This is example, this shows a part of functions
  8. 9 Copyright NTT CORPORATION Key Technology in MediaGnosis  Multi-modal

    foundation modeling based on our home-made multi-modal or multi-task joint modeling techniques  Joint modeling enables knowledge sharing by representing various AI functions in a unified model structure and performing co-learning Text Generation Speech and Audio Understanding Image and Video Understanding Emotion Classification Attribute Classification Happy, Sad, neutral Male, Female Elder, Adult, Child “It is sunny today” Crossmodal Understanding Natural Language Understanding Can be shared between speech recognition, speaker recognition, speech-based emotion recognition, etc. Can be shared between speech-based attribute estimation, face-based attribute estimation, and audio-visual attribute estimation, etc. Can be shared between all AI functions ※ This is example, this shows a part of modules
  9. 10 Copyright NTT CORPORATION Training in MediaGnosis  MediaGnosis jointly

    utilizes various datasets for training architecture Dataset for speech recognition Dataset for face emotion recognition Dataset for machine translation Training datasets etc. Text Generation Speech and Audio Understanding Image and Video Understandin g Emotion Classification Attribute Classification Happy, Sad, neutral Male, Female Elder, Adult, Child “It is sunny today” Crossmodal Understanding Natural Language Understanding  Utilize both unpaired datasets (text-only, speech-only, image-only datasets) and input-output paired datasets Text-only dataset Speech-only dataset Image-only dataset ※ This is example, this shows a part of modules
  10. 11 Copyright NTT CORPORATION Inference in MediaGnosis  Possible to

    implement single-modal functions and multimodal functions by extracting modules for the target function without using all modules Case 1: Speech recognition Case 3: Audio-visual emotion recognition Case 2: Face-based gender and age estimation Speech and Audio Understanding Cross-modal Understanding Text Generation Image and Video Understanding Cross-modal Understanding Attribute Classification Speech and Audio Understanding Cross-modal Understanding Image and Video Understanding Emotion Classification Happy, Sad, neutral Male, Female Elder, Adult, Child “It is sunny today” ※ This is example, this shows a part of modules
  11. 12 Copyright NTT CORPORATION Our major technical point (1) Speech,

    Text, and speech-text cross-modal joint modeling for text generation [Masumura+ INTERSPEECH2022] Speech, Image, and speech-image cross-modal joint modeling for classification [Takashima+ INTERSPEECH2022]  Cross-modal Transformer-based joint modeling [Masumura+ INTERSPEECH2022] [Takashima+ INTERSPEECH2022]  Jointly model multiple single-modal tasks and multimodal tasks using a shared cross-modal transformer architecture Speech Text Speech Image
  12. 13 Copyright NTT CORPORATION Our major technical point (2) 

    Grounding other modals into Text-based LM for cross-modal modeling [Masumura+ SLT2019] [Masumura+ INTERSPEECH2020] [Masumura+ EUSIPCO2023]  Leverage knowledge gained through large amounts of text for improving speech or image processing with limited training data Text-based LM improves image captioning [Masumura+ EUSIPCO2023] Text-based LM improves speech recognition [Masumura+ INTERSPEECH2020] Text Speech Text Image
  13. 14 Copyright NTT CORPORATION Our major technical point (3) 

    Self-supervised representation learning for multi-domain joint modeling [Masumura+ SLT2021] [Ihori+ ICASSP2021] [Tanaka+ INTERSPEECH2022] [Tanaka+ ICASSP2023]  Utilize unpaired data collected from a variety of domains for self-supervised learning Domain adversarial self-supervised representation learning [Tanaka+ INTERSPEECH2022] Cross-lingual self-supervised representation learning [Tanaka+ ICASSP2023] Considering multi-domain datasets Considering cross-lingual datasets
  14. 15 Copyright NTT CORPORATION Our major technical point (4) 

    Special token based inference control in joint modeling [Ihori+ INTERSPEECH2021] [Tanaka+ INTERSPEECH2021] [Ihori+ COLING 2022] [Orihashi+ ICIP 2022]  Control output generation style in inference while jointly training multiple tasks Style token based inference control for speech recognition [Tanaka+ INTERSPEECH2021] Switching token based inference control for text-style conversion [Ihori+ INTERSPEECH2021] Switching tokens Style tokens Auxiliary token based inference control for scene-text recognition [Orihashi+ ICIP2022] Auxiliary tokens
  15. 16 Copyright NTT CORPORATION Our major technical point (5) 

    Joint inference of multiple information with auto-regressive joint modeling [Masumura+ INTERSPEECH2022] [Ihori+ INTERSPEECH2022] [Makishima+ INTERSPEECH2023]  Jointly generate multiple function’s outputs in one inference and consider relationship between the multiple functions Joint inference of multi-talker’s gender, age, and transcriptions [Masumura+ INTERSPEECH2022] Joint inference of dual-style outputs [Ihori+ INTERSPEECH2023] Jointly generate spoken text and written text Jointly generate multi-talker’s gender, age, and transcriptions
  16. 17 Copyright NTT CORPORATION Our major technical point (6) 

    Multi-modal end-to-end modeling for challenging domains [Yamazaki+ AAAI2022] [Hojo+ INTERSPEECH2023]  Enhance multi-modal scene-context awareness, conversation-context awareness Audio-visual scene aware dialog modeling with crossmodal transformer [Yamazaki+ AAAI2022] Audio-visual conversation understaniding modeling with crossmodal transformer [Hojo+ INTERSPEECH2023] Handle conversation-context between a seller and a buyer Handle conversation histories and conversation video scene
  17. 18 Copyright NTT CORPORATION Our major technical point (7) 

    Joint modeling of recognition and generation modules [Masumura+ APSIPA2019] [Masumura+ INTERSPEECH2019] [Makishima+ INTERSPEECH2022]  Consider speech generation or image generation to properly recognize speech or image information Consider generation criteria in itraining [Makishima+ INTERSPEECH2021] Consider generation criteria in inference [Masumura+ INTERSPEECH2019] Considering reconstruction criterion in inference Considering reconstruction criterion in training
  18. 19 Copyright NTT CORPORATION Our major technical point (8) 

    Training techniques to jointly use unpaired and paired data [Masumura+ ICASSP2020] [Takashima+ APSIPA2020] [Orihashi+ INTERSPEECH2020] [Suzuki+ ICIP2023]  Unpaired data can be utilized for semi-supervised learning in a training phase, and utilized for unsupervised adaptation in a run-time phase Semi-supervised learning for sequence-to-sequence modeling [Masumura+ ICASSP2020] Online domain adaptation for transformer based object detection [Suzuki+ ICIP2023] Unlabeled data Unlabeled data
  19. 21 Copyright NTT CORPORATION Our vision in industry fields Smart

    Communication Smart Office Smart City Smart XXX Communication practice ❓ AI Anywhere powered by Auto customer service City Robots Monitoring AI collaboration AI-based RPA  Provide any solutions across multimodal and multiple domains using the MediaGnosis (our multimodal foundation model)
  20. 22 Copyright NTT CORPORATION Applications powered by MediaGnosis  Have

    developed various single or multi-modal applications  Pickup two multi-modal applications for smart communication fields
  21. 23 Copyright NTT CORPORATION App. 1: Remote communication support 

    Facilitate a remote meeting by sensing multimodal multi-party information  We often encounter stumbling blocks…  It is important to ensure that such meetings proceed smoothly • I can’t sense the attitude of the other participants… • Only a limited number of people speak… • The discussion is deadlocked…
  22. 25 Copyright NTT CORPORATION Analysis sheet  Sense various aspects

    of a meeting and visualize its changing state Speaker’s audio- visual emotion Listeners’ visual emotion Audio-Visual Multiparty analyses  Help to improve overall quality of the meeting Transcriptions (can be trnaslated)
  23. 26 Copyright NTT CORPORATION App. 2: Personality measurement Select a

    situation that you like Perform role-playing against the selected situation for about 60s Have your analysis result  Measure your personality and help you discover your potential charm by sensing multi-modal information  Select a situation and perform a role-play (e.g., apologize to your lover)  Analyze your responses, express them numerically, and classifies them into categories
  24. 28 Copyright NTT CORPORATION Analysis sheet  Explain your popularity

    factors, which represent the most charming points of a person, and also give advice about how to further enhance your charm  Represent how your popularity factors differ from the averages Energetic compared with averaged ones Compare the differences with others in various aspects
  25. 30 Copyright NTT CORPORATION Summary  MediaGnosis is a multi-modal

    foundation model that can handle various functions and modalities within a single brain in an integrated manner Speech and Audio Processing Image and Video Processing Natural Language Processing Cross-modal Processing  Can provide various functions in an all-in-one manner and realize complex AI inference combining multi-modal multi-tasks  Aim to provide any solutions across multimodal and multiple domains Multi-modal foundation model
  26. 31 Copyright NTT CORPORATION References [Masumura+ INTERSPEECH2022] Ryo Masumura, Yoshihiro

    Yamazaki, Saki Mizuno, Naoki Makishima, Mana Ihori, Mihiro Uchida, Hiroshi Sato, Tomohiro Tanaka, Akihiko Takashima, Satoshi Suzuki, Shota Orihashi, Takafumi Moriya, Nobukatsu Hojo and Atsushi Ando, "End-to-End Joint Modeling of Conversation History-Dependent and Independent ASR Systems with Multi-History Training", In Proc. Annual Conference of the International Speech Communication Association (INTERSPEECH), pp.3218-3222, 2022. [Takashima+ INTERSPEECH2022] Akihiko Takashima, Ryo Masumura, Atsushi Ando, Yoshihiro Yamazaki, Mihiro Uchida and Shota Orihashi, "Interactive Co-Learning with Cross-Modal Transformer for Audio-Visual Emotion Recognition", In Proc. Annual Conference of the International Speech Communication Association (INTERSPEECH), pp.4740-4744, 2022. [Masumura+ SLT2019] Ryo Masumura, Mana Ihori, Tomohiro Tanaka, Atsushi Ando, Ryo Ishii, Takanobu Oba, Ryuichiro Higashinaka, "Improving Speech-Based End-of-Turn Detection via Cross-Modal Representation Learning with Punctuated Text Data", In Proc. IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp.1062-1069, 2019. [Masumura+ INTERSPEECH2020] Ryo Masumura, Naoki Makishima, Mana Ihori, Akihiko Takashima, Tomohiro Tanaka, Shota Orihashi, "Phoneme-to-Grapheme Conversion Based Large-Scale Pre-Training for End-to-End Automatic Speech Recognition", In Proc. Annual Conference of the International Speech Communication Association (INTERSPEECH), 2822-2826, 2020. [Masumura+ EUSIPCO2023] Ryo Masumura, Naoki Makishima, Mana Ihori, Akihiko Takashima, Tomohiro Tanaka, Shota Orihashi, "Text-to-Text Pre-Training with Paraphrasing for Improving Transformer-based Image Captioning", In Proc. European Signal Processing Conference (EUSIPCO), pp.516-520, 2023. [Masumura+ SLT2021] Ryo Masumura, Naoki Makishima, Mana Ihori, Tomohiro Tanaka, Akihiko Takashima and Shota Orihashi, "Large-Context Conversational Representation Learning: Self-Supervised Learning for Conversational Documents" In Proc. IEEE Spoken Language Technology Workshop (SLT), 1012-1019, 2021.
  27. 32 Copyright NTT CORPORATION References [Ihori+ ICASSP2021] Mana Ihori, Naoki

    Makishima, Tomohiro Tanaka, Akihiko Takashima, Shota Orihashi, Ryo Masumura, "MAPGN: MAsked Pointer-Generator Network for Sequence-to-Sequence Pre-training", In Proc. International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 7563-7567, 2021. [Tanaka+ INTERSPEECH2022] Tomohiro Tanaka, Ryo Masumura, Hiroshi Sato, Mana Ihori, Kohei Matsuura, Takanori Ashihara and Takafumi Moriya, "Domain Adversarial Self-Supervised Speech Representation Learning for Improving Unknown Domain Downstream Tasks", In Proc. Annual Conference of the International Speech Communication Association (INTERSPEECH), pp. 1066-1070, 2022. [Tanaka+ ICASSP2023] Tomohiro Tanaka, Ryo Masumura, Mana Ihori, Hiroshi Sato, Taiga Yamane, Takanori Ashihara, Kohei Matsuura, Takafumi Moriya, "Leveraging Language Embeddings for Cross-Lingual Self-Supervised Speech Representation Learning", In Proc. International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2023. [Ihori+ INTERSPEECH2021] Mana Ihori, Naoki Makishima, Tomohiro Tanaka, Akihiko Takashima, Shota Orihashi and Ryo Masumura, "Zero-Shot Joint Modeling of Multiple Spoken-Text-Style Conversion Tasks using Switching Tokens", In Proc. Annual Conference of the International Speech Communication Association (INTERSPEECH), 776-780, 2021. [Tanaka+ INTERSPEECH2021] Tomohiro Tanaka, Ryo Masumura, Mana Ihori, Akihiko Takashima, Shota Orihashi and Naoki Makishima, "End-to- End Rich Transcription-Style Automatic Speech Recognition with Semi-Supervised Learning", In Proc. Annual Conference of the International Speech Communication Association (INTERSPEECH), 4458-4462, 2021. [Ihori+ COLING 2022] Mana Ihori, Hiroshi Sato, Tomohiro Tanaka, Ryo Masumura, "Multi-Perspective Document Revision", In Proc. International Conference on Computational Linguistics (COLING), pp.6128-6138, 2022. [Orihashi+ ICIP2022] Shota Orihashi, Yoshihiro Yamazaki, Mihiro Uchida, Akihiko Takashima, Ryo Masumura, "Fully Sharable Scene Text Recognition Modeling for Horizontal and Vertical Writing", In Proc. International Conference on Image Processing (ICIP), pp.2636-2640, 2022.
  28. 33 Copyright NTT CORPORATION References [Masumura+ INTERSPEECH2022] Ryo Masumura, Yoshihiro

    Yamazaki, Saki Mizuno, Naoki Makishima, Mana Ihori, Mihiro Uchida, Hiroshi Sato, Tomohiro Tanaka, Akihiko Takashima, Satoshi Suzuki, Shota Orihashi, Takafumi Moriya, Nobukatsu Hojo and Atsushi Ando, "End-to-End Joint Modeling of Conversation History-Dependent and Independent ASR Systems with Multi-History Training", In Proc. Annual Conference of the International Speech Communication Association (INTERSPEECH), pp.3218-3222, 2022. [Ihori+ INTERSPEECH2022] Mana Ihori, Hiroshi Sato, Tomohiro Tanaka, Ryo Masumura, Saki Mizuno, Nabukatsu Hojo, "Transcribing Speech as Spoken and Written Dual Text Using an Autoregressive Model", In Proc. Annual Conference of the International Speech Communication Association (INTERSPEECH), pp.461-465, 2023. [Makishima+ INTERSPEECH2023] Naoki Makishima, Keita Suzuki, Satoshi Suzuki, Atsushi Ando, Ryo Masumura, "Joint Autoregressive Modeling of End-to-End Multi-Talker Overlapped Speech Recognition and Utterance-level Timestamp Prediction", In Proc. Annual Conference of the International Speech Communication Association (INTERSPEECH), pp.2913-2917, 2023. [Yamazaki+ AAAI2022] Yoshihiro Yamazaki, Shota Orihashi, Ryo Masumura, Mihiro Uchida, Akihiko Takashima, "Audio Visual Scene-Aware Dialog Generation with Transformer-based Video Representations", In Proc. DSTC Workshop at AAAI Conference on Artificial Intelligence(AAAI), No.35, 2022. [Hojo+ INTERSPEECH2023] Nobukatsu Hojo, Saki Mizuno, Satoshi Kobashikawa, Ryo Masumura, Mana Ihori, Hiroshi Sato, Tomohiro Tanaka, "Audio-Visual Praise Estimation for Conversational Video based on Synchronization-Guided Multimodal Transformer", In Proc. Annual Conference of the International Speech Communication Association (INTERSPEECH), pp.2663-2667, 2023. [Masumura+ APSIPA2019] Ryo Masumura, Yusuke Ijima, Satoshi Kobashikawa, Takanobu Oba, Yushi Aono, "Can We Simulate Generative Process of Acoustic Modeling Data? Towards Data Restoration for Acoustic Modeling", In Proc. Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp.655-661, 2019.
  29. 34 Copyright NTT CORPORATION References [Masumura+ INTERSPEECH2019] Ryo Masumura, Hiroshi

    Sato, Tomohiro Tanaka, Takafumi Moriya, Yusuke Ijima, Takanobu Oba, "End-to-End Automatic Speech Recognition with a Reconstruction Criterion Using Speech-to-Text and Text-to-Speech Encoder-Decoders", In Proc. Annual Conference of the International Speech Communication Association (INTERSPEECH), pp.1606-1610, 2019. [Makishima+ INTERSPEECH2022] Naoki Makishima, Satoshi Suzuki, Atsushi Ando and Ryo Masumura, "Speaker consistency loss and step-wise optimization for semi-supervised joint training of TTS and ASR using unpaired text data", In Proc. Annual Conference of the International Speech Communication Association (INTERSPEECH), pp.526-530, 2022. [Masumura+ ICASSP2020] Ryo Masumura, Mana Ihori, Akihiko Takashima, Takafumi Moriya, Atsushi Ando, Yusuke Shinohara, "Sequence-level Consistency Training for Semi-Supervised End-to-End Automatic Speech Recognition", In Proc. International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp.7049-7053, 2020. [Takashima+ APSIPA2020] Akihiko Takashima, Naoki Makishima, Mana Ihori, Tomohiro Tanaka, Shota Orihashi, Ryo Masumura, "Unsupervised Domain Adversarial Training in Angular Space for Facial Expression Recognition", In Proc. Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp.1054-1059, 2020. [Orihashi+ INTERSPEECH2020] Shota Orihashi, Mana Ihori, Tomohiro Tanaka, Ryo Masumura, "Unsupervised Domain Adaptation for Dialogue Sequence Labeling Based on Hierarchical Adversarial Training", In Proc. Annual Conference of the International Speech Communication Association (INTERSPEECH), pp.1575-1579, 2020. [Suzuki+ ICIP2023] Satoshi Suzuki, Taiga Yamane, Naoki Makishima, Keita Suzuki, Atsushi Ando, Ryo Masumura, "ONDA-DETR: Online Domain Adaptation for Detection Transformers with Self-Training Framework", In Proc. International Conference on Image Processing (ICIP), pp.1780- 1784, 2023.