Project Review Report 4 - Robust Speech Recognition in Malayalam

Robust Speech Recognition System for Malayalam KURIAN BENOY 2021MCS120014 24TH
FEBRUARY, 2024

CONTENTS  Introduction – Motivation and Project Objectives  Updated
Literature Survey  Project Objectives 1- Develop an Opensource ASR  Project Objectives 2- Long-form Audio Speech transcription  Project Objectives 3- Benchmarking  Big mistake in benchmarking process and need to benchmark again in Malayalam  Fixing Mistake in benchmarking  Indic Subtitler  Conclusion and Further plans

Phase-1 Work  The goal of obtaining a Word Error
Rate (WER) of less than 0.15 with our model weight - kurianbenoy/vegam-whisper-medium-ml has been achieved in the dataset SMC MSC speech corpus. Though the same score was not attained using the Common Voice dataset, we have reached a Character Error Rate (CER) of less than 0.15. The project has embarked on the task of long-form speech transcription where the initial findings are encouraging. We build a model on top of whisperX architecture for Malayalam to do long form audio transcription.  We have successfully benchmarked 17 ASR models using our benchmarking toolkit, Malayalam_asr_benchmarking which is published in PyPi along with whisper_normalizer package.

Review comments 1. In terms of the threshold defined for
ASR system of being satisfiable, why did you went with a WER and CER as 0.15? 2. What is defined as long-form audio transcription, robustness of speech? 3. Why are you perusing Open-source route for building models? 4. What is your experimental setup? 5. Literature survey expand

How we addressed review comments?  Feedback (1-4) has been
addressed in updated project report  We are working on expanding the literature review, by adding more papers in literature in this review as well as the upcoming project reviews.

Motivation 1. In Malayalam language, at the moment there are
not any automatic speech recognition models which support long-form audio speech transcription, addressing the specific requirements for transcribing extended spoken content with timestamps. This is an essential component in creating subtitles for academic lectures, interviews, movies, serials etc. 2. Even though there has been a lot of works in Malayalam Speech to text. They aren’t open- source most of the time. This means leveraging open-source methodologies, the system intends to provide access to datasets, model architectures, and algorithms, promoting transparency, reproducibility, and collaboration in the development of Malayalam ASR technology. 3. Lot of works claim to have achieved 90 percentage accuracy in datasets, even in datasets which are not available in the public domain and kept proprietary. Yet an apple to apple comparison will only ensure that whether model A or model B is better for Malayalam speech.

Problem Objectives

1. Problem Objectives Develop an Open-Source ASR System: The project
aims to design and implement an open-source ASR system for Malayalam that overcomes the limitations of existing speech-to-text techniques. By leveraging open-source methodologies, the system intends to provide access to datasets, model architectures, and algorithms, promoting transparency, reproducibility, and collaboration in the development of Malayalam ASR technology. It should achieve a key goal of the project is to achieve a Word Error Rate (WER) of less than 0.15 in the developed ASR system for speech to text model accuracy.

2&3. Problem Objectives Support Long-Form Audio Speech Transcription: In addressing
the dearth of specialized provisions for transcribing long-form audio with timestamps in Malayalam, the project endeavors to develop features and capabilities that cater to the specific requirements of transcribing extended spoken content. Benchmark Various ASR Models: The project seeks to compare and benchmark multiple ASR models to evaluate their performance in the context of Malayalam speech-to-text processing. By conducting systematic comparisons, the project aims to identify the strengths and limitations of different ASR methodologies, leading to insights that can inform the selection of appropriate models for specific use cases.

Updated Literature Review

Seamless M4T Paper  SeamlessM4T[2] from Meta is a Massively
Multilingual & Multimodal Machine Translation—a single model that supports speech-to-speech translation, speech-to-text translation, text-to-speech translation, text-to- text translation, and automatic speech recognition for up to 100 languages.  To build this, they used 1 million hours of open speech audio data to learn self-supervised speech representations with w2v-BERT 2.0 architecture.  It comes with various model families like SeamlessM4T medium, large and large-v2. Barrault, L., Chung, Y. A., Meglioli, M. C., et al. “SeamlessM4T-Massively Multilingual & Multimodal Machine Translation.” In: AI Meta Publications, 2023 [2]

Seamless M4T Training data Malayalam was trained in Seamless M4T
with 110 hours of data with speech to text data and 103 hours translation dataset. Barrault, L., Chung, Y. A., Meglioli, M. C., et al. “SeamlessM4T-Massively Multilingual & Multimodal Machine Translation.” In: AI Meta Publications, 2023 [2]

Seamless M4T ASR performance  The performance metrics of ASR
in Malayalam is not available, and will need further benchmarking.  In the paper, it’s evaluated with Fleurs dataset in respectively 77 and 54 languages respectively.  It outperforms other ASR models in Fleurs 77, while MMS[3] slightly outperforms in other languages. It’s a better model that Whisper[1] in all the tests. Barrault, L., Chung, Y. A., Meglioli, M. C., et al. “SeamlessM4T-Massively Multilingual & Multimodal Machine Translation.” In: AI Meta Publications, 2023 [2]

Investigating End-to-End ASR Architectures for Long Form Audio Transcription 
The paper titled "Investigating End-to-End ASR Architectures for Long Form Audio Transcription" compares different end-to-end ASR models for accurately transcribing long-form audio. Three categories of models—convolutional, convolutional with SE, and convolutional with attention— are studied, revealing that the self-attention model with local attention and global tokens outperforms others. Self-attention with local attention and global token achieved the highest accuracy among tested models. Compared CTC and RNNT decoders, finding CTC-based models to be more robust and efficient on long-form audio.  The paper contributes valuable insights into improving the performance of end-to-end ASR models on long-form audio transcriptions, providing guidance for future development efforts in this area. Koluguri, Nithin Rao, et al. "Investigating End-to-End ASR Architectures for Long Form Audio Transcription." In. Nvidia nemo website(2023). [6]

Investigating End-to-End ASR Architectures for Long Form Audio Transcription Methodology
1. Studied three classes of ASR models: convolutional, convolutional with SE, and convolutional with attention. 2. Tested models on Earnings-21 and 22, CORAAL, and TED-LIUM3 datasets. Measured WER, maximum audio length, and RTF for each model. 3. Assessed the impact of global context on long-form audio transcription accuracy. 4. Compared models with CTC and RNNT decoders to determine their relative strengths and weaknesses on long-form audio. Koluguri, Nithin Rao, et al. "Investigating End-to-End ASR Architectures for Long Form Audio Transcription." In. Nvidia nemo website(2023) (2023). [6]

1. Problem Objectives Develop an Open-Source ASR System: The project
aims to design and implement an open-source ASR system for Malayalam that overcomes the limitations of existing speech-to-text techniques. By leveraging open-source methodologies, the system intends to provide access to datasets, model architectures, and algorithms, promoting transparency, reproducibility, and collaboration in the development of Malayalam ASR technology. It should achieve a key goal of the project is to achieve a Word Error Rate (WER) or Character Error Rate(CER) of less than 0.15 in the developed ASR system for speech to text model accuracy.

kurianbenoy/Malwhisper-v1-medium Available in Opensource repository: https://huggingface.co/kurianbenoy/Malwhisper-v1-medium LICENSE: MIT  Fine-tuned
on Whisper-medium architecture on IMaSC dataset[8]. This corpus contains 34,473 text-audio pairs of Malayalam sentences spoken by 8 speakers, totaling in approximately 50 hours of audio. Experiment Setup  GPUs: A100, 40 GB  Steps: 3000  Training time: 15 hours

Training Parameters

Logs during training Malwhisper-v1-medium

Evaluation of Malwhisper-v1-medium In Common Voice 11 dataset:  WER:
61.84  CER: 15.41 In SMC Malayalam Speech Corpus:  WER: 70.49  CER: 17.0

2. Problem Objectives Support Long-Form Audio Speech Transcription: In addressing
the dearth of specialized provisions for transcribing long-form audio with timestamps in Malayalam, the project endeavors to develop features and capabilities that cater to the specific requirements of transcribing extended spoken content.

Dataset collection  Started collection dataset of long-form audio in
Malayalam. As initial dataset collection, started with 4 set of audios in Malayalam with duration respectively as 3.5 minutes, 5 minutes, 10 minutes and 55 minutes.  Started the process of creating ground truth of the respective audio.  Compared to benchmarking datasets which usually consists of short audio’s recorded in studio like condition, these datasets are more real world in nature. Dataset link is hosted under GPL-v2.0 in hugging face:  https://huggingface.co/datasets/kurianbenoy/Indic-subtitler-audio_evals

3. Problem Objectives  Benchmark Various ASR Models: The project
seeks to compare and benchmark multiple ASR models to evaluate their performance in the context of Malayalam speech-to-text processing. By conducting systematic comparisons, the project aims to identify the strengths and limitations of different ASR methodologies, leading to insights that can inform the selection of appropriate models for specific use cases.

Normalization algorithms Why should we normalize/standardize text? In ASR systems
it’s important to normalize the text to reduce unintentional penalties in metrics like WER, CER etc. • Text normalization/standardization is process of converting texts in different styles into a standardized form, which is a best-effort attempt to penalize only when a word error is caused by actually mis transcribing a word, and not by formatting or punctuation differences. What was Whisper Normalization algorithm? (from paper [1]) - BasicTextNormalizer: Usually for Multi-lingual languages - EnglishTextNormalizer: English

What does Basic Text normalizer do? • From Whisper paper
[1]

Whisper_normalizer[9] package

Big bug found by Kavya Manohar

BasicTextNormalizer

Unicode category Mark:

Malayalam Unicode Characters

So what is solution?

How to handle edge cases - Use better libraries like,
indic-nlp-library[11] or libindic Normalizer[10] which has support for Malayalam text normalization. .

How to handle edge cases These libraries are not so
perfect though: - More text cleanup required. - Handle numbers in Malayalam and even more Indic languages - Expand abbreviations like Currency, numbers, fractions for ASR and TTS tasks Kavya Manohar and I are teaming up to solve and work on this issues.

Updated Benchmarking Numbers  The benchmarking is in progress and
only few models have been benchmarked on Common Voice 11 dataset and SMC MSC dataset.  In this benchmarking round, we will also include numbers from models like: 1. Seamless M4T [2] 2. Meta MMS [3] 3. ASR Malayalam by Kavya Manohar, Kaldi based [4] 4. WhisperX [6] 5. Faster-whisper [5]

Benchmarking in CommonVoice11

Benchmarking in SMC MSC Dataset

Indic-Subtitler  It’s an open source subtitling platform for transcribing
and translating videos/audios in Indic languages.  We are building this for an Opensource AI hackathon sponsored by Meta, which we were shortlisted for.  Support for transcribing and translating in 10+ Indic languages including Malayalam with SeamlessM4T[2], WhisperX[6] and faster-whisper[5]. Let me demo it: https://indicsubtitler.vercel.app/

Conclusion  We have created a new Open-source model, called
MalWhisper-v1-medium model.  Started working on collection ground truth and evaluating results.  Identified a big bug in Benchmarking pipeline, and we are re-evaluating the results. This bug helped us to identify language complexity of Malayalam and build improvements in existing tech for normalization of text in Malayalam.  Create a python package, whisper_normalizer and Indic Subtitler as Sub-product.

Future Plans  Complete Malayalam ASR models benchmarking Create a
benchmarking leaderboard for Malayalam  Further improve OpenASR models by refining the architecture and incorporating additional data to push the boundaries of the current state of the art.  Complete collection and creation of ground-truth for evaluation dataset  Build ASR models for long-form transcription.

REFERENCES 1. Radford, Alec, Jong Wook Kim, Tao Xu, Greg
Brockman, Christine McLeavey, and Ilya Sutskever. "Robust speech recognition via large-scale weak supervision." In International Conference on Machine Learning, pp. 28492-28518. PMLR, 2023. 2. Barrault, L., Chung, Y. A., Meglioli, M. C., et al. “SeamlessM4T-Massively Multilingual & Multimodal Machine Translation.” In: AI Meta Publications, 2023 [2] https://ai.meta.com/research/publications/seamlessm4t-massively-multilingual-multimodal-machine- translation/ 3. Pratap, Vineel, Andros Tjandra, Bowen Shi, Paden Tomasello, Arun Babu, Sayani Kundu, Ali Elkahky et al. "Scaling speech technology to 1,000+ languages." In AI Meta publication (2023). 4. Manohar Kavya et al., ASR for Malayalam, In: https://gitlab.com/kavyamanohar/asr-malayalam 5. Klein Gullimane et al., faster-whisper, In: https://github.com/SYSTRAN/faster-whisper 6. Koluguri, Nithin Rao, et al. "Investigating End-to-End ASR Architectures for Long Form Audio Transcription.“ In. Nvidia nemo website(2023). https://nvidia.github.io/NeMo/blogs/2024/2024-01- parakeet/

REFERENCES 7. Bain, Max, Jaesung Huh, Tengda Han, and Andrew
Zisserman. "WhisperX: Time-accurate speech transcription of long-form audio." In: Interspeech conference (2023). 8. Gopinath, Deepa P., and Vrinda V. Nair. "IMaSC--ICFOSS Malayalam Speech Corpus." arXiv preprint arXiv:2211.12796 (2022). 9. Benoy Kurian et al., In: https://github.com/kurianbenoy/whisper_normalizer 10. Dinesh S Akshay, Thottingal Santhosh et al., In: https://github.com/libindic/normalizer 11. Kunchukuttan Anoop et al., In: https://github.com/anoopkunchukuttan/indic_nlp_library

Project Review Report 4 - Robust Speech Recogni...

Project Review Report 4 - Robust Speech Recognition in Malayalam

More Decks by Kurian Benoy

Other Decks in Research

Featured

Transcript