Slide 1

Slide 1 text

Robust Speech Recognition System for Malayalam KURIAN BENOY 2021MCS120014 4TH MAY, 2024

Slide 2

Slide 2 text

What is ASR? Picture from IndiaCart, TheVerge

Slide 3

Slide 3 text

ASR Metrics ASR is evaluated by comparing ground truth and ASR output. Two common metrics used are: • Word Error Rate • Character Error Rate

Slide 4

Slide 4 text

Motivation 1. In Malayalam language, at the moment there are not any automatic speech recognition models which support long-form audio speech transcription, addressing the specific requirements for transcribing extended spoken content with timestamps. This is an essential component in creating subtitles for academic lectures, interviews, movies, serials etc. 2. Even though there has been a lot of works in Malayalam Speech to text. They aren’t open- source most of the time. This means leveraging open-source methodologies, the system intends to provide access to datasets, model architectures, and algorithms, promoting transparency, reproducibility, and collaboration in the development of Malayalam ASR technology. 3. Lot of works claim to have achieved 90 percentage accuracy in datasets, even in datasets which are not available in the public domain and kept proprietary. Yet an apple to apple comparison will only ensure that whether model A or model B is better for Malayalam speech.

Slide 5

Slide 5 text

Actual Motivation ❑ Whisper from OpenAI actually came with fantastic results in English language. Can we get even better results by fine-tuning? ❑ Whisper paper is titled as Robust Speech Recognition via Large-Scale Weak Supervision that is why I went with the title also ❑ I knew it was really possible to get good results by Fine-tuning, as Thennal (Btech student of IIIT Kottayam) by December 2022 proved we can get an ASR model with 0.1 WER ❑ Make ASR for Malayalam diaspora for 3crore+ users

Slide 6

Slide 6 text

Problem Objectives

Slide 7

Slide 7 text

1. Problem Objectives Develop an Open-Source ASR System: The project aims to design and implement an open-source ASR system for Malayalam that overcomes the limitations of existing speech-to-text techniques. By leveraging open-source methodologies, the system intends to provide access to datasets, model architectures, and algorithms, promoting transparency, reproducibility, and collaboration in the development of Malayalam ASR technology. It should achieve a key goal of the project is to achieve a Word Error Rate (WER) of less than 0.15 in the developed ASR system for speech to text model accuracy.

Slide 8

Slide 8 text

No content

Slide 9

Slide 9 text

Whisper Fine-tuning ❑ Whisper we fine-tuned on small, medium ❑ GPU poor vs GPU rich

Slide 10

Slide 10 text

Whisper Fine-tuning Small-model checkpoint in SMC MSC dataset WER - 73.56 CER - 17.82 https://huggingface.co/smcproject/Malwhisper-v1-small Medium-model checkpoint in SMC MSC dataset WER - 70.49 CER - 17.0 https://huggingface.co/smcproject/Malwhisper-v1-medium

Slide 11

Slide 11 text

Quantized model weights ❑ Using quantization, it’s possbile to optimize one of the best available ASR model for efficiency and high performance. The Whisper models supports int8float16, float16, int8 and int16 quantization formats, ensuring efficient processing and transcription of speech data without compromising accuracy much using faster-whisper framework.

Slide 12

Slide 12 text

2. Problem Objectives Support Long-Form Audio Speech Transcription: In addressing the dearth of specialized provisions for transcribing long-form audio with timestamps in Malayalam, the project endeavors to develop features and capabilities that cater to the specific requirements of transcribing extended spoken content.

Slide 13

Slide 13 text

Long Form Transcription

Slide 14

Slide 14 text

Indic-Subtitler ❑ It’s an open source subtitling platform 💻 for transcribing and translating videos/audios in Indic languages. ❑ We are building this for an Opensource AI hackathon sponsored by Meta, which we were shortlisted for. ❑ Support for transcribing and translating in 10+ Indic languages including Malayalam with SeamlessM4T[2], WhisperX[6] and faster-whisper[5]. ❑Let me demo it: https://indicsubtitler.in/

Slide 15

Slide 15 text

No content

Slide 16

Slide 16 text

No content

Slide 17

Slide 17 text

No content

Slide 18

Slide 18 text

Let’s do a demo

Slide 19

Slide 19 text

3. Problem Objectives ❑ Benchmark Various ASR Models: The project seeks to compare and benchmark multiple ASR models to evaluate their performance in the context of Malayalam speech-to-text processing. By conducting systematic comparisons, the project aims to identify the strengths and limitations of different ASR methodologies, leading to insights that can inform the selection of appropriate models for specific use cases.

Slide 20

Slide 20 text

Methodology 1. Establishment as a Python library for further benchmarking of whisper-based transformer models. 2. Conducting calculations for WER, CER, model size, and the time required to benchmark the model on selected datasets. 3. Development of a reproducible methodology so the results of benchmarking can be saved as a dataset. 


Slide 21

Slide 21 text

Benchmarking in CommonVoice 11

Slide 22

Slide 22 text

Benchmarking in CommonVoice 11

Slide 23

Slide 23 text

Benchmarking in CommonVoice 11

Slide 24

Slide 24 text

Benchmarking in MSC

Slide 25

Slide 25 text

Benchmarking in MSC

Slide 26

Slide 26 text

Benchmarking in MSC

Slide 27

Slide 27 text

Normalization algorithms Why should we normalize/standardize text? ❑In ASR systems it’s important to normalize the text to reduce unintentional penalties in metrics like WER, CER etc. • Text normalization/standardization is process of converting texts in different styles into a standardized form, which is a best-effort attempt to penalize only when a word error is caused by actually mis transcribing a word, and not by formatting or punctuation differences. What was Whisper Normalization algorithm? (from paper [1]) - BasicTextNormalizer: Usually for Multi-lingual languages - EnglishTextNormalizer: English

Slide 28

Slide 28 text

What does Basic Text normalizer do? • From Whisper paper [1]

Slide 29

Slide 29 text

Whisper_normalizer[9] package

Slide 30

Slide 30 text

Big bug found by Kavya Manohar

Slide 31

Slide 31 text

BasicTextNormalizer

Slide 32

Slide 32 text

Unicode category Mark:

Slide 33

Slide 33 text

Malayalam Unicode Characters

Slide 34

Slide 34 text

So what is solution?

Slide 35

Slide 35 text

How to handle edge cases - Use better libraries like, indic-nlp-library[11] or libindic Normalizer[10] which has support for Malayalam text normalization. .

Slide 36

Slide 36 text

Paper submitted for NLDB 2024 ❑ We submitted a paper for NLDB 2024 titled : An Open source platform for generating subtitles for Indian Languages

Slide 37

Slide 37 text

No content

Slide 38

Slide 38 text

No content

Slide 39

Slide 39 text

REFERENCES 1. Radford, Alec, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. "Robust speech recognition via large-scale weak supervision." In International Conference on Machine Learning, pp. 28492-28518. PMLR, 2023. 2. Barrault, L., Chung, Y. A., Meglioli, M. C., et al. “SeamlessM4T-Massively Multilingual & Multimodal Machine Translation.” In: AI Meta Publications, 2023 [2] https://ai.meta.com/research/publications/ seamlessm4t-massively-multilingual-multimodal-machine-translation/ 3. Pratap, Vineel, Andros Tjandra, Bowen Shi, Paden Tomasello, Arun Babu, Sayani Kundu, Ali Elkahky et al. "Scaling speech technology to 1,000+ languages." In AI Meta publication (2023). 4. Manohar Kavya et al., ASR for Malayalam, In: https://gitlab.com/kavyamanohar/asr-malayalam 5. Klein Gullimane et al., faster-whisper, In: https://github.com/SYSTRAN/faster-whisper 6. Koluguri, Nithin Rao, et al. "Investigating End-to-End ASR Architectures for Long Form Audio Transcription.“ In. Nvidia nemo website(2023). https://nvidia.github.io/NeMo/blogs/2024/2024-01- parakeet/

Slide 40

Slide 40 text

REFERENCES 7. Bain, Max, Jaesung Huh, Tengda Han, and Andrew Zisserman. "WhisperX: Time-accurate speech transcription of long-form audio." In: Interspeech conference (2023). 8. Gopinath, Deepa P., and Vrinda V. Nair. "IMaSC--ICFOSS Malayalam Speech Corpus." arXiv preprint arXiv:2211.12796 (2022). 9. Benoy Kurian et al., In: https://github.com/kurianbenoy/whisper_normalizer 10. Dinesh S Akshay, Thottingal Santhosh et al., In: https://github.com/libindic/normalizer 11. Kunchukuttan Anoop et al., In: https://github.com/anoopkunchukuttan/indic_nlp_library 12. S. wen Yang, P.-H. Chi, Y.-S. Chuang, C.-I. J. Lai, K. Lakhotia, Y. Y.Lin, A. T. Liu, J. Shi, X. Chang, G.-T. Lin, T.-H. Huang, W.-C. Tseng, K. tik Lee, D.-R. Liu, Z. Huang, S. Dong, S.-W. Li, S. Watanabe, A. Mohamed, and H. yi Lee, “SUPERB: Speech Processing Universal PERformance Benchmark,” in Proc. Interspeech 2021, pp. 1194–1198, 2021

Slide 41

Slide 41 text

REFERENCES 13. Kurian, Cini, Balakrishnan Kannan, “Speech recognition of Malayalam numbers”. In: 2009 world congress on nature & biological inspired computing(NaBIC) 14. Mohammed, Anuj, Nair K.N , “HMM/ANN hybrid model for continuous Malayalam speech recognition”. In: Procedia Engineering, Volume 30, pp: 616-622 15. Manohar Kavya,, Jayan A.R , Rajan Rajeev, , “Syllable Subword Tokens for Open Vocabulary Speech Recognition in Malayalam”. In: NSURL 2022, pages 1-7, Trento, Italy. 16. Manohar, Kavya, Jayan A.R , Rajan Rajeev, “Quantitative analysis of the morphological complexity of Malayalam language”. In: International Conference on Text, Speech and Dialogue pp 71-78. 17. S. Gandhi, P. Von Platen, and A. M. Rush, “Esb: A benchmark for multi-domain end-to-end speech recognition,”