NLP Colloquium Sep. 11, 2024 Taguchi

Slide 1

Slide 1 text

Chihiro Taguchi, David Chiang 田口智大，蔣偉 NLP Colloquium (presented at ACL 2024) September 11, 2024 Language Complexity and Speech Recognition Accuracy Orthographic Complexity Hurts, Phonological Complexity Doesn’t Paper LinkedIn

Slide 2

Slide 2 text

About me 2015-2019: Faculty of Law, Keio University Language policy and language endangerment in Ikema, Okinawa 2020-2022: MA in Engineering, Nara Institute of Science and Technology NLP for the Tatar language 2021-2022: MScR in Linguistics, University of Edinburgh Tatar syntax 2022-present: PhD in Computer Science, University of Notre Dame NLP for documenting endangered languages Why from the humanities/social sciences to NLP? • Constant interests in languages • Luck 2

Slide 3

Slide 3 text

Research interests NLP • Multilingual NLP – NLP for the documentation of endangered languages • Automatic speech recognition (ASR) • Machine translation (MT) • Corpora (Universal Dependencies) Linguistics • Descriptive linguistics, ﬁeld methods • Syntax (Tatar, Kichwa, Japanese) 3

Slide 4

Slide 4 text

Documenting Kichwa Kichwa (< Quechuan language family) • Fieldwork • Building Kichwa ASR with the community – Dataset – Model – Paper at LREC-COLING 2024 • Kichwa syntax – Paper at LSA 2024 4 Kichwa-speaking area in South America (Image: Wikimedia Commons, CC-BY) With my informants (Quito, Ecuador) This image was removed for privacy reasons.

Slide 5

Slide 5 text

Background of today’s talk I was a newbie in ASR… (2022-) Project in 2022-23: Speech-to-IPA • Interspeech 2023: https://arxiv.org/abs/2308.03917 NLP coursework taught by my supervisor in Fall 2023 • Final project Submission to ARR under appendicitis (February 2024) 5

Slide 6

Slide 6 text

Slide 7

Slide 7 text

What makes speech recognition hard? For humans, 1. The number of characters (graphemes) Too many characters → Diﬃcult prediction https://www.nippon.com/hk/views/b05601/ 7

Slide 8

Slide 8 text

What makes speech recognition hard? For humans, 1. The number of characters 2. Inconsistent spelling (logographicity) English: /raɪt/ → right, write, rite, Wright Chinese: /shìshí/ → 事實, 適時, 是時, 嗜食, … Japanese: /senkoo=/ → 先行, 専攻, 選考, 閃光, 先攻, 潜航, 穿孔, … Thai: /kasètsàat/ → เกษตรศาสตร (กะเส็ดสาด) https://www.zabaan.com/blog/whats-wrong-with-english-spelling/ “fact” “timely” “at this time”“to have predilection for certain food” “preceding” “major” “screening” “sparkle” “bat first” “cruising underwater” “perforation” 8 e k ʂ t r ɕ ā s t r ʻ k a e s d s ā d

Slide 9

Slide 9 text

Logographicity? 9 Alphabetic Abugida Abjad Moraic Syllabic Finnish Hindi Phoenician Japanese (kana) Modern Yi Phonography Morphography English German French Korean Japanese (kanji) Chinese Modern Tibetan Thai Persian Akkadian Egyptian (hieroglyph) Modified from Sproat. (2008). Computational Theory of Writing Systems. p.138

Slide 10

Slide 10 text

For humans, 1. The number of characters 2. Inconsistent spelling 3. The number of sounds (phonemes) What makes speech recognition hard? https://omniglot.com/writing/abkhaz.htm Japanese 5 vowels vs. English >10 vowels 10

Slide 11

Slide 11 text

What makes speech recognition hard … For machines? Today’s topic: Do machines also struggle with these linguistic complexities? If so, what are the factors? 11

Slide 12

Slide 12 text

Let’s test it with Wav2Vec2-XLSR-53 What is Wav2Vec2-XLSR-53? (Conneau et al., 2020) • Encoder-only speech model pretrained on 53 languages • Self-supervised multilingual model (like mBERT) • Adaptable to unseen languages https://huggingface.co/facebook/wav2vec2-large-xlsr-53 12

Slide 13

Slide 13 text

Setup Now, we want to see what makes speech recognition hard for Wav2Vec2-XLSR-53. → Fine-tuning to languages with different writing systems Japanese: Kanji (日本語) Kana (ニホンゴ) Romaji (nihongo) Chinese: Hanzi (漢語) Zhuyin (注音: ㄏㄢ ̀ ㄩ ˇ) Pinyin (拼音: hànyǔ) Korean: Hangul syllabary (한국어) Hangul Jamo (ㅎㅏㄴㄱㅜㄱㅇㅓ) Phonographic languages: Thai, Arabic, English, French, Italian, Czech, Swedish, Dutch, German 13

Slide 14

Slide 14 text

Data for speech recognition The same amount of data across all the training languages • Common Voice 16.1 (Ardila et al., 2020) – English: LibriSpeech (Panayotov et al., 2015) – Korean: Zeroth-Korean (https://github.com/goodatlas/zeroth) • Training data: 10,000 seconds • 12 languages, 10 writing systems 14

Slide 15

Slide 15 text

Metrics: • Character Error Rate (CER): – where S: #substitution, I: #insertion, D: #deletion, N: reference length • 🆕Calibrated Errors per Second (CEPS): – where VAD: voice activity detection, a: audio length – Calibrated to mitigate the error bias caused by orthographic differences – It considers potential multiple errors occurring in a slice (e.g., character) Setup 15

Slide 16

Slide 16 text

Assumptions: • All languages convey the same amount of information per second • Speech is divided into equal-length slices of τ seconds each. • An ASR error is an event that occurs at a single point in time • Errors are Poisson-distributed Then, the prob. that a slice (of τ seconds) has k errors: We want to estimate λ: maximum likelihood estimation (MLE) Likelihood function for all observations: the product Details on Calibrated Errors per Second (CEPS) 16 λ: calibrated errors per second τ: second per character λτ: number of errors in a slice p: error rate n: total number of slices

Slide 17

Slide 17 text

Log-likelihood: Estimate λ: In the implementation, we use p = CER and Details on Calibrated Errors per Second (CEPS) 17 p: error rate (CER, WER, etc.) Slices with no errors Slices with at least one error

Slide 18

Slide 18 text

CEPS and CER: example Reference: 腹減りすぎて3分待てなかったやべ Prediction: 腹減り全て3分待てなかった矢部 18 （2秒で言ったと仮定する） HikakinTV (2021) “【6年ぶり】 YouTubeの自動字幕実況したら爆笑が止まらない www【ヒカキン TV・セイキン TV・マスオ TV】” 3:14. Retrieved on September 8, 2024. https://www.youtube.com/watch?v=kLHc0c3Yv7U This image was removed for copyright reasons.

Slide 19

Slide 19 text

Setup Compare CER and CEPS with: • Grapheme distribution – Number of grapheme types – Unigram character entropy for each grapheme type c • Logographicity – Attention-based measure (Sproat & Gutkin, 2021) • Phoneme distribution – Number of phoneme types from Phoible 2.0 (Moran and McCloy, 2019) How can we measure logographicity with attention? 19

Slide 20

Slide 20 text

But how can we measure logographicity? Logographicity: the irregularity of the mapping from pronunciation to spelling (Sproat and Gutkin, 2021) → If a writing system is logographic, one must consider the context to determine the correct spelling! • on your /raɪt/ hand side → right • can you /raɪt/ it down → write • Stravinsky’s “The /raɪt/ of Spring” → rite • The /raɪt/ brothers invented an aircraft → Wright 20

Slide 21

Slide 21 text

Attention-based measure of logographicity Check how much the attention is spread out across the context! Target word in the orthography Phoneme sequence with the context Mask the target matrix (zero-out) 21

Slide 22

Slide 22 text

Attention-based measure of logographicity To compute the logographicity of a language, 1. Train a seq-to-seq model that converts a phoneme sequence into the target word in the orthography 2. Use the model to get the (last) attention matrix 3. Mask the target word in the matrix 4. Compute the ratio of the original matrix and the masked matrix attention matrix attention matrix (masked) 22

Slide 23

Slide 23 text

Results: Same language, different writing systems Japanese: Kanji (日本語) Kana (ニホンゴ)Romaji (nihongo) Chinese: Hanzi (漢語) Zhuyin (ㄏㄢ ̀ ㄩ ˇ) Pinyin (hànyǔ) Korean: Hangul syllabary (한국어) Hangul Jamo (ㅎㅏㄴㄱㅜㄱㅇㅓ) Would they show different scores? 23

Slide 24

Slide 24 text

Results Language Writing system CER↓ CEPS↓ #Graphemes Unigram entropy Logographicity #Phonemes Japanese Kanji + Kana 58.12 7.21 1702 7.74 44.98 27 Kana 29.71 3.48 92 5.63 41.22 Romaji 17.09 2.91 27 3.52 29.46 Chinese Hanzi 62.81 2.65 2155 9.47 41.59 39.5 Zhuyin 9.71 1.04 49 4.81 24.32 Pinyin 9.17 1.01 56 5.02 22.50 Korean Hangul 28.21 2.63 965 7.98 25.27 42.5 Jamo 16.72 3.23 62 4.90 15.99 “Simpler” writing systems get better scores! 24

Slide 25

Slide 25 text

Logographic orthographies are hard to learn • Japanese: Slower learning of Kanji than Kana/Romaji • Korean: Slower learning of Hangul than Jamo • Chinese: Slower learning of Hanzi than Zhuyin/Pinyin Japanese Korean Chinese 25

Slide 26

Slide 26 text

Results (contd.): Phonographic languages Language Writing system CER CEPS #Graphemes Unigram entropy Logographicity #Phonemes Thai Thai 19.77 1.80 67 5.24 20.55 20.67 Arabic Arabic 40.59 4.78 53 4.77 21.57 37 English Latin 3.17 0.58 27 4.17 19.17 41.22 French Latin 19.64 2.79 69 4.42 20.37 36.75 Italian Latin 14.82 1.84 48 4.27 21.28 43.33 Czech Latin 16.89 1.86 46 4.92 20.57 39 Swedish Latin 20.31 2.71 34 4.52 19.81 35 Dutch Latin 12.35 1.77 36 4.2 19.67 49.38 German Latin 7.61 1.03 48 4.18 18.03 40 26

Slide 27

Slide 27 text

Any correlation? CER CEPS #Graphemes Unigram entropy Logographicity #Phonemes CER 1.00 0.77 0.85 0.81 0.76 -0.37 CEPS 1.00 0.49 0.41 0.61 -0.66 #Graphemes 1.00 0.93 0.72 -0.14 Unigram entropy 1.00 0.67 -0.08 Logographicity 1.00 -0.60 Correlation matrix 27 • Signiﬁcant correlation between CER and the orthography-related variables • CEPS has weaker correlation with the orthography-related variables • No signiﬁcant correlation between CER and the number of phonemes

Slide 28

Slide 28 text

Any correlation? • Signiﬁcant correlation between CER and the orthography-related variables • CEPS has weaker correlation with the orthography-related variables • No signiﬁcant correlation between CER and the number of phonemes CER CEPS #Graphemes Unigram entropy Logographicity #Phonemes CER 1.00 0.77 0.85 0.81 0.76 -0.37 CEPS 1.00 0.49 0.41 0.61 -0.66 #Graphemes 1.00 0.93 0.72 -0.14 Unigram entropy 1.00 0.67 -0.08 Logographicity 1.00 -0.60 Correlation matrix 28

Slide 29

Slide 29 text

Slide 30

Slide 30 text

Conclusion What makes automatic speech recognition hard? • Orthographic complexity – Worse performance – Slower learning – Calibrated Errors Per Second (CEPS) can mitigate the orthographic bias • Phonological complexity does not affect the performance 30

Slide 31

Slide 31 text

Why is this ﬁnding interesting? • Speech recognition for low-resource logographic languages – Some low-resource languages have complex orthographies – Better accuracy with transliterated data? e.g. Taiwanese Hokkien https://www.omniglot.com/writing/yi.htm https://www.omniglot.com/writing/inuktitut.htm 31 goa2 si7 jit8 pun2 lang5 ↓ 我是日本人 (rule-based or seq2seq conversion)

Slide 32

Slide 32 text

Why is this finding interesting? • Similarities to children’s first language acquisition – Babies can perfectly learn the phonology of their first language regardless of the phonological complexity of the language – Children need a lot of conscious efforts to learn writing – Fine-tuning of pretrained multilingual self-supervised ASR models is somewhat like first language acquisition? – Choi et al. (2024): “Self-Supervised Speech Representations are More Phonetic than Semantic” 32