Thirty Years of Progress in Speech Synthesis: A Personal Perspective on the Past, Present, and Future

Thirty Years of Progress in Speech Synthesis: A Personal Perspective
on the Past, Present, and Future Keiichi Tokuda Nagoya Institute of Technology December 5, 2025 Symposium on Speech & Behavior Informatics

Challenges in speech synthesis • Converting text into speech: Text-to-Speech
(TTS) • Realizing machines that speak like humans, with capabilities such as: • Voice characteristics of arbitrary speakers • Various speaking styles (e.g., reading style, conversational style) • Emotional expression (e.g., joyful, sad) • Emphasis on specific words • Timing control and fillers • Other types of nonverbal information • And in any language! • Furthermore, even singing and rap performances!!

A brief history of speech synthesis • Rule-based Speech Synthesis
(–1980s) • Methods based on human expert knowledge • Concatenative Speech Synthesis (1990s) • Data-driven methods based on waveform concatenation • Statistical Speech Synthesis (mid-1990s–) • Machine-learning–based methods • Later accelerated by AI technologies (Deep Learning) We have been working on this   >> >>

Rule-based / formant speech synthesis • Approaches based on hand-crafted
rules and formant synthesizers • KlattTalk: formant synthesizer [Klatt, JASA 1980] • MITalk: text-to-speech system [Allen+, Cambridge University Press 1987] • DECtalk：commercial product [Digital Equipment Corp. 1984] DECtalk demo: Wikimedia Commons (CC BY-SA 3.0) ↩ “Daisy Bell” sung by DECtalk: Wikipedia (CC0 1.0）

Concatenative synthesis (fixed-unit) Diphone synthesis • PSOLA [Moulines+, SPECOM 1990]
• MBROLA [Dutoit+, ICSLP 1996] Parametric concatenation • Units stored as acoustic parameters (LPC, cepstrum, F0, etc.) • Connected / interpolated in the parameter domain, and resynthesized with a vocoder [Imai, IECE-JA 1978] Prosody modeling • Fujisaki model (Japanese prosody generation) [Fujisaki+, IEICE-A 1993] ・・・・・・ Synthesized speech a i u   ↩

works well: goes wrong: Concatenative synthesis (unit selection) ・・・・・・
How well matches the target? (target cost) How smoothly connected? (concatenation cost) a large database of recorded speech Selected units (synthesized speech) Automatically selects and concatenates waveform segments Minimizes total cost at runtime using dynamic programming あいう • ν-talk [Sagisaka+, ATR, ICSLP 1992] • CHATR [Hunt+, ATR, ICASSP 1996] • Festival [Black+, Edinburgh, SPECOM 1998] • NextGen [Syrdal+, AT&T, ICSLP 2000]   ↩

Statistical formulation of speech synthesis Text analysis 𝑃 𝒍 𝒘,
𝜆𝐿 Acoustic model 𝑝 𝒐 𝒍, 𝜆𝐴 Vocoder 𝑝 𝒙 𝒐 Acoustic feature Acoustic feature 𝒐 Linguistic feature Linguistic feature 𝒍 Text 𝒘 Speech 𝒙 Generative model 𝑝 𝒙 𝒘, 𝜆 Text 𝒘 Speech 𝒙 >> >> （Grapheme level）（Phoneme level）（Frame level）（Sample level）（Grapheme level）（Sample level）

Hidden Markov model-based approach (hidden Markov model; HMM) Mel-cepstral analysis
+ MLSA filter Mel-cepstral analysis + MLSA filter [Imai, ICASSP 1983], [Tokuda+, IEICE-JD 1991], [Fukada+, ICASSP 1992] Hidden Markov Model (HMM) Hidden Markov Model (HMM) [Tokuda+, ICASSP 1995], … >> >> Text analysis 𝑃 𝒍 𝒘, 𝜆𝐿 Acoustic model 𝑝 𝒐 𝒍, 𝜆𝐴 Vocoder 𝑝 𝒙 𝒐 Acoustic feature Acoustic feature 𝒐 Linguistic feature Linguistic feature 𝒍 Text 𝒘 Speech 𝒙 （Grapheme level）（Phoneme level）（Frame level）（Sample level）

• ML estimation of mel-cepstrum: when 𝒙 is Gaussian process,
𝑝 𝒙 𝒄 is convex with respect to 𝒄 [Tokuda+, IEICE-JD 1991], [Fukada+, ICASSP 1992] • 𝐻 𝑧 can be implemented by the MLSA filter structure Mel-cepstral analysis / MLSA filter Frequency 𝜔 (rad) warped frequency mel-scale frequency  2   0 2 /  ො 𝒄 = arg max 𝒄 𝑝 𝒙 𝒄 𝐻 𝑒𝑗𝜔 = exp ෍ 𝑚=0 𝑀 𝑐 𝑚 𝑒−𝑗 ෥ 𝜔𝑚 , 𝑒−𝑗 ෥ 𝜔 = 𝑒−𝑗𝜔 − 𝛼 1 − 𝛼𝑒−𝑗𝜔 𝒄 = 𝑐 0 , 𝑐 1 , … , 𝑐(𝑀) T  mel-cepstrum   𝑧 = 𝑒𝑗𝜔 Frequency-transformation by 1st order all-pass function Warped frequency ෥ 𝜔 = 𝛽 𝜔 (rad) ෥ 𝜔 = 𝛽 𝜔 = tan−1 (1 − 𝛼2) sin 𝜔 (1 + 𝛼2) cos 𝜔 − 2𝛼 [Imai, ICASSP’83], [Tokuda+, IEICE-JD’91], [Fukada+, ICASSP’92]

Spectral estimation example 0 4 8 Frequency (kHz) 100 50
0 Log Magnitude (dB) ↩

Hidden Markov model (HMM) 11 a 22 a 33 a
12 a 23 a ) ( 1 t b o ) ( 2 t b o ) ( 3 t b o 1 o 2 o 3 o 4 o 5 o T o   ・・ 1 2 3 1 1 1 1 2 2 3 3   o q Observation sequence State sequence ij a ) ( t q b o : state transition probability : state output probability Each state output probability b 下付き q 、左小かっこ太字 o 下付き t 右小かっこ is controlled by a regression tree conditioned on linguistic feature 太字斜体 l Each state output probability 𝑏𝑞 (𝐨𝑡 ) is controlled by a regression tree conditioned on linguistic feature 𝒍

• Conditional independence assumption of output probabilities • Parameter generation
algorithm • Modeling of F0 (fundamental frequency) • Multi-space distribution HMM (state output distributions for F0) • Multi-stream / full-context clustering • Modeling of state duration • Hidden semi-Markov model (HSMM) Challenges in HMM-based Speech Synthesis [Tokuda+, ICASSP 1995, EUROSPEECH 1995, ICASSP 2000] [Yoshimura+, ICSLP 1998] [Tokuda+, ICASSP 1999] [Yoshimura+, EUROSPEECH 1999]   Simultaneous modeling of spectrum, F0, and duration

Generated speech parameter trajectory /sil/ /a/ /i/ /sil/ ↩

Trajectory HMM q c Mean trajectory sil a i d
a sil sil a i d a sil 5 10 15 20 25 30 35 40 45 50 55 5 10 15 20 25 30 35 40 45 50 55 Time(frame) Temporal covariance matrix q P 1 𝑍𝒄 𝑃 𝒐 𝒒, መ 𝜆 = 𝑁 𝒄 𝒄𝒒 , 𝑷𝒒 with dynamic feature w/o dynamic feature 𝑍𝒄 = න 𝑃 𝒐 𝒒, መ 𝜆 𝑑𝒄 [Tokuda+, EUROSPEECH 2003], [Zen+, CSL 2007] ↩

Structure of state output (observation) vector Spectrum part Excitation part
(e.g., F0) Spectral parameter (e.g., mel-cepstrum） log F0, Voiced / Unvoiced       Dynamic feature (corresponds to the time derivative) Dynamic feature (corresponds to the time derivative)   Dynamic feature [Furui, IEEE TASSP 1986]    

Structure of state output (observation) vector Spectrum part Excitation part
(e.g., F0) Spectral parameter (e.g., mel-cepstrum） log F0, Voiced / Unvoiced       Dynamic feature (corresponds to the time derivative) Dynamic feature (corresponds to the time derivative)       Dynamic feature [Furui, IEEE TASSP 1986]

Stream-dependent tree-based clustering Regression trees for spectrum parameter Regression trees
for F0 parameter HMM State duration model Reggression tree for state duration models Three dimensional Gaussian Trained using the EM algorithm Each regression tree is conditioned on linguistic feature 太字斜体 l Each regression tree is conditioned on linguistic feature 𝒍 ↩

Flexibility to control speech variations • Speaker Adaptation (mimicking voices)
• [Tamura+, ESCA SSW 1998], [Tamura+, ICASSP 2001], [Yamagishi+, ICASSP 2003], … • Speaker Interpolation (mixing voices) • [Yoshimura+, EUROSPEECH 1997], … • Eigenvoice (producing voices) • [Shichiri+, ICSLP 2002], [Kazumi+, ICASSP 2010], … • Cross-lingual (speaking in another language) • [Wu+, ISCSLP 2008], [Oura+, ICASSP 2010], … Only from publications by the HTS working group

My Personal History in Statistical Speech Synthesis • 〜1995: Research
on speech spectrum analysis, speech coding, and adaptive filters • 1995–: Rise of unit-selection speech synthesis • 1995: Proposal of the algorithm for generating parameters from HMMs • 1995–1999: Developed a complete system — and enjoyed the journey • 2001–2002: sabbatical at Carnegie Mellon University (global dissemination of English TTS systems) • 2002: Release of HTS version 1.0 • 2002: IEEE Speech Synthesis Workshop (introducing the English system) • 2005: Blizzard Challenge started • 2005–: Practical applications began to appear (Voice Signal, SVOX, iFLYTEK, ATR, Nuance Communications, KDDI Labs, NTT DOCOMO, Google Android, etc.) • 2008–2011: EU FP7 EMIME Project (Edinburgh, Cambridge, Helsinki Tech, IDIAP, Nokia, NITech) • 2011–2017: JST CREST uDialogue Project (NITech, Edinburgh, NII) • 2013: Real-world deployment: CeVIO Project, JOYSOUND Vocal Assist • 2013: Emergence of DNN-based speech synthesis • 2014–2015: Sabbatical at Google (DNN-based waveform generation) • 2016: WaveNet >> >>  →

Blizzard Challenge 2005 Discussions with Prof. Alan Black during my
sabbatical stay at CMU (2001–2002) • The performance of TTS system strongly depends on the speech database • It is difficult to fairly compare speech synthesis techniques themselves Alan: “I’ve recorded the data. Let’s do it.” — at ISCA Speech Synthesis Workshop 2004 The need for an evaluation campaign for speech synthesis systems using a common dataset [Black+, INTERSPEECH 2005]

Improvements introduced through the Blizzard Challenge 3. Introduction of STRAIGHT
Pitch-synchronous analysis and band aperiodicity measures [Kawahara+, SPECOM 1999] 1. Introduction of Hidden Semi-Markov Model (HSMM) Joint training of duration models [Zen+, IEICE-D 2007] 2. GV-based parameter generation Recovering the over-smoothing caused by acoustic models [Toda+, INTERSPEECH 2005] Key persons: Tomoki Toda, Heiga Zen   Text analysis 𝑃 𝒍 𝒘, 𝜆𝐿 Acoustic model 𝑝 𝒐 𝒍, 𝜆𝐴 Vocoder 𝑝 𝒙 𝒐 Acoustic feature Acoustic feature 𝒐 Linguistic feature Linguistic feature 𝒍 Text 𝒘 Speech 𝒙

2024 IEEE James L. Flanagan Speech and Audio Processing Award
For contributions to statistical speech synthesis and speech signal processing Recognizing work that laid the foundations for modern neural speech generation.  

Introduction of deep neural network • DNN-based speech synthesis [Zen+,
ICASSP 2013] • LSTM-based speech synthesis [Fan+, INTERSPEECH 2014], etc. FFNN, LSTM   Text analysis 𝑃 𝒍 𝒘, 𝜆𝐿 Acoustic model 𝑝 𝒐 𝒍, 𝜆𝐴 Vocoder 𝑝 𝒙 𝒐 Acoustic feature Acoustic feature 𝒐 Linguistic feature Linguistic feature 𝒍 Text 𝒘 Speech 𝒙

Sabbatical Leave (2014–2015) One-year stay at Google • I decided
to pursue risky research, since I was temporarily free from daily responsibilities. • Abundant resources • Google’s computational resources • Google’s software tools • My former student Heiga Zen (working at Google) • And some of my own time “Wait… isn’t risky research what we are supposed to do at universities?” Direct modeling of speech waveforms using neural networks DNNs were used for acoustic modeling, but quality was still limited by vocoders >> >>

Direct Modeling of Speech Waveforms FFNN, LSTM Source filter model
Training neural networks to directly maximize the likelihood of speech waveforms • Directly modeling speech waveforms by neural networks [Tokuda+, ICASSP 2015] • Directly modeling voiced and unvoiced components by neural networks [Tokuda+, ICASSP 2016] Direct waveform modeling   Text analysis 𝑃 𝒍 𝒘, 𝜆𝐿 Acoustic model 𝑝 𝒐 𝒍, 𝜆𝐴 Vocoder p の太字斜体 ⓜ x , 太字斜体 o 右小かっこ Vocoder 𝑝 𝒙 𝒐 Acoustic feature Acoustic feature 𝒐 Linguistic feature Linguistic feature 𝒍 Text 𝒘 Speech 𝒙 >> >>

Speech signal model ℎ𝑣 (𝑛) White Gaussian 𝑒(𝑛)~𝒩(𝑥; 0, 1)
Voiced component 𝑣 𝑛 = 𝑝(𝑛) ∗ ℎ𝑣 (𝑛) Signal model for unvoiced+voiced sounds ℎ𝑣 𝑛 = 1 2𝜋 න −𝜋 𝜋 𝐻𝑣 (𝑒𝑗𝜔) 𝑒𝑗𝜔𝑛 𝑑𝜔 ℎ𝑢 (𝑛) Speech signal 𝑥 𝑛 ℎ𝑢 𝑛 = 1 2𝜋 න −𝜋 𝜋 𝐻𝑢 𝑒𝑗𝜔 𝑒𝑗𝜔𝑛 𝑑𝜔 Pulse train 𝑝(𝑛) Minimum phase Mixed phase Unvoiced component 𝑢 𝑛 = 𝑝(𝑛) ∗ ℎ𝑢 (𝑛)

Direct Modeling of Speech Waveforms “WaveNet” Text analysis 𝑃 𝒍
𝒘, 𝜆𝐿 Autoregressive generative model p の太字斜体 ⓜ x , 太字斜体 l , ラムダ下付き大文字 A. 大文字 V 、右小かっこ Autoregressive generative model 𝑝 𝒙 𝒍, 𝜆𝐴𝑉 • WaveNet: A Generative Model for Raw Audio [van den Oord+, INTERSPEECH 2016],     Text 𝒘 Linguistic feature Linguistic feature 𝒍 Speech 𝒙 Direct waveform modeling

WaveNet vocoder Neural vocoder (model parameter: 𝜆𝑉) • WaveNet vocoder
[Tamamori+, INTERSPEECH 2017] Direct waveform modeling   Text analysis 𝑃 𝒍 𝒘, 𝜆𝐿 Acoustic model 𝑝 𝒐 𝒍, 𝜆𝐴 Vocoder p の太字斜体 ⓜ x , 太字斜体 o , 、ラムダ下付き太字斜体大文字 V 、右小かっこ Vocoder 𝑝 𝒙 𝒐, 𝜆𝑽 Acoustic feature Acoustic feature 𝒐 Linguistic feature Linguistic feature 𝒍 Text 𝒘 Speech 𝒙

WaveNet • Autoregressive generative model using convolutional NN • Directly
modeling speech waveform • Dilated causal convolution : waveform modeled by using CNN : acoustic and linguistic feature

Speech signal generation model 𝑃𝑆 𝑧 𝑃𝐿 𝑧 Long-term predictor
Short-term predictor Speech signal 𝑥 𝑛 Excitation signal 𝑒 𝑛 It works!   Spectrum parameter F0 WaveNet Non-linear predictor F0 Spectrum parameter () (→) >> >> () () 𝒐 (initially, language feature 𝒍) or Mel-spectrogram Excitation signal 𝑒 𝑛 Speech signal 𝑥 𝑛

Famous words in speech technology (1980s) “Every time I fire
a linguist, the performance of the speech recognizer goes up” by Frederick Jelinek “Every time I fire a speech synthesis researcher, the performance of the speech synthesizer goes up” by ????? ?????

Various Approaches to Neural Waveform Modeling • Autoregressive • WaveNet,
SampleRNN, WaveRNN, … • Normalizing flow • WaveGlow, Pallalel WaveNet, ClariNet, FloWaveNet, WaveGrad, … • Nonautoregressive (+GAN) (+upsampling) • Parallel WaveNet, Parallel WaveGAN, HiFi-GAN • diffusion probabilistic model / flow matching • WaveGrad, PriorGrad, SpecGrad, … • Combining with source filter model • LPCNet, ExcitNet, GlotNet, LP-WaveNet, … • Introducing signal processing technique • SubbandWaveNet, FFTNet, iSTFTNet, … • Non-upsampling • VOCOS, WaveNeXt Exciation-driven approach ⇕ Upsampling-based approach ＋Combined with signal processing techniques where models designed for parallel computation

PeriodNet: neural vocoder based on periodic/aperiodic decomposition [Hono+, ICASSP 2020]
 () () ① Periodic component generated by sinusoidal excitation ② Aperiodic component generated by noise excitation ③ final speech signal ① ② ③ Acoustic feature Acoustic feature

Modeling temporal structure (omitted in this talk) • Explicit duration
models (FastSpeech-type, diffusion-based models) • Attention-based models (Tacotron-type) • Monotonic alignment search (e.g., VITS) • HMM / HSMM-based models (e.g., deep-HSMM)   Text analysis 𝑃 𝒍 𝒘, 𝜆𝐿 Acoustic model 𝑝 𝒐 𝒍, 𝜆𝐴 Vocoder 𝑝 𝒙 𝒐 Acoustic feature Acoustic feature 𝒐 Linguistic feature Linguistic feature 𝒍 Text 𝒘 Speech 𝒙 （Grapheme level）（Frame level）（Sample level） (Phoneme level） ⇓ （Frame level）

EnCodec High Fidelity Neural Audio Compression https://github.com/facebookresearch/encodec

Neural Codec Language Models https://www.microsoft.com/en-us/research/project/vall-e-x/ https://arxiv.org/abs/2301.02111    

Multimodal Large Language Models https://arxiv.org/abs/2306.13549v4

Large-scale pretrained models (+𝛼) • Wav2vec 2.0, HuBERT, Spin (content
representation) • SoundStream, EnCodec (neural audio codec / discretization) • ECAPA-TDNN (speaker embedding) • BigVGAN (neural vocoder) • VALL-E (speech generation) • Whisper (speech recogonition / representation learning) • BERT/RoBERTa (text representation) • ・・・

Other important technical issues (not covered today) • Text analysis
• Shared / common datasets • Text normalization • Voice conversion / speech conversion • Physical simulation of sound production • Increasing complexity of user Interfaces

Societal and Ethical Issues • Welfare and healthcare applications •
Visual impairment / speech disorders • Overcoming language barriers • Cross-lingual dubbing (voice-preserving) • Detection of fake / synthetic speech • Training with unauthorized data • Relationship between voice professionals and speech technology

Summary • We will probably continue to experience moments like
“Wow, it really works!” and “I never imagined it could be used this way.” • The tension between “explicit knowledge about speech and language” and “the power of data” will remain a key driving force. • What makes speech truly fascinating is its deep connection to human perception and emotion. • The social impact and importance of speech technology will continue to grow. “Is speech research ever ending?”

Special thanks • Supervisors: Satoshi Imai, Tadashi Kitamura, Takao Kobayashi
• Colleagues and students: Takashi Masuko, Noboru Miyazaki, Takayoshi Yoshimura, Shinji Sako, Masatsune Tamura, Junichi Yamagishi, Tomoki Toda, Heiga Zen, Kazuhito Koishida, Tetsuya Yamada, Masatsune Tamura, Nobuaki Mizutani, Ryuta Terashima, Akinobu Lee, Keiichiro Oura, Keijiro Saino, Kenichi Nakamura, Yi-Jian Wu, Ling-Hui Chen, Shifeng Pan, Yoshihiko Nankaku , Ranniery Maia, Sayaka Shiota, Chiyomi Miyajima, Kei Hashimoto, Shinji Takaki, Kazuhiro Nakamura, Kei Sawada, Takenori Yoshimura, Daisuke Yamamoto, Yukiya Hono, Takato Fujimoto, … • Collaborators, mentors, and research colleagues in the speech research community: Junichi Takami, Naoto Iwahashi, Mike Schuster, Satoshi Nakamura, Frank Soong, Mchael Picheny, Simon King, Steve Young, Mari Ostendorf, Alan Black, Alex Acero, Bill Byrne, Phil Woodland, Thomas Hain, Phil Garner, Masataka Goto, Shigeru Katagiri, Sadaoki Furui, Hideki Kenmochi, Kazuya Takeda, Tatsuya Kawahara, Sadaoki Furui, Seiichi Nakagawa, Keikichi Hirose, Tetsunori Kobayashi, Mikko Kurimo, Shigeki Sagayama, Kiyohiro Shikano, Hisashi Kawai, Nobuyuki Nishizawa, Minoru Tsuzaki, Yoichi Yamashita, Nobuaki Minematsu, Mat Shannon, Mark Gales, Kai Yu, John Dines, … Listed in no particular order. My apologies if I have missed anyone.

Thank you for your kind attention!

Thirty Years of Progress in Speech Synthesis: A...

Thirty Years of Progress in Speech Synthesis: A Personal Perspective on the Past, Present, and Future

Other Decks in Research

Featured

Transcript