Development and Operation of Expressive Speech Synthesis System

Development and Operation of Expressive Speech Synthesis System Kosuke Futamata
/ LINE Ryo Terashima / LINE

Agenda - Introduction - Emotional TTS from Neutral Data -
TTS System for Product

Agenda - Introduction - What is “TTS” ? - Our
TTS System “Coharis” - Emotional TTS from Neutral Data - TTS System for Product

What is “TTS” ? to Speech Text Hello

What is “TTS” ? Acoustic model Vocoder ありがとう Text processing
Acoustic features [ a- ri^ ga- to- o- ] Linguistic features

Agenda - Introduction - What is “TTS” ? - Our
TTS System “Coharis” - Emotional TTS from Neutral Data - TTS System for Product

COntrollable, High-quAlity, and expRessIve TTS

Acoustic model Vocoder ありがとう Text processing Neutral Conversation Narration Happy
Sad Angry Expressive Focus on Emotion Expressive TTS System Coharis

Audiobook Narration Telephone Answering Use cases of Coharis

Problems in building emotional TTS Recording required Emotional Data Acting
skills needed Burden on Speakers Only neutral data Low Resource Data

Emotional TTS using neutral data Emotional TTS Speech database (A)
with emotional data Speech database (B) with only neutral data

Emotional TTS using neutral data Emotional TTS Speech database (A)
with emotional data Emotional TTS! Speech database (B) with pseudo emotional data Emotion transfer

Emotional TTS using neutral data Data augmentation via pitch-shifting Data
augmentation via voice conversion

Data augmentation via voice conversion Neutral Happiness Sadness Source speaker
Neutral Target speaker Target speaker (VC) VC model training Neutral Happiness Sadness

Data augmentation via voice conversion Neutral Happiness Sadness Neutral Happiness
Sadness VC Style: Source speaker Identity: Target speaker Generation of pseudo-emotional data using VC Source speaker Neutral Target speaker Target speaker (VC)

Data augmentation via pitch-shifting Generate data for various pitches -3
semitone +12 semitone Original Pitch-shift -3 to +12 semitones ・・・・・・

MS-TTS VC-TTS VC-TTS-PS Performance evaluations Source Target n Neutral n
Happiness n Sadness n Neutral Target n Neutral n Neutral n Happiness n Sadness Target Multi-speaker Single-speaker with VC Single-speaker with PS&VC n Neutral n Neutral n Happiness n Sadness via VC w/ PS via VC

Performance evaluations Performance on Naturalness 2.85 3.88 4.20 2.74 4.00
4.00 1 2 3 4 5 MS-TTS VC-TTS VC-TTS-PS Happiness Sadness

Performance evaluations 2.48 3.33 3.82 2.72 3.63 3.67 1 2
3 4 MS-TTS VC-TTS VC-TTS-PS Happiness Sadness Performance on Emotional Similarity

Performance evaluations 4.20 4.08 4.00 3.96 1 2 3 4
5 VC-TTS-PS VC-TTS-PS (40%) Naturalness Experiments with less than half of the recorded speech data of the target speaker 3.82 3.87 3.67 3.63 1 2 3 4 VC-TTS-PS VC-TTS-PS (40%) Emotional Similarity Happiness Sadness

n We proposed a cross-speaker emotion style transfer method that
combines PS-based and VC-based data augmentation. n Our method improved the performance of emotional TTS. In particular, the experiments showed that our method significantly enhanced the naturalness and emotional similarity of speech with a happiness style. Audio samples https://ryojerky.github.io/demo_vc-tts-ps/ Summary

About the architecture of Coharis TTS system for product ⎯
How Coharis works ⎯ Introduce microservice architecture on Coharis ⎯ Benefits and difficulties after rebuilding Controllable, High-quAlity, and expRessIve TTS

TTS flow with monolithic architecture Phrase break prediction Accent prediction
Grapheme-to-phoneme Text segmentation / normalization Mix up linguistic features Acoustic model Vocoder Coharis server Raw text Raw speech “こんにちは” For TTS For text processing

Grapheme-to-phoneme Text segmentation / normalization Mix up linguistic features Acoustic model Vocoder Coharis server Raw text Raw speech “こんにちは” Raw text with acoustic parameters is given through REST

Grapheme-to-phoneme Text segmentation / normalization Mix up linguistic features Acoustic model Vocoder Coharis server Raw text Raw speech “こんにちは” Split and normalize sentences e.g., 10:00 am → 午前10時

Grapheme-to-phoneme Text segmentation / normalization Mix up linguistic features Acoustic model Vocoder Coharis server Raw text Raw speech “こんにちは” Convert raw text into appropriate pronunciation e.g., 午前10時 → ゴゼンジュージ

Grapheme-to-phoneme Text segmentation / normalization Mix up linguistic features Acoustic model Vocoder Coharis server Raw text Raw speech “こんにちは” Predict appropriate pause positions e.g., あのねあそこにね → あのね / あそこにね

Grapheme-to-phoneme Text segmentation / normalization Mix up linguistic features Acoustic model Vocoder Coharis server Raw text Raw speech “こんにちは” Predict appropriate accents e.g., 音声 → オ_H / ン_L / セ_L / ー_L

Grapheme-to-phoneme Text segmentation / normalization Mix up linguistic features Acoustic model Vocoder Coharis server Raw text Raw speech “こんにちは” Align and mix up all linguistic features for TTS

Grapheme-to-phoneme Text segmentation / normalization Mix up linguistic features Acoustic model Vocoder Coharis server Raw text Raw speech “こんにちは” Convert linguistic features into acoustic features

Grapheme-to-phoneme Text segmentation / normalization Mix up linguistic features Acoustic model Vocoder Coharis server Raw text Raw speech “こんにちは” Convert acoustic features into speech

Problems in monolithic architecture Difficult to update part of modules
quickly Tightly coupled Can not deploy each service independently Scalability Takes much time to build and deploy Slow development

Why microservice in our case? Increase the number of App
specific APIs Fast development Required to utilize GPU resources efficiently since the number of available GPUs is limited Let ML engineer focus only on developing ML modules (Not Server/Infrastructure) Less time to update models and dictionaries compared with monolithic architecture Scale up each module independently

Coharis with microservice architecture BACKEND MICROSERVICE PROXY API SERVICE APP
SERVICE CLIENTS Preprocessing G2P Phrase break Accent (TTS components) PROXY SERVICE CLOVA VOICE AUDIOBOOK APP1 APP2 CLOVA SERVICE AUDIOBOOK SERVICE APP SERVICE 1 APP SERVICE 2

Scale up each modules independently BACKEND MICROSERVICE PROXY API SERVICE
APP SERVICE CLIENTS Preprocessing G2P Phrase break Accent (TTS components) PROXY SERVICE CLOVA VOICE AUDIOBOOK APP1 APP2 CLOVA SERVICE AUDIOBOOK SERVICE APP SERVICE 1 APP SERVICE 2 We can scale up only high GPU usage modules independently

Increase the number of App specific APIs BACKEND MICROSERVICE PROXY
API SERVICE APP API SERVICE CLIENTS Preprocessing G2P Phrase break Accent (TTS components) PROXY SERVICE CLOVA VOICE AUDIOBOOK APP1 APP2 CLOVA SERVICE AUDIOBOOK SERVICE APP SERVICE 1 APP SERVICE 2 Aggregate data coming from each backend service to offer App specific API

API SERVICE APP SERVICE CLIENTS Preprocessing G2P Phrase break Accent (TTS components) PROXY SERVICE CLOVA VOICE AUDIOBOOK APP1 APP2 CLOVA SERVICE AUDIOBOOK SERVICE APP SERVICE 1 APP SERVICE 2 Different Requests and Responses are given. And resource usage for each backend service is different

API SERVICE APP SERVICE CLIENTS Preprocessing G2P Phrase break Accent (TTS components) PROXY SERVICE CLOVA VOICE AUDIOBOOK APP1 APP2 CLOVA SERVICE AUDIOBOOK SERVICE APP SERVICE 1 APP SERVICE 2 Generate speech from raw text Activated modules

API SERVICE APP SERVICE CLIENTS Preprocessing G2P Phrase break Accent (TTS components) PROXY SERVICE CLOVA VOICE AUDIOBOOK APP1 APP2 NAVER GATEWAY AUDIOBOOK SERVICE APP SERVICE 1 APP SERVICE 2 Activated modules Generate linguistic features from raw text

API SERVICE APP SERVICE CLIENTS Preprocessing G2P Phrase break Accent (TTS components) PROXY SERVICE CLOVA VOICE AUDIOBOOK APP1 APP2 CLOVA SERVICE EXTERNAL SERVICE APP SERVICE 1 APP SERVICE 2 Activated modules Generate speech from linguistic features

Fast development BACKEND MICROSERVICE PROXY API SERVICE APP SERVICE CLIENTS
Preprocessing G2P Phrase break Accent (TTS components) PROXY SERVICE CLOVA VOICE AUDIOBOOK APP1 APP2 CLOVA SERVICE AUDIOBOOK SERVICE APP SERVICE 1 APP SERVICE 2 takes less time to update them since they are independent

Techniques to speed up inference time Support adaptive batching Support
gPRC protocol Use CPU-based workers to handle the sharp increase in requests Requests coming in short intervals are batched together in the proxy server Each service is connected by gRPC to reduce network delay Support CPU-based TTS workers with GPU-based one

Microservice architecture on Coharis Phrase break service Accent service Normalization
service Segmentation service Text formatter service Acoustic model service Vocoder service BACKEND SERVICES Proxy service Redis CPU-based workers Model storage gRPC App services REST Cache storage Load pre-trained models Queue TTS jobs

API call flow Phrase break service Accent service Normalization service
Segmentation service Text formatter service Acoustic model service Vocoder service BACKEND SERVICES Proxy service Redis CPU-based workers Model storage gRPC App services REST Cache storage Load pre-trained models Queue TTS jobs

CPU-based TTS workers Phrase break service Accent service Normalization service
Segmentation service Text formatter service Acoustic model service Vocoder service BACKEND SERVICES Proxy service Redis CPU-based workers Model storage gRPC App services REST Cache storage Load pre-trained models Queue TTS jobs

Model loading and cache mechanism Phrase break service Accent service
Normalization service Segmentation service Text formatter service Acoustic model service Vocoder service BACKEND SERVICES Proxy service Redis CPU-based workers Model storage gRPC App services REST Cache storage Load pre-trained models Queue TTS jobs

Focus only on developing ML services Phrase break service Accent
service Normalization service Segmentation service Text formatter service Acoustic model service Vocoder service BACKEND SERVICES Proxy service Redis CPU based workers Model storage gRPC App services REST Cache storage Load pre-trained models Queue TTS jobs Wrapped by ML serving tool e.g. Triton, Torch-serve, seldon-core

Pros and Cons Scalability Release procedure ML engineers can focus
on backend services without the effects on other services Only backend services with high load can be scaled up and can use GPU effectively Need to consider the release procedure carefully Effective development Setup local environment for researchers Previous Coharis is still used for research purposes since this is easier to set up J J L L

n We introduced TTS system for product ready environment n
The new architecture makes ML engineers focus on developing TTS n Renewing architecture results in faster development and easier scaling Summary

Thank you!!

Development and Operation of Expressive Speech ...

Development and Operation of Expressive Speech Synthesis System

More Decks by Tech-Verse2022

Other Decks in Technology

Featured

Transcript