Emotional TTS using neutral data Emotional TTS Speech database (A) with emotional data Emotional TTS! Speech database (B) with pseudo emotional data Emotion transfer
MS-TTS VC-TTS VC-TTS-PS Performance evaluations Source Target n Neutral n Happiness n Sadness n Neutral Target n Neutral n Neutral n Happiness n Sadness Target Multi-speaker Single-speaker with VC Single-speaker with PS&VC n Neutral n Neutral n Happiness n Sadness via VC w/ PS via VC
Performance evaluations 4.20 4.08 4.00 3.96 1 2 3 4 5 VC-TTS-PS VC-TTS-PS (40%) Naturalness Experiments with less than half of the recorded speech data of the target speaker 3.82 3.87 3.67 3.63 1 2 3 4 VC-TTS-PS VC-TTS-PS (40%) Emotional Similarity Happiness Sadness
n We proposed a cross-speaker emotion style transfer method that combines PS-based and VC-based data augmentation. n Our method improved the performance of emotional TTS. In particular, the experiments showed that our method significantly enhanced the naturalness and emotional similarity of speech with a happiness style. Audio samples https://ryojerky.github.io/demo_vc-tts-ps/ Summary
About the architecture of Coharis TTS system for product ⎯ How Coharis works ⎯ Introduce microservice architecture on Coharis ⎯ Benefits and difficulties after rebuilding Controllable, High-quAlity, and expRessIve TTS
TTS flow with monolithic architecture Phrase break prediction Accent prediction Grapheme-to-phoneme Text segmentation / normalization Mix up linguistic features Acoustic model Vocoder Coharis server Raw text Raw speech “こんにちは” For TTS For text processing
TTS flow with monolithic architecture Phrase break prediction Accent prediction Grapheme-to-phoneme Text segmentation / normalization Mix up linguistic features Acoustic model Vocoder Coharis server Raw text Raw speech “こんにちは” Raw text with acoustic parameters is given through REST
TTS flow with monolithic architecture Phrase break prediction Accent prediction Grapheme-to-phoneme Text segmentation / normalization Mix up linguistic features Acoustic model Vocoder Coharis server Raw text Raw speech “こんにちは” Split and normalize sentences e.g., 10:00 am → 午前10時
TTS flow with monolithic architecture Phrase break prediction Accent prediction Grapheme-to-phoneme Text segmentation / normalization Mix up linguistic features Acoustic model Vocoder Coharis server Raw text Raw speech “こんにちは” Convert raw text into appropriate pronunciation e.g., 午前10時 → ゴゼンジュージ
TTS flow with monolithic architecture Phrase break prediction Accent prediction Grapheme-to-phoneme Text segmentation / normalization Mix up linguistic features Acoustic model Vocoder Coharis server Raw text Raw speech “こんにちは” Predict appropriate pause positions e.g., あのねあそこにね → あのね / あそこにね
TTS flow with monolithic architecture Phrase break prediction Accent prediction Grapheme-to-phoneme Text segmentation / normalization Mix up linguistic features Acoustic model Vocoder Coharis server Raw text Raw speech “こんにちは” Predict appropriate accents e.g., 音声 → オ_H / ン_L / セ_L / ー_L
TTS flow with monolithic architecture Phrase break prediction Accent prediction Grapheme-to-phoneme Text segmentation / normalization Mix up linguistic features Acoustic model Vocoder Coharis server Raw text Raw speech “こんにちは” Align and mix up all linguistic features for TTS
TTS flow with monolithic architecture Phrase break prediction Accent prediction Grapheme-to-phoneme Text segmentation / normalization Mix up linguistic features Acoustic model Vocoder Coharis server Raw text Raw speech “こんにちは” Convert linguistic features into acoustic features
TTS flow with monolithic architecture Phrase break prediction Accent prediction Grapheme-to-phoneme Text segmentation / normalization Mix up linguistic features Acoustic model Vocoder Coharis server Raw text Raw speech “こんにちは” Convert acoustic features into speech
Problems in monolithic architecture Difficult to update part of modules quickly Tightly coupled Can not deploy each service independently Scalability Takes much time to build and deploy Slow development
Why microservice in our case? Increase the number of App specific APIs Fast development Required to utilize GPU resources efficiently since the number of available GPUs is limited Let ML engineer focus only on developing ML modules (Not Server/Infrastructure) Less time to update models and dictionaries compared with monolithic architecture Scale up each module independently
Coharis with microservice architecture BACKEND MICROSERVICE PROXY API SERVICE APP SERVICE CLIENTS Preprocessing G2P Phrase break Accent (TTS components) PROXY SERVICE CLOVA VOICE AUDIOBOOK APP1 APP2 CLOVA SERVICE AUDIOBOOK SERVICE APP SERVICE 1 APP SERVICE 2
Scale up each modules independently BACKEND MICROSERVICE PROXY API SERVICE APP SERVICE CLIENTS Preprocessing G2P Phrase break Accent (TTS components) PROXY SERVICE CLOVA VOICE AUDIOBOOK APP1 APP2 CLOVA SERVICE AUDIOBOOK SERVICE APP SERVICE 1 APP SERVICE 2 We can scale up only high GPU usage modules independently
Increase the number of App specific APIs BACKEND MICROSERVICE PROXY API SERVICE APP API SERVICE CLIENTS Preprocessing G2P Phrase break Accent (TTS components) PROXY SERVICE CLOVA VOICE AUDIOBOOK APP1 APP2 CLOVA SERVICE AUDIOBOOK SERVICE APP SERVICE 1 APP SERVICE 2 Aggregate data coming from each backend service to offer App specific API
Increase the number of App specific APIs BACKEND MICROSERVICE PROXY API SERVICE APP SERVICE CLIENTS Preprocessing G2P Phrase break Accent (TTS components) PROXY SERVICE CLOVA VOICE AUDIOBOOK APP1 APP2 CLOVA SERVICE AUDIOBOOK SERVICE APP SERVICE 1 APP SERVICE 2 Different Requests and Responses are given. And resource usage for each backend service is different
Increase the number of App specific APIs BACKEND MICROSERVICE PROXY API SERVICE APP SERVICE CLIENTS Preprocessing G2P Phrase break Accent (TTS components) PROXY SERVICE CLOVA VOICE AUDIOBOOK APP1 APP2 CLOVA SERVICE AUDIOBOOK SERVICE APP SERVICE 1 APP SERVICE 2 Generate speech from raw text Activated modules
Increase the number of App specific APIs BACKEND MICROSERVICE PROXY API SERVICE APP SERVICE CLIENTS Preprocessing G2P Phrase break Accent (TTS components) PROXY SERVICE CLOVA VOICE AUDIOBOOK APP1 APP2 NAVER GATEWAY AUDIOBOOK SERVICE APP SERVICE 1 APP SERVICE 2 Activated modules Generate linguistic features from raw text
Increase the number of App specific APIs BACKEND MICROSERVICE PROXY API SERVICE APP SERVICE CLIENTS Preprocessing G2P Phrase break Accent (TTS components) PROXY SERVICE CLOVA VOICE AUDIOBOOK APP1 APP2 CLOVA SERVICE EXTERNAL SERVICE APP SERVICE 1 APP SERVICE 2 Activated modules Generate speech from linguistic features
Fast development BACKEND MICROSERVICE PROXY API SERVICE APP SERVICE CLIENTS Preprocessing G2P Phrase break Accent (TTS components) PROXY SERVICE CLOVA VOICE AUDIOBOOK APP1 APP2 CLOVA SERVICE AUDIOBOOK SERVICE APP SERVICE 1 APP SERVICE 2 takes less time to update them since they are independent
Techniques to speed up inference time Support adaptive batching Support gPRC protocol Use CPU-based workers to handle the sharp increase in requests Requests coming in short intervals are batched together in the proxy server Each service is connected by gRPC to reduce network delay Support CPU-based TTS workers with GPU-based one
Microservice architecture on Coharis Phrase break service Accent service Normalization service Segmentation service Text formatter service Acoustic model service Vocoder service BACKEND SERVICES Proxy service Redis CPU-based workers Model storage gRPC App services REST Cache storage Load pre-trained models Queue TTS jobs
API call flow Phrase break service Accent service Normalization service Segmentation service Text formatter service Acoustic model service Vocoder service BACKEND SERVICES Proxy service Redis CPU-based workers Model storage gRPC App services REST Cache storage Load pre-trained models Queue TTS jobs
CPU-based TTS workers Phrase break service Accent service Normalization service Segmentation service Text formatter service Acoustic model service Vocoder service BACKEND SERVICES Proxy service Redis CPU-based workers Model storage gRPC App services REST Cache storage Load pre-trained models Queue TTS jobs
Model loading and cache mechanism Phrase break service Accent service Normalization service Segmentation service Text formatter service Acoustic model service Vocoder service BACKEND SERVICES Proxy service Redis CPU-based workers Model storage gRPC App services REST Cache storage Load pre-trained models Queue TTS jobs
Focus only on developing ML services Phrase break service Accent service Normalization service Segmentation service Text formatter service Acoustic model service Vocoder service BACKEND SERVICES Proxy service Redis CPU based workers Model storage gRPC App services REST Cache storage Load pre-trained models Queue TTS jobs Wrapped by ML serving tool e.g. Triton, Torch-serve, seldon-core
Pros and Cons Scalability Release procedure ML engineers can focus on backend services without the effects on other services Only backend services with high load can be scaled up and can use GPU effectively Need to consider the release procedure carefully Effective development Setup local environment for researchers Previous Coharis is still used for research purposes since this is easier to set up J J L L
n We introduced TTS system for product ready environment n The new architecture makes ML engineers focus on developing TTS n Renewing architecture results in faster development and easier scaling Summary