Slide 1

Slide 1 text

Development and Operation of Expressive Speech Synthesis System Kosuke Futamata / LINE Ryo Terashima / LINE

Slide 2

Slide 2 text

Agenda - Introduction - Emotional TTS from Neutral Data - TTS System for Product

Slide 3

Slide 3 text

Agenda - Introduction - What is “TTS” ? - Our TTS System “Coharis” - Emotional TTS from Neutral Data - TTS System for Product

Slide 4

Slide 4 text

What is “TTS” ? to Speech Text Hello

Slide 5

Slide 5 text

What is “TTS” ? Acoustic model Vocoder ありがとう Text processing Acoustic features [ a- ri^ ga- to- o- ] Linguistic features

Slide 6

Slide 6 text

Agenda - Introduction - What is “TTS” ? - Our TTS System “Coharis” - Emotional TTS from Neutral Data - TTS System for Product

Slide 7

Slide 7 text

COntrollable, High-quAlity, and expRessIve TTS

Slide 8

Slide 8 text

Acoustic model Vocoder ありがとう Text processing Neutral Conversation Narration Happy Sad Angry Expressive Focus on Emotion Expressive TTS System Coharis

Slide 9

Slide 9 text

Audiobook Narration Telephone Answering Use cases of Coharis

Slide 10

Slide 10 text

Agenda - Introduction - Emotional TTS from Neutral Data - TTS System for Product

Slide 11

Slide 11 text

Problems in building emotional TTS Recording required Emotional Data Acting skills needed Burden on Speakers Only neutral data Low Resource Data

Slide 12

Slide 12 text

Emotional TTS using neutral data Emotional TTS Speech database (A) with emotional data Speech database (B) with only neutral data

Slide 13

Slide 13 text

Emotional TTS using neutral data Emotional TTS Speech database (A) with emotional data Emotional TTS! Speech database (B) with pseudo emotional data Emotion transfer

Slide 14

Slide 14 text

Emotional TTS using neutral data Data augmentation via pitch-shifting Data augmentation via voice conversion

Slide 15

Slide 15 text

Data augmentation via voice conversion Neutral Happiness Sadness Source speaker Neutral Target speaker Target speaker (VC) VC model training Neutral Happiness Sadness

Slide 16

Slide 16 text

Data augmentation via voice conversion Neutral Happiness Sadness Neutral Happiness Sadness VC Style: Source speaker Identity: Target speaker Generation of pseudo-emotional data using VC Source speaker Neutral Target speaker Target speaker (VC)

Slide 17

Slide 17 text

Data augmentation via pitch-shifting Generate data for various pitches -3 semitone +12 semitone Original Pitch-shift -3 to +12 semitones ・ ・ ・ ・ ・ ・

Slide 18

Slide 18 text

MS-TTS VC-TTS VC-TTS-PS Performance evaluations Source Target n Neutral n Happiness n Sadness n Neutral Target n Neutral n Neutral n Happiness n Sadness Target Multi-speaker Single-speaker with VC Single-speaker with PS&VC n Neutral n Neutral n Happiness n Sadness via VC w/ PS via VC

Slide 19

Slide 19 text

Performance evaluations Performance on Naturalness 2.85 3.88 4.20 2.74 4.00 4.00 1 2 3 4 5 MS-TTS VC-TTS VC-TTS-PS Happiness Sadness

Slide 20

Slide 20 text

Performance evaluations 2.48 3.33 3.82 2.72 3.63 3.67 1 2 3 4 MS-TTS VC-TTS VC-TTS-PS Happiness Sadness Performance on Emotional Similarity

Slide 21

Slide 21 text

Performance evaluations 4.20 4.08 4.00 3.96 1 2 3 4 5 VC-TTS-PS VC-TTS-PS (40%) Naturalness Experiments with less than half of the recorded speech data of the target speaker 3.82 3.87 3.67 3.63 1 2 3 4 VC-TTS-PS VC-TTS-PS (40%) Emotional Similarity Happiness Sadness

Slide 22

Slide 22 text

n We proposed a cross-speaker emotion style transfer method that combines PS-based and VC-based data augmentation. n Our method improved the performance of emotional TTS. In particular, the experiments showed that our method significantly enhanced the naturalness and emotional similarity of speech with a happiness style. Audio samples https://ryojerky.github.io/demo_vc-tts-ps/ Summary

Slide 23

Slide 23 text

Agenda - Introduction - Emotional TTS from Neutral Data - TTS System for Product

Slide 24

Slide 24 text

About the architecture of Coharis TTS system for product ⎯ How Coharis works ⎯ Introduce microservice architecture on Coharis ⎯ Benefits and difficulties after rebuilding Controllable, High-quAlity, and expRessIve TTS

Slide 25

Slide 25 text

TTS flow with monolithic architecture Phrase break prediction Accent prediction Grapheme-to-phoneme Text segmentation / normalization Mix up linguistic features Acoustic model Vocoder Coharis server Raw text Raw speech “こんにちは” For TTS For text processing

Slide 26

Slide 26 text

TTS flow with monolithic architecture Phrase break prediction Accent prediction Grapheme-to-phoneme Text segmentation / normalization Mix up linguistic features Acoustic model Vocoder Coharis server Raw text Raw speech “こんにちは” Raw text with acoustic parameters is given through REST

Slide 27

Slide 27 text

TTS flow with monolithic architecture Phrase break prediction Accent prediction Grapheme-to-phoneme Text segmentation / normalization Mix up linguistic features Acoustic model Vocoder Coharis server Raw text Raw speech “こんにちは” Split and normalize sentences e.g., 10:00 am → 午前10時

Slide 28

Slide 28 text

TTS flow with monolithic architecture Phrase break prediction Accent prediction Grapheme-to-phoneme Text segmentation / normalization Mix up linguistic features Acoustic model Vocoder Coharis server Raw text Raw speech “こんにちは” Convert raw text into appropriate pronunciation e.g., 午前10時 → ゴゼンジュージ

Slide 29

Slide 29 text

TTS flow with monolithic architecture Phrase break prediction Accent prediction Grapheme-to-phoneme Text segmentation / normalization Mix up linguistic features Acoustic model Vocoder Coharis server Raw text Raw speech “こんにちは” Predict appropriate pause positions e.g., あのねあそこにね → あのね / あそこにね

Slide 30

Slide 30 text

TTS flow with monolithic architecture Phrase break prediction Accent prediction Grapheme-to-phoneme Text segmentation / normalization Mix up linguistic features Acoustic model Vocoder Coharis server Raw text Raw speech “こんにちは” Predict appropriate accents e.g., 音声 → オ_H / ン_L / セ_L / ー_L

Slide 31

Slide 31 text

TTS flow with monolithic architecture Phrase break prediction Accent prediction Grapheme-to-phoneme Text segmentation / normalization Mix up linguistic features Acoustic model Vocoder Coharis server Raw text Raw speech “こんにちは” Align and mix up all linguistic features for TTS

Slide 32

Slide 32 text

TTS flow with monolithic architecture Phrase break prediction Accent prediction Grapheme-to-phoneme Text segmentation / normalization Mix up linguistic features Acoustic model Vocoder Coharis server Raw text Raw speech “こんにちは” Convert linguistic features into acoustic features

Slide 33

Slide 33 text

TTS flow with monolithic architecture Phrase break prediction Accent prediction Grapheme-to-phoneme Text segmentation / normalization Mix up linguistic features Acoustic model Vocoder Coharis server Raw text Raw speech “こんにちは” Convert acoustic features into speech

Slide 34

Slide 34 text

Problems in monolithic architecture Difficult to update part of modules quickly Tightly coupled Can not deploy each service independently Scalability Takes much time to build and deploy Slow development

Slide 35

Slide 35 text

Why microservice in our case? Increase the number of App specific APIs Fast development Required to utilize GPU resources efficiently since the number of available GPUs is limited Let ML engineer focus only on developing ML modules (Not Server/Infrastructure) Less time to update models and dictionaries compared with monolithic architecture Scale up each module independently

Slide 36

Slide 36 text

Coharis with microservice architecture BACKEND MICROSERVICE PROXY API SERVICE APP SERVICE CLIENTS Preprocessing G2P Phrase break Accent (TTS components) PROXY SERVICE CLOVA VOICE AUDIOBOOK APP1 APP2 CLOVA SERVICE AUDIOBOOK SERVICE APP SERVICE 1 APP SERVICE 2

Slide 37

Slide 37 text

Scale up each modules independently BACKEND MICROSERVICE PROXY API SERVICE APP SERVICE CLIENTS Preprocessing G2P Phrase break Accent (TTS components) PROXY SERVICE CLOVA VOICE AUDIOBOOK APP1 APP2 CLOVA SERVICE AUDIOBOOK SERVICE APP SERVICE 1 APP SERVICE 2 We can scale up only high GPU usage modules independently

Slide 38

Slide 38 text

Increase the number of App specific APIs BACKEND MICROSERVICE PROXY API SERVICE APP API SERVICE CLIENTS Preprocessing G2P Phrase break Accent (TTS components) PROXY SERVICE CLOVA VOICE AUDIOBOOK APP1 APP2 CLOVA SERVICE AUDIOBOOK SERVICE APP SERVICE 1 APP SERVICE 2 Aggregate data coming from each backend service to offer App specific API

Slide 39

Slide 39 text

Increase the number of App specific APIs BACKEND MICROSERVICE PROXY API SERVICE APP SERVICE CLIENTS Preprocessing G2P Phrase break Accent (TTS components) PROXY SERVICE CLOVA VOICE AUDIOBOOK APP1 APP2 CLOVA SERVICE AUDIOBOOK SERVICE APP SERVICE 1 APP SERVICE 2 Different Requests and Responses are given. And resource usage for each backend service is different

Slide 40

Slide 40 text

Increase the number of App specific APIs BACKEND MICROSERVICE PROXY API SERVICE APP SERVICE CLIENTS Preprocessing G2P Phrase break Accent (TTS components) PROXY SERVICE CLOVA VOICE AUDIOBOOK APP1 APP2 CLOVA SERVICE AUDIOBOOK SERVICE APP SERVICE 1 APP SERVICE 2 Generate speech from raw text Activated modules

Slide 41

Slide 41 text

Increase the number of App specific APIs BACKEND MICROSERVICE PROXY API SERVICE APP SERVICE CLIENTS Preprocessing G2P Phrase break Accent (TTS components) PROXY SERVICE CLOVA VOICE AUDIOBOOK APP1 APP2 NAVER GATEWAY AUDIOBOOK SERVICE APP SERVICE 1 APP SERVICE 2 Activated modules Generate linguistic features from raw text

Slide 42

Slide 42 text

Increase the number of App specific APIs BACKEND MICROSERVICE PROXY API SERVICE APP SERVICE CLIENTS Preprocessing G2P Phrase break Accent (TTS components) PROXY SERVICE CLOVA VOICE AUDIOBOOK APP1 APP2 CLOVA SERVICE EXTERNAL SERVICE APP SERVICE 1 APP SERVICE 2 Activated modules Generate speech from linguistic features

Slide 43

Slide 43 text

Fast development BACKEND MICROSERVICE PROXY API SERVICE APP SERVICE CLIENTS Preprocessing G2P Phrase break Accent (TTS components) PROXY SERVICE CLOVA VOICE AUDIOBOOK APP1 APP2 CLOVA SERVICE AUDIOBOOK SERVICE APP SERVICE 1 APP SERVICE 2 takes less time to update them since they are independent

Slide 44

Slide 44 text

Techniques to speed up inference time Support adaptive batching Support gPRC protocol Use CPU-based workers to handle the sharp increase in requests Requests coming in short intervals are batched together in the proxy server Each service is connected by gRPC to reduce network delay Support CPU-based TTS workers with GPU-based one

Slide 45

Slide 45 text

Microservice architecture on Coharis Phrase break service Accent service Normalization service Segmentation service Text formatter service Acoustic model service Vocoder service BACKEND SERVICES Proxy service Redis CPU-based workers Model storage gRPC App services REST Cache storage Load pre-trained models Queue TTS jobs

Slide 46

Slide 46 text

API call flow Phrase break service Accent service Normalization service Segmentation service Text formatter service Acoustic model service Vocoder service BACKEND SERVICES Proxy service Redis CPU-based workers Model storage gRPC App services REST Cache storage Load pre-trained models Queue TTS jobs

Slide 47

Slide 47 text

CPU-based TTS workers Phrase break service Accent service Normalization service Segmentation service Text formatter service Acoustic model service Vocoder service BACKEND SERVICES Proxy service Redis CPU-based workers Model storage gRPC App services REST Cache storage Load pre-trained models Queue TTS jobs

Slide 48

Slide 48 text

Model loading and cache mechanism Phrase break service Accent service Normalization service Segmentation service Text formatter service Acoustic model service Vocoder service BACKEND SERVICES Proxy service Redis CPU-based workers Model storage gRPC App services REST Cache storage Load pre-trained models Queue TTS jobs

Slide 49

Slide 49 text

Focus only on developing ML services Phrase break service Accent service Normalization service Segmentation service Text formatter service Acoustic model service Vocoder service BACKEND SERVICES Proxy service Redis CPU based workers Model storage gRPC App services REST Cache storage Load pre-trained models Queue TTS jobs Wrapped by ML serving tool e.g. Triton, Torch-serve, seldon-core

Slide 50

Slide 50 text

Pros and Cons Scalability Release procedure ML engineers can focus on backend services without the effects on other services Only backend services with high load can be scaled up and can use GPU effectively Need to consider the release procedure carefully Effective development Setup local environment for researchers Previous Coharis is still used for research purposes since this is easier to set up J J L L

Slide 51

Slide 51 text

n We introduced TTS system for product ready environment n The new architecture makes ML engineers focus on developing TTS n Renewing architecture results in faster development and easier scaling Summary

Slide 52

Slide 52 text

Thank you!!