Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Development and Operation of Expressive Speech Synthesis System

Development and Operation of Expressive Speech Synthesis System

Ryo Terashima (LINE / Voice Team / Machine Learning Engineer)
Kosuke Futamata (LINE / Voice Team / Machine Learning Engineer)

https://tech-verse.me/ja/sessions/55
https://tech-verse.me/en/sessions/55
https://tech-verse.me/ko/sessions/55

Tech-Verse2022

January 12, 2023
Tweet

More Decks by Tech-Verse2022

Other Decks in Technology

Transcript

  1. Agenda - Introduction - What is “TTS” ? - Our

    TTS System “Coharis” - Emotional TTS from Neutral Data - TTS System for Product
  2. What is “TTS” ? Acoustic model Vocoder ありがとう Text processing

    Acoustic features [ a- ri^ ga- to- o- ] Linguistic features
  3. Agenda - Introduction - What is “TTS” ? - Our

    TTS System “Coharis” - Emotional TTS from Neutral Data - TTS System for Product
  4. Acoustic model Vocoder ありがとう Text processing Neutral Conversation Narration Happy

    Sad Angry Expressive Focus on Emotion Expressive TTS System Coharis
  5. Problems in building emotional TTS Recording required Emotional Data Acting

    skills needed Burden on Speakers Only neutral data Low Resource Data
  6. Emotional TTS using neutral data Emotional TTS Speech database (A)

    with emotional data Speech database (B) with only neutral data
  7. Emotional TTS using neutral data Emotional TTS Speech database (A)

    with emotional data Emotional TTS! Speech database (B) with pseudo emotional data Emotion transfer
  8. Data augmentation via voice conversion Neutral Happiness Sadness Source speaker

    Neutral Target speaker Target speaker (VC) VC model training Neutral Happiness Sadness
  9. Data augmentation via voice conversion Neutral Happiness Sadness Neutral Happiness

    Sadness VC Style: Source speaker Identity: Target speaker Generation of pseudo-emotional data using VC Source speaker Neutral Target speaker Target speaker (VC)
  10. Data augmentation via pitch-shifting Generate data for various pitches -3

    semitone +12 semitone Original Pitch-shift -3 to +12 semitones ・ ・ ・ ・ ・ ・
  11. MS-TTS VC-TTS VC-TTS-PS Performance evaluations Source Target n Neutral n

    Happiness n Sadness n Neutral Target n Neutral n Neutral n Happiness n Sadness Target Multi-speaker Single-speaker with VC Single-speaker with PS&VC n Neutral n Neutral n Happiness n Sadness via VC w/ PS via VC
  12. Performance evaluations Performance on Naturalness 2.85 3.88 4.20 2.74 4.00

    4.00 1 2 3 4 5 MS-TTS VC-TTS VC-TTS-PS Happiness Sadness
  13. Performance evaluations 2.48 3.33 3.82 2.72 3.63 3.67 1 2

    3 4 MS-TTS VC-TTS VC-TTS-PS Happiness Sadness Performance on Emotional Similarity
  14. Performance evaluations 4.20 4.08 4.00 3.96 1 2 3 4

    5 VC-TTS-PS VC-TTS-PS (40%) Naturalness Experiments with less than half of the recorded speech data of the target speaker 3.82 3.87 3.67 3.63 1 2 3 4 VC-TTS-PS VC-TTS-PS (40%) Emotional Similarity Happiness Sadness
  15. n We proposed a cross-speaker emotion style transfer method that

    combines PS-based and VC-based data augmentation. n Our method improved the performance of emotional TTS. In particular, the experiments showed that our method significantly enhanced the naturalness and emotional similarity of speech with a happiness style. Audio samples https://ryojerky.github.io/demo_vc-tts-ps/ Summary
  16. About the architecture of Coharis TTS system for product ⎯

    How Coharis works ⎯ Introduce microservice architecture on Coharis ⎯ Benefits and difficulties after rebuilding Controllable, High-quAlity, and expRessIve TTS
  17. TTS flow with monolithic architecture Phrase break prediction Accent prediction

    Grapheme-to-phoneme Text segmentation / normalization Mix up linguistic features Acoustic model Vocoder Coharis server Raw text Raw speech “こんにちは” For TTS For text processing
  18. TTS flow with monolithic architecture Phrase break prediction Accent prediction

    Grapheme-to-phoneme Text segmentation / normalization Mix up linguistic features Acoustic model Vocoder Coharis server Raw text Raw speech “こんにちは” Raw text with acoustic parameters is given through REST
  19. TTS flow with monolithic architecture Phrase break prediction Accent prediction

    Grapheme-to-phoneme Text segmentation / normalization Mix up linguistic features Acoustic model Vocoder Coharis server Raw text Raw speech “こんにちは” Split and normalize sentences e.g., 10:00 am → 午前10時
  20. TTS flow with monolithic architecture Phrase break prediction Accent prediction

    Grapheme-to-phoneme Text segmentation / normalization Mix up linguistic features Acoustic model Vocoder Coharis server Raw text Raw speech “こんにちは” Convert raw text into appropriate pronunciation e.g., 午前10時 → ゴゼンジュージ
  21. TTS flow with monolithic architecture Phrase break prediction Accent prediction

    Grapheme-to-phoneme Text segmentation / normalization Mix up linguistic features Acoustic model Vocoder Coharis server Raw text Raw speech “こんにちは” Predict appropriate pause positions e.g., あのねあそこにね → あのね / あそこにね
  22. TTS flow with monolithic architecture Phrase break prediction Accent prediction

    Grapheme-to-phoneme Text segmentation / normalization Mix up linguistic features Acoustic model Vocoder Coharis server Raw text Raw speech “こんにちは” Predict appropriate accents e.g., 音声 → オ_H / ン_L / セ_L / ー_L
  23. TTS flow with monolithic architecture Phrase break prediction Accent prediction

    Grapheme-to-phoneme Text segmentation / normalization Mix up linguistic features Acoustic model Vocoder Coharis server Raw text Raw speech “こんにちは” Align and mix up all linguistic features for TTS
  24. TTS flow with monolithic architecture Phrase break prediction Accent prediction

    Grapheme-to-phoneme Text segmentation / normalization Mix up linguistic features Acoustic model Vocoder Coharis server Raw text Raw speech “こんにちは” Convert linguistic features into acoustic features
  25. TTS flow with monolithic architecture Phrase break prediction Accent prediction

    Grapheme-to-phoneme Text segmentation / normalization Mix up linguistic features Acoustic model Vocoder Coharis server Raw text Raw speech “こんにちは” Convert acoustic features into speech
  26. Problems in monolithic architecture Difficult to update part of modules

    quickly Tightly coupled Can not deploy each service independently Scalability Takes much time to build and deploy Slow development
  27. Why microservice in our case? Increase the number of App

    specific APIs Fast development Required to utilize GPU resources efficiently since the number of available GPUs is limited Let ML engineer focus only on developing ML modules (Not Server/Infrastructure) Less time to update models and dictionaries compared with monolithic architecture Scale up each module independently
  28. Coharis with microservice architecture BACKEND MICROSERVICE PROXY API SERVICE APP

    SERVICE CLIENTS Preprocessing G2P Phrase break Accent (TTS components) PROXY SERVICE CLOVA VOICE AUDIOBOOK APP1 APP2 CLOVA SERVICE AUDIOBOOK SERVICE APP SERVICE 1 APP SERVICE 2
  29. Scale up each modules independently BACKEND MICROSERVICE PROXY API SERVICE

    APP SERVICE CLIENTS Preprocessing G2P Phrase break Accent (TTS components) PROXY SERVICE CLOVA VOICE AUDIOBOOK APP1 APP2 CLOVA SERVICE AUDIOBOOK SERVICE APP SERVICE 1 APP SERVICE 2 We can scale up only high GPU usage modules independently
  30. Increase the number of App specific APIs BACKEND MICROSERVICE PROXY

    API SERVICE APP API SERVICE CLIENTS Preprocessing G2P Phrase break Accent (TTS components) PROXY SERVICE CLOVA VOICE AUDIOBOOK APP1 APP2 CLOVA SERVICE AUDIOBOOK SERVICE APP SERVICE 1 APP SERVICE 2 Aggregate data coming from each backend service to offer App specific API
  31. Increase the number of App specific APIs BACKEND MICROSERVICE PROXY

    API SERVICE APP SERVICE CLIENTS Preprocessing G2P Phrase break Accent (TTS components) PROXY SERVICE CLOVA VOICE AUDIOBOOK APP1 APP2 CLOVA SERVICE AUDIOBOOK SERVICE APP SERVICE 1 APP SERVICE 2 Different Requests and Responses are given. And resource usage for each backend service is different
  32. Increase the number of App specific APIs BACKEND MICROSERVICE PROXY

    API SERVICE APP SERVICE CLIENTS Preprocessing G2P Phrase break Accent (TTS components) PROXY SERVICE CLOVA VOICE AUDIOBOOK APP1 APP2 CLOVA SERVICE AUDIOBOOK SERVICE APP SERVICE 1 APP SERVICE 2 Generate speech from raw text Activated modules
  33. Increase the number of App specific APIs BACKEND MICROSERVICE PROXY

    API SERVICE APP SERVICE CLIENTS Preprocessing G2P Phrase break Accent (TTS components) PROXY SERVICE CLOVA VOICE AUDIOBOOK APP1 APP2 NAVER GATEWAY AUDIOBOOK SERVICE APP SERVICE 1 APP SERVICE 2 Activated modules Generate linguistic features from raw text
  34. Increase the number of App specific APIs BACKEND MICROSERVICE PROXY

    API SERVICE APP SERVICE CLIENTS Preprocessing G2P Phrase break Accent (TTS components) PROXY SERVICE CLOVA VOICE AUDIOBOOK APP1 APP2 CLOVA SERVICE EXTERNAL SERVICE APP SERVICE 1 APP SERVICE 2 Activated modules Generate speech from linguistic features
  35. Fast development BACKEND MICROSERVICE PROXY API SERVICE APP SERVICE CLIENTS

    Preprocessing G2P Phrase break Accent (TTS components) PROXY SERVICE CLOVA VOICE AUDIOBOOK APP1 APP2 CLOVA SERVICE AUDIOBOOK SERVICE APP SERVICE 1 APP SERVICE 2 takes less time to update them since they are independent
  36. Techniques to speed up inference time Support adaptive batching Support

    gPRC protocol Use CPU-based workers to handle the sharp increase in requests Requests coming in short intervals are batched together in the proxy server Each service is connected by gRPC to reduce network delay Support CPU-based TTS workers with GPU-based one
  37. Microservice architecture on Coharis Phrase break service Accent service Normalization

    service Segmentation service Text formatter service Acoustic model service Vocoder service BACKEND SERVICES Proxy service Redis CPU-based workers Model storage gRPC App services REST Cache storage Load pre-trained models Queue TTS jobs
  38. API call flow Phrase break service Accent service Normalization service

    Segmentation service Text formatter service Acoustic model service Vocoder service BACKEND SERVICES Proxy service Redis CPU-based workers Model storage gRPC App services REST Cache storage Load pre-trained models Queue TTS jobs
  39. CPU-based TTS workers Phrase break service Accent service Normalization service

    Segmentation service Text formatter service Acoustic model service Vocoder service BACKEND SERVICES Proxy service Redis CPU-based workers Model storage gRPC App services REST Cache storage Load pre-trained models Queue TTS jobs
  40. Model loading and cache mechanism Phrase break service Accent service

    Normalization service Segmentation service Text formatter service Acoustic model service Vocoder service BACKEND SERVICES Proxy service Redis CPU-based workers Model storage gRPC App services REST Cache storage Load pre-trained models Queue TTS jobs
  41. Focus only on developing ML services Phrase break service Accent

    service Normalization service Segmentation service Text formatter service Acoustic model service Vocoder service BACKEND SERVICES Proxy service Redis CPU based workers Model storage gRPC App services REST Cache storage Load pre-trained models Queue TTS jobs Wrapped by ML serving tool e.g. Triton, Torch-serve, seldon-core
  42. Pros and Cons Scalability Release procedure ML engineers can focus

    on backend services without the effects on other services Only backend services with high load can be scaled up and can use GPU effectively Need to consider the release procedure carefully Effective development Setup local environment for researchers Previous Coharis is still used for research purposes since this is easier to set up J J L L
  43. n We introduced TTS system for product ready environment n

    The new architecture makes ML engineers focus on developing TTS n Renewing architecture results in faster development and easier scaling Summary