Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Development and Operation of Expressive Speech Synthesis System

Development and Operation of Expressive Speech Synthesis System

Ryo Terashima (LINE / Voice Team / Machine Learning Engineer)
Kosuke Futamata (LINE / Voice Team / Machine Learning Engineer)

https://tech-verse.me/ja/sessions/55
https://tech-verse.me/en/sessions/55
https://tech-verse.me/ko/sessions/55

Tech-Verse2022
PRO

January 12, 2023
Tweet

More Decks by Tech-Verse2022

Other Decks in Technology

Transcript

  1. Development and
    Operation of Expressive
    Speech Synthesis System
    Kosuke Futamata / LINE
    Ryo Terashima / LINE

    View Slide

  2. Agenda - Introduction
    - Emotional TTS from Neutral Data
    - TTS System for Product

    View Slide

  3. Agenda
    - Introduction
    - What is “TTS” ?
    - Our TTS System “Coharis”
    - Emotional TTS from Neutral Data
    - TTS System for Product

    View Slide

  4. What is “TTS” ?
    to Speech
    Text
    Hello

    View Slide

  5. What is “TTS” ?
    Acoustic model Vocoder
    ありがとう
    Text processing
    Acoustic features
    [ a- ri^ ga- to- o- ]
    Linguistic features

    View Slide

  6. Agenda
    - Introduction
    - What is “TTS” ?
    - Our TTS System “Coharis”
    - Emotional TTS from Neutral Data
    - TTS System for Product

    View Slide

  7. COntrollable, High-quAlity, and expRessIve TTS

    View Slide

  8. Acoustic model Vocoder
    ありがとう
    Text processing
    Neutral
    Conversation
    Narration
    Happy
    Sad
    Angry
    Expressive
    Focus on Emotion
    Expressive TTS System Coharis

    View Slide

  9. Audiobook Narration
    Telephone
    Answering
    Use cases of Coharis

    View Slide

  10. Agenda - Introduction
    - Emotional TTS from Neutral Data
    - TTS System for Product

    View Slide

  11. Problems in building emotional TTS
    Recording required
    Emotional
    Data
    Acting skills needed
    Burden on
    Speakers
    Only neutral data
    Low Resource
    Data

    View Slide

  12. Emotional TTS using neutral data
    Emotional TTS
    Speech database (A)
    with emotional data
    Speech database (B)
    with only neutral data

    View Slide

  13. Emotional TTS using neutral data
    Emotional TTS
    Speech database (A)
    with emotional data
    Emotional TTS!
    Speech database (B)
    with pseudo emotional data
    Emotion
    transfer

    View Slide

  14. Emotional TTS using neutral data
    Data augmentation via pitch-shifting
    Data augmentation via voice conversion

    View Slide

  15. Data augmentation via voice conversion
    Neutral Happiness Sadness
    Source speaker
    Neutral
    Target speaker Target speaker (VC)
    VC model
    training
    Neutral Happiness Sadness

    View Slide

  16. Data augmentation via voice conversion
    Neutral Happiness Sadness Neutral Happiness Sadness
    VC
    Style: Source speaker
    Identity: Target speaker
    Generation of pseudo-emotional data using VC
    Source speaker
    Neutral
    Target speaker Target speaker (VC)

    View Slide

  17. Data augmentation via pitch-shifting
    Generate data for various pitches
    -3 semitone
    +12 semitone
    Original
    Pitch-shift
    -3 to +12 semitones






    View Slide

  18. MS-TTS VC-TTS VC-TTS-PS
    Performance evaluations
    Source
    Target
    n Neutral
    n Happiness
    n Sadness
    n Neutral
    Target
    n Neutral
    n Neutral
    n Happiness
    n Sadness
    Target
    Multi-speaker Single-speaker with VC Single-speaker with PS&VC
    n Neutral
    n Neutral
    n Happiness
    n Sadness
    via VC w/ PS
    via VC

    View Slide

  19. Performance evaluations
    Performance on Naturalness
    2.85
    3.88
    4.20
    2.74
    4.00 4.00
    1
    2
    3
    4
    5
    MS-TTS VC-TTS VC-TTS-PS
    Happiness Sadness

    View Slide

  20. Performance evaluations
    2.48
    3.33
    3.82
    2.72
    3.63 3.67
    1
    2
    3
    4
    MS-TTS VC-TTS VC-TTS-PS
    Happiness Sadness
    Performance on Emotional Similarity

    View Slide

  21. Performance evaluations
    4.20 4.08
    4.00 3.96
    1
    2
    3
    4
    5
    VC-TTS-PS VC-TTS-PS (40%)
    Naturalness
    Experiments with less than half of the recorded speech data of the target speaker
    3.82 3.87
    3.67 3.63
    1
    2
    3
    4
    VC-TTS-PS VC-TTS-PS (40%)
    Emotional Similarity
    Happiness
    Sadness

    View Slide

  22. n We proposed a cross-speaker emotion style transfer method that combines
    PS-based and VC-based data augmentation.
    n Our method improved the performance of emotional TTS. In particular,
    the experiments showed that our method significantly enhanced the
    naturalness and emotional similarity of speech with a happiness style.
    Audio samples
    https://ryojerky.github.io/demo_vc-tts-ps/
    Summary

    View Slide

  23. Agenda - Introduction
    - Emotional TTS from Neutral Data
    - TTS System for Product

    View Slide

  24. About the architecture of Coharis
    TTS system for product
    ⎯ How Coharis works
    ⎯ Introduce microservice architecture on Coharis
    ⎯ Benefits and difficulties after rebuilding
    Controllable, High-quAlity, and expRessIve TTS

    View Slide

  25. TTS flow with monolithic architecture
    Phrase break
    prediction
    Accent prediction
    Grapheme-to-phoneme
    Text segmentation /
    normalization
    Mix up
    linguistic features
    Acoustic model
    Vocoder
    Coharis server
    Raw text
    Raw speech
    “こんにちは”
    For TTS
    For text processing

    View Slide

  26. TTS flow with monolithic architecture
    Phrase break
    prediction
    Accent prediction
    Grapheme-to-phoneme
    Text segmentation /
    normalization
    Mix up
    linguistic features
    Acoustic model
    Vocoder
    Coharis server
    Raw text
    Raw speech
    “こんにちは”
    Raw text with acoustic parameters
    is given through REST

    View Slide

  27. TTS flow with monolithic architecture
    Phrase break
    prediction
    Accent prediction
    Grapheme-to-phoneme
    Text segmentation /
    normalization
    Mix up
    linguistic features
    Acoustic model
    Vocoder
    Coharis server
    Raw text
    Raw speech
    “こんにちは”
    Split and normalize sentences
    e.g., 10:00 am → 午前10時

    View Slide

  28. TTS flow with monolithic architecture
    Phrase break
    prediction
    Accent prediction
    Grapheme-to-phoneme
    Text segmentation /
    normalization
    Mix up
    linguistic features
    Acoustic model
    Vocoder
    Coharis server
    Raw text
    Raw speech
    “こんにちは”
    Convert raw text into
    appropriate pronunciation
    e.g., 午前10時 → ゴゼンジュージ

    View Slide

  29. TTS flow with monolithic architecture
    Phrase break
    prediction
    Accent prediction
    Grapheme-to-phoneme
    Text segmentation /
    normalization
    Mix up
    linguistic features
    Acoustic model
    Vocoder
    Coharis server
    Raw text
    Raw speech
    “こんにちは”
    Predict appropriate pause positions
    e.g., あのねあそこにね → あのね / あそこにね

    View Slide

  30. TTS flow with monolithic architecture
    Phrase break
    prediction
    Accent prediction
    Grapheme-to-phoneme
    Text segmentation /
    normalization
    Mix up
    linguistic features
    Acoustic model
    Vocoder
    Coharis server
    Raw text
    Raw speech
    “こんにちは”
    Predict appropriate accents
    e.g., 音声 → オ_H / ン_L / セ_L / ー_L

    View Slide

  31. TTS flow with monolithic architecture
    Phrase break
    prediction
    Accent prediction
    Grapheme-to-phoneme
    Text segmentation /
    normalization
    Mix up
    linguistic features
    Acoustic model
    Vocoder
    Coharis server
    Raw text
    Raw speech
    “こんにちは”
    Align and mix up
    all linguistic features for TTS

    View Slide

  32. TTS flow with monolithic architecture
    Phrase break
    prediction
    Accent prediction
    Grapheme-to-phoneme
    Text segmentation /
    normalization
    Mix up
    linguistic features
    Acoustic model
    Vocoder
    Coharis server
    Raw text
    Raw speech
    “こんにちは”
    Convert linguistic features into
    acoustic features

    View Slide

  33. TTS flow with monolithic architecture
    Phrase break
    prediction
    Accent prediction
    Grapheme-to-phoneme
    Text segmentation /
    normalization
    Mix up
    linguistic features
    Acoustic model
    Vocoder
    Coharis server
    Raw text
    Raw speech
    “こんにちは”
    Convert acoustic features
    into speech

    View Slide

  34. Problems in monolithic architecture
    Difficult to update part of
    modules quickly
    Tightly
    coupled
    Can not deploy each service
    independently
    Scalability
    Takes much time to
    build and deploy
    Slow
    development

    View Slide

  35. Why microservice in our case?
    Increase the number of App specific APIs
    Fast development
    Required to utilize GPU resources efficiently since the number of available GPUs is limited
    Let ML engineer focus only on developing ML modules (Not Server/Infrastructure)
    Less time to update models and dictionaries compared with monolithic architecture
    Scale up each module independently

    View Slide

  36. Coharis with microservice architecture
    BACKEND
    MICROSERVICE
    PROXY API SERVICE
    APP SERVICE
    CLIENTS
    Preprocessing
    G2P
    Phrase break
    Accent
    (TTS components)
    PROXY SERVICE
    CLOVA VOICE
    AUDIOBOOK
    APP1
    APP2
    CLOVA SERVICE
    AUDIOBOOK
    SERVICE
    APP SERVICE 1
    APP SERVICE 2

    View Slide

  37. Scale up each modules independently
    BACKEND
    MICROSERVICE
    PROXY API SERVICE
    APP SERVICE
    CLIENTS
    Preprocessing
    G2P
    Phrase break
    Accent
    (TTS components)
    PROXY SERVICE
    CLOVA VOICE
    AUDIOBOOK
    APP1
    APP2
    CLOVA SERVICE
    AUDIOBOOK
    SERVICE
    APP SERVICE 1
    APP SERVICE 2
    We can scale up only high GPU
    usage modules independently

    View Slide

  38. Increase the number of App specific APIs
    BACKEND
    MICROSERVICE
    PROXY API SERVICE
    APP API SERVICE
    CLIENTS
    Preprocessing
    G2P
    Phrase break
    Accent
    (TTS components)
    PROXY SERVICE
    CLOVA VOICE
    AUDIOBOOK
    APP1
    APP2
    CLOVA SERVICE
    AUDIOBOOK
    SERVICE
    APP SERVICE 1
    APP SERVICE 2
    Aggregate data coming from each backend
    service to offer App specific API

    View Slide

  39. Increase the number of App specific APIs
    BACKEND
    MICROSERVICE
    PROXY API SERVICE
    APP SERVICE
    CLIENTS
    Preprocessing
    G2P
    Phrase break
    Accent
    (TTS components)
    PROXY SERVICE
    CLOVA VOICE
    AUDIOBOOK
    APP1
    APP2
    CLOVA SERVICE
    AUDIOBOOK
    SERVICE
    APP SERVICE 1
    APP SERVICE 2
    Different Requests and Responses are given.
    And resource usage for each
    backend service is different

    View Slide

  40. Increase the number of App specific APIs
    BACKEND
    MICROSERVICE
    PROXY API SERVICE
    APP SERVICE
    CLIENTS
    Preprocessing
    G2P
    Phrase break
    Accent
    (TTS components)
    PROXY SERVICE
    CLOVA VOICE
    AUDIOBOOK
    APP1
    APP2
    CLOVA SERVICE
    AUDIOBOOK
    SERVICE
    APP SERVICE 1
    APP SERVICE 2
    Generate speech
    from raw text
    Activated modules

    View Slide

  41. Increase the number of App specific APIs
    BACKEND
    MICROSERVICE
    PROXY API SERVICE
    APP SERVICE
    CLIENTS
    Preprocessing
    G2P
    Phrase break
    Accent
    (TTS components)
    PROXY SERVICE
    CLOVA VOICE
    AUDIOBOOK
    APP1
    APP2
    NAVER
    GATEWAY
    AUDIOBOOK
    SERVICE
    APP SERVICE 1
    APP SERVICE 2
    Activated modules
    Generate linguistic
    features from raw text

    View Slide

  42. Increase the number of App specific APIs
    BACKEND
    MICROSERVICE
    PROXY API SERVICE
    APP SERVICE
    CLIENTS
    Preprocessing
    G2P
    Phrase break
    Accent
    (TTS components)
    PROXY SERVICE
    CLOVA VOICE
    AUDIOBOOK
    APP1
    APP2
    CLOVA SERVICE
    EXTERNAL
    SERVICE
    APP SERVICE 1
    APP SERVICE 2
    Activated modules
    Generate speech from
    linguistic features

    View Slide

  43. Fast development
    BACKEND
    MICROSERVICE
    PROXY API SERVICE
    APP SERVICE
    CLIENTS
    Preprocessing
    G2P
    Phrase break
    Accent
    (TTS components)
    PROXY SERVICE
    CLOVA VOICE
    AUDIOBOOK
    APP1
    APP2
    CLOVA SERVICE
    AUDIOBOOK
    SERVICE
    APP SERVICE 1
    APP SERVICE 2
    takes less time to update them
    since they are independent

    View Slide

  44. Techniques to speed up inference time
    Support adaptive batching
    Support gPRC protocol
    Use CPU-based workers to handle the sharp increase in requests
    Requests coming in short intervals are batched together in the proxy server
    Each service is connected by gRPC to reduce network delay
    Support CPU-based TTS workers with GPU-based one

    View Slide

  45. Microservice architecture on Coharis
    Phrase break
    service
    Accent service
    Normalization
    service
    Segmentation
    service
    Text formatter
    service
    Acoustic model
    service
    Vocoder service
    BACKEND SERVICES
    Proxy service
    Redis
    CPU-based
    workers
    Model
    storage
    gRPC
    App
    services
    REST
    Cache
    storage
    Load pre-trained models
    Queue TTS jobs

    View Slide

  46. API call flow
    Phrase break
    service
    Accent service
    Normalization
    service
    Segmentation
    service
    Text formatter
    service
    Acoustic model
    service
    Vocoder service
    BACKEND SERVICES
    Proxy service
    Redis
    CPU-based
    workers
    Model
    storage
    gRPC
    App
    services
    REST
    Cache
    storage
    Load pre-trained models
    Queue TTS jobs

    View Slide

  47. CPU-based TTS workers
    Phrase break
    service
    Accent service
    Normalization
    service
    Segmentation
    service
    Text formatter
    service
    Acoustic model
    service
    Vocoder service
    BACKEND SERVICES
    Proxy service
    Redis
    CPU-based
    workers
    Model
    storage
    gRPC
    App
    services
    REST
    Cache
    storage
    Load pre-trained models
    Queue TTS jobs

    View Slide

  48. Model loading and cache mechanism
    Phrase break
    service
    Accent service
    Normalization
    service
    Segmentation
    service
    Text formatter
    service
    Acoustic model
    service
    Vocoder service
    BACKEND SERVICES
    Proxy service
    Redis
    CPU-based
    workers
    Model
    storage
    gRPC
    App
    services
    REST
    Cache
    storage
    Load pre-trained models
    Queue TTS jobs

    View Slide

  49. Focus only on developing ML services
    Phrase break
    service
    Accent service
    Normalization
    service
    Segmentation
    service
    Text formatter
    service
    Acoustic model
    service
    Vocoder service
    BACKEND SERVICES
    Proxy service
    Redis
    CPU based
    workers
    Model
    storage
    gRPC
    App
    services
    REST
    Cache
    storage
    Load pre-trained models
    Queue TTS jobs
    Wrapped by ML serving tool
    e.g. Triton, Torch-serve, seldon-core

    View Slide

  50. Pros and Cons
    Scalability
    Release procedure
    ML engineers can focus on backend services without the effects on other services
    Only backend services with high load can be scaled up and can use GPU effectively
    Need to consider the release procedure carefully
    Effective development
    Setup local environment for researchers
    Previous Coharis is still used for research purposes since this is easier to set up
    J
    J
    L
    L

    View Slide

  51. n We introduced TTS system for product ready environment
    n The new architecture makes ML engineers focus on developing TTS
    n Renewing architecture results in faster development and easier scaling
    Summary

    View Slide

  52. Thank you!!

    View Slide