Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Adapting Text-based Dialogue State Tracker for Spoken Dialogues

Adapting Text-based Dialogue State Tracker for Spoken Dialogues

Seunghyun Hwang

September 11, 2023
Tweet

More Decks by Seunghyun Hwang

Other Decks in Research

Transcript

  1. Adapting Text-based Dialogue State Tracker for Spoken Dialogues Presented by

    Seunghyun Hwang Kim Jaechul Graduate School of AI, KAIST, Seoul, Republic of Korea 2023. 9. 11. 1 Jaeseok Yoon, Seunghyun Hwang, Ran Han, Jeonguk Bang, Kee-Eung Kim DSTC11 Track 3 - Speech-aware dialog systems
  2. Task-Oriented Dialogue System Dialogue System 2 Can you help me

    book a hotel near Prague castle? For how many people? For two people, thanks! How about a “OREA Hotel”. EX) Part of hotel reservation
  3. Speech-Aware Dialog System (TOD System) Dialogue System Can you help

    me book a hotel near Prague castle? For how many people? For two people, thanks! How about a “OREA Hotel”. EX) Part of restaurant reservation 3
  4. Related Works Related Works • TRADE[1] (TRAnsferable Dialogue statE generator)

    uses shared parameters and a copy me chanism for robust, cross-domain dialogue state tracking to avoid forgetting previously learn ed tasks. • UBAR[2] fine-tunes a GPT-2 model on entire dialog sessions, achieving state-of-the-art perf ormance and transferability to new domains. • D3ST[3] (Description Driven Dialog State Tracking) uses natural language descriptions for t ask schemata, leading to better understanding, higher performance in state tracking. [1] Wu, Chien-Sheng, et al. "Transferable multi-domain state generator for task-oriented dialogue systems." (ACL 2019) [2] Yang, Yunyi, Yunhao Li, and Xiaojun Quan. "UBAR: Towards fully end-to-end task-oriented dialog system with GPT-2." (AAAI 2021) [3] Zhao, Jeffrey, et al. "Description-driven task-oriented dialog modeling." arXiv preprint arXiv:2201.08904 (2022) 4
  5. DSTC11 Track3 - Speech-aware Dialog Systems Speech-aware dialog systems [1]

    Budzianowski, Paweł, et al. "MultiWOZ--A Large-Scale Multi-Domain Wizard-of-Oz Dataset for Task-Oriented Dialogue Modelling.” (EMNLP 2018) • Overcome the challenges in implementing a good speech dialogue system, based on MultiWoz[1] dataset • Dataset with four audio features 1. Raw audio in the standard PCM format 2. Audio encoder output from the ASR system 3. Transcripts from the ASR system (ASR hypothesis) 4. Time alignment describing how the recognized words map to the encoder output seque nces. 5
  6. Why Speech-aware Dialog is Difficult? Speech-aware dialog systems Extending Text

    Based Dialogue Systems to Speech Input by ASR Error Correction Transformer 6 Written Conversation User I need a hotel in Fisherman’s Wharf Agent Is there a particular price range you are looking for? User I’m looking in the expensive Agent The Suite at Fisherman’s Wharf may work for you User Do you know how much the parking is? Agent It would 25 dollars per day. Spoken Conversation User hi i’m looking for a place at to stay at fisherman’s wharf at a hotel in the expensive Agent sure let me see ok so there is one called the suites at fisherman’s wharf is that something that would be interesting to you User can you tell me how much parking Agent sure okay this hotel charges twenty five dollars per day price range pressure engine ummm uhhh cost coast BLUE: Disfluencies RED: Speech recognition error
  7. ASR Error Correction Transformer Our approach Training Phase Source:“[noise] i

    want to depart form joshua and arrive to waggoner by 12:53 a m a m” Prediction:“i want to depart from joshua and arrive to waggoner by 0:53 am.” Target:“please get me a ticket for one that leaves at 0:49 am and send me the reference number.” Source:“[noise] please get me a ticket for one that leaves at 12:49 a m a m and send me the reference number” Remove noise Fix speech recognition error Fix time format error Inference: Correct ASR errors Remove noise Fix time format error Add period Add period 9
  8. Text-Based Dialogue System Our approach 0 destination of the train

    1 departure location of the train 2 arrival time of the train … 34 star rating of the hotel 35 length of stay at the hotel 36 number of people for the hotel booking [user] i need a train leaving form little mountain this thursday [system] in order to better assist you, may i please have your destination? … [user] [noise] please get me a ticket for one that leaves at 12:49 am am and send me the reference number" • Our Input data format [user] i need a train leaving from little mountain this thursday. [system] in order to better assist you, may i please have your destination? … [user] please get me a ticket for one that leaves at 0:49 am and send me the reference number." Description Conversation history Revised Conversation history 11
  9. Adapting Text-based Dialogue State Tracker for Spoken Dialogues Our approach

    Special char, wrong verb Wrong proper noun D3ST text based model 13
  10. Experiment Settings Experiment Settings • Use two T5[1] models of

    hugging face[2]. • for ASR error correction (input length 256, output length 256) • for text based dialogue system (input length 1024, output length 512) • Dataset – MultiWoz[3] (DSTC11 Track 3 revised) • Over 56,000 examples in training set • 7,373 examples in the validation set • 7,371 examples in the test set • Text data and audio features are include • Use AdamW[4] optimizer with 𝛽𝛽1 = 0.9, 𝛽𝛽2 = 0.999, weight decay = 0.01 • learning rate 0.3 • batch size 8 14 [1] Raffel, Colin, et al. "Exploring the limits of transfer learning with a unified text-to-text transformer." (JMLR 2020) [2] Hugging Face T5 - https://huggingface.co/docs/transformers/model_doc/t5 [3] Budzianowski, Paweł, et al. "MultiWOZ--A Large-Scale Multi-Domain Wizard-of-Oz Dataset for Task-Oriented Dialogue Modelling.” (EMNLP 2018) [4] Loshchilov, Ilya, and Frank Hutter. "Decoupled weight decay regularization." (ICLR 2019)
  11. ASR Error Correction Result Experimental Results 15 Figure 1: The

    comparison of sentence error rate performance depending on whether ASR error correction is applied.