Adapting Text-based Dialogue State Tracker for Spoken Dialogues

Adapting Text-based Dialogue State Tracker for Spoken Dialogues Presented by
Seunghyun Hwang Kim Jaechul Graduate School of AI, KAIST, Seoul, Republic of Korea 2023. 9. 11. 1 Jaeseok Yoon, Seunghyun Hwang, Ran Han, Jeonguk Bang, Kee-Eung Kim DSTC11 Track 3 - Speech-aware dialog systems

Task-Oriented Dialogue System Dialogue System 2 Can you help me
book a hotel near Prague castle? For how many people? For two people, thanks! How about a “OREA Hotel”. EX) Part of hotel reservation

Speech-Aware Dialog System (TOD System) Dialogue System Can you help
me book a hotel near Prague castle? For how many people? For two people, thanks! How about a “OREA Hotel”. EX) Part of restaurant reservation 3

Related Works Related Works • TRADE[1] (TRAnsferable Dialogue statE generator)
uses shared parameters and a copy me chanism for robust, cross-domain dialogue state tracking to avoid forgetting previously learn ed tasks. • UBAR[2] fine-tunes a GPT-2 model on entire dialog sessions, achieving state-of-the-art perf ormance and transferability to new domains. • D3ST[3] (Description Driven Dialog State Tracking) uses natural language descriptions for t ask schemata, leading to better understanding, higher performance in state tracking. [1] Wu, Chien-Sheng, et al. "Transferable multi-domain state generator for task-oriented dialogue systems." (ACL 2019) [2] Yang, Yunyi, Yunhao Li, and Xiaojun Quan. "UBAR: Towards fully end-to-end task-oriented dialog system with GPT-2." (AAAI 2021) [3] Zhao, Jeffrey, et al. "Description-driven task-oriented dialog modeling." arXiv preprint arXiv:2201.08904 (2022) 4

DSTC11 Track3 - Speech-aware Dialog Systems Speech-aware dialog systems [1]
Budzianowski, Paweł, et al. "MultiWOZ--A Large-Scale Multi-Domain Wizard-of-Oz Dataset for Task-Oriented Dialogue Modelling.” (EMNLP 2018) • Overcome the challenges in implementing a good speech dialogue system, based on MultiWoz[1] dataset • Dataset with four audio features 1. Raw audio in the standard PCM format 2. Audio encoder output from the ASR system 3. Transcripts from the ASR system (ASR hypothesis) 4. Time alignment describing how the recognized words map to the encoder output seque nces. 5

Why Speech-aware Dialog is Difficult? Speech-aware dialog systems Extending Text
Based Dialogue Systems to Speech Input by ASR Error Correction Transformer 6 Written Conversation User I need a hotel in Fisherman’s Wharf Agent Is there a particular price range you are looking for? User I’m looking in the expensive Agent The Suite at Fisherman’s Wharf may work for you User Do you know how much the parking is? Agent It would 25 dollars per day. Spoken Conversation User hi i’m looking for a place at to stay at fisherman’s wharf at a hotel in the expensive Agent sure let me see ok so there is one called the suites at fisherman’s wharf is that something that would be interesting to you User can you tell me how much parking Agent sure okay this hotel charges twenty five dollars per day price range pressure engine ummm uhhh cost coast BLUE: Disfluencies RED: Speech recognition error

Adapting Text-based Dialogue State Tracker for Spoken Dialogues Our approach
[1 ] [2 ] [3 ] 7

ASR Error Correction Transformer Our approach 8

ASR Error Correction Transformer Our approach Training Phase Source:“[noise] i
want to depart form joshua and arrive to waggoner by 12:53 a m a m” Prediction:“i want to depart from joshua and arrive to waggoner by 0:53 am.” Target:“please get me a ticket for one that leaves at 0:49 am and send me the reference number.” Source:“[noise] please get me a ticket for one that leaves at 12:49 a m a m and send me the reference number” Remove noise Fix speech recognition error Fix time format error Inference: Correct ASR errors Remove noise Fix time format error Add period Add period 9

Text-Based Dialogue System Our approach 10

Text-Based Dialogue System Our approach 0 destination of the train
1 departure location of the train 2 arrival time of the train … 34 star rating of the hotel 35 length of stay at the hotel 36 number of people for the hotel booking [user] i need a train leaving form little mountain this thursday [system] in order to better assist you, may i please have your destination? … [user] [noise] please get me a ticket for one that leaves at 12:49 am am and send me the reference number" • Our Input data format [user] i need a train leaving from little mountain this thursday. [system] in order to better assist you, may i please have your destination? … [user] please get me a ticket for one that leaves at 0:49 am and send me the reference number." Description Conversation history Revised Conversation history 11

Named Entity Post-Processing Our approach 12

Adapting Text-based Dialogue State Tracker for Spoken Dialogues Our approach
Special char, wrong verb Wrong proper noun D3ST text based model 13

Experiment Settings Experiment Settings • Use two T5[1] models of
hugging face[2]. • for ASR error correction (input length 256, output length 256) • for text based dialogue system (input length 1024, output length 512) • Dataset – MultiWoz[3] (DSTC11 Track 3 revised) • Over 56,000 examples in training set • 7,373 examples in the validation set • 7,371 examples in the test set • Text data and audio features are include • Use AdamW[4] optimizer with 𝛽𝛽1 = 0.9, 𝛽𝛽2 = 0.999, weight decay = 0.01 • learning rate 0.3 • batch size 8 14 [1] Raffel, Colin, et al. "Exploring the limits of transfer learning with a unified text-to-text transformer." (JMLR 2020) [2] Hugging Face T5 - https://huggingface.co/docs/transformers/model_doc/t5 [3] Budzianowski, Paweł, et al. "MultiWOZ--A Large-Scale Multi-Domain Wizard-of-Oz Dataset for Task-Oriented Dialogue Modelling.” (EMNLP 2018) [4] Loshchilov, Ilya, and Frank Hutter. "Decoupled weight decay regularization." (ICLR 2019)

ASR Error Correction Result Experimental Results 15 Figure 1: The
comparison of sentence error rate performance depending on whether ASR error correction is applied.

Main Result : Ablation Study Experimental Results 16

Main Result : Competition Score Experimental Results 17

Adapting Text-based Dialogue State Tracker for ...

Adapting Text-based Dialogue State Tracker for Spoken Dialogues

Seunghyun Hwang

More Decks by Seunghyun Hwang

Other Decks in Research

Featured

Transcript