Upgrade to PRO for Only $50/Yearโ€”Limited-Time Offer! ๐Ÿ”ฅ

Continual Prompt Tuning for Dialog State Tracking

Continual Prompt Tuning for Dialog Stateย Tracking

Avatar for Seunghyun Hwang

Seunghyun Hwang

February 16, 2023
Tweet

More Decks by Seunghyun Hwang

Other Decks in Research

Transcript

  1. Continual Prompt Tuning for Dialog State Tracking Presented by Seunghyun

    Hwang 2023. 2. 16. 1 Qi Zhu , Bing Li , Fei Mi , Xiaoyan Zhu , Minlie Huang ACL, 2022 Reading club 3 Research outcome 0
  2. Contents 1. Continual Prompt Tuning for Dialogue State Tracking -

    Overview 2. Background Information 1. Continual learning 2. Prompt-based tuning 3. Dialogue state tracking 3. Continual Prompt Tuning for Dialogue State Tracking - Model Structure(Method) 4. Experiment Result 2
  3. Contents 1. Continual Prompt Tuning for Dialogue State Tracking -

    Overview 2. Background Information 1. Continual learning 2. Prompt-based tuning 3. Dialogue state tracking 3. Continual Prompt Tuning for Dialogue State Tracking - Model Structure(Method) 4. Experiment Result 3
  4. Continual Prompt Tuning for DST 5 Overview Continual Learning Prompt

    Tuning Dialogue State Tracking Model - Catastrophic forgetting problem - Model support new domain service - Parameter-efficient to avoid forgetting - Knowledge transfer between tasks - Crucial for a dialog system to continually learn new tasks - Deployed dialog system is often required above
  5. Contents 1. Continual Prompt Tuning for Dialogue State Tracking -

    Overview 2. Background Information 1. Continual learning 2. Prompt-based tuning 3. Dialogue state tracking 3. Continual Prompt Tuning for Dialogue State Tracking - Model Structure(Method) 4. Experiment Result 8
  6. Continual Learning 9 โ€ข Similar concept with incremental learning โ€ข

    Continually acquiring knowledge from a data stream and reusing it for future learning while avoiding forgetting โ€ข Three methods of continual learning โ€ข Rehearsal method[1] โ€ข Regularization method[2] โ€ข Architectual method[3] [1] Rebuffi, Sylvestre-Alvise, et al. "icarl: Incremental classifier and representation learning." , CVPR 2017 [2] Kirkpatrick, James, et al. "Overcoming catastrophic forgetting in neural networks.โ€œ, PNAS 2017 [3] Rusu, Andrei A., et al. "Progressive neural networks.โ€œ, 2016 Background Information
  7. Continual Learning in dialogue system 10 โ€ข Various general CL

    methods have been applied[1] โ€ข AdapterCL[2] โ€ข Most related with this paper โ€ข Freezes the pre-trained model and learn adapter โ€ข Paper method is more parameter-efficient [1 Lee, Sungjin. "Toward continual learning for conversational agents.โ€œ, 2017 [2] Madotto, Andrea, et al. "Continual learning in task-oriented dialogue systems.โ€œ, 2021 Background Information
  8. Prompt Tuning 11 โ€ข Using a textual prompt to convert

    downstream tasks is a more effective way to use finetuning[1] โ€ข Prompts whose embeddings are learned through back-propagation[2] โ€ข Prompt tuning is parameter-efficient and becomes more competitive with fine-tuning as the model size grows[3] [1] Brown, Tom, et al. "Language models are few-shot learners.โ€œ, Neurips 2020 [2] Liu, Xiao, et al. "GPT understands, too.โ€œ, (2021) [3] Lester, Brian, Rami Al-Rfou, and Noah Constant. "The power of scale for parameter-efficient prompt tuning โ€œ, (2021). Background Information
  9. Prompt Tuning 12 โ€ข Prompt tuning differs from embedding adapter[1]

    โ€ข Embedding adapter transforms all tokens embeddings but do not affect transformer layersโ€™ computation โ€ข Gu[2] and Vu[3] further explore the transferability of soft prompts across tasks โ€ข One-step adaptation -> Prompt transfer in the continual learning setting [1] Zhu, Yaoming, et al. "Counter-interference adapter for multilingual machine translation.โ€œ, 2021 [2] Gu, Yuxian, et al. "Ppt: Pre-trained prompt tuning for few-shot learning.", 2021 [3] Vu, Tu, et al. "Spot: Better frozen model adaptation through soft prompt transfer.โ€œ, 2021 Background Information
  10. Dialogue State Tracking (DST) 13 NLU Natural Language Understanding DST

    Dialogue State Tracking NLG Natural Language Generation DP Dialogue Policy learning people_num=5 Restaurant_Book (Area = Hoegi) Restaurant_Book (Area = Hoegi, people_num = 5) DST is a dialogue-level task that maps partial dialogues into dialogue states. โ€ข Input: a dialogue / a turn โ€ข Output: dialogue state (e.g. slot-value pairs) Can you help me book a restaurant near Hoegi Station? For five people, thanks! Dialogue state tracking
  11. Dialogue State Tracking 14 โ€ข Generation-based models either generate all

    (slot, value) pairs in one pass[1] โ€ข Or generate value for each given slot separately[2] โ€ข Efficiency vs Incorporating more information โ€ข Integrates multiple slot descriptions into a single query and generates all values in one pass [1] Madotto, Andrea, et al. "Continual learning in task-oriented dialogue systems.โ€œ, 2021 [2] Wu, Chien-Sheng, et al. "Transferable multi-domain state generator for task-oriented dialogue systems.โ€œ, 2019 Background Information
  12. Contents 1. Continual Prompt Tuning for Dialogue State Tracking -

    Overview 2. Background Information 1. Continual learning 2. Prompt-based tuning 3. Dialogue state tracking 3. Continual Prompt Tuning for Dialogue State Tracking - Model Structure(Method) 4. Experiment Result 15
  13. Problem setting 16 Model Overview ๐‘†๐‘†๐‘†๐‘†๐‘†๐‘†๐‘†๐‘†๐‘†๐‘†๐‘†๐‘† ๐‘œ๐‘œ๐‘œ๐‘œ ๐‘ก๐‘ก๐‘ก๐‘ก๐‘ก๐‘ก๐‘ก๐‘ก ๐‘ป๐‘ป๐Ÿ๐Ÿ ,

    โ€ฆ ๐‘ป๐‘ป๐’•๐’• ๐‘Ž๐‘Ž๐‘Ž๐‘Ž๐‘Ž๐‘Ž ๐‘‘๐‘‘๐‘‘๐‘‘๐‘‘๐‘‘๐‘‘๐‘‘๐‘‘๐‘‘๐‘‘๐‘‘๐‘‘๐‘‘ ๐‘ซ๐‘ซ๐‘ซ๐‘ซ, โ€ฆ ๐‘ซ๐‘ซ๐’•๐’• ๐‘ƒ๐‘ƒ๐‘ƒ๐‘ƒ๐‘ƒ๐‘ƒ๐‘ƒ๐‘ƒ๐‘ƒ๐‘ƒ๐‘ƒ๐‘ƒ๐‘ƒ๐‘ƒ ๐‘ก๐‘ก๐‘ก๐‘ก๐‘ก๐‘ก๐‘ก๐‘ก๐‘ก๐‘ก๐‘ก๐‘ก ๐’š๐’š ๐‘”๐‘”๐‘”๐‘”๐‘”๐‘”๐‘”๐‘”๐‘”๐‘” ๐‘–๐‘–๐‘–๐‘–๐‘–๐‘–๐‘–๐‘–๐‘–๐‘– ๐’™๐’™ ๐‘Ž๐‘Ž๐‘Ž๐‘Ž๐‘Ž๐‘Ž ๐‘ก๐‘ก๐‘ก๐‘ก๐‘ก๐‘ก๐‘ก๐‘ก ๐‘ป๐‘ป๐’Œ๐’Œ โˆถ ๐‘€๐‘€๐‘€๐‘€๐‘€๐‘€๐‘€๐‘€๐‘€๐‘€ ๐’‡๐’‡ โˆถ ๐‘‹๐‘‹ ร— ๐‘‡๐‘‡ โ†’ ๐‘Œ๐‘Œ ๐ธ๐ธ๐ธ๐ธ๐ธ๐ธ๐ธ ๐‘ ๐‘ ๐‘ ๐‘ ๐‘ ๐‘ ๐‘ ๐‘ ๐‘ ๐‘ ๐‘ ๐‘ ๐‘ ๐‘  ๐‘‡๐‘‡๐‘˜๐‘˜ โ„Ž๐‘Ž๐‘Ž๐‘Ž๐‘Ž ๐‘ ๐‘ ๐‘ ๐‘ ๐‘ ๐‘  ๐‘œ๐‘œ๐‘œ๐‘œ ๐‘๐‘๐‘๐‘๐‘๐‘ ๐‘‘๐‘‘๐‘‘๐‘‘๐‘‘๐‘‘๐‘‘๐‘‘๐‘‘๐‘‘๐‘‘๐‘‘๐‘‘๐‘‘ ๐‘ ๐‘ ๐‘ ๐‘ ๐‘ ๐‘ ๐‘ ๐‘ ๐‘ ๐‘  ๐‘บ๐‘บ๐’Œ๐’Œ = ๐‘ ๐‘ 1 , โ€ฆ , ๐‘ ๐‘ ๐‘›๐‘›๐‘˜๐‘˜ ๐ผ๐ผ๐ผ๐ผ๐ผ๐ผ๐ผ๐ผ๐ผ๐ผ ๐’™๐’™ ๐‘–๐‘–๐‘–๐‘– ๐‘‘๐‘‘๐‘‘๐‘‘๐‘‘๐‘‘๐‘‘๐‘‘๐‘‘๐‘‘๐‘‘๐‘‘ ๐‘Ž๐‘Ž๐‘Ž๐‘Ž๐‘Ž๐‘Ž ๐‘œ๐‘œ๐‘œ๐‘œ๐‘œ๐‘œ๐‘œ๐‘œ๐‘œ๐‘œ๐‘œ๐‘œ ๐’š๐’š ๐‘๐‘๐‘๐‘๐‘๐‘๐‘๐‘๐‘๐‘๐‘๐‘๐‘๐‘๐‘๐‘ ๐‘ ๐‘ ๐‘ ๐‘ ๐‘ ๐‘ ๐‘ ๐‘ _๐‘ฃ๐‘ฃ๐‘ฃ๐‘ฃ๐‘ฃ๐‘ฃ๐‘ฃ๐‘ฃ๐‘ฃ๐‘ฃ ๐‘๐‘๐‘๐‘๐‘๐‘๐‘๐‘๐‘๐‘ โˆถ { ๐‘ ๐‘ 1 , ๐‘ฃ๐‘ฃ1 , โ€ฆ ๐‘ ๐‘ ๐‘›๐‘›๐‘˜๐‘˜ , ๐‘ฃ๐‘ฃ๐‘›๐‘›๐‘˜๐‘˜ } Prompt Tuning ๐น๐น๐น๐น๐น๐น๐น๐น๐น๐น๐น๐น ๐‘ก๐‘ก๐‘ก๐‘ก๐‘ก ๐‘๐‘๐‘๐‘๐‘๐‘๐‘๐‘๐‘๐‘๐‘๐‘๐‘๐‘๐‘๐‘๐‘๐‘๐‘๐‘ ๐‘š๐‘š๐‘š๐‘š๐‘š๐‘š๐‘š๐‘š๐‘š๐‘š ๐‘Ž๐‘Ž๐‘Ž๐‘Ž๐‘Ž๐‘Ž ๐‘Ž๐‘Ž๐‘Ž๐‘Ž๐‘Ž๐‘Ž ๐‘›๐‘›๐‘›๐‘›๐‘›๐‘› ๐‘ ๐‘ ๐‘ ๐‘ ๐‘ ๐‘ ๐‘ ๐‘  ๐‘๐‘๐‘๐‘๐‘๐‘๐‘๐‘๐‘๐‘๐‘๐‘ ๐‘ก๐‘ก๐‘ก๐‘ก๐‘ก๐‘ก๐‘ก๐‘ก๐‘ก๐‘ก๐‘ก๐‘ก ๐ธ๐ธ๐ธ๐ธ๐ธ๐ธ๐ธ ๐‘ก๐‘ก๐‘ก๐‘ก๐‘ก๐‘ก๐‘ก๐‘ก ๐‘‡๐‘‡๐‘˜๐‘˜ ๐‘”๐‘”๐‘”๐‘”๐‘”๐‘”๐‘”๐‘” ๐‘ƒ๐‘ƒ๐‘˜๐‘˜ = ๐‘ƒ๐‘ƒ๐‘˜๐‘˜ 1๐‘ƒ๐‘ƒ๐‘˜๐‘˜ 2 โ€ฆ ๐‘ƒ๐‘ƒ๐‘˜๐‘˜ ๐‘š๐‘š , ๐’Ž๐’Ž ๐‘›๐‘›๐‘›๐‘›๐‘›๐‘› ๐‘ ๐‘ ๐‘ ๐‘ ๐‘ ๐‘ ๐‘ ๐‘  ๐‘๐‘๐‘๐‘๐‘๐‘๐‘๐‘๐‘๐‘๐‘๐‘ ๐‘ก๐‘ก๐‘ก๐‘ก๐‘ก๐‘ก๐‘ก๐‘ก๐‘ก๐‘ก๐‘ก๐‘ก Model Structure
  14. Problem setting 17 DST as Masked spans Recovering ๐‘“๐‘“๐‘“๐‘“๐‘“๐‘“๐‘“๐‘“๐‘“๐‘“๐‘“๐‘“๐‘“๐‘“๐‘“๐‘“ ๐‘”๐‘”๐‘˜๐‘˜

    : ๐‘‹๐‘‹ ร— ๐‘Œ๐‘Œ โ†’ ๐‘‰๐‘‰โˆ— ร— ๐‘‰๐‘‰โˆ— ๐‘ค๐‘ค๐‘ค๐‘ค๐‘ค๐‘ค๐‘ค๐‘ค๐‘ค ๐‘ฝ๐‘ฝ ๐‘–๐‘–๐‘–๐‘– ๐‘ก๐‘ก๐‘ก๐‘ก๐‘ก ๐‘ฃ๐‘ฃ๐‘ฃ๐‘ฃ๐‘ฃ๐‘ฃ๐‘ฃ๐‘ฃ๐‘ฃ๐‘ฃ๐‘ฃ๐‘ฃ๐‘ฃ๐‘ฃ๐‘ฃ๐‘ฃ๐‘ฃ๐‘ฃ๐‘ฃ๐‘ฃ ๐‘’๐‘’๐‘’๐‘’๐‘’๐‘’๐‘’ ๐‘ ๐‘ ๐‘ ๐‘ ๐‘ ๐‘ ๐‘ ๐‘ ๐‘ ๐‘ ๐‘ ๐‘ ๐‘ ๐‘  ๐‘‡๐‘‡๐‘˜๐‘˜ ๐’•๐’•๐’•๐’•๐’•๐’•๐’•๐’•๐’•๐’•๐’•๐’•๐’•๐’•๐’•๐’•๐’•๐’• ๐‘œ๐‘œ๐‘œ๐‘œ๐‘œ๐‘œ๐‘œ๐‘œ๐‘œ๐‘œ๐‘œ๐‘œ๐‘œ๐‘œ๐‘œ๐‘œ ๐‘‘๐‘‘๐‘‘๐‘‘๐‘‘๐‘‘๐‘‘๐‘‘ โˆถ ๏ฟฝ ๐‘ฅ๐‘ฅ, ๏ฟฝ ๐‘ฆ๐‘ฆ = ๐‘”๐‘”๐‘˜๐‘˜ ๐‘ฅ๐‘ฅ, ๐‘ฆ๐‘ฆ ๐‘ค๐‘ค๐‘ค๐‘ค๐‘ค๐‘ค๐‘ค๐‘ค๐‘ค ๏ฟฝ ๐’™๐’™, ๏ฟฝ ๐’š๐’š ๐‘Ž๐‘Ž๐‘Ž๐‘Ž๐‘Ž๐‘Ž ๐‘š๐‘š๐‘š๐‘š๐‘š๐‘š๐‘š๐‘š๐‘š๐‘š ๐‘–๐‘–๐‘–๐‘–๐‘–๐‘–๐‘–๐‘–๐‘–๐‘– ๐‘Ž๐‘Ž๐‘Ž๐‘Ž๐‘Ž๐‘Ž ๐‘œ๐‘œ๐‘œ๐‘œ๐‘œ๐‘œ๐‘œ๐‘œ๐‘œ๐‘œ๐‘œ๐‘œ ๐‘„๐‘„๐‘˜๐‘˜ = "๐‘‘๐‘‘1 ๐‘˜๐‘˜: < ๐‘€๐‘€1 > โ‹ฏ ๐‘‘๐‘‘๐‘›๐‘›๐‘˜๐‘˜ ๐‘˜๐‘˜ : < ๐‘€๐‘€๐‘›๐‘›๐‘˜๐‘˜ > " ๏ฟฝ ๐‘ฅ๐‘ฅ = ๐‘ฅ๐‘ฅ; ๐‘„๐‘„๐‘˜๐‘˜ ; ๐‘ƒ๐‘ƒ๐‘˜๐‘˜ ๏ฟฝ ๐‘ฆ๐‘ฆ = " < ๐‘€๐‘€1 > ๐‘ฃ๐‘ฃ1 ๐‘˜๐‘˜ โ€ฆ < ๐‘€๐‘€๐‘›๐‘›๐‘˜๐‘˜ > ๐‘ฃ๐‘ฃ๐‘›๐‘›๐‘˜๐‘˜ ๐‘˜๐‘˜ " โ„’๐œƒ๐œƒ๐‘ƒ๐‘ƒ๐‘˜๐‘˜ ๐ท๐ท๐‘˜๐‘˜ = โˆ’ ๏ฟฝ ๐‘—๐‘—=1 ๐ท๐ท๐‘˜๐‘˜ log ๐‘๐‘๐œƒ๐œƒ (๐‘ฆ๐‘ฆ๐‘—๐‘— ~๐‘˜๐‘˜|[๐‘ฅ๐‘ฅ๐‘—๐‘— ๐‘˜๐‘˜; ๐‘„๐‘„๐‘˜๐‘˜ ; ๐‘ƒ๐‘ƒ๐‘˜๐‘˜ ]) Model Structure
  15. Continual learning : Forward Transfer 19 โ€ข Continual Prompt Initialization

    โ€ข CLInit โ€“ selects last taskโ€™s prompt ๐‘ƒ๐‘ƒ๐‘˜๐‘˜โˆ’1 to initialize current taskโ€™s prompt ๐‘ƒ๐‘ƒ๐‘˜๐‘˜ โ€ข SelectInit - selects the previous prompt with the lowest loss to initialize ๐‘ƒ๐‘ƒ๐‘˜๐‘˜ โ€ข Query Fusion โ€ข Sample ๐‘›๐‘›1 slots from ๐‘†๐‘†๐‘˜๐‘˜ randomly, where ๐‘›๐‘›1 is sample from [1, ๐‘†๐‘†๐‘˜๐‘˜ ] uniformly โ€ข Sample ๐‘›๐‘›2 slots from โ‹ƒ๐‘–๐‘–<๐‘˜๐‘˜ ๐‘†๐‘†๐‘–๐‘– randomly, where ๐‘›๐‘›2 is sample from [1, ๐‘›๐‘›1 ] uniformly โ€ข Combine ๐‘›๐‘›1 and ๐‘›๐‘›2 slotsโ€™ descriptions in a random order : ๐‘„๐‘„๐‘˜๐‘˜ โ€ฒ Model Structure
  16. Continual learning : Forward Transfer 20 โ€ข Memory Replay โ€ข

    Store a few samples for each task and replay them when training on new tasks โ€ข Store |๐‘€๐‘€| samples for each task ๐‘‡๐‘‡๐‘–๐‘– , ๐‘€๐‘€๐‘–๐‘– โ€ข Change loss function to โ„’๐œƒ๐œƒ๐‘ƒ๐‘ƒ๐‘˜๐‘˜ ๐ท๐ท๐‘˜๐‘˜ + ๐‘€๐‘€<๐‘˜๐‘˜ Model Structure
  17. Continual learning : Backward Transfer 21 โ€ข Memory-Guided Backward Transfer

    โ€ข For each previous task ๐‘‡๐‘‡๐‘–๐‘– , ๐‘–๐‘– < ๐‘˜๐‘˜, we initialize a new prompt ๐‘ƒ๐‘ƒ ๐‘–๐‘– (๐‘˜๐‘˜) ๐‘ก๐‘ก๐‘ก๐‘ก ๐‘ƒ๐‘ƒ๐‘–๐‘– โ€ข Trained it on current taskโ€™s data ๐ท๐ท๐‘˜๐‘˜ with memory ๐‘€๐‘€๐‘–๐‘– as regularization โ€ข Gradient from data and memory are ๐‘”๐‘”๐‘œ๐‘œ๐‘œ๐‘œ๐‘œ๐‘œ ๐‘Ž๐‘Ž๐‘Ž๐‘Ž๐‘Ž๐‘Ž ๐‘”๐‘”๐‘Ÿ๐‘Ÿ๐‘Ÿ๐‘Ÿ๐‘Ÿ๐‘Ÿ โ€ข Update with below gradient Model Structure
  18. Contents 1. Continual Prompt Tuning for Dialogue State Tracking -

    Overview 2. Background Information 1. Continual learning 2. Prompt-based tuning 3. Dialogue state tracking 3. Continual Prompt Tuning for Dialogue State Tracking - Model Structure(Method) 4. Experiment Result 23
  19. Experiment setting 24 โ€ข Dataset โ€ข Schema-Guided Dialog dataset(SGD)[1] โ€ข

    Evaluation Method โ€ข Joint Goal Accuracy(JGA)[2] โ€ข Effect of forward transfer[3] โ€ข Effect of backward transfer[3] Experiment Result [1] Rastogi, Abhinav, et al. "Towards scalable multi-domain conversational agents: The schema-guided dialogue dataset." , AAAI 2020. [2] Wu, Chien-Sheng, et al. "Transferable multi-domain state generator for task-oriented dialogue systems.โ€œ, 2019 [3] Lopez-Paz, David, and Marc'Aurelio Ranzato. "Gradient episodic memory for continual learning.โ€œ, Neurips 2017
  20. Experiment result 27 Experiment Result JGA with different model size

    and prompt lengths FWT with different model size and prompt lengths