Shotaro Ishihara (2023). Training Data Extraction From Pre-trained Language Models: A Survey. Proceedings of Third Workshop on Trustworthy Natural Language Processing.
 Nicholas Carlini et al. Extracting training data from
large language models. In USENIX Security 21.
 Nicholas Carlini et al. Quantifying memorization
across neural language models. In ICLR 2023.
Training Data Extraction From Pre-trained Language Models: A Survey
● This study is the ﬁrst to provide a comprehensive survey
of training data extraction from Pre-trained Language
● Our review covers more than 100 key papers in ﬁelds NLP
○ Preliminary knowledge
○ Taxonomy of memorization, attacks, defenses, and
○ Future research directions
Shotaro Ishihara (Nikkei Inc. [email protected] ) arXiv preprint: https://arxiv.org/abs/2305.16157
Attacks, defenses, and ﬁndings
● The attack is consist of candidate generation and
● The pioneering work has identiﬁed that personal
information can be extracted from pre-trained
GPT-2 models .
● Experiments show that memorization is related to
model size, prompt length and duplications in the
training data .
● Defenses include pre-processing, training and
Deﬁnition of memorization
With the advent of approximate
memorization, the concern
became similar to a famous issue
called model inversion attack.
Future research directions
● Is memorization always evil?
○ memorization vs association
○ memorization vs performance
● Toward broader research ﬁelds
○ model inversion attacks
○ plagiarism detection
○ image similarity
● Evaluation schema
○ benchmark dataset
○ evaluation metrics
● Model variation
○ masked language models