[1] Nicholas Carlini et al. Extracting training data from
large language models. In USENIX Security 21.
[2] Nicholas Carlini et al. Quantifying memorization
across neural language models. In ICLR 2023.
Training Data Extraction From Pre-trained Language Models: A Survey
Overview
● This study is the first to provide a comprehensive survey
of training data extraction from Pre-trained Language
Models (PLMs).
● Our review covers more than 100 key papers in fields NLP
and security:
○ Preliminary knowledge
○ Taxonomy of memorization, attacks, defenses, and
empirical findings
○ Future research directions
Shotaro Ishihara (Nikkei Inc.
[email protected] ) arXiv preprint: https://arxiv.org/abs/2305.16157
Attacks, defenses, and findings
● The attack is consist of candidate generation and
membership inference.
● The pioneering work has identified that personal
information can be extracted from pre-trained
GPT-2 models [1].
● Experiments show that memorization is related to
model size, prompt length and duplications in the
training data [2].
● Defenses include pre-processing, training and
post-processing.
Definition of memorization
With the advent of approximate
memorization, the concern
became similar to a famous issue
called model inversion attack.
Future research directions
● Is memorization always evil?
○ memorization vs association
○ memorization vs performance
● Toward broader research fields
○ model inversion attacks
○ plagiarism detection
○ image similarity
● Evaluation schema
○ benchmark dataset
○ evaluation metrics
● Model variation
○ masked language models