Slide 1
Slide 1 text
Quantifying Memorization of Domain-Specific Pre-trained
Language Models using Japanese Newspaper and Paywalls
Shotaro Ishihara (Nikkei Inc.) https://arxiv.org/abs/2404.17143
Research Question:
Do Japanese PLMs memorize the training data as well as the English PLMs?
Approach:
We pre-trained GPT-2 models using Japanese newspaper articles. The string
at the beginning (public) is used as a prompt, and the remaining string
within the paywall (private) is used for the evaluation.
Findings:
1. Japanese PLMs sometimes “copy and paste” on a large scale.
2. We replicated the empirical finding that memorization is related to
duplication, model size, and prompt length.
Memorized strings are highlighted in green. (48 chars)
The more epochs (more duplication),
the larger the model size, the longer
the prompt, the more memorization.