Quantifying Memorization of Domain-Specific Pre-trained Language Models using Japanese Newspaper and Paywalls - Speaker Deck

Tweet

Tweet

Slide 1

Slide 1 text

Quantifying Memorization of Domain-Speciﬁc Pre-trained Language Models using Japanese Newspaper and Paywalls Shotaro Ishihara (Nikkei Inc.) https://arxiv.org/abs/2404.17143 Research Question: Do Japanese PLMs memorize the training data as well as the English PLMs? Approach: We pre-trained GPT-2 models using Japanese newspaper articles. The string at the beginning (public) is used as a prompt, and the remaining string within the paywall (private) is used for the evaluation. Findings: 1. Japanese PLMs sometimes “copy and paste” on a large scale. 2. We replicated the empirical ﬁnding that memorization is related to duplication, model size, and prompt length. Memorized strings are highlighted in green. (48 chars) The more epochs (more duplication), the larger the model size, the longer the prompt, the more memorization.