Quantifying Memorization and Detecting Training...

Shotaro Ishihara

September 21, 2024

Research

100

Quantifying Memorization and Detecting Training Data of Pre-trained Language Models using Japanese Newspaper

https://aclanthology.org/2024.inlg-main.14/

Shotaro Ishihara

September 21, 2024

Tweet

More Decks by Shotaro Ishihara

See All by Shotaro Ishihara

JOAI2025講評 / joai2025-review

0

180

AI エージェントを活用した研究再現性の自動定量評価 / scisci2025

1

130

JSAI2025 企画セッション「人工知能とコンペティション」/ jsai2025-competition

0

38

生成的推薦の人気バイアスの分析：暗記の観点から / JSAI2025

0

220

Semantic Shift Stability: 学習コーパス内の単語の意味変化を用いた事前学習済みモデルの時系列性能劣化の監査

0

38

日本語ニュース記事要約支援に向けたドメイン特化事前学習済みモデルの構築と活用 / t5-news-summarization

0

46

Web からのデータ収集と探究事例の紹介 / no94_jsai_seminar

0

310

記者・編集者との協働：情報技術が変えるニュースメディア / Kaishi PU 2024

0

110

ニュースメディアにおける生成 AI の活用と開発 / UTokyo Lecture Business Introduction

0

330

Other Decks in Research

See All in Research

問いを起点に、社会と共鳴する知を育む場へ

PRO

0

490

Vision And Languageモデルにおける異なるドメインでの継続事前学習が性能に与える影響の検証 / YANS2024

1

120

2025/7/5 応用音響研究会招待講演＠北海道大学

1

130

Looking for Escorts in Sydney?

1

130

業界横断副業・兼業者の実態調査

0

200

Ad-DS Paper Circle #1

0

5.7k

学生向けアンケート＜データサイエンティストについて＞

datascientistsociety

PRO

0

4.6k

IMC の細かすぎる話 2025

2

430

[輪講] SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

2

670

なめらかなシステムと運用維持の終わらぬ未来 / dicomo2025_coherently_fittable_system

0

1.3k

Cross-Media Information Spaces and Architectures

PRO

0

230

在庫管理のための機械学習と最適化の融合

3

1.1k

Featured

See All Featured

What's in a price? How to price your products and services

246

12k

Save Time (by Creating Custom Rails Generators)

PRO

31

1.3k

Connecting the Dots Between Site Speed, User Experience & Your Business [WebExpo 2025]

8

390

Documentation Writing (for coders)

72

4.9k

455

43k

The Art of Delivering Value - GDevCon NA Keynote

15

1.6k

VelocityConf: Rendering Performance Case Studies

332

24k

The Success of Rails: Ensuring Growth for the Next 100 Years

45

7.5k

Java REST API Framework Comparison - PWX 2021

31

8.7k

Producing Creativity

PRO

346

40k

Why Our Code Smells

PRO

337

57k

30

1.2k

Transcript

Quantifying Memorization and Detecting Training Data of Pre-trained Language Models
using Japanese Newspaper Shotaro Ishihara (Nikkei Inc.) and Hiromu Takahashi Research Question: 1) Do Japanese PLMs memorize the training data as well as the English PLMs? 2) Is the memorized training data detectable as well as the English PLMs? Approach: 1. Pre-training GPT-2 models using Japanese newspaper articles. 2. Quantifying memorization using the generated candidate and reference. 3. Membership inference attacks using the generated candidate. Findings: 1. Japanese PLMs sometimes “copy and paste” on a large scale (max 48 chars). 2. We replicated the English empirical ﬁnding that memorization is related to duplication, model size, and prompt length. 3. Experiments demonstrated that the training data was detected from PLMs even in Japanese (AUC 0.60). The more duplicates and the longer the prompt, the easier the detection was. The more epochs (more duplication), the larger the model size, the longer the prompt, the more memorization. The more duplicates and the longer the prompt, the easier the detection was.