最先端NLP論文紹介：A Watermark for Large Language Models

A Watermark for Large Language Models John Kirchenbauer, Jonas Geiping,
Yuxin Wen, Jonathan Katz, Ian Miers, Tom Goldstein (ICML2023) ঺հऀɿ ܀ྛथੜ (MBZUAI) 2023/8/27 ࠷ઌ୺NLP2023

͋Δจষ͕ݴޠϞσϧʹΑͬͯੜ੒͞Εͨ΋ͷ͔ ݕ஌Ͱ͖ΔੈքͳΒخ͍͠ ͳͥʁ l ࣗಈੜ੒͞ΕͨςΩετͷѱ༻ʢྫɿϑΣΠΫχϡʔε࡞੒ʣݕ஌ - ʮ͋ͳͨͷͱ͜ΖͷݴޠϞσϧ͕໎࿭Λ͔͚͍ͯ·͢ʯͱ͍͏೉บʹର͢Δ๷Ӵ l ৽ͨͳݴޠϞσϧΛ܇࿅͢Δࡍͷ܇࿅σʔλϑΟϧλϦϯά l
จষͷ৴པੑʢਓؒʹΑͬͯॻ͔Ε͔ͨʣΛ൑அ͢ΔҰͭͷࢦඪ 2023/8/27 ࠷ઌ୺NLP2023 ਪ͠ *$.- PVUTUBOEJOHQBQFS ײ֒

΍Δ͜ͱɾ΍Βͳ͍͜ͱ ݴޠϞσϧ͔ΒςΩετΛੜ੒͢Δࡍʹʮಁ͔͠ʯΛೖΕΔ ---proactive-imprinting 2023/8/27 ࠷ઌ୺NLP2023 ͋Δ伴Λ࢖ͬͯ σίʔυ ݴޠϞσϧ 伴Λ஌͍ͬͯΔਓ͸ɼจষʹ
ಁ͔͕͠ೖ͍ͬͯΔ͔Λ֬ೝͰ͖Δ จষੜ੒ Generated by LM ʲ΍Δ͜ͱʳ ʲ΍Βͳ͍͜ͱʳ • ݴޠϞσϧਓؒͷͲͪΒʹΑͬͯॻ͔Ε͔ͨจষԽΛ෼ྨ͢ΔϞσϧ࡞੒ʢ(15;FSPͳͲʣ ---post-hoc detection • ೚ҙͷςΩετʹޙ͔Βಁ͔͠ΛೖΕΔ ʢݹ͔͘Βͷ΍Γํɼࠓճ͸ϑΥʔΧε֎ʣ ---post-hoc imprinting • "1*ఏڙऀ͕Ϟσϧͷੜ੒ཤྺΛอଘ͠ɼর߹ͯ͠ݕ஌͢Δ ʢϓϥΠόγʔ໰୊͋Γʣ ---retrieval-based detection

ݴޠϞσϧ͔Β୯ޠΛαϯϓϦϯά͢ΔࡍʹಛघͳόΠΞεΛ͔͚Δ ఏҊ๏ͷํ਑ 2023/8/27 ࠷ઌ୺NLP2023 … a to the you
it 𝑝(𝑤!) … 𝑝(𝑤!"#) … 𝑝(𝑤!"$) ఏҊΞΠσΟΞͷࡶͳΠϝʔδ ୯ޠͷબͼํʹ ಛผͳ৘ใΛ ຒΊࠐΉ

ݴޠϞσϧ͔Β୯ޠΛαϯϓϦϯά͢ΔࡍʹಛघͳόΠΞεΛ͔͚Δ ҎԼΛͳΔ΂͘ຬ͍ͨͨ͠ɿ l จষͷ࣭ΛѱԽͤ͞ͳ͍ l ಁ͔͠Λফͮ͠Β͍ l ಁ͔͠ͷݕग़͕௿ίετ - cf.
ݕग़ʹେن໛ݴޠϞσϧ͕ඞཁ l ݕग़ྗ͕ߴ͍ʢୈೋछͷաޡ͕ੜ͡ʹ͍͘ʀϞσϧͷੜ੒෺Λݕ஌͜͠΅͞ͳ͍ʣ - গͳ͍τʔΫϯ਺Ͱݕग़Ͱ͖Δ ఏҊ๏ͷํ਑ 2023/8/27 ࠷ઌ୺NLP2023 👎 ੜ੒͞Εͨจষ͸͓ͦΒ͘ۃΊͯಡΈͮΒ͍ɽ 👎 ௕͍୯ޠΛ࢖͏ํ਑͕ਪଌͰ͖Ε͹ɼ୹͍୯ޠʹஔ͖׵͍͑ͯ͘͜ͱͰಁ͔͠ΛফͤΔɽ ྑ͘ͳ͍ྫɿͳΔ΂͘จࣈ਺͕ଟ͍୯ޠ͕࢖ΘΕΔΑ͏ʹ୯ޠΛαϯϓϦϯά͢Δʢಁ͔͠ΛೖΕΔʣɽ ୯ޠͷฏۉ௕ΛଌΔ͜ͱͰಁ͔͕͠ೖ͍ͬͯΔ͔Λ൑அ͢Δɽ … a to the you it 𝑝(𝑤!) … 𝑝(𝑤!"#) … 𝑝(𝑤!"$) ୯ޠͷબͼํʹ ಛผͳ৘ใΛ ຒΊࠐΉ ఏҊΞΠσΟΞͷࡶͳΠϝʔδ

ఏҊ๏ʢϋʔυ੍໿൛ʣ 2023/8/27 ࠷ઌ୺NLP2023 “I want” … 𝑤!"# = 𝑤𝑎𝑛𝑡
(લํจ຺) ᶄཚ਺Λ༻͍ͯ ޠኮू߹𝒱Λ ੍ݶޠኮ𝑅ͱ ީิޠኮ𝐺ʹ෼ׂ ʢ 𝐺 = 𝛾 𝒱 ; 0 < 𝛾 < 1 ʣ a to the you it ᶅީิޠኮ𝐺͔Β୯ޠΛαϯϓϧ 𝑤! =you ʲಁ͔͠ͷೖΕํʳ ֤୯ޠΛੜ੒͢Δࡍͷީิޠኮ𝐺Λ ௚લ୯ޠɾϋογϡؔ਺ɾཚ਺ੜ੒ث͔ΒఆΊΔ ᶃ௚લͷ୯ޠΛ΋ͱʹཚ਺ੜ੒ • The watermark cannot be removed without modifying a significant fraction of the generated tokens. • We can compute a rigorous statistical measure of con- fidence that the watermark has been detected. 1.1. Notation & Language model basics Language models have a “vocabulary” V containing words or word fragments known as “tokens.” Typical vocab- ularies contain |V| = 50, 000 tokens or more (Radford et al., 2019; Liu et al., 2019). Consider a sequence of T tokens {s(t)} 2 VT . Entries with negative indices, s( Np) , · · · , s( 1), represent a “prompt” of length Np and s(0) , · · · , s(T ) are tokens generated by an AI system in re- sponse to the prompt. A language model (LM) for next word prediction, is a func- tion f, often parameterized by a neural network, that accepts as input a sequence of known tokens s( Np) , · · · , s(t 1), which contains a prompt and the first t 1 tokens already produced by the language model, and then outputs a vector of |V | logits, one for each word in the vocabulary. These logits are then passed through a softmax operator to convert them into a discrete probability distribution over the vocabulary. The next token at position t is then sampled from this distribution using either standard multinomial sampling, or greedy sampling (greedy decoding) of the single most likely next token. Additionally, a procedure such as beam search can be employed to consider multiple possible sequences before selecting the one with the overall highest score. We start out by describing a simple “hard” red list watermark in Algorithm 1 that is easy to analyze, easy to detect and hard to remove. The simplicity of this approach comes at the cost of poor generation quality on low entropy sequences. We will discuss more sophisticated strategies later. Algorithm 1 Text Generation with Hard Red List Input: prompt, s( Np) · · · s( 1) for t = 0, 1, · · · do 1. Apply the language model to prior tokens s( Np) · · · s(t 1) to get a probability vector p(t) over the vocabulary. 2. Compute a hash of token s(t 1) , and use it to seed a random number generator. 3. Using this seed, randomly partition the vocabulary into a “green list” G and a “red list” R of equal size. 4. Sample s(t) from G , never generating any token in the red list. end for The method works by generating a pseudo-random red list of tokens that are barred from appearing as s(t) . The red list generator is seeded with the prior token s(t 1), enabling the red list to be reproduced later without access to the entire generated sequence. ※ଓฤ[Kirchenbauer+23]Ͱ͸ɼ௚લ୯ޠҎ֎͔Βཚ਺ΛܾΊΔ͜ͱ΋ݕ౼ɽ֓Ͷ্ड़ͷγϯϓϧͳํ๏Ͱྑ͍ 𝑝(𝑤!|𝑤$!) 𝒱

ఏҊ๏ʢϋʔυ੍໿൛ʣ 2023/8/27 ࠷ઌ୺NLP2023 ʲಁ͔͠ͷೖΕํʳ ֤୯ޠΛੜ੒͢Δࡍͷީิޠኮ𝐺Λ ௚લ୯ޠɾϋογϡؔ਺ɾཚ਺ੜ੒ث͔ΒఆΊΔ ʲݕ஌ํ๏ʳ 𝑤!"# 𝑤!
֤ (𝑤!"# , 𝑤! ) ʹ͍ͭͯɼ𝑤! ͕𝐺ʹؚ·ΕΔ͔Λ֬ೝɽ 𝐺ʹؚ·ΕΔࣄ৅͕༗ҙʹසൃ͍ͯ͠Δ͔ೋ߲ݕఆʢ𝜋$ = 𝛾ʣ ϋογϡؔ਺ͱϥϯμϜ ू߹෼ׂث͕͋Ε͹Α͍ ʢ👍 ݩͷݴޠϞσϧ͸ෆཁʣ ※ϋʔυ൛ͷ৔߹ɼ ྘Ұ৭ʹͳΔ ※࠷௿Ͱ΋શମͷ 1/4τʔΫϯΛ ॻ͖׵͑ͳ͍ͱ ྘৭τʔΫϯΛ1/2 ·ͰݮΒͤͳ͍ “I want” … 𝑤!"# = 𝑤𝑎𝑛𝑡 (લํจ຺) ᶄཚ਺Λ༻͍ͯ ޠኮू߹𝒱Λ ੍ݶޠኮ𝑅ͱ ީิޠኮ𝐺ʹ෼ׂ ʢ 𝐺 = 𝛾 𝒱 ; 0 < 𝛾 < 1 ʣ a to the you it ᶅީิޠኮ𝐺͔Β୯ޠΛαϯϓϧ 𝑤! =you ᶃ௚લͷ୯ޠΛ΋ͱʹཚ਺ੜ੒ 𝑝(𝑤!|𝑤$!) 𝒱 A Watermark for Large Language Models John Kirchenbauer * 1 Jonas Geiping * 1 Yuxin Wen 1 Jonathan Katz 1 Ian Miers 1 Tom Goldstein 1 Abstract Potential harms of large language models can be mitigated by watermarking model output, i.e., em- bedding signals into generated text that are invisi- ble to humans but algorithmically detectable from a short span of tokens. We propose a watermarking framework for proprietary language models. The watermark can be embedded with negligible impact on text quality, and can be detected using an efﬁcient open-source algorithm without access to the language model API or parameters. The watermark works by selecting a randomized set of “green” tokens before a word is generated, and then softly promoting use of green tokens during sampling. We propose a statistical test for de- tecting the watermark with interpretable p-values, and derive an information-theoretic framework for analyzing the sensitivity of the watermark. We test the watermark using a multi-billion parameter model from the Open Pretrained Transformer …The watermark detection algorithm can be made public, enabling third parties (e.g., social media platforms) to run it themselves, or it can be kept private and run behind an API. We seek a watermark with the following properties: Prompt No watermark With watermark - minimal marginal probability for a detection attempt. - Good speech frequency and energy rate reduction. - messages indiscernible to humans. - easy for humans to verify. Extremely efficient on average term lengths and word frequencies on synthetic, microamount text (as little as 25 words) Very small and low-resource key/hash (e.g., 140 bits per key is sufficient for 99.999999999% of the Synthetic Internet p-value Z-score Num tokens 56 36 .31 7.4 .38 6e-14

ݶք l Τϯτϩϐʔͷ௿͍෦෼ʹಁ͔͠ΛೖΕΔͱจষͷ࣭͕ѱԽ͢Δ 2023/8/27 ࠷ઌ୺NLP2023 ͚͋·ͯ͠ … (લํจ຺) ͓ΊͰͱ͏
ީิޠኮ𝐺͔Β୯ޠΛαϯϓϧ 𝑤% = ɺ ɺ ʔ ʂ ʮ͜Ε͔͠ͳ͍ʯͱ͍͏࣍୯ޠީิ͕ ۮવ੍ݶޠኮʹೖΓɺจষͷ࣭͕མͪΔ Մೳੑ͕͋Δ ޠኮͷ෼ׂ A Watermark for Large Language Models c, enabling third parties (e.g., social media o run it themselves, or it can be kept private hind an API. We seek a watermark with the properties: ermark can be algorithmically detected with- knowledge of the model parameters or access for(i=0;i<n;i++) sum+=array[i] Were they produced by a human or by a language model? Determining this is fundamentally hard because these se quences have low entropy; the ﬁrst few tokens strongly determine the following tokens. Low entropy text creates two problems for watermarking ಛʹϓϩάϥϜิ׬ͳͲͷ৔໘Ͱ͸ɺ ࣍୯ޠ͕Ұҙʹఆ·Δ͜ͱ͸ى͖ಘΔ

ఏҊ๏ʢιϑτ੍໿൛ʣ l ੍ݶޠኮΛഉআ͢ΔͷͰ͸ͳ͘ɼީิޠኮͷϩδοτʹԼବ𝛿Λཤ͔ͤΔɽ Τϯτϩϐʔͷߴ͍ՕॴͰಁ͔͕͠ೖΓɼจষͷ࣭ͷѱԽΛ๷͛ͦ͏ɽ 2023/8/27 ࠷ઌ୺NLP2023 ͚͋·ͯ͠ … (લํจ຺)
͓ΊͰͱ͏ 𝑤% = ͓ΊͰͱ͏ ɺ Ͷ ʂ ୯ޠ֬཰෼෍ͷεύΠΫΤϯτϩϐʔ𝑆ͱ͍͏΋ ͷΛಋೖ͠ʢؾ࣋ͪ͸γϟϊϯΤϯτϩϐʔʹ͍ۙʣɼ ௕͞𝑇ͷจষʹೖΔਪ঑ޠኮͷ਺|𝑠|𝐺 ͷظ଴஋ͷ ԼքΛઆ໌ +𝛿 +𝛿 𝔼|𝑠|& ≥ 𝛾𝑇𝑒' 1 + (𝑒' − 1)𝛾 𝑆 Τϯτϩϐʔͷߴ͍จষ΄Ͳɼͨ͘͞Μಁ͔͠Λ ೖΕΒΕΔʢୈҰछͷաޡ͕ݮΔʣ ޠኮͷ෼ׂ ※ଓฤ[Kirchenbauer+23]Ͱ͸ɼ௚લ୯ޠҎ֎͔Βཚ਺ΛܾΊΔ͜ͱ΋ݕ౼ɽ֓Ͷ্ड़ͷγϯϓϧͳํ๏Ͱྑ͍

࣮ݧɾ෼ੳʢൈਮʣ l ಁ͔͠ͷʮڧ͞ʯʢ੍ݶޠኮׂ߹𝛾ɼީิޠኮͷԼବ𝛿ʣ ͱPPL͸τϨʔυΦϑ - ಁ͔͠Λڧ͘ೖΕͯ΋ɼਓؒͷධՁʢಁ͔͋͠Γɾͳ͠ ͲͬͪͷςΩετΛ޷Ή͔ʣʹେ͖ͳӨڹ͸༩͑ͳ͍ [ଓฤAppendix 9.1] l
จষ͕௕͍΄Ͳಁ͔͠ೖΓจষͷz-scoreߴ - 128token΋͋Ε͹ܦݧ্98.4%ͷಁ͔͠ ςΩετ͸ݕग़Ͱ͖ͨʢOPTʣ l ಁ͔͠ೖΓจষΛGPT3.5΍ ਓؒͰॻ͖׵͚͑ͨͩͰ͸ɼ ಁ͔͕͠׬શʹফ͑ͳ͍ [ଓฤ] - ಁ͔͠ೖΓςΩετΛίϐʔͯ͠ಁ͔͠ͳ͠ ςΩετͷҰ෦ʹࠞͥࠐ·ΕΔঢ়گ͕ରॲ͠ʹ͍͘ 2023/8/27 ࠷ઌ୺NLP2023 3 4 5 6 7 8 9 10 11 12 13 Oracle Model PPL (better →) 0 5 10 15 20 25 30 35 40 z-score (better →) = 10.0 = 5.0 = 2.0 = 1.0 = 0.0 0.9 0.75 0.5 0.25 0.1 Figure 2. The tradeoff between average z-score and language mod (right) Greedy and beam search with 4 and 8 beams for = .5. Be with smaller impact to model quality (perplexity, PPL). length of tokens from the end and treat them as a “baseline” completion. The remaining tokens are a prompt. A larger oracle language model (OPT-2.7B) is used to compute perplexity (PPL) for the generated completions and for the human baseline. Watermark Strength vs Text Quality. One can achieve a very strong watermark for short sequences by choosing a small green list size and a large green list bias . How- ever, creating a stronger watermark may distort generated text. Figure 2 (left) shows the tradeoff between watermark strength (z-score) and text quality (perplexity) for various combinations of watermarking parameters. We compute PPL௿ ಁ͔͠ೖΓ ςΩετͷ z-score จষ௕ ݕग़੒ޭ ͠΍͍͢ ߈ܸͷ छྨ ॻ͖ ׵͑ ίϐʔ షΓ ෇͚ ʲؤ݈ੑʳ ʲੑ࣭ͷ֬ೝʢڻ͖͸ͳ͍ʣʳ

ײ૝ l A Watermark for Large Language Models ͱݴ͍ͬͯΔ͕ɼ ࢓૊Έ্͸೚ҙͷ୯ํ޲σίʔμʹదԠՄೳɼ࣮૷؆୯
l ࣾձ࣮૷͢Δ৔߹ɼӡ༻ํ๏ʹ͍ͭͯߋʹٞ࿦͕ඞཁͦ͏ - ಁ͔͠ೖΓϞσϧAPIͱಉ࣌ʹಁ͔͠ݕग़ث΋APIͰެ։͢Δͱ͍͏ӡ༻͕҆ṛʁ - 伴͕όϨͣʹɼҰൠϢʔβ΋ಁ͔͠Λݕ஌Մೳ - ϞσϧΛύϥϝʔλ͝ͱެ։͢Δ৔߹͸ɺϢʔβʹಁ͔͠ΛೖΕͯ໯͏ඞཁ͕͋Δ - Ϣʔβ͕Θ͟Θ͟ಁ͔͠ΛೖΕΔΠϯηϯςΟϒ͸ʁ - ֤ʑͷ伴ʢϋογϡؔ਺ɾཚ਺ੜ੒ثʣ͕ཚཱ͢͠ΔͱɼݕఆΛ܁Γฦ͢ඞཁ͕ग़ͯ͠·͏ʢ୭͕؅ཧʁʣ l ଓฤɿ On the Reliability of Watermarks for Large Language Models (Kirchenbauer+,23; arXiv) ΋νΣοΫ - ఏҊख๏ͷѥछɼݕఆํ๏ͷվળɼଞख๏ͱͷൺֱͳͲ 2023/8/27 ࠷ઌ୺NLP2023

最先端NLP論文紹介：A Watermark for Large Language Models

最先端NLP論文紹介：A Watermark for Large Language Models

tatsuki kuribayashi

More Decks by tatsuki kuribayashi

Other Decks in Research

Featured

Transcript

A Watermark for Large Language Models John Kirchenbauer, Jonas Geiping,

͋Δจষ͕ݴޠϞσϧʹΑͬͯੜ੒͞Εͨ΋ͷ͔ ݕ஌Ͱ͖ΔੈքͳΒخ͍͠ ͳͥʁ l ࣗಈੜ੒͞ΕͨςΩετͷѱ༻ʢྫɿϑΣΠΫχϡʔε࡞੒ʣݕ஌ - ʮ͋ͳͨͷͱ͜ΖͷݴޠϞσϧ͕໎࿭Λ͔͚͍ͯ·͢ʯͱ͍͏೉บʹର͢Δ๷Ӵ l ৽ͨͳݴޠϞσϧΛ܇࿅͢Δࡍͷ܇࿅σʔλϑΟϧλϦϯά l

΍Δ͜ͱɾ΍Βͳ͍͜ͱ ݴޠϞσϧ͔ΒςΩετΛੜ੒͢Δࡍʹʮಁ͔͠ʯΛೖΕΔ ---proactive-imprinting 2023/8/27 ࠷ઌ୺NLP2023 ͋Δ伴Λ࢖ͬͯ σίʔυ ݴޠϞσϧ 伴Λ஌͍ͬͯΔਓ͸ɼจষʹ

ݴޠϞσϧ͔Β୯ޠΛαϯϓϦϯά͢ΔࡍʹಛघͳόΠΞεΛ͔͚Δ ఏҊ๏ͷํ਑ 2023/8/27 ࠷ઌ୺NLP2023 … a to the you

ݴޠϞσϧ͔Β୯ޠΛαϯϓϦϯά͢ΔࡍʹಛघͳόΠΞεΛ͔͚Δ ҎԼΛͳΔ΂͘ຬ͍ͨͨ͠ɿ l จষͷ࣭ΛѱԽͤ͞ͳ͍ l ಁ͔͠Λফͮ͠Β͍ l ಁ͔͠ͷݕग़͕௿ίετ - cf.

ఏҊ๏ʢϋʔυ੍໿൛ʣ 2023/8/27 ࠷ઌ୺NLP2023 “I want” … 𝑤!"# = 𝑤𝑎𝑛𝑡

ఏҊ๏ʢϋʔυ੍໿൛ʣ 2023/8/27 ࠷ઌ୺NLP2023 ʲಁ͔͠ͷೖΕํʳ ֤୯ޠΛੜ੒͢Δࡍͷީิޠኮ𝐺Λ ௚લ୯ޠɾϋογϡؔ਺ɾཚ਺ੜ੒ث͔ΒఆΊΔ ʲݕ஌ํ๏ʳ 𝑤!"# 𝑤!

ݶք l Τϯτϩϐʔͷ௿͍෦෼ʹಁ͔͠ΛೖΕΔͱจষͷ࣭͕ѱԽ͢Δ 2023/8/27 ࠷ઌ୺NLP2023 ͚͋·ͯ͠ … (લํจ຺) ͓ΊͰͱ͏

ఏҊ๏ʢιϑτ੍໿൛ʣ l ੍ݶޠኮΛഉআ͢ΔͷͰ͸ͳ͘ɼީิޠኮͷϩδοτʹԼବ𝛿Λཤ͔ͤΔɽ Τϯτϩϐʔͷߴ͍ՕॴͰಁ͔͕͠ೖΓɼจষͷ࣭ͷѱԽΛ๷͛ͦ͏ɽ 2023/8/27 ࠷ઌ୺NLP2023 ͚͋·ͯ͠ … (લํจ຺)

࣮ݧɾ෼ੳʢൈਮʣ l ಁ͔͠ͷʮڧ͞ʯʢ੍ݶޠኮׂ߹𝛾ɼީิޠኮͷԼବ𝛿ʣ ͱPPL͸τϨʔυΦϑ - ಁ͔͠Λڧ͘ೖΕͯ΋ɼਓؒͷධՁʢಁ͔͋͠Γɾͳ͠ ͲͬͪͷςΩετΛ޷Ή͔ʣʹେ͖ͳӨڹ͸༩͑ͳ͍ [ଓฤAppendix 9.1] l

ײ૝ l A Watermark for Large Language Models ͱݴ͍ͬͯΔ͕ɼ ࢓૊Έ্͸೚ҙͷ୯ํ޲σίʔμʹదԠՄೳɼ࣮૷؆୯