$30 off During Our Annual Pro Sale. View Details »

最先端NLP論文紹介:A Watermark for Large Language Models

最先端NLP論文紹介:A Watermark for Large Language Models

tatsuki kuribayashi

August 27, 2023
Tweet

More Decks by tatsuki kuribayashi

Other Decks in Research

Transcript

  1. A Watermark for Large Language Models
    John Kirchenbauer, Jonas Geiping, Yuxin Wen, Jonathan Katz,
    Ian Miers, Tom Goldstein (ICML2023)
    ঺հऀɿ ܀ྛथੜ (MBZUAI)
    2023/8/27 ࠷ઌ୺NLP2023

    View Slide

  2. ͋Δจষ͕ݴޠϞσϧʹΑͬͯੜ੒͞Εͨ΋ͷ͔
    ݕ஌Ͱ͖ΔੈքͳΒخ͍͠
    ͳͥʁ
    l ࣗಈੜ੒͞ΕͨςΩετͷѱ༻ʢྫɿϑΣΠΫχϡʔε࡞੒ʣݕ஌
    - ʮ͋ͳͨͷͱ͜ΖͷݴޠϞσϧ͕໎࿭Λ͔͚͍ͯ·͢ʯͱ͍͏೉บʹର͢Δ๷Ӵ
    l ৽ͨͳݴޠϞσϧΛ܇࿅͢Δࡍͷ܇࿅σʔλϑΟϧλϦϯά
    l จষͷ৴པੑʢਓؒʹΑͬͯॻ͔Ε͔ͨʣΛ൑அ͢ΔҰͭͷࢦඪ
    2023/8/27 ࠷ઌ୺NLP2023
    ਪ͠
    *$.-
    PVUTUBOEJOHQBQFS
    ײ֒

    View Slide

  3. ΍Δ͜ͱɾ΍Βͳ͍͜ͱ
    ݴޠϞσϧ͔ΒςΩετΛੜ੒͢Δࡍʹʮಁ͔͠ʯΛೖΕΔ
    ---proactive-imprinting
    2023/8/27 ࠷ઌ୺NLP2023
    ͋Δ伴Λ࢖ͬͯ
    σίʔυ
    ݴޠϞσϧ
    伴Λ஌͍ͬͯΔਓ͸ɼจষʹ
    ಁ͔͕͠ೖ͍ͬͯΔ͔Λ֬ೝͰ͖Δ
    จষੜ੒
    Generated
    by LM
    ʲ΍Δ͜ͱʳ
    ʲ΍Βͳ͍͜ͱʳ
    • ݴޠϞσϧਓؒͷͲͪΒʹΑͬͯॻ͔Ε͔ͨจষԽΛ෼ྨ͢ΔϞσϧ࡞੒ʢ(15;FSPͳͲʣ
    ---post-hoc detection
    • ೚ҙͷςΩετʹޙ͔Βಁ͔͠ΛೖΕΔ ʢݹ͔͘Βͷ΍Γํɼࠓճ͸ϑΥʔΧε֎ʣ
    ---post-hoc imprinting
    • "1*ఏڙऀ͕Ϟσϧͷੜ੒ཤྺΛอଘ͠ɼর߹ͯ͠ݕ஌͢Δ ʢϓϥΠόγʔ໰୊͋Γʣ
    ---retrieval-based detection

    View Slide

  4. ݴޠϞσϧ͔Β୯ޠΛαϯϓϦϯά͢ΔࡍʹಛघͳόΠΞεΛ͔͚Δ
    ఏҊ๏ͷํ਑
    2023/8/27 ࠷ઌ୺NLP2023

    a
    to
    the
    you
    it
    𝑝(𝑤!)

    𝑝(𝑤!"#)

    𝑝(𝑤!"$)
    ఏҊΞΠσΟΞͷࡶͳΠϝʔδ ୯ޠͷબͼํʹ
    ಛผͳ৘ใΛ
    ຒΊࠐΉ

    View Slide

  5. ݴޠϞσϧ͔Β୯ޠΛαϯϓϦϯά͢ΔࡍʹಛघͳόΠΞεΛ͔͚Δ
    ҎԼΛͳΔ΂͘ຬ͍ͨͨ͠ɿ
    l จষͷ࣭ΛѱԽͤ͞ͳ͍
    l ಁ͔͠Λফͮ͠Β͍
    l ಁ͔͠ͷݕग़͕௿ίετ
    - cf. ݕग़ʹେن໛ݴޠϞσϧ͕ඞཁ
    l ݕग़ྗ͕ߴ͍ʢୈೋछͷաޡ͕ੜ͡ʹ͍͘ʀϞσϧͷੜ੒෺Λݕ஌͜͠΅͞ͳ͍ʣ
    - গͳ͍τʔΫϯ਺Ͱݕग़Ͱ͖Δ
    ఏҊ๏ͷํ਑
    2023/8/27 ࠷ઌ୺NLP2023
    👎 ੜ੒͞Εͨจষ͸͓ͦΒ͘ۃΊͯಡΈͮΒ͍ɽ
    👎 ௕͍୯ޠΛ࢖͏ํ਑͕ਪଌͰ͖Ε͹ɼ୹͍୯ޠʹஔ͖׵͍͑ͯ͘͜ͱͰಁ͔͠ΛফͤΔɽ
    ྑ͘ͳ͍ྫɿͳΔ΂͘จࣈ਺͕ଟ͍୯ޠ͕࢖ΘΕΔΑ͏ʹ୯ޠΛαϯϓϦϯά͢Δʢಁ͔͠ΛೖΕΔʣɽ
    ୯ޠͷฏۉ௕ΛଌΔ͜ͱͰಁ͔͕͠ೖ͍ͬͯΔ͔Λ൑அ͢Δɽ

    a
    to
    the
    you
    it
    𝑝(𝑤!)

    𝑝(𝑤!"#)

    𝑝(𝑤!"$)
    ୯ޠͷબͼํʹ
    ಛผͳ৘ใΛ
    ຒΊࠐΉ
    ఏҊΞΠσΟΞͷࡶͳΠϝʔδ

    View Slide

  6. ఏҊ๏ʢϋʔυ੍໿൛ʣ
    2023/8/27 ࠷ઌ୺NLP2023
    “I want”

    𝑤!"# = 𝑤𝑎𝑛𝑡
    (લํจ຺) ᶄཚ਺Λ༻͍ͯ
    ޠኮू߹𝒱Λ
    ੍ݶޠኮ𝑅ͱ
    ީิޠኮ𝐺ʹ෼ׂ
    ʢ 𝐺 = 𝛾 𝒱 ; 0 < 𝛾 < 1 ʣ
    a
    to
    the
    you
    it
    ᶅީิޠኮ𝐺͔Β୯ޠΛαϯϓϧ
    𝑤! =you
    ʲಁ͔͠ͷೖΕํʳ
    ֤୯ޠΛੜ੒͢Δࡍͷީิޠኮ𝐺Λ
    ௚લ୯ޠɾϋογϡؔ਺ɾཚ਺ੜ੒ث͔ΒఆΊΔ
    ᶃ௚લͷ୯ޠΛ΋ͱʹཚ਺ੜ੒
    • The watermark cannot be removed without modifying
    a significant fraction of the generated tokens.
    • We can compute a rigorous statistical measure of con-
    fidence that the watermark has been detected.
    1.1. Notation & Language model basics
    Language models have a “vocabulary” V containing words
    or word fragments known as “tokens.” Typical vocab-
    ularies contain |V| = 50, 000 tokens or more (Radford
    et al., 2019; Liu et al., 2019). Consider a sequence of
    T tokens {s(t)} 2 VT . Entries with negative indices,
    s( Np)
    , · · · , s( 1), represent a “prompt” of length Np
    and
    s(0)
    , · · · , s(T ) are tokens generated by an AI system in re-
    sponse to the prompt.
    A language model (LM) for next word prediction, is a func-
    tion f, often parameterized by a neural network, that accepts
    as input a sequence of known tokens s( Np)
    , · · · , s(t 1),
    which contains a prompt and the first t 1 tokens already
    produced by the language model, and then outputs a vector
    of |V | logits, one for each word in the vocabulary. These
    logits are then passed through a softmax operator to convert
    them into a discrete probability distribution over the vocab-
    ulary. The next token at position t is then sampled from this
    distribution using either standard multinomial sampling, or
    greedy sampling (greedy decoding) of the single most likely
    next token. Additionally, a procedure such as beam search
    can be employed to consider multiple possible sequences
    before selecting the one with the overall highest score.
    We start out by describing a simple “hard” red list watermark
    in Algorithm 1 that is easy to analyze, easy to detect and
    hard to remove. The simplicity of this approach comes at the
    cost of poor generation quality on low entropy sequences.
    We will discuss more sophisticated strategies later.
    Algorithm 1 Text Generation with Hard Red List
    Input: prompt, s( Np) · · · s( 1)
    for t = 0, 1, · · · do
    1. Apply the language model to prior tokens
    s( Np) · · · s(t 1) to get a probability vector p(t)
    over the vocabulary.
    2. Compute a hash of token s(t 1)
    , and use it to
    seed a random number generator.
    3. Using this seed, randomly partition the vocab-
    ulary into a “green list” G and a “red list” R of
    equal size.
    4. Sample s(t) from G , never generating any token
    in the red list.
    end for
    The method works by generating a pseudo-random red list
    of tokens that are barred from appearing as s(t)
    . The red list
    generator is seeded with the prior token s(t 1), enabling the
    red list to be reproduced later without access to the entire
    generated sequence.
    ※ଓฤ[Kirchenbauer+23]Ͱ͸ɼ௚લ୯ޠҎ֎͔Βཚ਺ΛܾΊΔ͜ͱ΋ݕ౼ɽ֓Ͷ্ड़ͷγϯϓϧͳํ๏Ͱྑ͍
    𝑝(𝑤!|𝑤$!)
    𝒱

    View Slide

  7. ఏҊ๏ʢϋʔυ੍໿൛ʣ
    2023/8/27 ࠷ઌ୺NLP2023
    ʲಁ͔͠ͷೖΕํʳ
    ֤୯ޠΛੜ੒͢Δࡍͷީิޠኮ𝐺Λ
    ௚લ୯ޠɾϋογϡؔ਺ɾཚ਺ੜ੒ث͔ΒఆΊΔ
    ʲݕ஌ํ๏ʳ
    𝑤!"#
    𝑤!
    ֤ (𝑤!"#
    , 𝑤!
    ) ʹ͍ͭͯɼ𝑤!
    ͕𝐺ʹؚ·ΕΔ͔Λ֬ೝɽ
    𝐺ʹؚ·ΕΔࣄ৅͕༗ҙʹසൃ͍ͯ͠Δ͔ೋ߲ݕఆʢ𝜋$
    = 𝛾ʣ
    ϋογϡؔ਺ͱϥϯμϜ
    ू߹෼ׂث͕͋Ε͹Α͍
    ʢ👍 ݩͷݴޠϞσϧ͸ෆཁʣ
    ※ϋʔυ൛ͷ৔߹ɼ
    ྘Ұ৭ʹͳΔ
    ※࠷௿Ͱ΋શମͷ
    1/4τʔΫϯΛ
    ॻ͖׵͑ͳ͍ͱ
    ྘৭τʔΫϯΛ1/2
    ·ͰݮΒͤͳ͍
    “I want”

    𝑤!"# = 𝑤𝑎𝑛𝑡
    (લํจ຺) ᶄཚ਺Λ༻͍ͯ
    ޠኮू߹𝒱Λ
    ੍ݶޠኮ𝑅ͱ
    ީิޠኮ𝐺ʹ෼ׂ
    ʢ 𝐺 = 𝛾 𝒱 ; 0 < 𝛾 < 1 ʣ
    a
    to
    the
    you
    it
    ᶅީิޠኮ𝐺͔Β୯ޠΛαϯϓϧ
    𝑤! =you
    ᶃ௚લͷ୯ޠΛ΋ͱʹཚ਺ੜ੒
    𝑝(𝑤!|𝑤$!)
    𝒱
    A Watermark for Large Language Models
    John Kirchenbauer * 1 Jonas Geiping * 1 Yuxin Wen 1 Jonathan Katz 1 Ian Miers 1 Tom Goldstein 1
    Abstract
    Potential harms of large language models can be
    mitigated by watermarking model output, i.e., em-
    bedding signals into generated text that are invisi-
    ble to humans but algorithmically detectable from
    a short span of tokens. We propose a watermark-
    ing framework for proprietary language models.
    The watermark can be embedded with negligible
    impact on text quality, and can be detected using
    an efficient open-source algorithm without access
    to the language model API or parameters. The
    watermark works by selecting a randomized set
    of “green” tokens before a word is generated, and
    then softly promoting use of green tokens during
    sampling. We propose a statistical test for de-
    tecting the watermark with interpretable p-values,
    and derive an information-theoretic framework
    for analyzing the sensitivity of the watermark. We
    test the watermark using a multi-billion parameter
    model from the Open Pretrained Transformer
    …The watermark detection algorithm
    can be made public, enabling third
    parties (e.g., social media
    platforms) to run it themselves, or
    it can be kept private and run behind
    an API. We seek a watermark with the
    following properties:
    Prompt
    No watermark
    With watermark
    - minimal marginal probability for a
    detection attempt.
    - Good speech frequency and energy
    rate reduction.
    - messages indiscernible to humans.
    - easy for humans to verify.
    Extremely efficient on average term
    lengths and word frequencies on
    synthetic, microamount text (as little
    as 25 words)
    Very small and low-resource key/hash
    (e.g., 140 bits per key is sufficient
    for 99.999999999% of the Synthetic
    Internet
    p-value
    Z-score
    Num tokens
    56
    36
    .31
    7.4
    .38
    6e-14

    View Slide

  8. ݶք
    l Τϯτϩϐʔͷ௿͍෦෼ʹಁ͔͠ΛೖΕΔͱจষͷ࣭͕ѱԽ͢Δ
    2023/8/27 ࠷ઌ୺NLP2023
    ͚͋·ͯ͠

    (લํจ຺)
    ͓ΊͰͱ͏
    ީิޠኮ𝐺͔Β୯ޠΛαϯϓϧ
    𝑤% = ɺ
    ɺ
    ʔ
    ʂ
    ʮ͜Ε͔͠ͳ͍ʯͱ͍͏࣍୯ޠީิ͕
    ۮવ੍ݶޠኮʹೖΓɺจষͷ࣭͕མͪΔ
    Մೳੑ͕͋Δ
    ޠኮͷ෼ׂ
    A Watermark for Large Language Models
    c, enabling third parties (e.g., social media
    o run it themselves, or it can be kept private
    hind an API. We seek a watermark with the
    properties:
    ermark can be algorithmically detected with-
    knowledge of the model parameters or access
    for(i=0;iWere they produced by a human or by a language model?
    Determining this is fundamentally hard because these se
    quences have low entropy; the first few tokens strongly
    determine the following tokens.
    Low entropy text creates two problems for watermarking
    ಛʹϓϩάϥϜิ׬ͳͲͷ৔໘Ͱ͸ɺ
    ࣍୯ޠ͕Ұҙʹఆ·Δ͜ͱ͸ى͖ಘΔ

    View Slide

  9. ఏҊ๏ʢιϑτ੍໿൛ʣ
    l ੍ݶޠኮΛഉআ͢ΔͷͰ͸ͳ͘ɼީิޠኮͷϩδοτʹԼବ𝛿Λཤ͔ͤΔɽ
    Τϯτϩϐʔͷߴ͍ՕॴͰಁ͔͕͠ೖΓɼจষͷ࣭ͷѱԽΛ๷͛ͦ͏ɽ
    2023/8/27 ࠷ઌ୺NLP2023
    ͚͋·ͯ͠

    (લํจ຺)
    ͓ΊͰͱ͏
    𝑤%
    = ͓ΊͰͱ͏
    ɺ
    Ͷ
    ʂ
    ୯ޠ֬཰෼෍ͷεύΠΫΤϯτϩϐʔ𝑆ͱ͍͏΋
    ͷΛಋೖ͠ʢؾ࣋ͪ͸γϟϊϯΤϯτϩϐʔʹ͍ۙʣɼ
    ௕͞𝑇ͷจষʹೖΔਪ঑ޠኮͷ਺|𝑠|𝐺
    ͷظ଴஋ͷ
    ԼքΛઆ໌
    +𝛿
    +𝛿
    𝔼|𝑠|&

    𝛾𝑇𝑒'
    1 + (𝑒' − 1)𝛾
    𝑆
    Τϯτϩϐʔͷߴ͍จষ΄Ͳɼͨ͘͞Μಁ͔͠Λ
    ೖΕΒΕΔʢୈҰछͷաޡ͕ݮΔʣ
    ޠኮͷ෼ׂ
    ※ଓฤ[Kirchenbauer+23]Ͱ͸ɼ௚લ୯ޠҎ֎͔Βཚ਺ΛܾΊΔ͜ͱ΋ݕ౼ɽ֓Ͷ্ड़ͷγϯϓϧͳํ๏Ͱྑ͍

    View Slide

  10. ࣮ݧɾ෼ੳʢൈਮʣ
    l ಁ͔͠ͷʮڧ͞ʯʢ੍ݶޠኮׂ߹𝛾ɼީิޠኮͷԼବ𝛿ʣ
    ͱPPL͸τϨʔυΦϑ
    - ಁ͔͠Λڧ͘ೖΕͯ΋ɼਓؒͷධՁʢಁ͔͋͠Γɾͳ͠
    ͲͬͪͷςΩετΛ޷Ή͔ʣʹେ͖ͳӨڹ͸༩͑ͳ͍ [ଓฤAppendix 9.1]
    l จষ͕௕͍΄Ͳಁ͔͠ೖΓจষͷz-scoreߴ
    - 128token΋͋Ε͹ܦݧ্98.4%ͷಁ͔͠
    ςΩετ͸ݕग़Ͱ͖ͨʢOPTʣ
    l ಁ͔͠ೖΓจষΛGPT3.5΍
    ਓؒͰॻ͖׵͚͑ͨͩͰ͸ɼ
    ಁ͔͕͠׬શʹফ͑ͳ͍ [ଓฤ]
    - ಁ͔͠ೖΓςΩετΛίϐʔͯ͠ಁ͔͠ͳ͠
    ςΩετͷҰ෦ʹࠞͥࠐ·ΕΔঢ়گ͕ରॲ͠ʹ͍͘
    2023/8/27 ࠷ઌ୺NLP2023
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    Oracle Model PPL (better →)
    0
    5
    10
    15
    20
    25
    30
    35
    40
    z-score (better →)
    = 10.0
    = 5.0
    = 2.0
    = 1.0
    = 0.0
    0.9
    0.75
    0.5
    0.25
    0.1
    Figure 2. The tradeoff between average z-score and language mod
    (right) Greedy and beam search with 4 and 8 beams for = .5. Be
    with smaller impact to model quality (perplexity, PPL).
    length of tokens from the end and treat them as a “baseline”
    completion. The remaining tokens are a prompt. A larger
    oracle language model (OPT-2.7B) is used to compute per-
    plexity (PPL) for the generated completions and for the
    human baseline.
    Watermark Strength vs Text Quality. One can achieve
    a very strong watermark for short sequences by choosing
    a small green list size and a large green list bias . How-
    ever, creating a stronger watermark may distort generated
    text. Figure 2 (left) shows the tradeoff between watermark
    strength (z-score) and text quality (perplexity) for various
    combinations of watermarking parameters. We compute
    PPL௿
    ಁ͔͠ೖΓ
    ςΩετͷ
    z-score
    จষ௕
    ݕग़੒ޭ
    ͠΍͍͢
    ߈ܸͷ
    छྨ
    ॻ͖
    ׵͑
    ίϐʔ
    షΓ
    ෇͚
    ʲؤ݈ੑʳ
    ʲੑ࣭ͷ֬ೝʢڻ͖͸ͳ͍ʣʳ

    View Slide

  11. ײ૝
    l A Watermark for Large Language Models ͱݴ͍ͬͯΔ͕ɼ
    ࢓૊Έ্͸೚ҙͷ୯ํ޲σίʔμʹదԠՄೳɼ࣮૷؆୯
    l ࣾձ࣮૷͢Δ৔߹ɼӡ༻ํ๏ʹ͍ͭͯߋʹٞ࿦͕ඞཁͦ͏
    - ಁ͔͠ೖΓϞσϧAPIͱಉ࣌ʹಁ͔͠ݕग़ث΋APIͰެ։͢Δͱ͍͏ӡ༻͕҆ṛʁ
    - 伴͕όϨͣʹɼҰൠϢʔβ΋ಁ͔͠Λݕ஌Մೳ
    - ϞσϧΛύϥϝʔλ͝ͱެ։͢Δ৔߹͸ɺϢʔβʹಁ͔͠ΛೖΕͯ໯͏ඞཁ͕͋Δ
    - Ϣʔβ͕Θ͟Θ͟ಁ͔͠ΛೖΕΔΠϯηϯςΟϒ͸ʁ
    - ֤ʑͷ伴ʢϋογϡؔ਺ɾཚ਺ੜ੒ثʣ͕ཚཱ͢͠ΔͱɼݕఆΛ܁Γฦ͢ඞཁ͕ग़ͯ͠·͏ʢ୭͕؅ཧʁʣ
    l ଓฤɿ On the Reliability of Watermarks for Large Language Models
    (Kirchenbauer+,23; arXiv) ΋νΣοΫ
    - ఏҊख๏ͷѥछɼݕఆํ๏ͷվળɼଞख๏ͱͷൺֱͳͲ
    2023/8/27 ࠷ઌ୺NLP2023

    View Slide