Upgrade to Pro — share decks privately, control downloads, hide ads and more …

JSAI NeurIPS 2024 参加報告会(AI アライメント)

Akifumi Wachi
February 07, 2025

JSAI NeurIPS 2024 参加報告会(AI アライメント)

第96回人工知能セミナー(2025.2.7)「AIトレンド・トップカンファレンス報告会(NeurIPS2024):世界最先端のAI研究開発動向が1日でわかる!」
- https://www.ai-gakkai.or.jp/event/ai-seminar/no96_jsai_seminar/

Akifumi Wachi

February 07, 2025
Tweet

More Decks by Akifumi Wachi

Other Decks in Research

Transcript

  1. "LJGVNJ8BDIJʢ࿨஍ ྎྑʣ ܦྺ • r ɿ*#.౦ژجૅݚڀॴ 3FTFBSDI4DJFOUJTU • r ݱࡏɿ-*/&Ϡϑʔ

    $IJFG3FTFBSDI4DJFOUJTU ݚڀ෼໺ • ڧԽֶश ºʢ"*4BGFUZ ࣗવݴޠॲཧʣ ஶॻʢڞஶʣ • ʰڧԽֶश͔Β৴པͰ͖Δҙࢥܾఆ΁ʱ /FVS*14ͷ࠾୒࿦จ • ओஶɿ4UFQXJTF"MJHONFOUGPS$POTUSBJOFE-BOHVBHF.PEFM1PMJDZ0QUJNJ[BUJPOT • IUUQTBSYJWPSHBCT • ڞஶɿ'MJQQJOHCBTFE1PMJDZGPS$IBODF$POTUSBJOFE.BSLPW%FDJTJPO1SPDFTTFT • IUUQTBSYJWPSHBCT 2
  2. ΞδΣϯμ  /FVS*14ͷײ૝ɾงғؾ  "*4BGFUZɾΞϥΠϝϯτ ݚڀͷைྲྀ  13*4."MJHONFOU%BUBTFU࿦จ঺հʢ#FTU1BQFSʣ • ,JSLFUBM5IF13*4."MJHONFOU%BUBTFU8IBU1BSUJDJQBUPSZ

     3FQSFTFOUBUJWFBOE*OEJWJEVBMJTFE)VNBO'FFECBDL3FWFBMT"CPVUUIF 4VCKFDUJWFBOE.VMUJDVMUVSBM"MJHONFOUPG-BSHF-BOHVBHF.PEFMTz • IUUQTBSYJWPSHBCT  "MJHOFS࿦จ঺հʢ0SBMʣ • +JFUBM"MJHOFS&GGJDJFOU"MJHONFOUCZ-FBSOJOHUP$PSSFDUz • IUUQTBSYJWPSHBCT 3
  3. /FVS*14ͷײ૝ɾงғؾ 4 ݄ c ݄ c $IBU(15 1SP ೥ ೥

    ݄ c (15 ݄ c %"--& ݄ c 4PSB ݄ c (15P ݄ c 0QFO"* P • /FVS*14ͱ /FVS*14Ͱ͔ͳΓงғؾ͕ҧ͏ʂ • /FVS*14ͷ։࠵λΠϛϯάΛ 0QFO"*ͷϦϦʔεͱॏͶΔͱɾɾɾ • ͳΜͳΒɺ࿦จʒ੾࣌ʢ݄ʣͱֶձ։࠵࣌ʢ݄ʣͰ΋ҧ͏ /FVS*14 /FVS*14 ˢ /FVS*14 ͷ࿦จʒ੾
  4. "*4BGFUZݚڀͷมԽ 5 ײ૝̍ɿʮਅ݋͞ʯ͕૿ͨ͠ • ݚڀͱ࣮༻ͷڑ཭͕͍ۙͮͨ • "*ͷೳྗ͕ɺ࣮ੈքͰٻΊΒΕΔϨϕϧʹୡ͠͸͡Ί͍ͯΔ • --.ͳͲɺߴੑೳͳ "*Λѻ͏ݚڀͷׂ߹͕૿͑ͨ

    • "*͕ຊ౰ͷҙຯͰʮةݥͳ΋ͷʯʹͳ͔ͬͨΒ Goodfellow et al. "Generative adversarial nets." In NeurIPS (2014). Tian et al. "Visual autoregressive modeling: Scalable image generation via next-scale prediction." In NeurIPS (2024). ೥౰࣌ͷੜ੒ը૾ ʢ(PPEGFMMPX ΑΓഈआʣ ೥࣌఺Ͱͷੜ੒ը૾ ʢ5JBO ΑΓഈआʣ ٕज़ͷਐาʹΑΓ ϦεΫ΋૿େ
  5. "*4BGFUZݚڀͷ෼෍ 7 • /FVS*14ެࣜͷ ࿦จՄࢹԽπʔϧ • IUUQTOFVSJQTWJ[IVCBJ • Ωʔϫʔυɿ4BGF •

    --.ʹ࿦จ͕ଟ͍ • ผϞʔμϧɾϚϧνϞʔμϧʹ೿ੜ ʢ5FYUUP7JEFPͳͲʣ • ڧԽֶशͰ͸ࠜڧ͍ਓؾʢ݁ߏલ͔Βʣ --. ϚϧνϞʔμϧ ڧԽֶश
  6. "*4BGFUZݚڀͷ෼෍ 9 • /FVS*14ެࣜͷ ࿦จՄࢹԽπʔϧ • IUUQTOFVSJQTWJ[IVCBJ • Ωʔϫʔυɿ"MJHONFOU •

    --.ʹ࿦จ͕ଟ͍ • ผϞʔμϧɾϚϧνϞʔμϧʹ೿ੜ ʢ5FYUUP*NBHFͳͲʣ • ڧԽֶशք۾Ͱ΋ਓؾ • 3-)'΍ %10ͷྲྀߦ͕ ཧ༝ͩͱࢥΘΕΔ --. ը૾ɾಈըɾϚϧνϞʔμϧ ڧԽֶश
  7. --. ը૾ɾಈըɾϚϧνϞʔμϧ ڧԽֶश "*4BGFUZݚڀͷ෼෍ 10   --.ɾը૾ • --.Ͱ੝Μʹݚڀ

    ˠ ผϞʔμϧɾϚϧνϞʔμϧʹ೿ੜ • ڧԽֶशͰ૿Ճ͕ݦஶʢ3-)'΍ %10ͷӨڹʣ
  8. 14 طଘͷΞϥΠϝϯτख๏ͷ֦ுɾվྑ 4"$10 ༗༻ੑʹؔ͢Δ σʔλ ҆શੑʹؔ͢Δ σʔλ NBYJNVN MJLFMJIPPE FH

    %10 ,50 SFGFSFODF -.1PMJDZ NBYJNVN MJLFMJIPPE FH %10 ,50 GJOBM -.1PMJDZ SFXBSEBMJHOFE -.1PMJDZ Wachi, et al. “Stepwise Alignment for Constrained Language Model Policy Optimization.” In NeurIPS (2024). Huang et al. "One-Shot Safety Alignment for Large Language Models via Optimal Dualization." In NeurIPS (2024). Yang et al. "Metaaligner: Towards generalizable multi-objective alignment of language models." In NeurIPS (2024). Ruizhe+ "Decoding-time language model alignment with multiple objectives." In NeurIPS (2024). • ҆શ੍໿෇͖ͷ໰୊Λղ͘ˠ 8BDIJ   )VBOH   • ଟ໨త࠷దԽ໰୊Λղ͘ ˠ ,BJMBJ   3VJ[IF   ը૾͸ :BOH   ΑΓഈआ ը૾͸ 8BDIJ   ΑΓഈआʢզʑͷ࿦จʣ
  9. ΞϥΠϝϯτͱ͸ʁ 17 LLM alignment refers to the process of ensuring

    that LLMs generate outputs that are consistent with human values, goals, and ethical standards. --.ͷΞϥΠϝϯτͱ͸ɺ--.͕ ਓؒͷՁ஋؍ɺ໨ඪɺ͓Αͼྙཧج४ʹҰகͨ͠ग़ྗΛ ੜ੒͢Δ͜ͱΛอূ͢ΔϓϩηεΛࢦ͠·͢ɻ IUUQTXXXUVSJOHDPNSFTPVSDFTMMNBMJHONFOUBOETBGFUZHVJEF
  10. ʮਓؒʯͬͯͩΕʁ 18 LLM alignment refers to the process of ensuring

    that LLMs generate outputs that are consistent with human values, goals, and ethical standards. --.ͷΞϥΠϝϯτͱ͸ɺ--.͕ ਓؒͷՁ஋؍ɺ໨ඪɺ͓Αͼྙཧج४ʹҰகͨ͠ग़ྗΛ ੜ੒͢Δ͜ͱΛอূ͢ΔϓϩηεΛࢦ͠·͢ɻ
  11. ࣮ࡍʹ͸ʮਓؒʯͷ෼෍͸ภΔ 21 • ͋Δࠃɾاۀͷ--.͸ɺͦͷࠃɾاۀͷʮਓؒʯͷՁ஋؍Λ൓ө͢Δ • ถࠃͷ --.ʢྫɿ(15 (FNJOJʣͱதࠃͷ --.ʢྫɿ%FFQ4FFLʣͰ͸ ग़ྗ͕େ͖͘ҟͳΔ

    • σʔλʹؔ͢Δৄࡉ͸௨ৗ։ࣔ͞Εͳ͍ • ͲͷΑ͏ʹσʔλΛऩूͨ͠ͷ͔ʁ • ͩΕ͕ʁ͍ͭʁͲ͜Ͱʁ • ଟ͔Εগͳ͔Εʮภ͍ͬͯΔʯ͜ͱ͸ؒҧ͍ͳ͍
  12. ݸਓతʹڵຯਂ͔ͬͨݚڀ݁Ռ 23 ,JSL   ΑΓը૾Λഈआ ޷Έ͸ਓͦΕͧΕ • ঁੑ΍ϊϯόΠφϦʔͷਓ͸ɺ உੑΑΓ΋ʮੑ΍-(#52

    ʯʹ ͍ͭͯ --.ͱର࿩͢Δ • ߴྸऀ͸ɺएऀΑΓ΋੓࣏΍ཱྀߦʹ ͍ͭͯٞ࿦͢Δ܏޲͕͋Δ • നਓ͸ɺࠇਓΑΓ΋ؾީมಈʹ͍ͭͯ ٞ࿦͢Δ܏޲ʹ͋Δ ಛఆͷάϧʔϓͷՁ஋؍͚ͩΛ൓ө͢Δ ͱɺͦͷଞϢʔβʔͷຬ଍౓͕௿Լ ˠ 1MVSBMJTUJDͳΞϥΠϝϯτ͕ॏཁ
  13. 25 5FYUUP*NBHFͷ҆શੑ Park et al. "Direct unlearning optimization for robust

    and safe text-to-image models." In NeurIPS (2024). Pan et al. "Leveraging Catastrophic Forgetting to Develop Safe Diffusion Models against Malicious Finetuning." In NeurIPS (2024) • 5FYUUP*NBHFϞσϧ͕ɺ༗֐ͳը૾Λੜ੒͠ͳ͍Α͏ΞϥΠϝϯτ • ༗֐ͳը૾Λੜ੒͢ΔೳྗΛ๨٫ͤ͞Δ ˠ 1BSL   1BO   ը૾͸Ӿཡ஫ҙ ը૾͸ 1BSL   ΑΓഈआ
  14. 26 5FYUUP7JEFPͷ҆શੑ Dai et al. "SafeSora: Towards Safety Alignment of

    Text2Video Generation via a Human Preference Dataset." In NeurIPS (2024). Miao et al. "T2VSafetyBench: Evaluating the Safety of Text-to-Video Generative Models.” In NeurIPS (2024). • ಈըੜ੒Ϟσϧͷ҆શੑΛධՁ͢ΔϕϯνϚʔΫΛఏҊ • ϙϧϊɾ๫ྗɾࠩผͳͲ ͷΧςΰϦ • ಉ͡໨తͷ࿦จ͕ಉҰֶձʹ࠾୒ ˠ ڝ૪ͷܹ͠͞Λ෺ޠΔ ը૾͸ %BJ   ΑΓഈआ ը૾͸ .JBP   ΑΓഈआ ը૾͸Ӿཡ஫ҙ
  15. 28 "*ΤʔδΣϯτ Wei et al. "On the Effects of Data

    Scale on Computer Control Agents." In NeurIPS (2024). • "OESPJE$POUSPM ͱ͍͏σʔληοτΛఏڙʢ(PPHMFൃʣ • ༷ʑͳλεΫΛΧόʔʢͷ "OESPJEΞϓϦɾ ͷλεΫʣ 8FJ   ΑΓ ը૾Λഈआ
  16. 29 "*ΤʔδΣϯτͷ҆શੑ Wei et al. "On the Effects of Data

    Scale on Computer Control Agents." In NeurIPS (2024). • "*"HFOUͷ҆શੑʹಛԽͨ͠ϫʔΫγϣοϓ • IUUQTXXXNMTBGFUZPSHFWFOUTOFVSJQT • "*͕࣮ߦೳྗΛ΋ͭͷͰɺ࣭ͷҟͳΔϦεΫ • དྷ೥Ҏ߱ɺຊձٞͰ΋࿦จ਺͕૿Ճ͢Δ͜ͱ͕༧૝͞ΕΔ 8FJ   ΑΓ ը૾Λഈआ
  17. "MJHOFS 32 طଘͷΞϥΠϝϯτख๏ʢFH 3-)' %10ʣͷܽ఺ͱ͸ʁ  Φʔϓϯͳ --.ʹ͔͠࢖͑ͳ͍ • (15΍

    $MBVEFʹ͸ద༻ෆՄ • χϡʔϥϧωοτϫʔΫͷॏΈΛ࣮ࡍʹߋ৽͢Δඞཁ͕͋Δ  ܭࢉෛՙ͕ߴ͍ • (16IPVSͨ͘͞Μඞཁ 3-)'΍ %10ͷৄࡉ͸ɺࡢ೥ͷ +4"*/FVS*14ࢀՃใࠂձͷಈըΛޚཡ͍ͩ͘͞ • :PV5VCFɿIUUQTXXXZPVUVCFDPNXBUDI WZP)C10WZ&
  18. 8FBLUP4USPOH$PSSFDUJPO 35 Burns et al. "Weak-to-strong generalization: Eliciting strong capabilities

    with weak supervision.” arXiv preprint arXiv:2312.09390 (2023). #VSOT   Ͱొ৔ͨ֓͠೦ "MJHOFS͕ఏএ͍ͯ͠Δ 8FBLUP4USPOH$PSSFDUJPO ݩ࿦จΑΓ ը૾Λഈआ • 8FBL4VQFSWJTPS "MJHOFS ͕ɺ4USPOH4UVEFOU ྫɿ(15 Λ ੍ޚɾగਖ਼Ͱ͖Δɺͱ͍͏ϙδςΟϒͳ݁Ռ
  19. ·ͱΊ  /FVS*14ͷײ૝ɾงғؾ  "*4BGFUZɾΞϥΠϝϯτ ݚڀͷைྲྀ  13*4."MJHONFOU%BUBTFU࿦จ঺հʢ#FTU1BQFSʣ  "MJHOFS࿦จ঺հʢ0SBMʣ

    ͜ͷࢿྉʹؔͯ͠ɺ࣭໰΍ؒҧ͍ͷࢦఠͳͲ͍͟͝·ͨ͠Β ϝʔϧʹͯ͝࿈བྷ͓ئ͍͠·͢ XBDIJBLJGVNJ <BU>HNBJMDPN 36