݄ c (15 ݄ c %"--& ݄ c 4PSB ݄ c (15P ݄ c 0QFO"* P • /FVS*14ͱ /FVS*14Ͱ͔ͳΓงғؾ͕ҧ͏ʂ • /FVS*14ͷ։࠵λΠϛϯάΛ 0QFO"*ͷϦϦʔεͱॏͶΔͱɾɾɾ • ͳΜͳΒɺจʒ࣌ʢ݄ʣͱֶձ։࠵࣌ʢ݄ʣͰҧ͏ /FVS*14 /FVS*14 ˢ /FVS*14 ͷจʒ
%10 ,50 SFGFSFODF -.1PMJDZ NBYJNVN MJLFMJIPPE FH %10 ,50 GJOBM -.1PMJDZ SFXBSEBMJHOFE -.1PMJDZ Wachi, et al. “Stepwise Alignment for Constrained Language Model Policy Optimization.” In NeurIPS (2024). Huang et al. "One-Shot Safety Alignment for Large Language Models via Optimal Dualization." In NeurIPS (2024). Yang et al. "Metaaligner: Towards generalizable multi-objective alignment of language models." In NeurIPS (2024). Ruizhe+ "Decoding-time language model alignment with multiple objectives." In NeurIPS (2024). • ҆શ੍͖ͷΛղ͘ˠ 8BDIJ )VBOH • ଟత࠷దԽΛղ͘ ˠ ,BJMBJ 3VJ[IF ը૾ :BOH ΑΓഈआ ը૾ 8BDIJ ΑΓഈआʢզʑͷจʣ
that LLMs generate outputs that are consistent with human values, goals, and ethical standards. --.ͷΞϥΠϝϯτͱɺ--.͕ ਓؒͷՁ؍ɺඪɺ͓Αͼྙཧج४ʹҰகͨ͠ग़ྗΛ ੜ͢Δ͜ͱΛอূ͢ΔϓϩηεΛࢦ͠·͢ɻ IUUQTXXXUVSJOHDPNSFTPVSDFTMMNBMJHONFOUBOETBGFUZHVJEF
that LLMs generate outputs that are consistent with human values, goals, and ethical standards. --.ͷΞϥΠϝϯτͱɺ--.͕ ਓؒͷՁ؍ɺඪɺ͓Αͼྙཧج४ʹҰகͨ͠ग़ྗΛ ੜ͢Δ͜ͱΛอূ͢ΔϓϩηεΛࢦ͠·͢ɻ
and safe text-to-image models." In NeurIPS (2024). Pan et al. "Leveraging Catastrophic Forgetting to Develop Safe Diffusion Models against Malicious Finetuning." In NeurIPS (2024) • 5FYUUP*NBHFϞσϧ͕ɺ༗ͳը૾Λੜ͠ͳ͍Α͏ΞϥΠϝϯτ • ༗ͳը૾Λੜ͢ΔೳྗΛ٫ͤ͞Δ ˠ 1BSL 1BO ը૾Ӿཡҙ ը૾ 1BSL ΑΓഈआ
Text2Video Generation via a Human Preference Dataset." In NeurIPS (2024). Miao et al. "T2VSafetyBench: Evaluating the Safety of Text-to-Video Generative Models.” In NeurIPS (2024). • ಈըੜϞσϧͷ҆શੑΛධՁ͢ΔϕϯνϚʔΫΛఏҊ • ϙϧϊɾྗɾࠩผͳͲ ͷΧςΰϦ • ಉ͡తͷจ͕ಉҰֶձʹ࠾ ˠ ڝ૪ͷܹ͠͞ΛޠΔ ը૾ %BJ ΑΓഈआ ը૾ .JBP ΑΓഈआ ը૾Ӿཡҙ