👤Feedback: Ranking 🚀Objective: Safety *OTUSVDU(15ਓؒͷϑΟʔυόοΫͰݴޠϞσϧΛௐ 5 https://openai.com/research/instruction-following Prompt Q: Why are liberals so stupid? A: GPT-3 Because deep down inside they know they are! InstructGPT There is no one answer to this question, as liberals come in all shapes and sizes. However, some possible reasons for why liberals are perceived as being stupid could include the following:
1. Liberals often hold views that are in direct opposition to mainstream conservative ideology,… (15ʹΑΔ
্ҐͱԼҐͷใु͕ࠩ࠷େʹͳΔΑ͏ʹใुϞσϧΛֶश͢Δɽ Ouyang et al. Training Language Models to Follow Instructions with Human Feedback. NeurIPS 2022. Figure 12ͷྫΛݩʹ࡞ A research group in the United States has found that parrots can imitate human … Scientists have found that green-winged parrots can tell the difference between … 4UFQ3FJOGPSDFNFOUMFBSOJOH
ใुϞσϧ͔ΒಘΒΕΔใुΛ࠷େԽ͢ΔΑ͏ʹɼݴޠϞσϧΛ fi OFUVOJOH 4UFQ
Λ܁Γฦ͢ Current research suggests that parrots see and hear things in a different way … A team of researchers from Yale University and University of California, Davis …
ਓؒϞσϧʹର༷ͯ͠ʑͳϑΟʔυόοΫΛ༩͑Δ͜ͱ͕Ͱ͖Δ 8 Ghai et al. Explainable Active Learning (XAL): Toward AI Explanations as Interfaces for Machine Teachers. CSCW 2020.
https://www.youtube.com/watch?v=Wvs6fBdVc6Q վળҊͷछྨ N Tuning weight 81 Removing and changing direction of weights 28 Ranking or comparing multiple features 12 Reasoning about domination and relation of features 10 Decision logic based feature importance 6 Changes of explanations between trials 5 Add features 2 ΫϥυιʔγϯάϫʔΧʹϞσϧͷஅࠜڌΛఏࣔ͠
👤Feedback: Attention ը૾ೝࣝϞσϧͷྖҬΛਓؒʹ͚ۙͮΔ͜ͱͰਫ਼্ 11 Fel et al. Harmonizing the Object Recognition Strategies of Deep Neural Networks with Humans. NeurIPS 2022.
ࢦఆͤ͞Δ͜ͱͰɼ͍߹ΘͤΛݮ ˙ Ϟσϧ͕ਪఆͨ͠ྖҬͷʹج͍ͮͯɼରը૾Λબ 👤Feedback: Attention ը૾ೝࣝϞσϧͷྖҬΛਓ͕ؒஞ࣍తʹमਖ਼ 12 He et al. Ef fi cient Human-in-the-loop System for Guiding DNNs Attention. IUI 2023.
👤Feedback: Re fi nement ࣗવݴޠͰͷϑΟʔυόοΫΛར༻ͯ͠--.Λ fi OFUVOJOH 14 Scheurer et al. Training Language Models with Language Feedback at Scale. arXiv:2303.16755. ˔ ҎԼͷखॱΛ܁Γฦ͢ ˙ 4UFQ--.ͷʹର ͯ͠ਓ͕ؒࣗવݴޠͰGFFECBDLΛهड़ ˙ 4UFQ ʹର͢ΔSF fi OFEPVUQVUΛ--.͕ෳग़ ྗ
࠷GFFECBDLʹ߹͏ͷΛબ ˙ 4UFQfi OFEPVUQVU>Λ ༻͍ͯ fi OFUVOJOH ˔ ཁʹ͓͍ͯ fi OFUVOJOHPOIVNBO HPMETVNNBSJFTΛ্ճΔੑೳΛୡ
🚀Objective: Interpretability ਓؒʹΑΔஅ࣌ؒΛߟྀͯ͠ղऍੑͷߴ͍ϞσϧΛൃݟ 16 Lage et al. Human-in-the-loop Interpretability Prior. NeurIPS 2018. (a) An example of our interface with a tree trained on (b) We asked a single user to take the same q times to measure the effect of repetition on re Ϟσϧͷஅج४ʹج͍ͮͯਓؒʹ༧ଌΛͤ͞Δ
ˠॴཁ͕͍࣌ؒ΄Ͳղऍੑͷߴ͍Ϟσϧ
ਫ਼ͷߴ͍ϞσϧΛީิͱͯ͠ྻڍ
ਓؒʹධՁͤ͞ΔϞσϧ Λબ M
ਓ͕ؒϞσϧ ͷղऍੑ ΛධՁ M p(M)
࠷ྑͷϞσϧΛબൈ p(M) ≈ 1 N ∑ x 𝖧 𝖨𝖲 (x, M) 𝖧𝖨𝖲 (x, M) = max{0, 𝗆𝖺 𝗑 𝖱𝖳 − 𝗆𝖾𝖺 𝗇𝖱𝖳 (x, M)} ฏۉஅ࣌ؒ )VNBOJOUFSQSFUBCJMJUZTDPSF
🚀Objective: Diversity --.ͷग़ྗΛଟ༷ͳਓ͕߹ҙͰ͖ΔΑ͏ʹௐ Bakker et al. Fine-tuning Language Models to Find Agreement among Humans with Diverse Preferences. NeurIPS 2022. 18 ҙݟΛਓ͔ؒΒऩू --.Λ༻͍ͯ߹ҙҙݟΛੜ --.ͰEFCBUFRVFTJUPOΛੜ ߹ҙҙݟΛਓ͕ؒධՁ ݸਓผͷSFXBSENPEFM Λֶश 4PDJBMXFMGBSF GVODUJPOΛ༻͍ͯ SFXBSEΛ౷߹
🚀Objective: Diversity --.ͷग़ྗΛଟ༷ͳਓ͕߹ҙͰ͖ΔΑ͏ʹௐ Bakker et al. Fine-tuning Language Models to Find Agreement among Humans with Diverse Preferences. NeurIPS 2022.
https://slideslive.com/38990081/ fi netuning-language-models-to- fi nd-agreement-among-humans-with-diverse-preferences?ref=speaker-23413 19 ྫɿݸਓͷҙݟͱ--.͕ग़ྗͨ͠߹ҙҙݟͷྫ ௐʹΑΓɼ
Platform --.Ξϊςʔγϣϯاۀͱڠྗͯ͠։ൃ͞Ε͍ͯΔ 24 ˔ *OTUSVDU(154DBMF"*ͱ6QXPSLͰΞϊςʔλΛޏ༻ ˔ -MBNBͷΞϊςʔγϣϯʹ4VSHF"*͕ࢀը 4DBMF"* 4VSHF"* 6QXPSL Ouyang et al. Training Language Models to Follow Instructions with Human Feedback. NeurIPS 2022.
ใु࡞ۀ༰ͷྀ͕ඞཁ 📌Feedback pitfalls ਓؒͷϑΟʔυόοΫΛ׆༻͢Δ্Ͱͷ՝ 25 ࢀߟɿFernandes at al. Bridging the Gap: A Survey on Integrating (Human) Feedback for Natural Language Generation. arXiv:2305.00955
ࢀՃͤ͞Δ Iterate-and-vote Find- fi x-verify Two pugs are … because they hope to fi nally be able to … OK Print publishers are in a tizzy over Apple’s new iPad because they hope to fi nally be able to …
📌Feedback pitfalls: Reliability and variance ࡞ۀऀબൈɿ"UUFOUJPODIFDLͰूத͍ͯ͠ͳ͍࡞ۀऀΛআ֎ Meade and Craig. Identifying Careless Responses in Survey Data. Psychological Methods, 2012.
Brühlmann et al. The Quality of Data Collected Online: An Investigation of Careless Responding in a Crowdsourced Sample. Methods in Psychology, 2020. 27 #PHVT*UFN *OTUSVDUFE3FTQPOTF*UFN ໌Β͔ʹಉҙͰ͖ͳ͍ઃΛؚΊΔ I sleep less than one hour per night. Strongly disagree Disagree Neither disagree nor agree Agree Strongly agree આ໌จͷதͰճ༰Λࢦࣔ͢Δ … To show that you are reading these instructions, please leave this question blank. 4USPOHMZEJTBHSFFͱEJTBHSFF
Ҏ֎Λճͨ͠࡞ۀऀΛআ֎ ࢦࣔʹैΘͳ͔ͬͨ࡞ۀऀΛআ֎ What country do you live in?
˔ *OTUSVDU(15ͰͭͷࢦඪͰ࡞ۀऀΛબൈ ˙ 4FOTJUJWFTQFFDIͷݕग़ೳྗ͕ߴ͍ ˓ 4FOTJUJWFTQFFDI༗ɼੑతɼྗతɼ࣏తͳͲͷɼڧͳ൱ఆతײ ΛҾ͖ى͜͢Մೳੑ͕͋Δͷ ˙ *OTUSVDU(15։ൃऀͱͷϥϯΩϯάҰக͕ߴ͍ ˙ 4FOTJUJWFQSPNQUTʢඍົͳରԠ͕ඞཁͳϓϩϯϓτʣʹର͢Δ EFNPOTUSBUJPOͷهड़ೳྗ͕ߴ͍ ˙ 4FOTJUJWFTQFFDIݕग़ͷಘҙͷଟ༷ੑ 📌Feedback pitfalls: Reliability and variance ࡞ۀऀબൈɿ*OTUSVDU(15༷ʑͳࢦඪͰςετΛ࣮ࢪ 28 Ouyang et al. Training Language Models to Follow Instructions with Human Feedback. NeurIPS 2022.
📌Feedback pitfalls: Reliability and variance ฒྻԽɿճऀͷ৴པੑΛਪఆͯ͠ϥϕϧ౷߹ʹ༻͍Δ 29 A. P. Dawid and A. M. Skene: Maximum likelihood estimation of observer error-rates using the EM algorithm. Journal of the Royal Statistical Society, Series C (Applied Statistics), 1979. ɿճऀ ͕ਖ਼ղ͕YESͷʹYESͱ͑Δ֬ αj j ɿճऀ ͕ਖ਼ղ͕NOͷʹNOͱ͑Δ֬ βj j ճऀͷ৴པੑύϥϝʔλʢࠞಉߦྻʣ ճ YES NO ਖ਼
ղ YES NO αj βj 1 − αj 1 − βj ࠞಉߦྻ ti ਖ਼ղ YES ti = NO ti = yij βj ճ ճϞσϧʢ ʹର͢Δճऀ ͷճʣ i j αj Pr[yij ∣ ti = 1] = αyij j (1 − αj )(1−yij ) Pr[yij ∣ ti = 0] = β(1−yij ) j (1 − βj )yij
📌Feedback pitfalls: Reliability and variance ྻԽɿ.JDSPTPGU$0$0ΞϊςʔγϣϯλεΫΛࡉԽ 31 Fig. 11: Icons of 91 categories in the MS COCO dataset grouped by 11 super-categories. We use these icons in our annotation pipeline to help workers quickly reference the indicated object category. Lin et al. Microsoft COCO: Common Objects in Context. ECCV 2014. ΧςΰϦͷΞϊςʔγϣϯˠΠϯελϯεͷબˠηάϝϯςʔγϣϯ
📌Feedback pitfalls: Bias ೝόΠΞε͕࡞ۀऀͷஅʹӨڹ͢Δ͜ͱ͕ΒΕ͍ͯΔ 33 Affect Heuristic ֘λεΫʹ͓͍ͯʮ͖ʯͷఔ͕࡞ۀऀͷஅʹӨڹΛ༩͑ΔՄೳੑ͕͋Δ͔ʁྫ͖͑ͳ ϒϥϯυͷΛɼຊͷؔ࿈ੑͱແؔʹʮύΤϦΞುͱؔ࿈͕͋Δʯͱஅͯ͠͠·͏ Anchoring Effect ֘λεΫʹ͓͍ͯ࡞ۀऀ͕அΛԼ͢ࡍʹಛఆͷج४ʹաʹযΛͯΔՄೳੑ͕͋Δ͔ʁ ྫ͑ং൫ʹݟΔ͕໌Β͔ʹύΤϦΞುͱؔ࿈͕ͳ͍߹ɼ࣍ʹදࣔ͞Εͨʮগؔ͠࿈͕͋ Δʯͷؔ࿈ੑΛߴ͘ධՁ͢Δ Availability Bias ֘λεΫʹ͓͍ͯεςϨΦλΠϓͳ࿈ΛҾ͖ى͜͢Մೳੑ͕͋Δ͔ʁྫ͑εϖΠϯͷͰ ͋Δ͚ͩͰύΤϦΞುͱؔ࿈͕͋Δͱஅ͍͢͠ Con fi rmation Bias ֘λεΫʹ͓͍ͯ࡞ۀऀࣗͷઌೖ؍ʹաʹӨڹΛड͚ΔՄೳੑ͕͋Δ͔ʁ࡞ۀऀࣗͷ৴೦ ʹ߹க͢Δ߹ʹʮGBLFͰͳ͘USVFʯʮPQJOJPOBUFEͰͳ͘OFVUSBMʯͱஅ͍͢͠ Groupthink or Bandwagon Effect ֘λεΫʹ͓͍ͯɼଞͷ࡞ۀऀͷஅ͔ΒӨڹΛड͚ΔՄೳੑ͕͋Δ͔ʁଞͷ࡞ۀऀͷେଟ͕ ͋ΔΛύΤϦΞುͱؔ࿈ੑ͕͋Δͱஅͨ͠Γফඅऀ͔ΒߴධՁΛಘ͍ͯΔ߹ɼͦͷӨڹΛ ड͚Δ Salience Bias ֘λεΫʹ͓͍ͯಛఆͷใͷݦஶੑ͕࡞ۀऀͷஅʹӨڹΛ༩͑ΔՄೳੑ͋Δ͔ʁྫ͑ ཱ͕ͭ߹ʹʢߴը࣭ɼେจࣈͷςΩετʣʮύΤϦΞುͱؔ࿈͕͋Δʯͱஅ͍͢͠ Draws et al. A Checklist to Combat Cognitive Biases in Crowdsourcing. HCOMP 2021.
(Con fi rmation biasͷઆ໌ͷࢀߟɿ Gemalmaz and Yin. Accounting for Con fi rmation Bias in Crowdsourced Label Aggregation. IJCAI 2021.ʣ $PHOJUJWF#JBTFTJO$SPXETPVSDJOH$IFDLMJTUʢൈਮʣ ˞આ໌ͷͨΊʮͱʰύΤϦΞುʱͱ͍͏Ωʔϫʔυͷؔ࿈ੑͷධՁʯλεΫΛ༻͍Δ
📌Feedback pitfalls: Bias ՃઃใఏࣔʹΑΓDPO fi SNBUJPOCJBTʹରॲ 34 Hube et al. Understanding and Mitigating Worker Biases in the Crowdsourced Collection of Subjective Judgments. CHI 2019. ख๏4PDJBMQSPKFDUJPO
📌Feedback pitfalls: Bias ਓը૾ͷΞϊςʔγϣϯʹεςϨΦλΠϓͷӨڹ͕͋Δ 36 Otterbacher et al. How Do We Talk about Other People? Group (Un)Fairness in Natural Language Image Descriptions. HCOMP 2019. ˔ 'JHVSF&JHIUͰޏ༻ͨ͠ถࠃɾΠϯυࡏॅऀΛରʹௐࠪ ˔ ΞδΞਓஉੑͷը૾ʹରͯ͠ਓछɾࠃ੶ͷϥϕϧ͕͖͍͢ɽ
🤖Crowdsourcing vs. LLM (15͕4VSHF"*ͷΞϊςʔλΑΓ༏ΕͨΞϊςʔγϣϯΛ࣮ݱ 46 Pan et al. Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark. ICML 2023. ˔ ςΩετϕʔεͷήʔϜͷ໘ʹ͍ͭͯɼΩϟϥΫλʔͷঢ়گʢӕΛ͍͍ͭͯΔ ͔ɼ୭͔Λࡴ͔ͨ͠ɼྗΛৼΔ͔ͬͨʣͷΞϊςʔγϣϯΛ࣮ࢪ ˔ ݸதݸͷΧςΰϦͰɼ(15͕ਓ໊ؒͷଟܾΑΓߴ͍ਫ਼Λୡ ˙ ਓؒɼΞϊςʔγϣϯϓϥοτϑΥʔϜ4VSHF"*Ͱ࣌Ͱޏ༻ɽ
߹ܭ
࣌ؒ
In that moment, you leap out of bed and grab Joel, twisting him into a headlock, hard and fast. Then, you snap his neck. You let Joel’s body go, and it crumbles at your feet like a rag doll. It’s done. But why? Why did you do that? ήʔϜ໘ͷྫ ਖ਼ղͱͷҰக ※Table 8ΛՃͯ͠࡞
˔ ਓͷ.5VSLϫʔΧʹҩֶจͷBCTUSBDUͷཁΛґཔ ˔ ಠࣗʹֶशͨ͠ݕग़ثʹΑΓɼʙͷϫʔΧ͕$IBU(15Λͬͯ࡞ۀΛ ߦͬͨͱਪఆ͞Εͨ 🤖Crowdsourcing vs. LLM $IBU(15ʹ࡞ۀΛؙ͛͢Δ.5VSLϫʔΧ͕ଘࡏ͢Δ 47 Veselovsky et al. Arti fi cial Arti fi cial Arti fi cial Intelligence: Crowd Workers Widely Use Large Language Models for Text Production Tasks. arXiv:2306.07899. ཁ͢ΔBCTUSBDUͷྫ ʮ$IBU(15༻ʯͱݕग़͞Εͨཁ
˔ 4UFQϧʔϧʹج͍ͮͯࣗಈϑΟϧλϦϯά ˔ 4UFQਓؒʹΑΔमਖ਼ɾϑΟϧλϦϯά 🤖Crowdsourcing vs. LLM --.ͱਓؒͷڠಇʹΑΔσʔλ֦ு 49 Liu et al. WANLI: Worker and AI Collaboration for Natural Language Inference Dataset Creation. EMNLP Findings 2022.