Upgrade to Pro — share decks privately, control downloads, hide ads and more …

詳解 強化学習 / In-depth Guide to Reinforcement Learning

Avatar for PRIN Lab PRIN Lab
March 19, 2026

詳解 強化学習 / In-depth Guide to Reinforcement Learning

下記教科書+αをまとめたものです
https://www.it-book.co.jp/books/147.html

Avatar for PRIN Lab

PRIN Lab

March 19, 2026
Tweet

Other Decks in Technology

Transcript

  1. Principles of Robot Intelligence from Nature/Nurture ࢀߟॻ ৄղ ڧԽֶश 2

    https://www.amazon.co.jp/dp/4910558276 ࠷৽ͷڧԽֶशΞϧΰϦζϜΒΛ࣮༻ੑͷ؍఺͔Β੔ཧɾ঺հ 1. ڧԽֶशͱ͸ 2. ڧԽֶशͷجຊతͳ໰୊ઃఆ 3. جຊతͳֶशΞϧΰϦζϜ 4. ํࡦޯ഑๏ͷൃల Ø ࠷ۙͷఆ൪PPO, SACͳͲͷ ૂ͍ͱཧ࿦ 5. ϞσϧϕʔεڧԽֶश Ø ੈքϞσϧͷߏஙɾֶश๏ͱ ׆༻ࣄྫ 6. ใुઃܭͷ՝୊ͱରࡦ Ø ࣮༻࣌ʹ࠷΋ࠔΔใुؔ਺ʹ ؔ࿈͢Δٕज़ 7. ࠓޙͷల๬
  2. Principles of Robot Intelligence from Nature/Nurture ྑ͋͘ΔAI ৄղ ڧԽֶश 4

    ೖྗσʔλʹରͯ͠ਖ਼ղͱͳΔग़ྗσʔλͷϖΞΛେྔʹؚΜͩσʔληοτΛ ༻͍ͨڭࢣ͋ΓֶशΛͯ͠ɼ৽نೖྗσʔλʹର͢Δग़ྗσʔλΛ༧ଌ͢Δ Black-box function (Oracle) Input Output Gorilla Deer Sunfish Tuna Flan Tai-yaki
  3. Principles of Robot Intelligence from Nature/Nurture ڧԽֶशͱ͸ ৄղ ڧԽֶश 5

    ΤʔδΣϯτ͕ ະ஌ͷ؀ڥͰ ࢼߦࡨޡ͍ͯ͘͠தͰ কདྷతʹಘΒΕΔऩӹΛ࠷େԽ͢ΔΑ͏ͳ ঢ়ଶʹԠͨ͡ํࡦΛֶश͢Δ u ΤʔδΣϯτɿֶश͢ΔओମʢϩϘοτɾήʔϜAIͳͲʣ u ؀ڥɿΤʔδΣϯτ͕׆ಈ͢Δ෣୆ u ऩӹɿ؀ڥʹજΉධՁػߏ͔ΒಘΒΕΔ֤࣌ࠁͷධՁ݁Ռʢใुʣͷྦྷੵ஋ u ঢ়ଶɿ؀ڥʢʴΤʔδΣϯτʣ͕Ͳ͏ͳ͍ͬͯΔ͔Λ୺తʹද͢σʔλ u ํࡦɿΤʔδΣϯτ͕؀ڥʹհೖ͢Δํ๏ʢߦಈʣͷҙࢥܾఆػߏ
  4. Principles of Robot Intelligence from Nature/Nurture ڧԽֶशͷ೉͠͞ ৄղ ڧԽֶश 6

    u ؒ઀తͳڭࣔ Ø ڭࢣ͋ΓֶशͷΑ͏ʹɼ͋Δঢ়ଶʹର͢Δਖ਼ղͷߦಈ͸༩͑ΒΕͳ͍… Ø ୅ΘΓʹɼใु͕ࣔ͢ྑ͠ѱ͠ΛཔΓʹ࠷΋ྑ͍ߦಈΛ୳͢ u σʔλͷऩू Ø ڭࢣ͋ΓֶशͷΑ͏ʹɼࣄલʹֶश༻ͷσʔληοτ͕༻ҙ͞Ε͍ͯͳ͍… Ø ୅ΘΓʹɼΤʔδΣϯτࣗ਎͕৭ʑͳߦಈΛࢼͯ͠σʔλΛूΊΔ u ऩӹͷ༧ଌ Ø ใु͸ݱঢ়ͷྑ͠ѱ͚ͩ͠Λࣔ͠ɼকདྷͷྑ͠ѱ͠͸ڭ͑ͯ͘Εͳ͍… Ø ୅ΘΓʹɼΤʔδΣϯτͷํࡦʹ΋ґଘ͢ΔऩӹΛ༧ଌ͢Δ
  5. Principles of Robot Intelligence from Nature/Nurture ճ౴ྫ ৄղ ڧԽֶश 8

    u ίϯύεͷ౗ΕΔํ޲ʹઌճΓ͢ΔΑ͏ʹखΛಈ͔͢ u ઌʹखΛपظతʹಈ͔ͯ͠ίϯύεͷಈ͖ΛύλʔϯԽ u ίϯύεͷ਑ΛखͷͻΒʹࢗ͢ Ø ʮকདྷΛߟྀͯ͠౎߹ͷྑ͍΋ͷʹಋ͘ʯΑ͏ߦಈ͍ͯ͠Δʂ ڧԽֶशͰ࠷΋ʁॏཁͳલఏ
  6. Principles of Robot Intelligence from Nature/Nurture ڧԽֶश͕ʢ޻෉ͳ͠ʹʣղ͚Δલఏ৚݅ ৄղ ڧԽֶश 9

    ঢ়ଶ͕ߦಈʹΑͬͯͲ͏ભҠ͢Δͷ͔֬཰తʹ༧ଌͰ͖Δ͜ͱ -> ϚϧίϑܾఆաఔʢMarkov Decision Process; MDPʣ ྫ୊ɿ࠷୹ܦ࿏ܭը ঢ়ଶɿάϦου্ͷҐஔ ߦಈɿલޙࠨӈ΁ͷҠಈ ใुɿΰʔϧͱͷڑ཭
  7. Principles of Robot Intelligence from Nature/Nurture Ұൠతͳঢ়ଶભҠ ৄղ ڧԽֶश 10

    ঢ়ଶ𝑠ͷભҠ͸͜Ε·Ͱͷঢ়ଶཤྺʹै͏ -> શཤྺΛࢀরͯ͠ͷ༧ଌ͸ࠔ೉… ঢ়ଶભҠ֬཰ աڈ ະདྷ … !!"# !! !!$# !!$% !!$& !!$' ༧ଌʹඞཁͳ৚݅ ݱ࣌ࠁ " a) Ұൠతͳ؀ڥͷঢ়ଶભҠ b) Ϛϧίϑաఔ ཚ୒ ৚݅
  8. Principles of Robot Intelligence from Nature/Nurture ద੾ͳঢ়ଶΛఆٛ͢Δͱ… ৄղ ڧԽֶश 11

    ͜Ε·Ͱͷঢ়ଶཤྺ͕ͳͯ͘΋ঢ়ଶભҠΛ༧ଌՄೳʹʂ -> Ϛϧίϑաఔʢ͋Δ͍͸୯७Ϛϧίϑ࿈࠯ʣͱݺͿ ঢ়ଶભҠ֬཰ աڈ ະདྷ … !!"# !! !!$# !!$% !!$& !!$' ༧ଌʹඞཁͳ৚݅ ݱ࣌ࠁ " a) Ұൠతͳ؀ڥͷঢ়ଶભҠ աڈʢෆཁʂʣ ະདྷ … !!"# !! !!$# !!$% !!$& !!$' ৚݅Λݱঢ়ଶͷΈʹ ݱ࣌ࠁ " b) Ϛϧίϑաఔ c) Ϛϧίϑܾఆաఔ
  9. Principles of Robot Intelligence from Nature/Nurture ద੾ͳঢ়ଶΛఆٛ͢Δͱ… ৄղ ڧԽֶश 12

    ͜Ε·Ͱͷঢ়ଶཤྺ͕ͳͯ͘΋ঢ়ଶભҠΛ༧ଌՄೳʹʂ -> Ϛϧίϑաఔʢ͋Δ͍͸୯७Ϛϧίϑ࿈࠯ʣͱݺͿ ঢ়ଶભҠ֬཰ʢ࣌ࠁΛলུʣ աڈ ະདྷ … !!"# !! !!$# !!$% !!$& !!$' ༧ଌʹඞཁͳ৚݅ ݱ࣌ࠁ " a) Ұൠతͳ؀ڥͷঢ়ଶભҠ աڈʢෆཁʂʣ ະདྷ … !!"# !! !!$# !!$% !!$& !!$' ৚݅Λݱঢ়ଶͷΈʹ ݱ࣌ࠁ " b) Ϛϧίϑաఔ c) Ϛϧίϑܾఆաఔ
  10. Principles of Robot Intelligence from Nature/Nurture ߦಈͰঢ়ଶભҠʹհೖͰ͖Δͱ… ৄղ ڧԽֶश 13

    डಈతʹ࣌ؒൃల͢Δࣄ৅͔ΒকདྷΛʢ͋Δఔ౓ʣૢ࡞ՄೳͳγεςϜʹʂ -> ϚϧίϑܾఆաఔͳΒߦಈ𝑎͸ঢ়ଶ𝑠ʹͷΈґଘܾͯ͠ΊΕ͹ྑ͍ ঢ়ଶભҠ֬཰ աڈ ະདྷ … !!"# !! !!$# !!$% !!$& !!$' ݱ࣌ࠁ " աڈʢෆཁʂʣ ະདྷ … !!"# !! !!$# !!$% !!$& !!$' ৚݅Λݱঢ়ଶͷΈʹ ݱ࣌ࠁ " b) Ϛϧίϑաఔ աڈ ະདྷ … !!"# !! !!$# !!$% !!$& !!$' ݱ࣌ࠁ " c) Ϛϧίϑܾఆաఔ #! ঢ়ଶભҠʹհೖʂ ΤʔδΣϯτ ߦಈ ͲΜͳߦಈ͕ॴ๬ͷ ঢ়ଶʹભҠͤ͞Δʁ
  11. Principles of Robot Intelligence from Nature/Nurture ڧԽֶशͷϑϨʔϜϫʔΫ ৄղ ڧԽֶश 14

    ϚϧίϑܾఆաఔͳΒɼݱঢ়ͷྑ͠ѱ͠Λࣔ͢ใुؔ਺͸𝑟(𝑠, 𝑎, 𝑠!)ͰදͤΔʂ ํࡦ𝜋(𝑎|𝑠)͸۩ମతʹͲΜͳ֬཰෼෍…ʁ ߦಈ ! ঢ়ଶ " / ใु # ΤʔδΣϯτ 𝜋(𝑎|𝑠) ະ஌ͷ؀ڥ 𝑝"(𝑠!|𝑠, 𝑎), 𝑟(𝑠, 𝑎, 𝑠!)
  12. Principles of Robot Intelligence from Nature/Nurture ཭ࢄߦಈۭؒʹ͓͚Δํࡦ ৄղ ڧԽֶश 15

    ߦಈ͕༗ݶͷબ୒ࢶͰఆٛ͞ΕΔ৔߹… -> 𝑖-൪໨ͷબ୒ࢶͷ༏ઌ౓𝜃# Λ࠷దԽͭͭ͠ɼͦΕʹج͍ͮͯબ୒֬཰Λઃܭʂ a) !-άϦʔσΟํࡦ b) ιϑτϚοΫεํࡦ "!͕࠷େͷ΋ͷ͚ͩߴ֬཰ ଞ͸Ұ༷ બ୒֬཰ ߦಈͷબ୒ࢶ 1 બ୒֬཰ ߦಈͷબ୒ࢶ 1 "!ͷେখΛ֬཰ʹ൓ө
  13. Principles of Robot Intelligence from Nature/Nurture ࿈ଓߦಈۭؒʹ͓͚Δํࡦ ৄղ ڧԽֶश 16

    ߦಈ͕ʢ͋Δൣғͷʣ࣮਺ϕΫτϧͰఆٛ͞ΕΔ৔߹… -> ୯७ͳਖ਼ن෼෍͕ఆ൪͕ͩɼॴ๬ͷಛੑʹԠͨ͡ઃܭΛ͢Δ͜ͱ΋ʂ †࠷ۙ͸֦ࢄϞσϧͳͲͰे෼ͳදݱྗΛ֬อ͢Δ͜ͱ΋     ੍ޚೖྗͷఆٛҬ ֬཰ີ౓ Ø ਖ਼ن෼෍ɿ ୯७͞ Ø t෼෍ɿ ੄ͷॏ͞ Ø ϕʔλ෼෍ɿ ༗քੑ Ø ࠞ߹෼෍ɿ ଟๆੑ
  14. Principles of Robot Intelligence from Nature/Nurture ํࡦ͕֬཰෼෍Ͱ͋Δඞཁੑ ৄղ ڧԽֶश 17

    ΤʔδΣϯτ͕ࢼߦࡨޡ༷ͯ͠ʑͳܦݧΛಘΔʹ͸ɼ֬཰తͳڍಈ͕୯७͕ͩ༗ޮʂ † Exploration-Exploitation Dilemmaʢ୳ࡧͱ஌ࣝར༻ͷδϨϯϚʣʹ஫ҙ ࠜؾΑ͘୳ࡧ͠ͳ͍ͱݟ͔ͭΒͳ͍… ͜Ε͕࠷దͱ͸ݶΒͳ͍…
  15. Principles of Robot Intelligence from Nature/Nurture ํࡦͷ࠷దԽʹऩӹͷ࠷େԽ ৄղ ڧԽֶश 18

    ͜ͷઌ؀ڥ͔Β໯͑ΔใुͷྦྷܭʢऩӹʣΛ࠷େԽ͢ΔΑ͏ɼݱঢ়ͷํࡦΛ࠷దԽʂ -> ͲΜͳऩӹΛ໯͑Δ͔༧ଌͰ͖ͳ͍ͱɼ࠷େԽͷํ๏΋Θ͔Βͳ͍… ਺஋తൃࢄͷϦεΫ কདྷͷใुͷՁ஋Λগͳ͘͢Δ͜ͱͰ਺஋҆ఆԽ ׂҾ཰ 𝛾 ∈ [0, 1) ऩӹ
  16. Principles of Robot Intelligence from Nature/Nurture Ձ஋ؔ਺ͷಋೖ ৄղ ڧԽֶश 19

    ঢ়ଶભҠ֬཰𝑝" ͱํࡦ𝜋ͰભҠ͠ଓ͚ͨࡍʹඳ͔ΕΔ ֬཰తͳيಓ𝜏ʹରԠ͢Δऩӹͷظ଴஋ͱͯؔ͠਺Խʂ ʢঢ়ଶʣՁ஋ؔ਺ ʢঢ়ଶʣ-ߦಈՁ஋ؔ਺ ظ଴஋ԋࢉ ࣌ؒ 𝑡 ঢ়ଶ 𝑠 𝑠! ऩӹ ߴ ௿ يಓ 𝜏
  17. Principles of Robot Intelligence from Nature/Nurture Tips: ϚϧίϑܾఆաఔΛຬͨ͢ʹ͸ʁ ৄղ ڧԽֶश

    20 ద੾ͳঢ়ଶɾߦಈۭؒΛઃܭ͢Ε͹ྑ͍ λεΫͷࢦఆ ʴ औΓ͏Δߦಈͷબఆ ඞཁͳঢ়ଶͷݕূ ؍ଌՄೳʁ ηϯαʹΑΔ؍ଌ ਪఆՄೳʁ Yes No ਪఆثɾ୅ସηϯαͷಋೖ Yes No ؀ڥͷݟ௚͠ 𝑠! (𝑖 = 1,2, … )
  18. Principles of Robot Intelligence from Nature/Nurture Tips: औΓ͏Δߦಈͷબఆʢ≒؀ڥͱΤʔδΣϯτͷڥքઃఆʣ ৄղ ڧԽֶश

    21 ୡ੒͍ͨ͠λεΫʹ௚݁͢Δ΋ͷΛબͿͱֶशޮ཰͕޲্͠΍͍͢ ʢྫɿ෺ମૢ࡞→खઌҐஔɾ଎౓ or ೺࣋ର৅෺ʣ ߦಈͷީิ u࿈ଓߦಈۭؒ Ø ؔઅ֯౓ɾ֯଎౓ɾτϧΫ Ø खઌҐஔɾ଎౓ɾྗ Ø PDήΠϯɾΠϯϐʔμϯεύϥϝʔλ Ø ͳͲͳͲ… u཭ࢄߦಈۭؒ Ø Քಇ͢ΔؔઅIDʴճసํ޲ Ø खઌҠಈํ޲ʢࠨӈɾ্ԼɾԞखલʣ Ø ϓϦηοτ͞Εͨύϥϝʔλ Ø ೺࣋͢Δର৅෺ମID Ø ͳͲͳͲ… ྫɿϩϘοτΞʔϜʹΑΔ෺ମૢ࡞
  19. Principles of Robot Intelligence from Nature/Nurture Tips: ඞཁͳঢ়ଶͷݕূ ৄղ ڧԽֶश

    22 ୡ੒͍ͨ͠λεΫ΍ઃܭͨ͠ߦಈۭؒʹґͬͯɼকདྷ༧ଌʹඞཁͳঢ়ଶ͸ҟͳΔ -> υϝΠϯ஌ࣝΛੵۃతʹ׆༻ͯ͠ɼఆੑతʹͰ΋ҼՌؔ܎Λઌʹ໌Β͔ʹ͢Δʂ u λεΫྫʹݻఆ͞Εͨ෺ମ΁ͷϦʔνϯά Ø ߦಈྫʹखઌ଎౓ɿ -> ؔઅ֯౓ or खઌҐஔ Ø ߦಈྫʹؔઅτϧΫɿ -> ؔઅ֯౓ɾ֯଎౓ or खઌҐஔɾ଎౓ u λεΫྫʹϥϯμϜ഑ஔͷ෺ମ΁ͷϦʔνϯά Ø ߦಈྫʹखઌ଎౓ɿ -> ؔઅ֯౓ or खઌҐஔ + ෺ମҐஔ u λεΫྫʹিಥճආ͠ͳ͕Βͷର৅෺೺࣋ Ø ߦಈྫʹखઌ଎౓ɿ -> ؔઅ֯౓ or खઌҐஔ + ର৅෺Ґஔɾ࢟੎ + ೺࣋ྗ + يಓपลͷো֐෺Ϛοϓ ྫɿϩϘοτΞʔϜʹΑΔ෺ମૢ࡞
  20. Principles of Robot Intelligence from Nature/Nurture Tips: ΑΓ۩ମతͳྫ ৄղ ڧԽֶश

    23 † https://gymnasium.farama.org/environments/classic_control/pendulum/ u λεΫ Ø ৼΓࢠͷৼΓ্͛ɾ౗ཱ = ใुɿৼΓࢠͷ֯౓͕௚্ u ߦಈ Ø ࠜຊͷ࣠΁ͷτϧΫ 𝜏 u ঢ়ଶ Ø ৼΓࢠͷ֯౓ 𝜃 Ø ৼΓࢠͷ֯଎౓ ̇ 𝜃 Ø ʢঢ়ଶʹఆ਺͸ෆཁʣ u ؍ଌ Ø ֯౓ɿΤϯίʔμ Ø ֯଎౓ɿδϟΠϩηϯα ӡಈํఔࣜ 𝐽 𝑚, 𝑙 ̈ 𝜃 = 𝜏 + 𝑚𝑔𝑙 sin(𝜃) ΦΠϥʔ๏ʹΑΔߋ৽ ̇ 𝜃 ← ̇ 𝜃 + ̈ 𝜃Δ𝑡, 𝜃 ← 𝜃 + ̇ 𝜃Δ𝑡 Pendulum-v0† ׳ੑ߲ʢఆ਺ʣ ॏྗ߲΁ͷ܎਺ʢఆ਺ʣ ࣌ؒεςοϓִؒʢఆ਺ʣ ໨ඪ஋ʢఆ਺ʣ
  21. Principles of Robot Intelligence from Nature/Nurture Tips: ෦෼؍ଌϚϧίϑܾఆաఔʢPartially Observable MDPʣ

    ৄղ ڧԽֶश 24 ঢ়ଶશͯΛ؍ଌͰ͖Δ͚ͩͷηϯα͕෇͍͍ͯͳ͍৔߹… -> ଞ৘ใ͔Βঢ়ଶΛਪఆ͢ΔඞཁʢRNN͕ྑ͘࢖ΘΕΔʣ ࣌ؒεςοϓ 𝑡 … ޙํࠩ෼ʹΑΔ֯଎౓ਪఆ ̇ 𝜃% ≃ 𝜃% − 𝜃%&' Δ𝑡 † 𝜃% ← Image% ֯౓ΛΧϝϥը૾͔Βਪఆ
  22. Principles of Robot Intelligence from Nature/Nurture Ձ஋ؔ਺ͷֶशɿڭࢣ͋Γֶश ৄղ ڧԽֶश 27

    ͦͷؾʹͳΕ͹ਖ਼ղͱͳΔऩӹ͸ࢉग़Մೳ͕ͩɼ୅ସखஈ΋ߟ͑ΒΕΔ… !! "! #! !!"# # "!"# # #!"# # !!"# $ "!"# $ #!"# $ · · · !!"$ ## "!"$ ## #!"$ ## !!"$ #$ "!"$ #$ #!"$ #$ · · · !!"$ $# "!"$ $# #!"$ $# · · · · · · !!"% #$% "!"% #$% #!"% #$% ֶशର৅ ֶश໨ඪ a) モンテカルロ法 · · · · · · !! "! #! !!"# # "!"# # #!"# # !!"# $ "!"# $ #!"# $ · · · !!"$ ## "!"$ ## #!"$ ## !!"$ #$ "!"$ #$ #!"$ #$ · · · !!"$ $# "!"$ $# #!"$ $# · · · · · · !!"% #$% "!"% #$% #!"% #$% ֶशର৅ ֶश໨ඪ b) TD法 · · · · · · a) ϞϯςΧϧϩ๏ b) TD๏
  23. Principles of Robot Intelligence from Nature/Nurture ϞϯςΧϧϩ๏ʢҰൠʣ ৄղ ڧԽֶश 28

    ظ଴஋ԋࢉ͸ɼ֬཰෼෍͔Βཚ୒ͨ֬͠཰ม਺ʹର͢Δؔ਺஋ͷฏۉͰۙࣅͰ͖Δʂ -> ڧԽֶशͰසൃ͢Δظ଴஋ԋࢉͷେ൒͸͜ΕͰ਺஋ܭࢉՄೳʹ͢Δ 1) αΠίϩ౤͛ 2) ๅ͘͡γϛϡϨʔλ
  24. Principles of Robot Intelligence from Nature/Nurture Ձ஋ؔ਺ͷֶशʹ͓͚ΔϞϯςΧϧϩ๏ ৄղ ڧԽֶश 29

    ࢼߦࡨޡΛதஅ͢Δ·ͰͷيಓΛੜ੒͠ɼঢ়ଶ𝑠# ͔Βͷऩӹ𝑅# Λࢉग़ -> ਅ஋ͳͷͰภࠩ͸ͳ͍͕ɼϥϯμϜੑ͕ڧ͘෼ࢄ͕େ͖͍ !! "! #! !!"# # "!"# # #!"# # !!"# $ "!"# $ #!"# $ · · · !!"$ ## "!"$ ## #!"$ ## !!"$ #$ "!"$ #$ #!"$ #$ · · · !!"$ $# "!"$ $# #!"$ $# · · · · · · !!"% #$% "!"% #$% #!"% #$% ֶशର৅ ֶश໨ඪ a) モンテカルロ法 · · · · · · !! "! #! ֶशର৅ ֶश໨ඪ b) TD法 Τϐιʔυ ϞϯςΧϧϩ๏ʹ͓͚ΔՁ஋ؔ਺ͷଛࣦؔ਺ a) ϞϯςΧϧϩ๏
  25. Principles of Robot Intelligence from Nature/Nurture ऩӹͷ࠶ؼߏ଄ʢϕϧϚϯํఔࣜʣ ৄղ ڧԽֶश 30

    ऩӹɾՁ஋ؔ਺͸ใुͱ࣍࣌ࠁͷऩӹɾՁ஋ؔ਺ͰදݱՄೳʂ -> ؀ڥΑΓಘΒΕͨใु෼͚͔ͩ֬ͳ৘ใΛ࣋ͭͷͰɼڭࢣσʔλʹͳΓಘΔ
  26. Principles of Robot Intelligence from Nature/Nurture TD๏ʹΑΔՁ஋ؔ਺ͷֶश ৄղ ڧԽֶश 31

    ӈล − ࠨลʹTemporal Difference (TD)ޡࠩ𝛿Λθϩʹʂ -> ਪఆ஋ΛؚΉͷͰภࠩ͸ੜ͡Δ͕ɼϥϯμϜੑ͸ݮͬͯ෼ࢄ͸খ͍͞ !! "! #! !!"# # "!"# # #!"# # !!"# $ "!"# $ #!"# $ · · · !!"$ ## "!"$ ## #!"$ ## !!"$ #$ "!"$ #$ #!"$ #$ · · · !!"$ $# "!"$ $# #!"$ $# · · · · · · !!"% #$% "!"% #$% #!"% #$% ֶशର৅ ֶश໨ඪ a) モンテカルロ法 · · · · · · !! "! #! !!"# # "!"# # #!"# # !!"# $ "!"# $ #!"# $ · · · !!"$ ## "!"$ ## #!"$ ## !!"$ #$ "!"$ #$ #!"$ #$ · · · !!"$ $# "!"$ $# #!"$ $# · · · · · · !!"% #$% "!"% #$% #!"% #$% ֶशର৅ ֶश໨ඪ b) TD法 · · · · · · TD๏ʹ͓͚ΔՁ஋ؔ਺ͷଛࣦؔ਺ ਺஋ͱͯ͠ར༻ ʢޯ഑͸ܭࢉ͠ͳ͍ʣ b) TD๏
  27. Principles of Robot Intelligence from Nature/Nurture TD๏ͷΠϝʔδ ৄղ ڧԽֶश 32

    ࣌ࠁؒͰऩӹͷ༧ଌʹζϨ͕ੜ͡Δ͕… -> ऩӹΛਖ਼͘͠༧ଌͰ͖ΔͱɼTDޡࠩ͸ʢཧ૝తʹ͸ʣθϩʹऩଋ͢Δʂ Current time Action 𝑎" Next time >
  28. Principles of Robot Intelligence from Nature/Nurture TD๏ͷΠϝʔδ ৄղ ڧԽֶश 33

    ࣌ࠁؒͰऩӹͷ༧ଌʹζϨ͕ੜ͡Δ͕… -> ऩӹΛਖ਼͘͠༧ଌͰ͖ΔͱɼTDޡࠩ͸ʢཧ૝తʹ͸ʣθϩʹऩଋ͢Δʂ Current time Next time < Action 𝑎#
  29. Principles of Robot Intelligence from Nature/Nurture (Expected) SARSA / Qֶश

    ৄղ ڧԽֶश 34 𝒂,ͷ༩͑ํͰΞϧΰϦζϜ͕෼ذ… u SARSA Ø ࣮ࡍʹબ୒ͨ͠ߦಈɿ𝑎$ = 𝑎! $ u Expected SARSA Ø ݱํࡦ𝜋ʹै͍બ୒͞ΕΔߦಈɿ𝑎$ ∼ 𝜋(𝑎|𝑠! $) u Qֶश Ø ϕετͱࢥΘΕΔߦಈɿ𝑎$ = argmax% 𝑄(𝑠! $, 𝑎) DQNʢQֶशͷਂ૚ڧԽֶश൛ʣ https://youtu.be/V1eYniJ0Rnk
  30. Principles of Robot Intelligence from Nature/Nurture ҰൠԽɿnεςοϓTD๏ʗTD(𝜆)๏ ৄղ ڧԽֶश 35

    ࠶ؼߏ଄ʹͳ͍ͬͯΔͷͰɼnεςοϓઌ·ͰͷใुΛߟྀ͢Δ͜ͱ΋Մೳ ʴ nεςοϓTD๏ʢ𝑛 = 1,2, … , ∞ʣΛ𝜆 ∈ [0,1]ͰॏΈ෇͚ฏۉԽ -> ภࠩͱ෼ࢄͷτϨʔυΦϑΛௐ੔ग़དྷΔʂ nεςοϓTD๏ TD(𝜆)๏ ՙॏ࿨ TD(𝜆)๏ͷผදݱ
  31. Principles of Robot Intelligence from Nature/Nurture ࠷దͳํࡦؔ਺ ৄղ ڧԽֶश 36

    ऩӹͷظ଴஋ʹՁ஋ؔ਺Λ࠷େԽ͢Δ͜ͱΛ໨తͱ͢Δ -> Ձ஋ؔ਺͕ํࡦʹґଘ͢ΔͷͰɼᩦཉͳํࡦͰ͸ؒҧ͏͜ͱ΋… ํࡦ΁ͷґଘੑΛ໌ه ࠷దํࡦͷఆٛ ݱঢ়͔Βઌͷيಓੜ੒֬཰
  32. Principles of Robot Intelligence from Nature/Nurture ཭ࢄߦಈۭؒʹ͓͚ΔߦಈՁ஋ؔ਺ʹجͮ͘ํࡦઃܭ ৄղ ڧԽֶश 37

    ୳ࡧʹ޲͚ͨ֬཰੒෼͸࢒ͭͭ͠ɼجຊతʹ͸ߦಈՁ஋ͷߴ͍ߦಈΛબ΂͹ྑ͍ʂ -> Լͷ𝜃Λ𝑄Ͱஔ׵͢Ε͹ઃܭ׬ྃ a) !-άϦʔσΟํࡦ b) ιϑτϚοΫεํࡦ "!͕࠷େͷ΋ͷ͚ͩߴ֬཰ ଞ͸Ұ༷ બ୒֬཰ ߦಈͷબ୒ࢶ 1 બ୒֬཰ ߦಈͷબ୒ࢶ 1 "!ͷେখΛ֬཰ʹ൓ө
  33. Principles of Robot Intelligence from Nature/Nurture ཭ࢄߦಈۭؒʹ͓͚ΔߦಈՁ஋ؔ਺ʹجͮ͘ํࡦઃܭ ৄղ ڧԽֶश 38

    ୳ࡧʹ޲͚ͨ֬཰੒෼͸࢒ͭͭ͠ɼجຊతʹ͸ߦಈՁ஋ͷߴ͍ߦಈΛબ΂͹ྑ͍ʂ -> Լͷ𝜃Λ𝑄Ͱஔ׵͢Ε͹ઃܭ׬ྃ a) !-άϦʔσΟํࡦ b) ιϑτϚοΫεํࡦ "!͕࠷େͷ΋ͷ͚ͩߴ֬཰ ଞ͸Ұ༷ બ୒֬཰ ߦಈͷબ୒ࢶ 1 બ୒֬཰ ߦಈͷબ୒ࢶ 1 "!ͷେখΛ֬཰ʹ൓ө 𝑄! 𝑄" 𝑄! 𝑄! 𝑄"!
  34. Principles of Robot Intelligence from Nature/Nurture ࿈ଓߦಈۭؒͷ৔߹ɿํࡦޯ഑๏ ৄղ ڧԽֶश 39

    ํࡦΛϞσϧԽͨ֬͠཰෼෍ͷύϥϝʔλ𝜃ʹؔ͢Δޯ഑Λ࠶ؼతʹܭࢉ † ཭ࢄߦಈۭؒʹ΋ద༻Մೳʢੵ෼ԋࢉΛ૯࿨ԋࢉʹஔ׵ʣ ࠶ؼߏ଄ 𝑠΁ͷ౸ୡ֬཰ 𝑅ʹஔ׵͢Ε͹Ձ஋ؔ਺ෆཁ ํࡦޯ഑๏
  35. Principles of Robot Intelligence from Nature/Nurture (Advantage) Actor-Critic๏ɿA2C ৄղ ڧԽֶश

    40 ํࡦޯ഑ʹภࠩΛ༩͑ͳ͍ϕʔεϥΠϯͱͯ͠Ձ஋ؔ਺Λಋೖͯ͠෼ࢄΛ཈੍ Ξυόϯςʔδؔ਺𝐴 = 𝑄 − 𝑉ͰॏΈ෇͚ͨ͠ํࡦؔ਺ͷ࠷໬ਪఆ໰୊ͱղऍՄೳʹʂ TDޡࠩ𝛿ͰۙࣅՄೳ Actor ! Critic " or # ߦಈ ! ߋ৽ྔ ঢ়ଶ "ʢ࣍ঢ়ଶ "!ʣ ใु # (Advantage) Actor-Critic๏ ਖ਼͍͠ධՁ͕ͳ͍ͱํࡦ͸࠷దԽͰ͖ͳ͍ͷͰɼ𝛼) ≤ 𝛼* ͕Ұൠత
  36. Principles of Robot Intelligence from Nature/Nurture Tips: ϕʔεϥΠϯͷಋೖ ৄղ ڧԽֶश

    41 ঢ়ଶʹͷΈґଘ͢ΔεΧϥʔؔ਺𝑏(𝑠)ΛՃ͑Δ෼ʹ͸ޯ഑͸มԽ͠ͳ͍ʂ -> Ձ஋ؔ਺𝑉(𝑠)͸ϕʔεϥΠϯͱͯ͠ར༻Մೳ ∇+ ln 𝜋 = ∇+𝜋 𝜋 = 1
  37. Principles of Robot Intelligence from Nature/Nurture ະ஌ͷؔ਺Λֶश͢Δ≒ؔ਺ۙࣅ ৄղ ڧԽֶश 42

    Ձ஋ؔ਺΍ํࡦؔ਺ΛύϥϝʔλͰܗঢ়มߋՄೳͳؔ਺ۙࣅثΛ࢖ֶͬͯशʂ جఈͷߴ͞ΛॏΈͰௐ੔ ௐ੔͞Εͨجఈͷ࿨Ͱؔ਺Λۙࣅ ྫɿઢܗؔ਺ۙࣅ 𝑦 = 𝑤,𝜙(𝑥) جఈؔ਺ʢಛ௃ྔʣ 𝜙: 𝑋 → ℝ
  38. Principles of Robot Intelligence from Nature/Nurture ڧྗͳؔ਺ۙࣅ≒ਂ૚ֶश ৄղ ڧԽֶश 43

    ಛ௃ྔ΋ύϥϝλϥΠζͯ͠ಉ࣌ʹֶश͢Δ͜ͱͰߴ͍ۙࣅੑೳΛൃشʂ † ֶश͕ෆ҆ఆԽ͠΍͍͢ͷͰରࡦ͕ඞཁ…ʢޙड़ʣ தؒ૚1 w/ !! χϡʔϩϯ · · · શ#૚ͷதؒ૚ʢಛ௃ྔࢉग़ʣ ೖྗ૚ $ ग़ྗ૚ % ࣸ૾ &: ℝ " → * ෇͖ ׆ੑԽؔ਺ &! தؒ૚2 w/ !# χϡʔϩϯ ׆ੑԽؔ਺ &# தؒ૚# w/ !$ χϡʔϩϯ ׆ੑԽؔ਺ &$ ɿॏΈɾόΠΞεʹΑΔઢܗม׵
  39. Principles of Robot Intelligence from Nature/Nurture ਂ૚ڧԽֶश ৄղ ڧԽֶश 44

    e.g. https://skrl.readthedocs.io/en/latest/ ڧԽֶश಺ͷؔ਺ۙࣅʹਂ૚ֶशΛ׆༻ͭͭ͠ɼ ଛࣦؔ਺Λޡࠩٯ఻೻๏ʴ֬཰తޯ഑߱Լ๏Ͱ࠷খԽʂ ޯ഑߱Լ๏ʹΑΔଛࣦؔ਺࠷খԽ ޡࠩٯ఻೻๏ʹΑΔޯ഑ͷޮ཰తܭࢉ optimizer.zero_grad() loss = objective_RL(data, net) loss.backward() optimizer.step() ਂ૚ֶशϥΠϒϥϦʢPytorchʣ෩ͷίʔυ ࢖͍ճ͠
  40. Principles of Robot Intelligence from Nature/Nurture ਂ૚ڧԽֶशͷॏཁςΫχοΫɿܦݧ࠶ੜ ৄղ ڧԽֶश 45

    ͜Ε·ͰͷܦݧΛόοϑΝʹ஝ੵͯ͠Կ౓΋ֶशʹར༻͢Δ͜ͱͰޮ཰վળ † ซ༻ՄೳͳֶशΞϧΰϦζϜ͸ݱํࡦʹґଘ͠ͳ͍ํࡦΦϑܕͷΈͳͷͰ஫ҙ ߦಈ ! ঢ়ଶ " / ใु # FIIFOόοϑΝ $ ܦݧ (", $, "!, %) · · · ݹ͍ܦݧ͔Β࡟আ ϥϯμϜʹબ୒ͯ͠ ֶशʹԿ౓΋࠶ར༻
  41. Principles of Robot Intelligence from Nature/Nurture ਂ૚ڧԽֶशͷॏཁςΫχοΫɿλʔήοτωοτϫʔΫ ৄղ ڧԽֶश 46

    ؇΍͔ʹϝΠϯʹ௥ै͢ΔผωοτϫʔΫΛ࢖ͬͯTD๏ͳͲͷڭࢣ৴߸Λਪఆ -> ڭࢣ৴߸ͷมಈ͕཈੍͞Εֶश͕҆ఆʹʂʢ΍Γա͗͸ֶश͕஗Ԇ͢ΔͷͰ஫ҙʣ ೖྗ ग़ྗ ൺֱ ޡࠩٯ఻೻๏ʹΑΔߋ৽ λʔήοτ ! " ίϐʔʹΑΔߋ৽ ϝΠϯ ! ྫɿ ` 𝜙 ← 1 − 𝜏 ` 𝜙 + 𝜏𝜙 (w/ 𝜏 ≪ 1)
  42. Principles of Robot Intelligence from Nature/Nurture ਂ૚ڧԽֶशͷॏཁςΫχοΫɿΞϯαϯϒϧֶश ৄղ ڧԽֶश 47

    ܭࢉίετ͸૿͑Δ͕ɼෳ਺ͷग़ྗΛൺֱ͢Δ͜ͱͰؔ਺ۙࣅͷޡࠩΛ཈੍ʂ ʴ ग़ྗͷ෼ࢄ͔Βֶश͕଍Γͳ͍ঢ়ଶͷਪఆɾ୳ࡧଅਐʹ΋༗༻ʢޙड़ʣ ೖྗ தؒ૚1 தؒ૚! ग़ྗ ネットワーク1 !! · · · b) தؒ૚·Ͱڞ༗͢ΔΞϯαϯϒϧ a) ಉҰωοτϫʔΫʹΑΔΞϯαϯϒϧ !! !" !# ೖྗ தؒ૚1 தؒ૚! ग़ྗ ネットワーク2 !" · · · ೖྗ தؒ૚1 தؒ૚! ग़ྗ ネットワーク" !# · · · · · · ೖྗ தؒ૚1 தؒ૚! ग़ྗ1 ग़ྗ2 ग़ྗ" · · · · · ·
  43. Principles of Robot Intelligence from Nature/Nurture ॏཁͳςΫχοΫɿʢ૬ରʣΤϯτϩϐʔ ৄղ ڧԽֶश 49

    ֬཰෼෍ͷᐆດ͞΍ଞͷ֬཰෼෍͔Βͷဃ཭౓ΛఆྔධՁ͢Δʂ † ֬཰෼෍ͷϞσϧʹΑͬͯ͸ظ଴஋ԋࢉΛղੳతʹղ͚Δ ≥ 0 ֬཰ม਺ 𝑥 ֬཰ີ౓ ʢor ֬཰࣭ྔʣ KL(𝑝"| 𝑝# ≃ 0 KL(𝑝"| 𝑝# ≫ 1 ℋ 𝑝 ≃ 0 ℋ 𝑝 ≫ 1 Τϯτϩϐʔ ʢ࿈ଓ஋ͩͱඍ෼Τϯτϩϐʔͱ΋ʣ ૬ରΤϯτϩϐʔ ʢKullback-LeiblerμΠόʔδΣϯεʣ ີ౓ൺ
  44. Principles of Robot Intelligence from Nature/Nurture ॏཁͳςΫχοΫɿॏ఺αϯϓϦϯά ৄղ ڧԽֶश 50

    ظ଴஋ԋࢉʹ͓͚Δཧ࿦্ͷ֬཰෼෍ͱཚ୒ʹ༻͍ͨ֬཰෼෍ͷࠩҟΛ ཧ࿦తʹ౳ՁੑΛ୲อ͠ͳ͕Βิঈ͢Δʂ = 1 ֬཰ม਺ 𝑥 ग़ྗ 𝑦 𝑓(𝑥) 𝑝(𝑥) 𝑞(𝑥) ֬཰ີ౓ ʢor ֬཰࣭ྔʣ ີ౓ൺ
  45. Principles of Robot Intelligence from Nature/Nurture ॏཁͳςΫχοΫɿ࠶ύϥϝʔλԽτϦοΫ ৄղ ڧԽֶश 51

    ֬཰෼෍͔Βཚ୒͞Εͨม਺Λ௨Δޡࠩٯ఻೻ʢܭࢉάϥϑͷอ࣋ʣͷͨΊʹɼ ཚ୒༻ϊΠζͱ෼෍ύϥϝʔλΛ෼཭ɾ߹੒͢Δʂ 𝑥 𝑝 𝑦 𝑥) 𝑦 ℒ ޡࠩٯ఻೻ ௨ৗͷཚ୒ 𝑥 𝜃-(𝑥) 𝑦 ℒ ޡࠩٯ఻೻ ࠶ύϥϝʔλԽτϦοΫ 𝜖 ߹੒ ϊΠζ ਖ਼ن෼෍ɿ𝑦 = 𝜇 𝑥 + 𝜎 𝑥 ⊙ 𝜖
  46. Principles of Robot Intelligence from Nature/Nurture ํࡦͷมԽΛߟྀͨ͠A2C ৄղ ڧԽֶश 52

    ܦݧΛಘͨ౰࣌ͷํࡦ𝜋./0 ͱֶश͍ͨ͠ݱํࡦ𝜋͸ඞͣ͠΋Ұக͠ͳ͍… -> ॏ఺αϯϓϦϯάΛ׆༻࣮ͯ͠૷ʹԊͬͨํࡦޯ഑ɾଛࣦؔ਺ʹ౳Ձม׵ʂ ํࡦޯ഑ ରԠ͢Δଛࣦؔ਺ ϞϯςΧϧϩۙࣅ 𝜋͚ͩඍ෼
  47. Principles of Robot Intelligence from Nature/Nurture ํࡦߋ৽ͷ੍ݶ ৄղ ڧԽֶश 53

    † https://proceedings.mlr.press/v37/schulman15.html ํࡦ͕ٸܹʹมಈ͢Δͱڍಈ΋ֶश΋ෆ҆ఆʹ… -> ֶशલޙͷํࡦมԽΛKLμΠόʔδΣϯεͰ੍ݶʂ ෆ౳੍ࣜ໿෇͖ͷํࡦֶश ͜ΕΛۙࣅతʹղ͘ͷ͕TRPO† ֬཰ม਺ 𝑥 ֬཰ີ౓ ʢor ֬཰࣭ྔʣ ࠷దํࡦ ֬཰ม਺ 𝑥 ֬཰ີ౓ ʢor ֬཰࣭ྔʣ ੍ݶͷ෇༩
  48. Principles of Robot Intelligence from Nature/Nurture Proximal Policy Optimization: PPO†ʢVer.

    1ʣ ৄղ ڧԽֶश 54 † https://arxiv.org/abs/1707.06347 ΑΓखܰʹɼෆ౳੍ࣜ໿ʹ୅Θͬͯਖ਼ଇԽ߲Λಋೖʂ ʴ ਖ਼ଇԽͷॏΈ𝛽ΛώϡʔϦεςΟοΫͳํ๏Ͱௐ੔ PPO (Ver. 1) 𝜋./0 Λอ͓͖࣋ͯ͠ղੳղΛಘΔ͔ɼ ཚ୒͍ͯͨ͠𝑎Λ࢖ͬͯϞϯςΧϧϩۙࣅ ॏΈͷߋ৽ଇ
  49. Principles of Robot Intelligence from Nature/Nurture Proximal Policy Optimization: PPO†ʢVer.

    2ʣ ৄղ ڧԽֶश 55 † https://arxiv.org/abs/1707.06347 ͞Βʹखܰʹɼີ౓ൺΛҰఆ৚݅ͰΫϦοϓʢʹޯ഑Λθϩʹʣʂ -> ఆ਺ͷ͖͍͠஋𝜖 ≃ 0.1~0.3Λઃఆ͢Δ͚ͩͰ͔ͳΓ҆ఆͳֶश͕Մೳʹ σʔλऩू ֶश                       ଛࣦʢ𝐴 < 0ʣ ଛࣦʢ𝐴 > 0ʣ ີ౓ൺ 1 − 𝜖 1 + 𝜖 ਖ਼ଇԽ
  50. Principles of Robot Intelligence from Nature/Nurture Tips: PPOʹ૊Έࠐ·Ε͍ͯΔτϦοΫ ৄղ ڧԽֶश

    56 u ํࡦΤϯτϩϐʔͷਖ਼ଇԽ Ø ํࡦ͕୳ࡧΛଓ͚ΒΕΔΑ͏௚઀ΤϯτϩϐʔΛ࠷େԽ͢Δ߲Λଛࣦʹ௥Ճ † ཧ࿦తʹํࡦΤϯτϩϐʔͷਖ਼ଇԽΛՃ͑Δͷ͸ޙड़ u TD(𝜆)๏ʢGAEͱ΋ݺ͹ΕΔʣ Ø ෳ਺ͷيಓΛऩूͯ͠TDޡࠩʢΞυόϯςʔδؔ਺ʣΛTD(𝜆)๏Ͱࢉग़ Ø όοϑΝ͔Βঢ়ଶભҠσʔλ୯ҐͰཚ୒ͯ͠ϛχόονֶश † PPO͸ཧ࿦্͸ܦݧ࠶ੜͱซ༻Ͱ͖ͳ͍ͱ͞ΕΔͷͰɼ ֶशޙʹόοϑΝΛϦηοτ
  51. Principles of Robot Intelligence from Nature/Nurture ௚઀తͳํࡦޯ഑ͷܭࢉ ৄղ ڧԽֶश 57

    ํࡦޯ഑๏͕ඞཁͳͷ͸ɼཚ୒͞Εͨߦಈ͔Βํࡦʹޡࠩٯ఻೻Ͱ͖ͳ͍͔Β… -> ࠶ύϥϝʔλԽτϦοΫΛۦ࢖͠ɼΑΓޮ཰ྑ͘ߦಈՁ஋ؔ਺Λ࠷େԽ͢Δʂ ߦಈՁ஋Λ࠷େԽ͢Δํ޲ͷ ύϥϝʔλ𝜙ʹؔ͢Δޯ഑ 𝑎 ∼ 𝜋(𝑎|𝑠)ͱ౳Ձ 𝜋ʹґଘ͠ͳͯ͘΋OK
  52. Principles of Robot Intelligence from Nature/Nurture Deep Deterministic Policy Gradient:

    DDPG† ৄղ ڧԽֶश 58 † https://arxiv.org/abs/1509.02971 ํࡦΛܾఆ࿦తͳؔ਺𝑎 = 𝜇1!(𝑠)ʹͯ͠σʔλऩू࣌ͷΈ୳ࡧϊΠζΛ෇༩ ϊΠζʹ͸Ornstein–Uhlenbeck (OU)աఔͳͲΛ࢖͏͜ͱ΋͋Δ σʔλऩू ֶश ߦಈՁ஋Λ࠷େԽ͢ΔߦಈΛ TDޡࠩͷࢉग़ʹར༻≒Qֶशʹ૬౰ ୳ࡧϊΠζͷεέʔϧ𝜎͸ݻఆ -> Τϯτϩϐʔਖ਼ଇԽ͸ෆཁ
  53. Principles of Robot Intelligence from Nature/Nurture Twin Delayed DDPG (TD3)†΁ͷվྑ

    ৄղ ڧԽֶश 59 † https://proceedings.mlr.press/v80/fujimoto18a.html u ํࡦͱՁ஋ؔ਺ͷֶशλΠϛϯάΛඇಉظԽ L Ձ஋ؔ਺ΛઌʹֶशͰ͖͍ͯͳ͍ͱํࡦ͸࠷దԽͰ͖ͳ͍… Ø ํࡦͷֶशස౓ΛՁ஋ؔ਺ΑΓݮΒ͢ʢσϑΥϧτ͸൒෼ʣ u Qֶश૬౰͔ΒExpected SARSA૬౰ʹมߋ L QֶशͩͱকདྷͷՁ஋ΛաେධՁͯ͠ޡͬͨߦಈΛ࠷దͱޡ൑ఆ͠΍͍͢… Ø কདྷͷՁ஋Λฏ׈Խͯ͠աେධՁΛ཈੍ u ߦಈՁ஋ؔ਺ͷΞϯαϯϒϧֶश L ۙࣅޡࠩʹΑΓաେධՁͯ͠͠·͏ͱ ʏ Ø ಠཱͨ̎ͭ͠ͷߦಈՁ஋ؔ਺Λֶशͯ͠খ͍͞ํͷকདྷͷՁ஋Λ࠾༻
  54. Principles of Robot Intelligence from Nature/Nurture ํࡦΤϯτϩϐʔͷ࠷େԽ ৄղ ڧԽֶश 60

    ݩʑͷใुؔ਺ʹํࡦΤϯτϩϐʔΛ௥Ճ -> ࠷దํࡦ΍Ձ஋ؔ਺ͷؔ܎Λཧ࿦తʹ࠶ఆٛʂ ใुͷ࠶ఆٛ Թ౓ύϥϝʔλʢߴ͍΄ͲϥϯμϜʣ ࠷దํࡦͷ࠶ఆٛ ௨ৗͷؔ܎ʹෛͷର਺໬౓͕௥Ճ
  55. Principles of Robot Intelligence from Nature/Nurture ιϑτϕϧϚϯํఔࣜ ৄղ ڧԽֶश 61

    ͜Ε·Ͱಉ༷ʹऩӹͷ࠶ؼߏ଄Λར༻ͯ͠ɼՁ஋ؔ਺ͷڭࢣ৴߸Λਪఆʂ -> ୯७ʹใुʹํࡦΤϯτϩϐʔΛՃ͑Δͷͱ͸ҟͳΔͷͰ஫ҙ ࣍ঢ়ଶʹԠͨ͡ํࡦΤϯτϩϐʔ ιϑτϕϧϚϯํఔࣜ
  56. Principles of Robot Intelligence from Nature/Nurture Soft Q-learning: SQL† ৄղ

    ڧԽֶश 62 † https://proceedings.mlr.press/v70/haarnoja17a.html ιϑτϕϧϚϯํఔࣜΛ׆༻ͯ͠ߦಈՁ஋ؔ਺Λֶशʂ -> ࠷దํࡦ͕཭ࢄߦಈۭؒͳΒιϑτϚοΫεํࡦʹ SQLͷଛࣦؔ਺ ཧ࿦্ͷ࠷దํࡦ ن֨Խఆ਺ʢ૯࿨͕1ʹ͢Δʣ
  57. Principles of Robot Intelligence from Nature/Nurture Soft Actor-Critic: SAC† ৄղ

    ڧԽֶश 63 † https://proceedings.mlr.press/v80/haarnoja18b.html SQLʹ͓͚Δ࠷దํࡦͱͷဃ཭౓Λ࠷খԽ͢Δ͜ͱͰํࡦΛֶशʂ -> ͱ͍͍ͭͭ΋ɼ͜Ε·Ͱͷํࡦޯ഑๏ͱಉ༷ʹՁ஋ؔ਺ͷ࠷େԽ SACͷํࡦߋ৽ 𝜋ͱແؔ܎ ෛͷՁ஋ؔ਺ͷ࠷খԽʹՁ஋ؔ਺ͷ࠷େԽ
  58. Principles of Robot Intelligence from Nature/Nurture Tips: SACʹ૊Έࠐ·Ε͍ͯΔτϦοΫ ৄղ ڧԽֶश

    64 u ߦಈՁ஋ؔ਺ͷΞϯαϯϒϧֶश Ø TD3ͱಉ༷ʹ̎ͭͷߦಈՁ஋ͷখ͍͞஋Λ࠾༻ u ࠶ύϥϝʔλԽτϦοΫͷར༻ Ø DDPG/TD3ͱಉ༷ʹ௚઀తͳํࡦޯ഑Λܭࢉ u ༗քͳํࡦ΁ͷม׵ Ø ਖ਼ن෼෍͔Βཚ୒ͨ͠ߦಈΛtanhؔ਺Ͱม׵ Ø ͜ͷม׵΋ؚΊͨ֬཰෼෍ϞσϧͰ໬౓΋ܭࢉ u Թ౓ύϥϝʔλͷࣗಈௐ੔ Ø ੍໿෇͖࠷దԽͱղऍͯ͠ϥάϥϯδϡ৐਺Խ ʢ࣍εϥΠυʣ σʔλऩू ֶश
  59. Principles of Robot Intelligence from Nature/Nurture Tips: Թ౓ύϥϝʔλͷࣗಈௐ੔ ৄղ ڧԽֶश

    65 https://arxiv.org/abs/2303.04356 ํࡦΤϯτϩϐʔͷਖ਼ଇԽΛෆ౳੍ࣜ໿ͱͯ͠Ұ୴ղऍ -> ʢ౳੍ࣜ໿ʹม׵͠ͳ͕Βͷʣϥάϥϯδϡͷະఆ৐਺๏Λద༻ʂ ํࡦΤϯτϩϐʔ੍໿෇͖ͷ࠷దԽ໰୊ ॴ๬ͷԼք ݩͷ࠷దํࡦ ϥάϥϯδϡͷ ະఆ৐਺๏
  60. Principles of Robot Intelligence from Nature/Nurture Tips: ํࡦΤϯτϩϐʔΛ࠷େԽ͢Δ෭࣍ޮՌ ৄղ ڧԽֶश

    66 ଟ༷ͳܦݧΛੵΉ͜ͱͰɼෆ࣮֬ੑʹΑΓؤ݈ͳํࡦʹͳΔʂ ະֶशͳ࿏໘ https://youtu.be/KOObeIjzXTY ະֶशͳखઌ΁ͷෛՙ https://youtu.be/EH3xVtlVaJw
  61. Principles of Robot Intelligence from Nature/Nurture ༨࿥ɿؔ਺ͷฏ׈Խ† ৄղ ڧԽֶश 67

    † https://ieeexplore.ieee.org/document/9981812 ํࡦ΍Ձ஋ؔ਺͕׈Β͔ʹͳ͍ͬͯͳ͍ͱڍಈ΍ֶश͕ෆ҆ఆʹ… -> ؔ਺ΛʢදݱྗΛଛͳΘͳ͍ൣғͰʣฏ׈Խ͢Δਖ਼ଇԽΛ෇༩ʂ L ׈Β͔ʹ͍ͨ͠ ঢ়ଶۭؒ 𝑆 𝑈&(𝑠$) 𝜎 w/ 𝜆 𝑠 w/ 𝜆 𝑠$ ̃ 𝑠 𝑑'# (𝑠, ̃ 𝑠; 𝑠$) ग़ྗมԽΛ཈੍͢Δہॴۭؒ
  62. Principles of Robot Intelligence from Nature/Nurture ༨࿥ɿLocally Lipschitz Continuous Constraint

    (L2C2)† ৄղ ڧԽֶश 68 † https://ieeexplore.ieee.org/document/9981812 ಉϨϕϧͷ੍ޚΛՄೳʹ͠ͳ͕Β΋ߦಈ͸ΑΓ׈Β͔ʹʂ -> ׈Β͔ͳߦಈ͸࣮ػ΁ͷํࡦͷσϓϩΠΛ༰қʹ΋ͯ͘͠ΕΔ
  63. Principles of Robot Intelligence from Nature/Nurture ༨࿥ɿώϡʔϚϊΠυ΁ͷԠ༻ ৄղ ڧԽֶश 69

    ώϡʔϚϊΠυ͸ଟ͘ͷؔઅΛ࣋ͪߴ࣍ݩߦಈۭؒͱͳΓɼ׈Β͔͕͞ΑΓॏཁʹ… -> L2C2ͳͲͷख๏ΛิॿతʹϑϨʔϜϫʔΫ಺Ͱ׆༻ʂ HoST https://youtu.be/Yruh-3CFwE4 PhysHSI https://youtu.be/dTj6FjoQ5u0
  64. Principles of Robot Intelligence from Nature/Nurture Ϟσϧ“ϑϦʔ”ڧԽֶश ৄղ ڧԽֶश 71

    ະ஌ͷ؀ڥͷ𝑝", 𝑟͸ະ஌ͷ··ɼऩӹΛ࠷େԽ͢ΔΑ͏ํࡦ𝜋Λ࠷దԽ͢Δ -> ଟ༷ͳيಓΛಘΔίετ͕ߴ͘ڭࢣ͋ΓֶशͰ΋ͳ͍ͷͰɼޮ཰͸ۃΊͯѱ͍… ߦಈ ! ঢ়ଶ " / ใु # ΤʔδΣϯτ 𝜋(𝑎|𝑠) ະ஌ͷ؀ڥ 𝑝"(𝑠!|𝑠, 𝑎), 𝑟(𝑠, 𝑎, 𝑠!)
  65. Principles of Robot Intelligence from Nature/Nurture ৄղ ڧԽֶश 72 ؀ڥΛΤʔδΣϯτͷ೴಺ͰγϛϡϨʔτ͢ΔʢੈքʣϞσϧΛཅʹֶश͢Δ

    -> ෆ଍͢ΔܦݧΛ೴಺Ͱิ͏͜ͱͰޮ཰Λܶతʹվળʂ ߦಈ ! ঢ়ଶ " / ใु # ΤʔδΣϯτ 𝜋(𝑎|𝑠) ະ஌ͷ؀ڥ 𝑝"(𝑠!|𝑠, 𝑎), 𝑟(𝑠, 𝑎, 𝑠!) σʔλόοϑΝ 𝒂 𝒔 𝒓 𝒔$ Ϟσϧ“ϕʔε”ڧԽֶश ׆༻ ֶश
  66. Principles of Robot Intelligence from Nature/Nurture ੈքϞσϧͷઃܭʢ؍ଌ≒ঢ়ଶͷ৔߹ʣ ৄղ ڧԽֶश 73

    ঢ়ଶભҠ֬཰𝑝" ͱใुؔ਺𝑟Λݸผʹͦͷ··ؔ਺ۙࣅ OR ใुؔ਺ͷҾ਺͔Β࣍ঢ়ଶ𝑠!Λআ͍ͯ֬཰෼෍ͱͯ͠·ͱΊͯؔ਺ۙࣅ ≃ ਅͷ؀ڥ ੈքϞσϧ a) ঢ়ଶભҠ֬཰ͱใुؔ਺Λݸผֶश 状態 +行動 (", $) 状態遷移確率 &! '! 状態 +行動 +次状態 (", $, "") 報酬 ( '# ࣮ࡍʹ͸෼෍ύϥϝʔλΛग़ྗ ʢྫɿਖ਼ن෼෍ͷฏۉɾ෼ࢄʣ b) ঢ়ଶભҠ֬཰ͱใुؔ਺ΛҰֶׅश 状態 +行動 (", $) 状態遷移確率・報酬確率 (&! , &# ) '$
  67. Principles of Robot Intelligence from Nature/Nurture ੈքϞσϧͷֶशʢ؍ଌ≒ঢ়ଶͷ৔߹ʣ ৄղ ڧԽֶश 74

    ਅͷೖग़ྗσʔλ͕ἧ͍ͬͯΔͷͰɼ୯७ʹڭࢣ͋Γֶश͢Ε͹ྑ͍ † ༧ଌͨ࣍͠ঢ়ଶΛ࠶ೖྗ͍ͯ͘͠௕ظ༧ଌʹ͸޻෉͕ٻΊΒΕΔ͜ͱ΋ʢޙड़ʣ 𝑎# 𝑠# 𝑠# !, 𝑟# 𝑝2 = 𝑝"𝑝3 σʔλόοϑΝ (𝑠!, 𝑎!, 𝑠! $, 𝑟!) !(" ) ଛࣦʢྫɿෛͷର਺໬౓ʣ − ln 𝑝*(𝑠! $, 𝑟!|𝑠!, 𝑎!) ޡࠩٯ఻೻ ཚ୒
  68. Principles of Robot Intelligence from Nature/Nurture ؍ଌ≠ঢ়ଶͷ৔߹ɿ࣌ܥྻදݱͷ֫ಘ ৄղ ڧԽֶश 75

    ݱ؍ଌ͚ͩͰ͸৘ใ͕ෆ଍͢Δ৔߹ɼ؍ଌཤྺͰ৘ใΛิ׬͠ͳ͚Ε͹ͳΒͳ͍ -> Recurrent Neural Network (RNN)ͳͲΛۦ࢖ͯ࣌͠ܥྻදݱΛֶशʂ RNNϢχοτ 入力 !!"# 内部状態 ℎ!"# 過去状態 ℎ! ల։ RNNϢχοτ 入力 !# 内部状態 ℎ!"# 初期状態 ℎ$ = 0 RNNϢχοτ 入力 !% · · · RNNϢχοτ 入力 !!"#
  69. Principles of Robot Intelligence from Nature/Nurture ؍ଌ≠ঢ়ଶͷ৔߹ɿ௿࣍ݩදݱͷ֫ಘ ৄղ ڧԽֶश 76

    ݱ؍ଌʹॏཁͳ৘ใ͕ӅΕ͍ͯΔ৔߹ɼ࣍ݩѹॖͰ৘ใΛநग़͠ͳ͚Ε͹ͳΒͳ͍ -> Variational Autoencoder (VAE)ͳͲΛۦ࢖ͯ͠௿࣍ݩදݱΛֶशʂ ೖྗ ! ʢྫɿRGBը૾ʣ ࠶ߏ੒݁Ռ f(!) જࡏม਺ ! ʢ " ≪ |%|ʣ
  70. Principles of Robot Intelligence from Nature/Nurture දݱֶशͱ߹ΘͤͨੈքϞσϧͷྫɿPlaNet† ৄղ ڧԽֶश 77

    † https://proceedings.mlr.press/v97/hafner19a.html RNNͱVAEΛ૊Έ߹ΘͤͨRecurrent State-Space Model (RSSM)Ͱ ঢ়ଶΛ֫ಘ ʴ ঢ়ଶʹجͮ͘ঢ়ଶભҠ֬཰ʢͱใु֬཰ʣΛಉ࣌ʹֶशʂ !!"# ℎ!"# αϯϓϦϯά #(%!"# |ℎ!"# , !!"# ) μΠφϛΫε ʢRNNʣ +!"# %!"# ℎ! !! αϯϓϦϯά #(%! |ℎ! , !! ) , !! ℎ! , %! (+, .! ℎ! , %! ) ,(%! |ℎ! ) %! ≃ Τϯίʔμ Τϯίʔμ σίʔμ
  71. Principles of Robot Intelligence from Nature/Nurture RSSMͷֶश ৄղ ڧԽֶश 78

    ௕͞𝑇ͷ࣌ܥྻσʔλʹؔ͢Δ࠷େԽ͢΂͖ม෼ԼքΛۙࣅతʹಋग़ ʴ ౰ॳ͸Latent Overshootingͱݺ͹ΕΔ௕ظ༧ଌਫ਼౓ΛߴΊΔτϦοΫΛซ༻ ࣌ܥྻσʔλͷʢۙࣅʣม෼Լք Latent Overshooting RNNͰ𝑑 = 1,2, … 𝐷εςοϓઌ·Ͱ༧ଌ
  72. Principles of Robot Intelligence from Nature/Nurture PlaNetͷൃలܥɿDreamerγϦʔζ ৄղ ڧԽֶश 79

    u જࡏม਺ͷ཭ࢄ෼෍දݱ L Ψ΢ε෼෍ΛԾఆ͍ͯ͠Δͱදݱྗ͕๡͍͠… Ø े෼ͳ཭ࢄԽͰϚϧνϞʔμϧͳදݱྗ౳Λ֬อʂ u KLόϥϯγϯά L ࣄલ෼෍΁ͷਖ਼ଇԽͱࣄલ෼෍ͷߋ৽Λಉ࣮࣌ࢪ͢Δͱੑೳ͕ग़ͮΒ͍… Ø 2ͭͷ߲Λ෼ׂͯ͠ॏΈ෇͚͢Δ͜ͱͰόϥϯεΛௐ੔ʂ u ใुؔ਺ͷมܗ L λεΫʹΑͬͯҟͳΔใुεέʔϧ΁ͷ൚Խ͕೉͍͠… Ø Symlogม׵ʹΑΓθϩۙ๣Ͱ͸ม׵લ௨Γʹͭͭ͠ର਺ଇͰ҆ఆԽʂ u ࠷৽Ϟσϧͷ׆༻ L RNN΍ཅʹϞσϧԽ͞Εͨ֬཰෼෍͸༧ଌਫ਼౓͕଍Γͳ͍… ʢεέʔϧ΋ͮ͠Β͍…ʣ Ø Transformer΍Flow matching౳Λੵۃతʹಋೖʂ
  73. Principles of Robot Intelligence from Nature/Nurture ੈքϞσϧͷ׆༻ྫ ৄղ ڧԽֶश 80

    u ऩӹͷਪఆ Ø nεςοϓઌ·ͰͷใुΛ༧ଌͯ͠𝑛εςοϓऩӹͷՁ஋ؔ਺Λֶश Ø J TD๏ͱϞϯςΧϧϩ๏ͷඒຯ͍͠ͱ͜औΓΛޮ཰ྑ࣮͘ࢪ Ø L ༧ଌޡ͕ࠩྦྷੵ͢ΔͨΊ௕ظ༧ଌ͸ࠔ೉ u Ծ૝తͳܦݧͷੜ੒ Ø ٖࣅσʔλΛੈքϞσϧΛ௨ͯ͡ੜ੒ֶͯ͠शʹར༻ Ø J ಺ૠͰ͋Ε͹ɼਫ਼౓ྑ͘ੜ੒ͨ͠σʔλֶ͕शΛଅਐɾ҆ఆԽ Ø L ֎ૠʹରͯ͠͸ɼޡͬͨσʔλΛੜ੒ֶͯ͠शΛ્֐͢Δةݥ u ϓϥϯχϯά Ø 𝐻εςοϓઌ·ͰͷγϛϡϨʔγϣϯ݁ՌΛجʹ࠷దͳߦಈܥྻΛਪఆ Ø J ௥ՃͰͷՁ஋ؔ਺΍ํࡦϞσϧͷֶश͕ෆཁ Ø L ܭࢉίετ͕๲େ
  74. Principles of Robot Intelligence from Nature/Nurture ੈքϞσϧͷ׆༻๏ɿऩӹͷਪఆ ৄղ ڧԽֶश 81

    nεςοϓTD๏ʢ͋Δ͍͸TD(𝜆)๏૬౰ʣΛ(𝑠%, 𝑎%, 𝑠% !, 𝑟%)ͷΈ͔ΒܭࢉՄೳʂ 𝑠+," = 𝑠+ $͔Β𝑘 = 1,2, … , 𝐻εςοϓઌ·Ͱͷ༧ଌ nεςοϓTD๏ɿModel-based Value Expansion (MVE) https://arxiv.org/abs/1803.00101 TD(𝜆)๏૬౰ɿStochastic Ensemble Value Expansion (STEVE) https://proceedings.neurips.cc/paper/2018/hash/f02208a057804ee16ac72ff4d3cec53b-Abstract.html ΞϯαϯϒϧֶशΛซ༻
  75. Principles of Robot Intelligence from Nature/Nurture ੈքϞσϧͷ׆༻๏ɿԾ૝తͳܦݧͷੜ੒ ৄղ ڧԽֶश 82

    † https://proceedings.neurips.cc/paper/2019/hash/5faf461eff3099671ad63c6f3f094f7f-Abstract.html ࣮ࡍͷঢ়ଶΛ࢝఺ʹෳ਺ͷيಓΛ֬཰తʹੜ੒ͯ͠ɼܦݧ࠶ੜ༻όοϑΝʹ௥Ճʂ ࣮σʔλ Ծ૝σʔλ 𝐷-./ ߦಈ ! ঢ়ଶ " / ใु # 𝐷012-3 ॳظঢ়ଶ ํࡦ 𝜋 ੈքϞσϧ 𝑝* †
  76. Principles of Robot Intelligence from Nature/Nurture ୅දతͳ੒ՌɿDayDreamer ৄղ ڧԽֶश 83

    https://youtu.be/xAXvfVTgqr0 ࢛٭ϩϘοτͷา༰ΛҰ࣌ؒఔ౓Ͱֶश͢Δ͜ͱʹ੒ޭʂ
  77. Principles of Robot Intelligence from Nature/Nurture ੈքϞσϧͷ׆༻๏ɿϓϥϯχϯά ৄղ ڧԽֶश 84

    ੈքϞσϧͰߦಈܥྻΛධՁ͠ɼ࠷దͳ΋ͷΛݟग़͢ʂ ! " # "! 2. !εςοϓઌ·Ͱ༧ଌ $ ධՁ஋ ߦಈܥྻ ֬཰ 1. ީิͷαϯϓϦϯά (#! ", … , #!#$ " ) "%& ' 3. '! " = )(* +! , ,! " )ʹԠͨ͡ - ͷվળ ఏҊ෼෍ʢํࡦʣ - #! ∗ ∈ ,! ∗ ऩଋ͢Δ·Ͱ܁Γฦ͠ ࣮؀ڥʹ࡞༻ ॳظঢ়ଶΛऔಘ ॳظҎ߱͸༧ଌ஋Λར༻
  78. Principles of Robot Intelligence from Nature/Nurture ιϧόʔͷҰྫɿCross Entropy Method (CEM)

    ৄղ ڧԽֶश 85 e.g. https://martius-lab.github.io/iCEM/ Cost probability Sampling Approaching 𝜋(𝑢; 𝜃4) 𝜋(𝑢; 𝜃4,") Action Elite data γ optimal action
  79. Principles of Robot Intelligence from Nature/Nurture ୅දతͳ੒ՌɿModel Predictive Path Integral

    (MPPI) ৄղ ڧԽֶश 86 https://youtu.be/f2at-cqaJMM ඇৗʹߴ଎ͳࣗಈӡసͷϦΞϧλΠϜ੍ޚʹ੒ޭʂ
  80. Principles of Robot Intelligence from Nature/Nurture γϛϡϨʔλͷੵۃత׆༻ ৄղ ڧԽֶश 87

    ϩϘοτͷCADσʔλ͸ద੾ʹม׵͢Ε͹ಈྗֶγϛϡϨʔλʹऔΓࠐΊΔ -> GPUʹΑΔฒྻγϛϡϨʔγϣϯͰେྔͷσʔλΛ୹࣌ؒͰऩूՄೳʹʂ Isaac Gym/Sim/Lab https://youtu.be/wcGt7EAkdVg RaiSim https://youtu.be/Xw3bsh0tWFU ଞɼMujoco MJX (Google Deepmind), GenesisͳͲͳͲ…
  81. Principles of Robot Intelligence from Nature/Nurture Sim-to-RealΪϟοϓ ৄղ ڧԽֶश 88

    u ݟͨ໨ͷҧ͍ Ø র໌৚݅ Ø ςΫενϟ Ø ܗঢ় Ø ͳͲͳͲ… u ڍಈͷҧ͍ Ø ຎࡲ Ø ΞΫνϡΤʔλͷ஗Ε Ø ؍ଌϊΠζ Ø ͳͲͳͲ… γϛϡϨʔγϣϯϞσϧ ࣮ػ
  82. Principles of Robot Intelligence from Nature/Nurture ՄೳͳൣғͰͷγεςϜಉఆ ৄղ ڧԽֶश 89

    γϛϡϨʔγϣϯ͕ΑΓ࣮ੈքͷϩϘοτʹ͍ۙڍಈͱͳΔΑ͏௥ٻʂ -> ಛʹɼϞʔλಛੑ͸ڍಈʹ௚݁͢Δʴಉఆ͠΍͍͢ͷͰࣄલͷಉఆ͕ఆ൪ ຎࡲରࡦྫɿBetter Actuator Models (BAM) https://youtu.be/5XPEEKDnQEM యܕతͳϞʔλͰͷSim-to-RealΪϟοϓ u ஗Ε Ø ௨৴ɾճ࿏Ԡ౴ɾࢦྩܾఆ·Ͱͷ࣌ؒɾ etc… u ຎࡲ Ø Ϋʔϩϯɾ೪ੑɾετϥΠϕοΫۂઢɾ etc… u ͦͷଞ Ø ίΪϯάτϧΫɾٯىిྗʢT-Nۂઢʣɾ όοΫϥογɾetc… † ࠷ۙͷQDD͸Ϊϟοϓ͕খ͞Ίʁ چདྷͷαʔϘϞʔλͰ͸ແࢹͰ͖ͳ͍
  83. Principles of Robot Intelligence from Nature/Nurture υϝΠϯཚ୒Խ ৄղ ڧԽֶश 90

    e.g. https://ieeexplore.ieee.org/document/8202133 ༷ʑͳγϛϡϨʔγϣϯύϥϝʔλΛ༩͑ͨ؀ڥ͔ΒσʔλΛऩू -> γϛϡϨʔγϣϯύϥϝʔλʹରͯ͠पลԽ͞ΕͨํࡦΛֶशʂ ߦಈ 𝑎 ঢ়ଶ 𝑠 + ใु 𝑟 ύϥϝʔλ𝑒" ύϥϝʔλ𝑒# ύϥϝʔλ𝑒5 … αϯϓϦϯά γϛϡϨʔγϣϯύϥϝʔλ𝑒 ֬཰ ࣮؀ڥ σʔληοτ (𝑠!, 𝑎!, 𝑠! $, 𝑟!) !(" ) ΤʔδΣϯτ
  84. Principles of Robot Intelligence from Nature/Nurture υϝΠϯదԠʢTeacher-student architectureʣ ৄղ ڧԽֶश

    91 e.g. https://arxiv.org/abs/2107.04034 γϛϡϨʔγϣϯύϥϝʔλʢಛݖ৘ใʣΛߟྀͨ͠ํࡦΛֶश͓͖ͯ͠ɼ ࣮ੈքͰͷ؍ଌཤྺ͔Βಛݖ৘ใʢ૬౰ͷಛ௃ྔʣΛਪఆͯ͠ํࡦʹ༩͑Δʂ ಛݖ৘ใ ಛ௃ྔ γϛϡϨʔλ Ձ஋ؔ਺ ํࡦ ߦಈ վળ ؍ଌ ཤྺ ଛࣦ࠷খԽ Τϯίʔμ Τϯίʔμ ࣮ੈք ํࡦ ߦಈ ؍ଌ ཤྺ Τϯίʔμ ֶश࣌ ਪ࿦࣌
  85. Principles of Robot Intelligence from Nature/Nurture ୅දతͳ੒Ռɿ࢛٭ϩϘοτͷϩόετͳาߦ੍ޚ ৄղ ڧԽֶश 92

    https://youtu.be/zXbb6KQ0xV8 4000୆ͷϩϘοτΛฒྻγϛϡϨʔγϣϯ࣮ͯ͠ػʹ΋ద༻ՄೳͳํࡦΛ֫ಘʂ
  86. Principles of Robot Intelligence from Nature/Nurture ༨࿥ɿ࢒ࠩڧԽֶश ৄղ ڧԽֶश 93

    https://ieeexplore.ieee.org/document/8794127 ࣄલʹΘ͔͍ͬͯΔൣғͷ৘ใ͔ΒϩϘοτͷ੍ޚث𝜋5 Λઃܭ ʴ ෆ଍͍ͯͨ͠৘ใ༝དྷͷߦಈͷ࢒ࠩΛڧԽֶशͰิరʂ طଘͷ੍ޚث !! ঢ়ଶ ! RLํࡦ " ∼ ! ใु " ݩͷ؀ڥ ิਖ਼͞Εͨ؀ڥ
  87. Principles of Robot Intelligence from Nature/Nurture ใुઃܭͷ೉͠͞ ৄղ ڧԽֶश 95

    u ઃܭͷࣗ༝౓ Ø 𝑟 = 𝑟(𝑠, 𝑎, 𝑠′)͑͞ຬͨͤ͹ྑ͘ɼಉ͡໨తʹෳ਺ͷใुઃܭ͕ߟ͑ΒΕΔ ྫɿ໨ඪ𝑔΁ͷϩϘοτͷखઌҐஔ𝑥ͷϦʔνϯάͳΒ… 𝑟 = 𝑓67( 𝑥 − 𝑔 8)ʢ𝐿8 ϊϧϜʹؔ͢Δ୯ௐݮগؔ਺𝑓67 ʣ 𝑓67 𝑥 = {−𝑥, 𝑥9", exp −𝑥 , 1 𝑥 < 𝛿 , −𝑥 + 𝑐, … } u ଟ໨తੑ Ø ҰͭͷλεΫΛୡ੒͢Δࡍʹෳ਺ͷαϒλεΫؚ͕·ΕΔ ྫɿ٭ϩϘοτͷ௚ਐาߦͳΒ… าߦ଎౓ʢલํ΁ͷҠಈྔʣɼࠨӈʹͣΕͳ͍͜ͱɼస౗͠ͳ͍͜ͱɼ ա౓ͳෛՙΛ͔͚ͳ͍͜ͱɼۉ౳ʹෛՙΛ෼഑͢Δ͜ͱ… u ఆྔԽͮ͠Β͍৔߹ Ø ৬ਓ͕ײ֮తʹ͔͠೺ѲͰ͖͍ͯͳ͍λεΫ͸ใुͱͯ͠਺஋Ͱදͤͳ͍ ྫɿֆͳͲͷඒతධՁɼྉཧͷຯ෇͚ɼ… u ֶश೉қ౓ Ø ׬શϥϯμϜͳํࡦʹߴ೉қ౓ͷಈ࡞Λֶशͤ͞Δʹ͸Ϊϟοϓ͕େ͖͍ ྫɿנ᛽ͷ伱ؒҠಈɼτοϓϓϩͱͷରઓɼ…
  88. Principles of Robot Intelligence from Nature/Nurture Tips: ใु΁ͷΦϑηοτ ৄղ ڧԽֶश

    96 † https://dl.acm.org/doi/abs/10.5555/645528.657613 ใुͷฏߦҠಈͳͷͰɼཧ࿦্ͷ࠷దํࡦ͸Ұக͢Δ͸ͣ†͕ͩ… -> ऴ୺ͰͷॲཧΛߟ͑Δͱɼਖ਼ͷใुͷ΄͏͕ແ೉…ʁ ྑ͋͘ΔTDޡࠩʢߋ৽ํ޲ʣ ઃܭͨ͠ใु͕ৗʹਖ਼ͳΒ… Ø ऴ୺ʹࢸΔيಓΛաখධՁ->୳ࡧଅਐ ઃܭͨ͠ใु͕ৗʹෛͳΒ… Ø ऴ୺ʹࢸΔيಓΛաେධՁ->஌ࣝར༻ ऴ୺Ҏ߱ͷใुΛ0ͱԾఆ ରࡦྫ: https://www.sciencedirect.com/science/article/pii/S2666720725000165
  89. Principles of Robot Intelligence from Nature/Nurture ີͳʗૄͳใु ৄղ ڧԽֶश 97

    u ີͳใु Ø ࠷େ஋ʹ޲͔ͬͯ୯ௐ૿Ճ Ø J ํࡦΛߋ৽͍ͯ͘͠΂͖ઌ͕Θ͔Γ΍͍͢ Ø L λεΫΛ൒୺ʹऴ͑΍͍͢ʢಛʹଟ໨త࣌ʣ Ø ෇ՃՁ஋ͷ௥ٻ޲͖ u ૄͳใु Ø ཁٻΛຬͨͨ͠ͱ͖ͷΈඇ0ɼͦΕҎ֎͸0 Ø J ใुΛඞͣ໯͑ΔΑ͏ʹʹλεΫΛୡ੒ Ø L ཁٻΛຬͨͤͳ͍ݶΓԆʑͱ୳ࡧ͕ඞཁ Ø ඞਢ৚݅ͷ࣮ݱ޲͖ Before After ྫɿ୯ҐΤωϧΪʔ౰ͨΓͷҠಈྔ ྫɿର৅෺ମͷ೺࣋ Before After
  90. Principles of Robot Intelligence from Nature/Nurture ಺ൃతಈػ෇͚ɿHindsight Experience Replay (HER)

    ৄղ ڧԽֶश 98 https://proceedings.neurips.cc/paper/2017/hash/453fadbd8a1a3af50a9df4df899537b5-Abstract.html ະֶशྖҬΛੵۃతʹ๚Εͯ୳ࡧͤ͞ΔλεΫඇґଘͷϘʔφεΛઃܭ͍ͨ͠ -> يಓதͷকདྷ౸ୡͨ͠఺Λٖࣅతͳΰʔϧʹઃఆʂ σʔλऩू ٙࣅΰʔϧ ঢ়ଶۭؒ S E G ٙࣅΰʔϧͷબ୒
  91. Principles of Robot Intelligence from Nature/Nurture ಺ൃతಈػ෇͚ɿHindsight Experience Replay (HER)

    ৄղ ڧԽֶश 99 https://proceedings.neurips.cc/paper/2017/hash/453fadbd8a1a3af50a9df4df899537b5-Abstract.html ະֶशྖҬΛੵۃతʹ๚Εͯ୳ࡧͤ͞ΔλεΫඇґଘͷϘʔφεΛઃܭ͍ͨ͠ -> يಓதͷকདྷ౸ୡͨ͠఺Λٖࣅతͳΰʔϧʹઃఆʂ σʔλऩू ٙࣅΰʔϧ ঢ়ଶۭؒ S G E https://proceedings.neurips.cc/paper/2017/hash/453fadbd8a1a3af50a9df4df899537b5-Abstract.html
  92. Principles of Robot Intelligence from Nature/Nurture ಺ൃతಈػ෇͚ɿDisagreement ৄղ ڧԽֶश 100

    e.g. https://proceedings.mlr.press/v97/pathak19a.html ະֶशྖҬΛੵۃతʹ๚Εͯ୳ࡧͤ͞ΔλεΫඇґଘͷϘʔφεΛઃܭ͍ͨ͠ -> ྫ͑͹ɼΞϯαϯϒϧֶशʹ͓͚Δ༧ଌ෼ࢄʹ஫໨ʂ 状態・行動 (", $) 予測値 &! ", $ ( = 1, … , , ֶशྖҬ ະֶशྖҬ Epistemic uncertainty Aleatoric uncertainty ޯ഑67 68 68 6) Λܭࢉͯ͠௚઀࠷େԽ΋Մೳ
  93. Principles of Robot Intelligence from Nature/Nurture ಺ൃతಈػ෇͚ͷϩϘοτԠ༻ྫ ৄղ ڧԽֶश 101

    https://youtu.be/Qob2k_ldLuw ૄͳใुʹ಺ൃతಈػ෇͚Ͱ୳ࡧଅਐͯ͠ϩίϚχϐϡϨʔγϣϯʹ੒ޭʂ
  94. Principles of Robot Intelligence from Nature/Nurture ଟ໨తੑ ৄղ ڧԽֶश 102

    ୯७ʹཁٻͦΕͧΕʹରԠ͢ΔใुΛ଍͠߹ΘͤΔͱ… -> ࢧ഑తͳҰͭͷΈୡ੒ or ൒୺ͳہॴղʹऩଋ ྑ͋͘Δଟ໨తͳใुؔ਺ † 𝑤$ > 0 ྫɿର৅෺ମͷ೺࣋->Ϧʔνϯάʴ೺࣋ ঢ়ଶ ใु1, 2 ঢ়ଶ ใु1, 2 1ͷΈ࠷େԽ ྆ํ൒୺ ϦʔνϯάͰຬ଍ͯ͠͠·͏͜ͱ΋…
  95. Principles of Robot Intelligence from Nature/Nurture ֊૚ܕڧԽֶश ৄղ ڧԽֶश 103

    e.g. https://www.sciencedirect.com/science/article/abs/pii/S0952197620302207 ཁٻͦΕͧΕΛݸผͷํࡦͰୡ੒ʴͦΕͧΕͷํࡦΛద੾ʹߏ଄Խ †Լखͳߏ଄ͩͱֶशੑೳ͕ѱԽ͢Δ͜ͱ΋… Commanded task Walk Acquired behavior Sub task: Perception level Direction control Stability control Sub task: Template level Footstep control Support leg control Upper body control Sub task: Actuator level Joint 1 control Joint N control Black box when using End-to-End learning ྫɿาߦλεΫͷ֊૚ߏ଄ ্૚ͷํࡦ͕Լ૚΁ͷ໨ඪࢦྩΛੜ੒͢ΔܗͰ࿈݁ ʢ͜ͷΑ͏ͳ্૚ͷํࡦΛΦϓγϣϯͱ΋ݺͿʣ
  96. Principles of Robot Intelligence from Nature/Nurture ଟ໨తڧԽֶश ৄղ ڧԽֶश 104

    ใुΛεΧϥʔ͔ΒϕΫτϧʹ֦ு͠ɼෳ਺ͷՁ஋ؔ਺ and/or ํࡦΛֶशɾ߹੒ʂ -> ద੾ͳॏΈ෇͚Ͱॴ๬ͷબ޷ղΛಘΔඞཁ ࣮ݱՄೳͳू߹ !! (#, %) !" (#, %) ύϨʔτϑϩϯτ બ޷ղ ʢ།Ұ఺ͷަ఺ʣ ॏΈϕΫτϧ ' !(#, %; ')ͷ౳ߴઢ a) ՙॏ࿨ɿತ෦ͷΈબ޷Մೳ !! (#, %) !" (#, %) b) νΣϏγΣϑ๏ɿඇತ෦΋બ޷Մೳ ཧ૝఺ '!Ͱͷબ޷ղ '"Ͱͷબ޷ղ
  97. Principles of Robot Intelligence from Nature/Nurture ଟ໨తڧԽֶशͷϩϘοτԠ༻ྫ ৄղ ڧԽֶश 105

    https://youtu.be/gQidYj-AKaA ॴ๬ͷӡಈʹؔ͢Δใु߲Λ෼ׂͯ͠ɼֶशޙʹॏΈΛௐ੔ͯ͠બ޷ղΛಘΔʂ
  98. Principles of Robot Intelligence from Nature/Nurture Tips: ࿨ΑΓੵʁ ৄղ ڧԽֶश

    106 ໨త͕τϨʔυΦϑؔ܎Ͱͳ͍ʴ֤ใुΛඇෛͰఆ͍ٛͯ͠ΔͳΒɼ ෳ਺ใुͷ࿨ΑΓ΋ੵͷํ͕શ໨తΛόϥϯεྑ͘ୡ੒ͨ͠ํࡦʹͳΔʁʂ 𝑟 = n :(" ; 𝑤:𝑟: 𝑟 = p :(" ; 𝑟: *% e.g. SkillMimic ਎ମɾϘʔϧͷزԿɾ૬ରؔ܎ɾ଎౓཈੍ͳͲ https://ingrid789.github.io/SkillMimic/
  99. Principles of Robot Intelligence from Nature/Nurture ηʔϑڧԽֶशɿ੍໿෇͖MDP ৄղ ڧԽֶश 107

    ҆શੑΛຬͨͨ͢Ίͷ৚݅Λίετͱͯ͠දݱ͠ɼ੍໿৚݅ͱͯ͠ผ్ਪఆɾ੍ݶʂ -> ྫ͑͹ɼKKT৚݅Λۦ࢖ͯ͠ϥάϥϯδϡ৐਺Λ߹Θͤͯ࠷దԽ͠ͳ͕Βղ͘ ࠷େڐ༰ྦྷܭίετ ֤࣌ࠁͷίετ ίετॏΈͷ࠷దԽ
  100. Principles of Robot Intelligence from Nature/Nurture ηʔϑڧԽֶशɿ(Dynamic) Shielding ৄղ ڧԽֶश

    108 e.g. https://ieeexplore.ieee.org/document/9392290 ίετ੍໿Λຬͨͤͳ͍ߦಈ͸҆શͳ΋ͷ΁ஔ׵͢Δ͜ͱͰ҆શੑΛ୲อʂ † ϦΧόϦʔํࡦ΍੾ସج४͕ະ஌ͷ৔߹Ͱ΋ֶशՄೳ ϦΧόϦʔํࡦ (Shielding) !!"# ঢ়ଶ ! RLํࡦ " ∼ ! ใु " ݩͷ؀ڥ ҆શͳ؀ڥ # !"#$ ʹΑΔ੾ସ
  101. Principles of Robot Intelligence from Nature/Nurture ηʔϑڧԽֶशͷϩϘοτԠ༻ྫ ৄղ ڧԽֶश 109

    https://youtu.be/B2PiFF-MhJI ShieldingΛ׆༻ͯ҆͠શͳਓͱϩϘοτͷΠϯλϥΫγϣϯʹ੒ޭʂ
  102. Principles of Robot Intelligence from Nature/Nurture ఆྔԽͮ͠Β͍ʁ ৄղ ڧԽֶश 110

    ࠷ѱɼλεΫͷ੒൱Λਓ͕൑அͯ͠ૄͳใुΛ༩͑Δ͜ͱ͸Մೳ͕ͩɼ ਓͷධՁج४͕ᐆດͳ΋ͷΛγεςϜ্Ͱີʹ਺஋Խ͢Δ͜ͱ͸೉͍͠… ྫɿʓʓ෩ʹา͘ʁʁ https://github.com/BandaiNamcoResearchInc/Bandai-Namco-Research-Motiondataset
  103. Principles of Robot Intelligence from Nature/Nurture ໛฿ֶशɿߦಈΫϩʔχϯάʹΑΔํࡦͷॳظԽ ৄղ ڧԽֶश 111

    ΤΩεύʔτͷํࡦΛσʔλ͔Β࠶ݱ͢Δ͜ͱͰ࠷దํࡦۙ๣͔Β୳ࡧ։࢝ʂ † Ձ஋ؔ਺͸ະֶशͳͷͰ໛฿ͨ͠ํࡦΛյ͞ͳ͍ରࡦʢྫɿਖ਼ଇԽʣ͕ඞཁ ΤΩεύʔτ w/ 𝝅𝒆 ߦಈ ! ঢ়ଶ " / ใु # ΤʔδΣϯτ 𝜋(𝑎|𝑠) ະ஌ͷ؀ڥ 𝑝"(𝑠!|𝑠, 𝑎), 𝑟(𝑠, 𝑎, 𝑠!) σʔληοτ 𝒔 𝒂 ॳظԽ ֶश
  104. Principles of Robot Intelligence from Nature/Nurture ໛฿ֶशɿٯڧԽֶश ৄղ ڧԽֶश 112

    † https://proceedings.neurips.cc/paper_files/paper/2016/hash/cc7e2b878868cbae992d1fb743995d8f-Abstract.html ΤΩεύʔτ͕࠷େԽ͠Α͏ͱ͍ͯͨ͠ใुؔ਺Λσʔλ͔Βਪఆʂ -> ۙ೥Ͱ͸ఢରతֶशͷߏ଄Λԉ༻ͯ͠ਫ਼౓ͷߴ͍໛฿͕Մೳʹ ਅͷσʔληοτ ੜ੒ث Generator ࣝผث Discriminator બ୒͞Εͨਅͷσʔλ ੜ੒͞Εِͨͷσʔλ ਅِ൑ఆ ਅͷσʔλΛݟ෼͚ΔΑ͏ֶश DiscriminatorΛὃ͢Α͏ֶश a) Generative Adversarial Network (GAN) b) Generative Adversarial Imitation Learning (GAIL)
  105. Principles of Robot Intelligence from Nature/Nurture ໛฿ֶशɿٯڧԽֶश ৄղ ڧԽֶश 113

    † https://proceedings.neurips.cc/paper_files/paper/2016/hash/cc7e2b878868cbae992d1fb743995d8f-Abstract.html ΤΩεύʔτ͕࠷େԽ͠Α͏ͱ͍ͯͨ͠ใुؔ਺Λσʔλ͔Βਪఆʂ -> ۙ೥Ͱ͸ఢରతֶशͷߏ଄Λԉ༻ͯ͠ਫ਼౓ͷߴ͍໛฿͕Մೳʹ †
  106. Principles of Robot Intelligence from Nature/Nurture Tips: GANͱGAILͷ਺ֶతؔ܎ ৄղ ڧԽֶश

    114 Jensen-ShannonμΠόʔδΣϯεΛࣜల։͢Δ͜ͱͰੜ੒ثͱࣝผثΛಋೖʂ 𝑥͕𝑝' ͱ𝑝9 ͷͲͪΒ͔Βੜ੒͞Ε͔ͨΛ൑ผ͢ΔϕϧψʔΠ෼෍ͷର਺໬౓ͱҰக 𝑝) → 𝑝)"#$ ࠷໬ਪఆ 𝜋ʹ͸ґଘ͠ͳ͍ͷͰใु͔Β͸আ֎
  107. Principles of Robot Intelligence from Nature/Nurture ༠ಋܕڧԽֶश ৄղ ڧԽֶश 115

    e.g. https://www.science.org/doi/10.1126/scirobotics.ads6790 σϞيಓͱͷྨࣅ౓ʹؔ͢ΔใुΛઃܭɾՃຯͭͭ͠ɼ λεΫʹؔ͢ΔใुΛޮ཰ྑ͘࠷େԽʂ 𝑟 = 𝜆𝑟: + 1 − 𝜆 𝑟; ΤʔδΣϯτͷڍಈ͸ ΤΩεύʔτͱࣅ͍ͯΔʁ λεΫ͸਱ߦͰ͖͍ͯΔʁ
  108. Principles of Robot Intelligence from Nature/Nurture Tips: ϦλʔήοςΟϯά ৄղ ڧԽֶश

    116 † https://github.com/YanjieZe/GMR ਓͷσϞϯετϨʔγϣϯ͸ܭଌ͕༰қ͕ͩϩϘοτͱ਎ମߏ଄͕গ͠ҧ͏… -> खઌͳͲͷॏཁՕॴΛἧ͑ͨࡍͷ࢟੎ɾؔઅσʔλ΁ม׵ʂ (a) Reference motion (b) Retarget videos Fig. 1: For the user study, participants were shown videos of the reference motion (a), and asked to choose which retarget video (b) was more similar to it. General Motion Retargeting (GMR) Human-Robot KeyBody Matching Human-Robot Cartesian Space Alignment Human Data Non-Uniform Local Scaling Solving Robot IK with Rotation Constraint Solving Robot IK with Rotation&Translation Constraint Human Body Translation & Rotation Robot Root Pose & Joint Position Step 1 Step 2 Step 3 Step 4 Step 5 Fig. 2: General Motion Retargeting (GMR) Pipeline. The tracking errors are computed for all frames that a policy is alive. General Motion Retargeting (GMR)†
  109. Principles of Robot Intelligence from Nature/Nurture ༠ಋܕڧԽֶशͷϩϘοτԠ༻ྫ ৄղ ڧԽֶश 117

    https://youtu.be/LQizdUn5Z1k ΤΩεύʔτͷيಓपลͰϩϘοτ͕࣮ݱՄೳͳಈ࡞΁ͱޮ཰ྑ͘मਖ਼ʂ
  110. Principles of Robot Intelligence from Nature/Nurture ֶश೉қ౓ ৄղ ڧԽֶश 118

    ඇৗʹ೉͍͠λεΫΛEnd-to-EndͰֶश͠Α͏ͱ͢Δͱɼ λεΫୡ੒ʹඞཁͳٕೳΛ͍ͭ·Ͱ΋ݟग़ͩͤͳ͍… า͖ํ΋Θ͔Βͳ͍ͷʹɼ נ᛽Λආ͚ͯ࠷଎Ͱ૸Γൈ͚͍ͨ… ??
  111. Principles of Robot Intelligence from Nature/Nurture ೉қ౓ௐ੔ɿΧϦΩϡϥϜֶश ৄղ ڧԽֶश 119

    † https://proceedings.neurips.cc/paper/2020/hash/68a9750337a418a86fe06c1991a1d64c-Abstract.html λεΫ೉қ౓Λ؆୯ͳ΋ͷ͔Βঃʑʹ೉ͯ͘͠͠ํࡦΛஈ֊తʹચ࿅͍ͤͯ͘͞ʂ -> ΧϦΩϡϥϜ͸ࣗ࡞͢Δ͜ͱ͕ଟ͍͕ɼ࠷దԽ໰୊Λ௨ͯࣗ͡ಈԽ΋Մೳ Self-Paced Deep reinforcement Learning (SPDL)† ೉қ౓ௐ੔༻ίϯςΩετ ॴ๬ͷ೉қ౓ ߴ͍ऩӹΛظ଴Ͱ͖Δ೉қ౓ʹঃʑʹௐ੔
  112. Principles of Robot Intelligence from Nature/Nurture ೉қ౓ௐ੔ɿࣗݾڝ૪ ৄղ ڧԽֶश 120

    https://youtu.be/chMwFy6kXhs ରઓܕͷλεΫͰ͋Ε͹ɼରઓ૬खͷํࡦΛࣗ਎ͱʢ΄΅ʣಉ౳ʹ͢Δ͜ͱͰɼ ࣗવͱλεΫ೉қ౓͕ঃʑʹ্͕͍ͬͯ͘͜ͱʹʂ
  113. Principles of Robot Intelligence from Nature/Nurture ༨࿥ɿਓͷू߹஌ʹΑΔใुઃܭ ৄղ ڧԽֶश 121

    ਓखͰใुΛఆٛ͢Δ͜ͱͳ͘ʴσϞيಓ΋༻ҙ͢Δ͜ͱͳ͘ɼ AIΛ࢖ͬͯୡ੒͍ͨ͠λεΫͷใुΛਪఆʂ Published as a conference paper at ICLR 2024 Figure 2: EUREKA takes unmodified environment source code and language task description as context to zero-shot generate executable reward functions from a coding LLM. Then, it iterates between reward sampling, GPU-accelerated reward evaluation, and reward reflection to progressively improve its reward outputs. domain expertise to construct task prompts or learn only simple skills, leaving a substantial gap in achieving human-level dexterity (Yu et al., 2023; Brohan et al., 2023). On the other hand, reinforcement learning (RL) has achieved impressive results in dexter- ity (Andrychowicz et al., 2020; Handa et al., 2023) as well as many other domains-if the human designers can carefully construct reward functions that accurately codify and provide learning signals e.g. Eureka LLMͰใुؔ਺Λ࠷దԽ https://eureka-research.github.io RIME: Robust Preference-based Reinforcement Learning with Noisy Preferences Preference predictor Fallible teacher Noisy preferences Reward learning from denoised preferences Denoising Discriminator Collecting preferences Phase 1: Pre-training for agent and reward model Replay buffer for pre-training Warm start with intrinsic rewards Phase 2: Online training for reward model Agent Env Intrinsic reward Figure 1. Overview of RIME. In the pre-training phase, we warm start the reward model ˆ rω with intrinsic rewards rint to facilit smooth transition to the online training phase. Post pre-training, the policy, Q-network, and reward model ˆ rω are all inherited as i configurations for online training. During online training, we utilize a denoising discriminator to screen denoised preferences for ro reward learning. This discriminator employs a dynamic lower bound ωlower on the KL divergence between predicted preferences Pω annotated preference labels ˜ y to filter trustworthy samples Dt , and an upper bound ωupper to flip highly unreliable labels Df . vergence between predicted and annotated preference labels to filter samples. Further, to mitigate the accumulated er- ror caused by incorrect filtration, we propose to warm start the reward model during the pre-training phase for a good (Lee et al., 2023), and reinforcement learning (Christ et al., 2017; Ibarz et al., 2018; Hejna III & Sadigh, 20 In the context of RL, Christiano et al. (2017) propos comprehensive framework for PbRL. To improve feedb e.g. RIME ڭࢣ͕ൺֱɾબ޷ͨ݁͠Ռ͔Βใुਪఆ https://proceedings.mlr.press/v235/cheng24k.html
  114. Principles of Robot Intelligence from Nature/Nurture ·ͱΊ ৄղ ڧԽֶश 122

    ü ڧԽֶशͰ͸ϚϧίϑܾఆաఔΛຬͨ͢ઃఆ͕ॏཁ Ø ຬͨͤͳ͍৔߹͸ઃܭΛݟ௚ͨ͠Γɼ࣌ܥྻ৘ใͰิ׬ͨ͠Γ ü ਂ૚ڧԽֶशͰ͸ෳ਺ͷޮ཰Խɾ҆ఆԽτϦοΫͷ׆༻͕ॏཁ Ø ͨͩ͠ɼద༻Մೳ৚݅΍σϝϦοτ΋͋ΔͷͰ஫ҙ ü ۙ೥ͷํࡦޯ഑๏͸୳ࡧೳྗͷҡ࣋΍ߋ৽ɾޯ഑ͷ׈Β͔͕͞ॏཁ Ø ௒ύϥϝʔλ͕ଟ͘ͳ͍ͬͯΔͷͰਪ঑஋ͷར༻͕Φεεϝ ü ࣮ػͰͷࢼߦࡨޡΛݮΒͨ͢ΊʹʢੈքʣϞσϧͷֶशɾ׆༻͕ॏཁ Ø Ϟσϧͷ༧ଌਫ਼౓͸Ͳͷ׆༻๏Ͱ΋ੑೳʹେ͖͘د༩ ü ࣮Ԡ༻ͰෳࡶʹͳΓ͕ͪͳใुʹ͸దͨ͠޻෉͕ॏཁ Ø ෳ਺ͷใु߲Λࠞͥ߹ΘͤΔ͚ͩͰ͸ݶք͕͋ΔͷͰ஫ҙ p ঺հٕज़͸೔ਐ݄าͰվྑ͞Ε͍ͯΔͷͰར༻࣌ʹ͸ௐࠪΛ๨Εͣʹ…