Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Self-Supervised Learning

Naoki Okamoto
January 24, 2023

Self-Supervised Learning

自己教あり学習 (Self-Supervised Learning) に関する資料
2023年1月24日作成
岡本直樹(中部大学・機械知覚&ロボティクス研究グループ)

Naoki Okamoto

January 24, 2023
Tweet

Other Decks in Research

Transcript

  1. ࣗݾڭࢣ͋Γֶश
    Ԭຊ௚थɼ౻٢߂࿱ɼฏ઒ཌྷɼࢁԼོٛʢத෦େֶɾػց஌֮ϩϘςΟΫεݚڀάϧʔϓʣ
    ੁপխಙʢ౦๺େֶʣ
    IUUQNQSHKQ

    View Slide

  2. w ڭࢣϥϕϧΛ෇༩ͯ͠ͳ͍େྔͷσʔλΛٖࣅతͳ໰୊ 1SFUFYUUBTL
    ʹΑֶͬͯश
    ࣗݾڭࢣ͋ΓֶशͰֶशͨ͠Ϟσϧ͸ࣄલֶशϞσϧͱͯ͠׆༻
    w ୅දతͳֶशํ๏
    ରরֶश ɿ4JN$-3<$IFO *$.-> .P$P<)F $713>
    /FHBUJWFGSFF ɿ#:0-<(SJMM /FVS*14> 4JN4JBN<$IFO $713>
    .BTLFE*NBHF.PEFMJOH ɿ4JN.*.<9JF $713> ."&<)F $713>
    ࣗݾڭࢣ͋Γֶश 44-4FMGTVQFSWJTFE-FBSOJOH


    ᶃେྔͷڭࢣͳ͠σʔλͰࣄલֶशϞσϧΛ࡞੒ ᶄ44-Ͱ࡞੒ͨ͠ࣄલֶशϞσϧΛର৅λεΫ΁
    fi
    OFUVOJOH
    1FMJDBO
    ڭࢣϥϕϧ
    '$

    ࣄલֶशϞσϧ
    )FBE

    ࣄલֶशϞσϧ
    େྔͷڭࢣͳ͠σʔλ
    ࣄલֶशϞσϧ
    esses image data, we analyze its internal
    inearly projects the flattened patches into
    the top principal components of the the
    ible basis functions for a low-dimensional
    Figure 6: Representative ex-
    amples of attention from the
    output token to the input
    space. See Appendix D.6 for
    details.
    ed to the
    del learns
    ition em-
    ition em-
    hes in the
    nusoidal
    D). That
    ology ex-
    not yield
    he entire
    egree the
    mpute the
    n is inte-
    is “atten-
    We find
    he lowest
    obally is
    nsistently
    alized at-
    esNet be-
    may serve
    Further,
    bally, we
    ʜ
    Ϋϥε෼ྨϞσϧ
    ෺ମݕग़Ϟσϧ
    ࣗݾڭࢣ͋Γֶश

    View Slide

  3. w ը૾΁زԿม׵ͳͲͷσʔλ૿෯Λద༻͢Δ͜ͱͰ࡞੒ͨ͠໰୊
    w 1SFUFYUUBTLͷྫɿ4JN$-3ʢϥϯμϜΫϩοϓʴ৭ม׵ʣ
    ϥϯμϜΫϩοϓɿಉҰҐஔͷ༧ଌ໰୊ͱۙ઀Ґஔͷ༧ଌ໰୊Λ࡞੒
    ৭ม׵ ɿ৭ͷ༧ଌ໰୊Λ࡞੒ɼҐஔͷ༧ଌ໰୊Λ৭৘ใ͔Βղ͘͜ͱΛ཈੍
    ٖࣅతͳ໰୊ 1SFUFYUUBTL


    ಉҰҐஔͷ༧ଌ ۙ઀Ґஔͷ༧ଌ ৭ͷ༧ଌ
    ϥϯμϜΫϩοϓ ৭ม׵

    View Slide

  4. w ࣗݾڭࢣ͋ΓֶशͰ֫ಘͨ͠ಛ௃දݱΛධՁ
    ,//๏ʹΑΔධՁ
    ઢܗධՁ
    w ࣄલֶशϞσϧͱͯ͠ͷసҠੑΛධՁ

    fi
    OFUVOJOH
    ࣗݾڭࢣ͋ΓֶशϞσϧͷධՁํ๏

    View Slide

  5. w ,//๏ʹΑΔධՁ
    Ϟσϧ͕நग़ͨ͠ಛ௃ྔͱڭࢣϥϕϧΛ༻͍ͯ,//๏Λద༻
    ϋΠύʔύϥϝʔλʹΑΔਫ਼౓มԽ͕গͳ͍ͨΊ౷ҰతͳධՁ͕Մೳ
    ࣗݾڭࢣ͋ΓֶशϞσϧͷධՁํ๏

    ࣄલֶशϞσϧ
    sses image data, we analyze its internal
    nearly projects the flattened patches into
    the top principal components of the the
    ble basis functions for a low-dimensional
    Figure 6: Representative ex-
    amples of attention from the
    d to the
    el learns
    ion em-
    ion em-
    es in the
    nusoidal
    D). That
    ogy ex-
    ot yield
    e entire
    gree the
    pute the
    is inte-
    s “atten-
    We find
    e lowest
    obally is
    sistently
    lized at-
    sNet be-
    ʜ ʜ
    ࣗݾڭࢣ͋Γֶश
    ʹ༻͍ͨσʔληοτ
    ಛ௃ྔ
    ࣗݾڭࢣ͋Γֶश
    ΛߦͳͬͨϞσϧ
    ʜ
    ಛ௃ྔ ڭࢣϥϕϧ
    ʜ
    "JSQMBOF
    $BU
    ֶश༻σʔλ
    ʜ
    ಛ௃ྔ ڭࢣϥϕϧ
    ʜ
    $BU
    %PH
    ධՁ༻σʔλ
    ,//๏ʹΑΓධՁ
    ,ݸͷۙ๣఺͔ΒΫϥε෼ྨ

    View Slide

  6. w ઢܗධՁ
    Ϟσϧ͕நग़ͨ͠ಛ௃ྔͱڭࢣϥϕϧΛ༻͍ͯ'$૚Λڭࢣ͋Γֶश
    ڭࢣ͋Γֶशͷ࠷దͳϋΠύʔύϥϝʔλ͕ࣗݾڭࢣ͋Γֶशͷख๏ʹΑΓҟͳΔ
    ࣗݾڭࢣ͋ΓֶशϞσϧͷධՁํ๏

    ࣄલֶशϞσϧ
    ormer processes image data, we analyze its internal
    ansformer linearly projects the flattened patches into
    left) shows the top principal components of the the
    emble plausible basis functions for a low-dimensional
    patch.
    Figure 6: Representative ex-
    amples of attention from the
    ding is added to the
    hat the model learns
    arity of position em-
    similar position em-
    pears; patches in the
    Finally, a sinusoidal
    (Appendix D). That
    image topology ex-
    variants do not yield
    ion across the entire
    e to what degree the
    ally, we compute the
    information is inte-
    , right). This “atten-
    ze in CNNs. We find
    lready in the lowest
    ormation globally is
    ds have consistently
    s highly localized at-
    t apply a ResNet be-
    ʜ
    ࣗݾڭࢣ͋Γֶश
    ʹ༻͍ͨσʔληοτ
    ʜ
    ಛ௃ྔ
    ࣗݾڭࢣ͋Γֶश
    ΛߦͳͬͨϞσϧ
    ڭࢣϥϕϧ
    '$
    ʜ
    ಛ௃ྔ
    ʜ
    "JSQMBOF
    $BU
    ڭࢣ͋Γֶश
    ཚ਺ॳظ஋
    ͷ'$૚

    View Slide

  7. w
    fi
    OFUVOJOH
    ࣗݾڭࢣ͋Γֶश࣌ͱҟͳΔσʔληοτʢԼྲྀλεΫʣ΁
    fi
    OFUVOJOH
    ࣗݾڭࢣ͋ΓֶशϞσϧͷධՁํ๏

    1FMJDBO
    ڭࢣϥϕϧ
    '$

    ࣄલֶशϞσϧ
    )FBE

    ࣄલֶशϞσϧ
    Ϋϥε෼ྨϞσϧ
    ෺ମݕग़Ϟσϧ
    ࣗݾڭࢣ͋Γֶश
    ΛߦͳͬͨϞσϧ
    λεΫݻ༗
    ͷϞσϧߏ଄
    ڭࢣ͋Γֶश

    View Slide

  8. ෼ੳ
    6OEFSTUBOEJOHUIF#FIBWJPVS
    <'8BOHBOE)-JV $713`>
    ଛࣦઃܭ΍ֶशޮՌʹ͍ͭͯ෼ੳ
    )PX8FMM%P4FMG4VQFSWJTFE.PEFMT5SBOTGFS
    <-&SJDTTPO $713`>
    ༷ʑͳ໰୊ઃఆ΁ͷసҠੑΛධՁ
    8IFO%PFT$POUSBTUJWF7JTVBM3FQSFTFOUBUJPO-FBSOJOH8PSL
    <&$PMF $713`>
    σʔληοτͱͷؔ܎ੑʹ͍ͭͯ෼ੳ
    *OGP.JO
    <:5JBO /FVS*14`>
    ϙδςΟϒϖΞͷ૊Έ߹Θͤʹ͍ͭͯ෼ੳ
    ࣗݾڭࢣ͋Γֶशͷ୅දతͳख๏

    #BSMPX5XJOT
    <+;CPOUBS *$.-`>
    CBUDIEJNFOTJPO
    $1$
    <"WE0PSE BS9JW`>
    ύονؒͰϖΞΛ
    ࡞੒ͯ͠ରরֶश
    $1$W
    <0+)ÉOB
    ff
    *$.-`>
    ϖΞͷ࡞੒΍
    Ϟσϧߏ଄ͳͲΛվળ
    ϚεΫͨ͠ྖҬͷϐΫηϧΛ༧ଌ
    $POUFYU&ODPEFST
    <%1BUIBL $713`>
    δάιʔύζϧ
    ৭৘ใΛ༧ଌ ճస֯౓Λ༧ଌ
    *NBHF3PUBUJPOT
    <4(JEBSJT *$-3`>
    $POUFYU1SFEJDUJPO
    <$%PFSTDI *$$7`>
    ύον෼ׂͨ͠ը૾ͷ
    ύονؒͷ૬ରҐஔΛ༧ଌ
    1*3-
    <*.JTSBBOE-.BBUFO $713`>
    δάιʔύζϧΛಋೖ
    4JN$-3
    <5$IFO *$.-`>
    4JN$-3W
    <5$IFO /FVS*14`>
    .P$P
    <,)F $713> .P$PW
    <9$IFO BS9JW>
    4JN$-3ͷςΫχοΫΛಋೖ
    ରরֶश
    $PVOUJOH
    <./PSPP[J *$$7`>
    ֤ύονग़ྗͷ࿨ͱը૾શମͷग़ྗ
    ͕Ұக͢ΔΑ͏ʹֶश
    +JHTBX
    <./PSPP[J $713`>
    ͭͷը૾ͷύζϧΛϛοΫε
    &NCFEEJOH-FBSOJOH
    <.:F $713`>
    6OTVQFSWJTFE
    8PSEWFD
    <5.JLPMPW BS9JW>
    #&35
    <+%FWMJO /""$->
    ࣗવݴޠॲཧ෼໺
    1$-
    <+-J *$-3`>
    ϓϩτλΠϓΛಋೖ
    MPDBM͔ΒMPDBM΋༧ଌ
    &T7J5
    <$-J *$-3`>
    .BTLFE*NBHF.PEFMJOH .*.



    $//Λର৅ͱͨࣗ͠ݾڭࢣ͋Γֶश
    7J5Λର৅ͱͨࣗ͠ݾڭࢣ͋Γֶश
    1SFUFYUλεΫͷվળ
    ྡ઀͢Δ୯ޠͷ༧ଌ
    Λը૾΁Ԡ༻
    %*/0
    <.$BSPO *$$7>
    σʔλ૿෯ͱෳ਺ͷը૾
    Λ༻͍ͨରরֶशΛఏҊ
    ը૾਺ʹΫϥε਺ͱֶͯ͠श
    աڈͷग़ྗΛ
    ωΨςΟϒϖΞͱͯ͠׆༻
    .BTLFE-BOHVBHF.PEFMJOH
    .-.
    Λը૾΁Ԡ༻
    େن໛ωοτϫʔΫͷಋೖ
    +JHTBX
    <./PSPP[JBOE1'BWBSP &$$7`>
    $PMPSJ[BUJPO
    <3;IBOH &$$7`>
    *OTUBODF%JTDSJNJOBUJPO
    <;8V $713`>
    MPDBM͔ΒHMPCBM
    HMPCBM͔ΒHMPCBMΛ༧ଌ
    ಛ௃ྔʹϚεΫΩϯά

    ϚεΫྖҬͷಛ௃ྔΛ༧ଌ
    #&J5
    <)#BP *$-3`>
    J#05
    <+;IPV *$-3`>
    ϚεΫྖҬͷըૉΛ༧ଌ
    ."&
    <,)F $713`>
    4JN.*.
    <;9JF $713`>
    4QPU"SUJGBDUT
    <4+FOOJBOE1'BWBSP $713`>
    .$5
    <9:VBO $713`>
    ϚϧνϞʔμϧ΁֦ு
    ʢը૾ʴςΩετʣ
    .P$P#:0-
    .P#:
    <;9JF BS9JW`>
    .P$PW
    <9$IFO *$$7>
    7J5Ͱͷ༗ޮੑΛධՁ
    7P-5"
    <41SBNBOJDL BS9JW`>
    MPDBMGFBUVSF"MJHONFOU
    ϚϧνϞʔμϧʢը૾ʴςΩετʣ
    γϯϓϧͳରরֶशΛఏҊ
    $-*1
    <"3BEGPSE *$.-`>
    ;FSP4IPU5SBOTGFS
    ϚϧνϞʔμϧʢը૾ʴςΩετʣ
    7J5ͷͨΊͷֶशํ๏
    /FHBUJWFGSFF
    #:0-
    <+(SJMM /FVS*14>
    ϙδςΟϒϖΞͷΈͰֶश
    4JN4JBN
    <9$IFO $713>
    ΑΓγϯϓϧͳֶशΛఏҊ
    ෼ੳ
    #:0-XPSLTFWFOXJUIPVUCBUDITUBUJTUJDT
    <13JDIFNPOE BS9JW`>
    όονਖ਼نԽͷ౷ܭ৘ใ͕҉໧తͳωΨςΟϒϖΞͳͷͰ͸ʁ
    ˠਖ਼نԽʹΑΔֶशͷ҆ఆԽ͕ॏཁ
    4X"7
    <.$BSPO /FVS*14`>
    ϙδςΟϒϖΞͷଐ͢Δ
    ΫϥελΛਪఆ

    View Slide

  9. ෼ੳ
    6OEFSTUBOEJOHUIF#FIBWJPVS
    <'8BOHBOE)-JV $713`>
    ଛࣦઃܭ΍ֶशޮՌʹ͍ͭͯ෼ੳ
    )PX8FMM%P4FMG4VQFSWJTFE.PEFMT5SBOTGFS
    <-&SJDTTPO $713`>
    ༷ʑͳ໰୊ઃఆ΁ͷసҠੑΛධՁ
    8IFO%PFT$POUSBTUJWF7JTVBM3FQSFTFOUBUJPO-FBSOJOH8PSL
    <&$PMF $713`>
    σʔληοτͱͷؔ܎ੑʹ͍ͭͯ෼ੳ
    *OGP.JO
    <:5JBO /FVS*14`>
    ϙδςΟϒϖΞͷ૊Έ߹Θͤʹ͍ͭͯ෼ੳ
    ࣗݾڭࢣ͋Γֶशͷ୅දతͳख๏

    #BSMPX5XJOT
    <+;CPOUBS *$.-`>
    CBUDIEJNFOTJPO
    $1$
    <"WE0PSE BS9JW`>
    ύονؒͰϖΞΛ
    ࡞੒ͯ͠ରরֶश
    $1$W
    <0+)ÉOB
    ff
    *$.-`>
    ϖΞͷ࡞੒΍
    Ϟσϧߏ଄ͳͲΛվળ
    ϚεΫͨ͠ྖҬͷϐΫηϧΛ༧ଌ
    $POUFYU&ODPEFST
    <%1BUIBL $713`>
    δάιʔύζϧ
    ৭৘ใΛ༧ଌ ճస֯౓Λ༧ଌ
    *NBHF3PUBUJPOT
    <4(JEBSJT *$-3`>
    $POUFYU1SFEJDUJPO
    <$%PFSTDI *$$7`>
    ύον෼ׂͨ͠ը૾ͷ
    ύονؒͷ૬ରҐஔΛ༧ଌ
    1*3-
    <*.JTSBBOE-.BBUFO $713`>
    δάιʔύζϧΛಋೖ
    4JN$-3
    <5$IFO *$.-`>
    4JN$-3W
    <5$IFO /FVS*14`>
    .P$P
    <,)F $713> .P$PW
    <9$IFO BS9JW>
    4JN$-3ͷςΫχοΫΛಋೖ
    ରরֶश
    $PVOUJOH
    <./PSPP[J *$$7`>
    ֤ύονग़ྗͷ࿨ͱը૾શମͷग़ྗ
    ͕Ұக͢ΔΑ͏ʹֶश
    +JHTBX
    <./PSPP[J $713`>
    ͭͷը૾ͷύζϧΛϛοΫε
    &NCFEEJOH-FBSOJOH
    <.:F $713`>
    6OTVQFSWJTFE
    8PSEWFD
    <5.JLPMPW BS9JW>
    #&35
    <+%FWMJO /""$->
    ࣗવݴޠॲཧ෼໺
    1$-
    <+-J *$-3`>
    ϓϩτλΠϓΛಋೖ
    MPDBM͔ΒMPDBM΋༧ଌ
    &T7J5
    <$-J *$-3`>
    .BTLFE*NBHF.PEFMJOH .*.



    $//Λର৅ͱͨࣗ͠ݾڭࢣ͋Γֶश
    7J5Λର৅ͱͨࣗ͠ݾڭࢣ͋Γֶश
    1SFUFYUλεΫͷվળ
    ྡ઀͢Δ୯ޠͷ༧ଌ
    Λը૾΁Ԡ༻
    %*/0
    <.$BSPO *$$7>
    σʔλ૿෯ͱෳ਺ͷը૾
    Λ༻͍ͨରরֶशΛఏҊ
    ը૾਺ʹΫϥε਺ͱֶͯ͠श
    աڈͷग़ྗΛ
    ωΨςΟϒϖΞͱͯ͠׆༻
    .BTLFE-BOHVBHF.PEFMJOH
    .-.
    Λը૾΁Ԡ༻
    େن໛ωοτϫʔΫͷಋೖ
    +JHTBX
    <./PSPP[JBOE1'BWBSP &$$7`>
    $PMPSJ[BUJPO
    <3;IBOH &$$7`>
    *OTUBODF%JTDSJNJOBUJPO
    <;8V $713`>
    MPDBM͔ΒHMPCBM
    HMPCBM͔ΒHMPCBMΛ༧ଌ
    ಛ௃ྔʹϚεΫΩϯά

    ϚεΫྖҬͷಛ௃ྔΛ༧ଌ
    #&J5
    <)#BP *$-3`>
    J#05
    <+;IPV *$-3`>
    ϚεΫྖҬͷըૉΛ༧ଌ
    ."&
    <,)F $713`>
    4JN.*.
    <;9JF $713`>
    4QPU"SUJGBDUT
    <4+FOOJBOE1'BWBSP $713`>
    .$5
    <9:VBO $713`>
    ϚϧνϞʔμϧ΁֦ு
    ʢը૾ʴςΩετʣ
    .P$P#:0-
    .P#:
    <;9JF BS9JW`>
    .P$PW
    <9$IFO *$$7>
    7J5Ͱͷ༗ޮੑΛධՁ
    7P-5"
    <41SBNBOJDL BS9JW`>
    MPDBMGFBUVSF"MJHONFOU
    ϚϧνϞʔμϧʢը૾ʴςΩετʣ
    γϯϓϧͳରরֶशΛఏҊ
    $-*1
    <"3BEGPSE *$.-`>
    ;FSP4IPU5SBOTGFS
    ϚϧνϞʔμϧʢը૾ʴςΩετʣ
    7J5ͷͨΊͷֶशํ๏
    #:0-
    <+(SJMM /FVS*14>
    ϙδςΟϒϖΞͷΈͰֶश
    4JN4JBN
    <9$IFO $713>
    ΑΓγϯϓϧͳֶशΛఏҊ
    ෼ੳ
    #:0-XPSLTFWFOXJUIPVUCBUDITUBUJTUJDT
    <13JDIFNPOE BS9JW`>
    όονਖ਼نԽͷ౷ܭ৘ใ͕҉໧తͳωΨςΟϒϖΞͳͷͰ͸ʁ
    ˠਖ਼نԽʹΑΔֶशͷ҆ఆԽ͕ॏཁ
    4X"7
    <.$BSPO /FVS*14`>
    ϙδςΟϒϖΞͷଐ͢Δ
    ΫϥελΛਪఆ
    /FHBUJWFGSFF

    View Slide

  10. w $PMPSGVM*NBHF$PMPSJ[BUJPO<;IBOH &$$7>
    Χϥʔը૾͔ΒάϨʔεέʔϧը૾Λ࡞੒
    άϨʔεέʔϧը૾͔Β-BC৭ۭؒͷBC஋Λ༧ଌ
    w 1SFEJDUJOH*NBHF3PUBUJPOT<(JEBSJT *$-3>
    ը૾ʹରͯ͠౓ɼ౓ɼ౓ɼ౓ͷ͍ͣΕ͔ͷճసΛద༻
    ద༻͞Εͨճస͕छྨͷ͏͍ͪͣΕ͔Λ༧ଌʢΫϥε෼ྨʣ
    1SFUFYUUBTLͷվળ

    4 Zhang, Isola, Efros
    Fig. 2. Our network architecture. Each conv layer refers to a block of 2 or 3 repeated
    conv and ReLU layers, followed by a BatchNorm [30] layer. The net has no pool layers.
    All changes in resolution are achieved through spatial downsampling or upsampling
    Published as a conference paper at ICLR 2018
    Rotated image: X0
    Rotated image: X3
    Rotated image: X 2
    Rotated image: X1
    ConvNet
    model F(.)
    ConvNet
    model F(.)
    ConvNet
    model F(.)
    ConvNet
    model F(.)
    Image X
    Predict 270 degrees rotation (y=3)
    Rotate 270 degrees
    g( X , y=3)
    Rotate 180 degrees
    g( X , y=2)
    Rotate 90 degrees
    g( X , y=1)
    Rotate 0 degrees
    g( X , y=0)
    Maximize prob.
    F3( X 3)
    Predict 0 degrees rotation (y=0)
    Maximize prob.
    F2( X2)
    Maximize prob.
    F1( X 1)
    Maximize prob.
    F0( X 0)
    Predict 180 degrees rotation (y=2)
    Predict 90 degrees rotation (y=1)
    Objectives:
    *NBHF3PUBUJPOT
    $PMPSGVM*NBHF$PMPSJ[BUJPO

    View Slide

  11. w 4PMWJOH+JHTBX1V[[MFT
    λΠϧঢ়ʹͭͷύονΛ࡞੒ͯ͠γϟοϑϧ
    ͋Β͔͡Ίఆٛ͞Εͨγϟοϑϧॱ൪ͷΠϯσοΫεΛ༧ଌ
    w $POUFYU&ODPEFST<1BUIBL $713>
    &ODPEFSɾ%FDPEFSߏ଄ͷϞσϧʹΑΓϚεΫ͞ΕͨྖҬΛ༧ଌ
    1SFUFYUUBTLͷվળ

    Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles 7
    Fig. 3: Context Free Network. The figure illustrates how a puzzle is generated
    differ in the approach: whereas [7] are solving a discrimina-
    tive task (is patch A above patch B or below?), our context
    encoder solves a pure prediction problem (what pixel inten-
    sities should go in the hole?). Interestingly, similar distinc-
    tion exist in using language context to learn word embed-
    dings: Collobert and Weston [5] advocate a discriminative
    approach, whereas word2vec [30] formulate it as word pre-
    diction. One important benefit of our approach is that our
    supervisory signal is much richer: a context encoder needs
    to predict roughly 15,000 real values per training example,
    compared to just 1 option among 8 choices in [7]. Likely
    due in part to this difference, our context encoders take far
    less time to train than [7]. Moreover, context based predic-
    Figure 2: Context Encoder. The context image is passed
    through the encoder to obtain features which are connected
    to the decoder using channel-wise fully-connected layer as
    described in Section 3.1. The decoder then produces the
    $POUFYU&ODPEFST
    4PMWJOH+JHTBX1V[[MFT









    View Slide

  12. w -FBSOJOHUP$PVOU
    ը૾શମͷಛ௃ྔͱू໿ͨ͠ύονͷಛ௃ྔ͕Ұக͢ΔΑ͏ʹֶश
    w /PO1BSBNFUSJD*OTUBODF%JTDSJNJOBUJPO<8V $713>
    ֤ը૾ʹର͢Δಛ௃ྔ͕ಠཱ͢ΔΑ͏ʹֶशʢσʔλ਺ʹΫϥε਺ͱֶͯ͠शʣ
    1SFUFYUUBTLͷվળ

    ×n×3 → Rp×q×3,
    x and map them
    so define a feature
    med image to some
    e a feature transfor-
    akes J features and
    the image transfor-
    ature φ by using the
    pervisory signal
    0 ∀x. (1)
    mily consists of the
    nsampling factor of
    1, . . . , 4, which ex-
    of tiles. Notice that
    es of the same size.
    4
    }. We also define
    eatures on the trans-
    d − 4
    j=1
    tj. This
    shared
    weights
    φ(T1
    ◦ x) φ(T2
    ◦ x) φ(T3
    ◦ x) φ(T4
    ◦ x) φ(D ◦ x)
    φ(D ◦ y)
    +
    t d
    c
    |d − t|2
    y x
    D ◦ x
    D ◦ y T1
    ◦ x T2
    ◦ x T3
    ◦ x T4
    ◦ x
    t
    max{0, M − |c − t|2}
    AlexNet
    conv1-5
    fc8 1000
    114x114x3
    fc7 4096
    fc6 4096
    3x3x256
    ReLU
    ReLU
    ReLU
    128D Unit Sphere
    O
    1-th image
    2-th image
    i-th image
    n-1 th image
    n-th image
    CNN backbone
    128D
    2048D
    128D
    L2 norm
    low dim
    Non-param
    Softmax
    Memory
    Bank
    Figure 2: The pipeline of our unsupervised feature learning approach. We use a backbone CNN to encode each image as a feature
    vector, which is projected to a 128-dimensional space and L2 normalized. The optimal feature embedding is learned via instance-level
    discrimination, which tries to maximally scatter the features of training samples over the 128-dimensional unit sphere.
    3. Approach
    Our goal is to learn an embedding function v = f✓(x)
    without supervision. f✓
    is a deep neural network with
    where ⌧ is a temperature parameter that controls the con-
    centration level of the distribution [11]. ⌧ is important for
    supervised feature learning [43], and also necessary for tun-
    ing the concentration of v on our unit sphere.
    *OTUBODF%JTDSJNJOBUJPO
    -FBSOJOHUP$PVOU

    View Slide

  13. w $POUFYU1SFEJDUJPO<%PFSTDI *$$7>
    λΠϧঢ়ʹΫϩοϓͨ͠ύονؒͷ૬ରҐஔΛ༧ଌ
    w $POUSBTUJWF1SFEJDUJWF$PEJOH $1$W
    <)ÉOB
    ff
    *$.->
    ύονͷಛ௃ྔ͔ΒύονҐஔ͕ݸઌͷύονͷಛ௃ྔΛ༧ଌ
    1SFUFYUUBTLͷվળ



    x z c
    InfoNCE
    [256, 256, 3] [7, 7, 4096] [7, 7, 4096]
    Masked
    ConvNet
    Patched
    ResNet-161


    x z y
    Cross
    Ent
    [256, 256, 3] [7, 7, 4096] [1000, 1]
    Linear
    Self-supervised
    pre-training
    100% images; 0% labels
    Linear classification
    100% images and labels


    x z y
    Cross
    Ent
    [224, 224, 3] [14, 14, 4096] ResNet-33
    Efficient classification
    1% to 100% images and labels


    x z y
    Multi
    Task
    [H, W, 3] [H/16, W/16, 4096]
    Transfer learning
    100% images and labels

    x y
    Cross
    Ent
    [224, 224, 3] [1000, 1]
    ResNet-152
    Supervised training
    1% to 100% images and labels
    Baseline
    Pre-training Evaluation
    Pre-trained
    Fixed / Tuned
    ResNet-161
    Image x
    Feature Extractor fθ
    Patched ResNet-161
    z
    c
    Context Network gφ
    Masked ConvNet
    Faster-RCNN [20, 1]
    [1000, 1]
    Pre-trained
    Fixed / Tuned
    ResNet-161
    Pre-trained Fixed
    Patched ResNet-161
    al configuration (if there is no spe-
    e parts, then it is “stuff” [1]). We
    d approach to learn a visual repre-
    . We demonstrate that the resulting
    good for both object detection, pro-
    t on PASCAL VOC 2007 compared
    , as well as for unsupervised object
    mining. This means, surprisingly,
    generalizes across images, despite
    bjective function that operates on a
    That is, instance-level supervision
    ormance on category-level tasks.
    a good image representation is as
    n appropriate generative model. An
    of natural images would both gener-
    o their natural distribution, and be
    hat it would seek common causes
    d share information between them.
    atent structure given an image is in-
    vely simple models. To deal with
    sues, a number of works, such as
    m [23], contrastive divergence [22],
    3
    2
    1
    5
    4
    8
    7
    6
    ); Y = 3
    ,
    X = (
    Figure 2. The algorithm receives two patches in one of these eight
    possible spatial arrangements, without any context, and must then
    classify which configuration was sampled.
    model (e.g. a deep network) to predict, from a single word,
    the n preceding and n succeeding words. In principle, sim-
    ilar reasoning could be applied in the image domain, a kind
    of visual “fill in the blank” task, but, again, one runs into the
    problem of determining whether the predictions themselves
    $POUSBTUJWF1SFEJDUJWF$PEJOH
    $POUFYU1SFEJDUJPO
    k

    View Slide

  14. ࣗݾڭࢣ͋Γֶशͷ୅දతͳख๏

    #BSMPX5XJOT
    <+;CPOUBS *$.-`>
    CBUDIEJNFOTJPO
    $1$
    <"WE0PSE BS9JW`>
    ύονؒͰϖΞΛ
    ࡞੒ͯ͠ରরֶश
    $1$W
    <0+)ÉOB
    ff
    *$.-`>
    ϖΞͷ࡞੒΍
    Ϟσϧߏ଄ͳͲΛվળ
    ϚεΫͨ͠ྖҬͷϐΫηϧΛ༧ଌ
    $POUFYU&ODPEFST
    <%1BUIBL $713`>
    δάιʔύζϧ
    ৭৘ใΛ༧ଌ ճస֯౓Λ༧ଌ
    *NBHF3PUBUJPOT
    <4(JEBSJT *$-3`>
    $POUFYU1SFEJDUJPO
    <$%PFSTDI *$$7`>
    ύον෼ׂͨ͠ը૾ͷ
    ύονؒͷ૬ରҐஔΛ༧ଌ
    1*3-
    <*.JTSBBOE-.BBUFO $713`>
    δάιʔύζϧΛಋೖ
    4JN$-3
    <5$IFO *$.-`>
    4JN$-3W
    <5$IFO /FVS*14`>
    .P$P
    <,)F $713> .P$PW
    <9$IFO BS9JW>
    4JN$-3ͷςΫχοΫΛಋೖ
    ରরֶश
    $PVOUJOH
    <./PSPP[J *$$7`>
    ֤ύονग़ྗͷ࿨ͱը૾શମͷग़ྗ
    ͕Ұக͢ΔΑ͏ʹֶश
    +JHTBX
    <./PSPP[J $713`>
    ͭͷը૾ͷύζϧΛϛοΫε
    &NCFEEJOH-FBSOJOH
    <.:F $713`>
    6OTVQFSWJTFE
    8PSEWFD
    <5.JLPMPW BS9JW>
    #&35
    <+%FWMJO /""$->
    ࣗવݴޠॲཧ෼໺
    1$-
    <+-J *$-3`>
    ϓϩτλΠϓΛಋೖ
    MPDBM͔ΒMPDBM΋༧ଌ
    &T7J5
    <$-J *$-3`>
    .BTLFE*NBHF.PEFMJOH .*.



    $//Λର৅ͱͨࣗ͠ݾڭࢣ͋Γֶश
    7J5Λର৅ͱͨࣗ͠ݾڭࢣ͋Γֶश
    1SFUFYUλεΫͷվળ
    ྡ઀͢Δ୯ޠͷ༧ଌ
    Λը૾΁Ԡ༻
    %*/0
    <.$BSPO *$$7>
    σʔλ૿෯ͱෳ਺ͷը૾
    Λ༻͍ͨରরֶशΛఏҊ
    ը૾਺ʹΫϥε਺ͱֶͯ͠श
    աڈͷग़ྗΛ
    ωΨςΟϒϖΞͱͯ͠׆༻
    .BTLFE-BOHVBHF.PEFMJOH
    .-.
    Λը૾΁Ԡ༻
    େن໛ωοτϫʔΫͷಋೖ
    +JHTBX
    <./PSPP[JBOE1'BWBSP &$$7`>
    $PMPSJ[BUJPO
    <3;IBOH &$$7`>
    *OTUBODF%JTDSJNJOBUJPO
    <;8V $713`>
    MPDBM͔ΒHMPCBM
    HMPCBM͔ΒHMPCBMΛ༧ଌ
    ಛ௃ྔʹϚεΫΩϯά

    ϚεΫྖҬͷಛ௃ྔΛ༧ଌ
    #&J5
    <)#BP *$-3`>
    J#05
    <+;IPV *$-3`>
    ϚεΫྖҬͷըૉΛ༧ଌ
    ."&
    <,)F $713`>
    4JN.*.
    <;9JF $713`>
    4QPU"SUJGBDUT
    <4+FOOJBOE1'BWBSP $713`>
    .$5
    <9:VBO $713`>
    ϚϧνϞʔμϧ΁֦ு
    ʢը૾ʴςΩετʣ
    .P$P#:0-
    .P#:
    <;9JF BS9JW`>
    .P$PW
    <9$IFO *$$7>
    7J5Ͱͷ༗ޮੑΛධՁ
    7P-5"
    <41SBNBOJDL BS9JW`>
    MPDBMGFBUVSF"MJHONFOU
    ϚϧνϞʔμϧʢը૾ʴςΩετʣ
    γϯϓϧͳରরֶशΛఏҊ
    $-*1
    <"3BEGPSE *$.-`>
    ;FSP4IPU5SBOTGFS
    ϚϧνϞʔμϧʢը૾ʴςΩετʣ
    7J5ͷͨΊͷֶशํ๏
    ෼ੳ
    6OEFSTUBOEJOHUIF#FIBWJPVS
    <'8BOHBOE)-JV $713`>
    ଛࣦઃܭ΍ֶशޮՌʹ͍ͭͯ෼ੳ
    )PX8FMM%P4FMG4VQFSWJTFE.PEFMT5SBOTGFS
    <-&SJDTTPO $713`>
    ༷ʑͳ໰୊ઃఆ΁ͷసҠੑΛධՁ
    8IFO%PFT$POUSBTUJWF7JTVBM3FQSFTFOUBUJPO-FBSOJOH8PSL
    <&$PMF $713`>
    σʔληοτͱͷؔ܎ੑʹ͍ͭͯ෼ੳ
    *OGP.JO
    <:5JBO /FVS*14`>
    ϙδςΟϒϖΞͷ૊Έ߹Θͤʹ͍ͭͯ෼ੳ
    #:0-
    <+(SJMM /FVS*14>
    ϙδςΟϒϖΞͷΈͰֶश
    4JN4JBN
    <9$IFO $713>
    ΑΓγϯϓϧͳֶशΛఏҊ
    ෼ੳ
    #:0-XPSLTFWFOXJUIPVUCBUDITUBUJTUJDT
    <13JDIFNPOE BS9JW`>
    όονਖ਼نԽͷ౷ܭ৘ใ͕҉໧తͳωΨςΟϒϖΞͳͷͰ͸ʁ
    ˠਖ਼نԽʹΑΔֶशͷ҆ఆԽ͕ॏཁ
    4X"7
    <.$BSPO /FVS*14`>
    ϙδςΟϒϖΞͷଐ͢Δ
    ΫϥελΛਪఆ
    /FHBUJWFGSFF

    View Slide

  15. w 4JN$-3ɿ4JNQMF'SBNFXPSLGPS$POUSBTUJWF-FBSOJOHPG7JTVBM3FQSFTFOUBUJPOT
    w ݩը૾͕ಉ͡ಛ௃ྔͱͷྨࣅ౓Λେ͖͘ɼҟͳΔಛ௃ྔͱͷྨࣅ౓Λখ͘͢͞ΔΑ͏ʹֶश
    ໰୊ઃఆɿݩը૾͕ಉ͡ಛ௃ྔͷϖΞΛݟ͚ͭΔ
    ରরֶशɿ4JN$-3<$IFO *$.->

    ϛχόον
    ϓϩδΣΫλ
    Τϯίʔμ
    ʢωοτϫʔΫʣ
    .-1
    ଛࣦܭࢉ
    /59FOU

    σʔλ૿෯
    ಛ௃ྔ
    Τϯίʔμ ɿग़ྗ૚Λআ͍ͨωοτϫʔΫ
    ϓϩδΣΫλ
    ɿ૚ͷ.-1

    View Slide

  16. w 4JN$-3ɿ4JNQMF'SBNFXPSLGPS$POUSBTUJWF-FBSOJOHPG7JTVBM3FQSFTFOUBUJPOT
    w ݩը૾͕ಉ͡ಛ௃ྔͱͷྨࣅ౓Λେ͖͘ɼҟͳΔಛ௃ྔͱͷྨࣅ౓Λখ͘͢͞ΔΑ͏ʹֶश
    ໰୊ઃఆɿݩը૾͕ಉ͡ಛ௃ྔͷϖΞΛݟ͚ͭΔ
    ରরֶशɿ4JN$-3<$IFO *$.->

    ϛχόον
    ϓϩδΣΫλ
    Τϯίʔμ
    ʢωοτϫʔΫʣ
    .-1
    ཭͢
    (negative pair)
    ଛࣦܭࢉ
    /59FOU

    σʔλ૿෯
    ͚ۙͮΔ
    (positive pair)
    ಛ௃ྔ
    Τϯίʔμ ɿग़ྗ૚Λআ͍ͨωοτϫʔΫ
    ϓϩδΣΫλ
    ɿ૚ͷ.-1

    View Slide

  17. w 4JN$-3ɿ4JNQMF'SBNFXPSLGPS$POUSBTUJWF-FBSOJOHPG7JTVBM3FQSFTFOUBUJPOT
    w ݩը૾͕ಉ͡ಛ௃ྔͱͷྨࣅ౓Λେ͖͘ɼҟͳΔಛ௃ྔͱͷྨࣅ౓Λখ͘͢͞ΔΑ͏ʹֶश
    ໰୊ઃఆɿݩը૾͕ಉ͡ಛ௃ྔͷϖΞΛݟ͚ͭΔ
    ରরֶशɿ4JN$-3<$IFO *$.->

    ϛχόον
    ϓϩδΣΫλ
    Τϯίʔμ
    ʢωοτϫʔΫʣ
    .-1
    ͚ۙͮΔ
    (positive pair)
    ཭͢
    (negative pair)
    ଛࣦܭࢉ
    /59FOU

    σʔλ૿෯
    ಛ௃ྔ
    Τϯίʔμ ɿग़ྗ૚Λআ͍ͨωοτϫʔΫ
    ϓϩδΣΫλ
    ɿ૚ͷ.-1

    View Slide

  18. w 4JN$-3ɿ4JNQMF'SBNFXPSLGPS$POUSBTUJWF-FBSOJOHPG7JTVBM3FQSFTFOUBUJPOT
    w ݩը૾͕ಉ͡ಛ௃ྔͱͷྨࣅ౓Λେ͖͘ɼҟͳΔಛ௃ྔͱͷྨࣅ౓Λখ͘͢͞ΔΑ͏ʹֶश
    ໰୊ઃఆɿݩը૾͕ಉ͡ಛ௃ྔͷϖΞΛݟ͚ͭΔ
    ରরֶशɿ4JN$-3<$IFO *$.->

    ϛχόον
    ϓϩδΣΫλ
    Τϯίʔμ
    ʢωοτϫʔΫʣ
    .-1
    ͚ۙͮΔ
    (positive pair)
    ཭͢
    (negative pair)
    ଛࣦܭࢉ
    /59FOU

    σʔλ૿෯
    ಛ௃ྔ
    Τϯίʔμ ɿग़ྗ૚Λআ͍ͨωοτϫʔΫ
    ϓϩδΣΫλ
    ɿ૚ͷ.-1

    View Slide

  19. w 4JN$-3ɿ4JNQMF'SBNFXPSLGPS$POUSBTUJWF-FBSOJOHPG7JTVBM3FQSFTFOUBUJPOT
    w ݩը૾͕ಉ͡ಛ௃ྔͱͷྨࣅ౓Λେ͖͘ɼҟͳΔಛ௃ྔͱͷྨࣅ౓Λখ͘͢͞ΔΑ͏ʹֶश
    ໰୊ઃఆɿݩը૾͕ಉ͡ಛ௃ྔͷϖΞΛݟ͚ͭΔ
    ରরֶशɿ4JN$-3<$IFO *$.->

    ಛ௃ྔ
    ϛχόον
    ϓϩδΣΫλ
    Τϯίʔμ
    ʢωοτϫʔΫʣ
    .-1
    ͚ۙͮΔ
    (positive pair)
    ཭͢
    (negative pair)
    ଛࣦܭࢉ
    /59FOU

    σʔλ૿෯
    #BDLQSPQ
    Τϯίʔμ ɿग़ྗ૚Λআ͍ͨωοτϫʔΫ
    ϓϩδΣΫλ
    ɿ૚ͷ.-1

    View Slide

  20. w σʔλ૿෯ͷ෼ੳɿͭͷσʔλ૿෯ͷ૊Έ߹ΘͤํʹΑΔઢܗධՁͷਫ਼౓มԽΛௐࠪ
    w σʔλ૿෯ͷ૊Έ߹ΘͤํʹΑͬͯਫ਼౓͕มԽ
    Ϋϩοϓʴ৭ม׵͕࠷΋ྑ͍૊Έ߹Θͤˠ4JN$-3Ҏ߱ͷख๏ʹ͓͚Δελϯμʔυͳઃఆʹ
    ରরֶशɿ4JN$-3<$IFO *$.->

    RS
    R R R
    6R 1R
    5R
    R R
    RS
    R
    R R
    6R
    1R
    5R
    R R

    View Slide

  21. w ଛࣦؔ਺ͱͯ͠/PSNBMJ[FE5FNQFSBUVSFTDBMFE$SPTT&OUSPQZMPTT /59FOU
    Λ࢖༻
    αϯϓϧؒͷྨࣅ౓ؔ܎Λ֬཰෼෍Ͱදݱͯ͠$SPTT&OUPSQZMPTTΛܭࢉ
    ରরֶशɿ4JN$-3<$IFO *$.->

    Li,j
    = − log
    exp(sim(zi
    , zj
    )/τ)
    ∑2N
    k=1
    1[k≠i]
    exp(sim(zi
    , zk
    )/τ)
    ίαΠϯྨࣅ౓ ϙδςΟϒϖΞ
    શͯͷϖΞͷྨࣅ౓
    Թ౓ύϥϝʔλ

    View Slide

  22. w ଛࣦؔ਺ͱͯ͠/PSNBMJ[FE5FNQFSBUVSFTDBMFE$SPTT&OUSPQZMPTT /59FOU
    Λ࢖༻
    αϯϓϧؒͷྨࣅ౓ؔ܎Λ֬཰෼෍Ͱදݱͯ͠$SPTT&OUPSQZMPTTΛܭࢉ
    ରরֶशɿ4JN$-3<$IFO *$.->

    Li,j
    = − log
    exp(sim(zi
    , zj
    )/τ)
    ∑2N
    k=1
    1[k≠i]
    exp(sim(zi
    , zk
    )/τ)
    ίαΠϯྨࣅ౓ ϙδςΟϒϖΞ
    શͯͷϖΞͷྨࣅ౓
    Թ౓ύϥϝʔλ
    pi,j
    =
    exp(sim(zi
    , zj
    )/τ)
    ∑2N
    k=1
    1[k≠i]
    exp(sim(zi
    , zk
    )/τ)
    MPHJUT ֬཰෼෍
    Թ౓෇͖

    4PGUNBYؔ਺
    $SPTT&OUSPQZ
    ଛࣦ
    ڭࢣϥϕϧ
    y1,2
    y1,3
    y1,2N
    y1,4
    Li,j
    = −
    2N

    k=1
    1[k≠i]
    yi,k
    log pi,k
    = − log pi,j
    sim(z1
    , z2
    )
    sim(z1
    , z3
    )
    sim(z1
    , z2N
    )
    sim(z1
    , z4
    )
    p1,2
    p1,3
    p1,2N
    p1,4
    αϯϓϧͱαϯϓϧ͕ϙδςΟϒϖΞ
    ͷྫ
    i = 1, j = 2

    View Slide

  23. w ଛࣦؔ਺ͱͯ͠/PSNBMJ[FE5FNQFSBUVSFTDBMFE$SPTT&OUSPQZMPTT /59FOU
    Λ࢖༻
    αϯϓϧؒͷྨࣅ౓ؔ܎Λ֬཰෼෍Ͱදݱͯ͠$SPTT&OUPSQZMPTTΛܭࢉ
    ରরֶशɿ4JN$-3<$IFO *$.->

    Li,j
    = − log
    exp(sim(zi
    , zj
    )/τ)
    ∑2N
    k=1
    1[k≠i]
    exp(sim(zi
    , zk
    )/τ)
    ίαΠϯྨࣅ౓ ϙδςΟϒϖΞ
    શͯͷϖΞͷྨࣅ౓
    Թ౓ύϥϝʔλ
    αϯϓϧͱαϯϓϧ͕ϙδςΟϒϖΞ
    ͷྫ
    i = 1, j = 2
    p1,2
    p1,3
    p1,2N
    p1,4
    p1,2
    p1,3
    p1,2N
    p1,4
    ֬཰෼෍ΛӶ͘
    ֬཰෼෍ΛͳͩΒ͔ʹ
    MPHJUT
    sim(z1
    , z2
    )
    sim(z1
    , z3
    )
    sim(z1
    , z2N
    )
    sim(z1
    , z4
    )
    Թ౓෇͖

    4PGUNBYؔ਺
    τ < 1.0
    τ > 1.0
    4JN$-3Ͱ͸ ͱͯ͠
    dͷ஋Λ࢖༻
    τ
    ௨ৗͷ4PGUNBYؔ਺ τ = 1.0
    ͱൺ΂ͯ

    View Slide

  24. w ૬ޓ৘ใྔͷ؍఺͔Βྑ͍ϙδςΟϒϖΞʹ͍ͭͯ෼ੳ
    w ϖΞͷ૬ޓ৘ใྔ͕େ͖ͯ͘΋খͯ͘͞΋ྑ͘ͳ͍͜ͱΛ࣮ݧతʹূ໌
    ରরֶशͷ෼ੳɿ8IBU.BLFTGPS(PPE7JFXTGPS$POUSBTUJWF-FBSOJOH
    <5JBO /FVS*14>

    ϥϯμϜΫϩοϓʹ͓͚ΔΫϩοϓҐஔͷӨڹ
    nformation between views is changed, information about the downstream task (green)
    d) can be selectively included or excluded, biasing the learned representation. (a)
    views are chosen to preserve downstream task information between views while
    rmation, while in (b) reducing MI always throws out information relevant for the task
    ormance as MI is reduced.
    reducing I(v1; v2) improves downstream accuracy. We use INCE
    as a neural
    depends on network architectures. Therefore for each plot in this paper, we
    s while keeping other settings the same, to make the results comparable.
    -10 classification (b) CIFAR-10 classification
    ws by using pairs of image patches at various offsets from each other. As INCE
    is
    ask accuracy firstly increases and then decreases, leading to a reverse-U shape.
    (v1; v2) with spatial distance. We create views by randomly cropping two
    ૬ޓ৘ใྔ େ

    I(v1
    ; v2
    ) ≥ log(K) − LNCE
    = INCE
    (v1
    ; v2
    )
    1BUDI
    EJTUBODF
    ࣗݾڭࢣ͋Γֶशɿ%*7,ˠઢܗධՁɿ$*'"3
    missing
    info
    excess
    info
    # bits
    per
    hypothesis
    captured info
    I(x; y)
    AAACD3icbVDNS8MwHE3n15xfVY9egkOZl9GKoOBl6EVvE9wHrGOkWbqFpWlJUrGU/gde/Fe8eFDEq1dv/jemXQXdfBB4vPf7JS/PDRmVyrK+jNLC4tLySnm1sra+sbllbu+0ZRAJTFo4YIHoukgSRjlpKaoY6YaCIN9lpONOLjO/c0eEpAG/VXFI+j4acepRjJSWBuahk9+RuCwiKbyuOT5SY9dL7tNz+MPj9GhgVq26lQPOE7sgVVCgOTA/nWGAI59whRmSsmdboeonSCiKGUkrTiRJiPAEjUhPU458IvtJniWFB1oZQi8Q+nAFc/X3RoJ8KWPf1ZNZRDnrZeJ/Xi9S3lk/oTyMFOF4+pAXMagCmJUDh1QQrFisCcKC6qwQj5FAWOkKK7oEe/bL86R9XLetun1zUm1cFHWUwR7YBzVgg1PQAFegCVoAgwfwBF7Aq/FoPBtvxvt0tGQUO7vgD4yPb1Z0nNQ=
    AAACD3icbVDNS8MwHE3n15xfVY9egkOZl9GKoOBl6EVvE9wHrGOkWbqFpWlJUrGU/gde/Fe8eFDEq1dv/jemXQXdfBB4vPf7JS/PDRmVyrK+jNLC4tLySnm1sra+sbllbu+0ZRAJTFo4YIHoukgSRjlpKaoY6YaCIN9lpONOLjO/c0eEpAG/VXFI+j4acepRjJSWBuahk9+RuCwiKbyuOT5SY9dL7tNz+MPj9GhgVq26lQPOE7sgVVCgOTA/nWGAI59whRmSsmdboeonSCiKGUkrTiRJiPAEjUhPU458IvtJniWFB1oZQi8Q+nAFc/X3RoJ8KWPf1ZNZRDnrZeJ/Xi9S3lk/oTyMFOF4+pAXMagCmJUDh1QQrFisCcKC6qwQj5FAWOkKK7oEe/bL86R9XLetun1zUm1cFHWUwR7YBzVgg1PQAFegCVoAgwfwBF7Aq/FoPBtvxvt0tGQUO7vgD4yPb1Z0nNQ=
    AAACD3icbVDNS8MwHE3n15xfVY9egkOZl9GKoOBl6EVvE9wHrGOkWbqFpWlJUrGU/gde/Fe8eFDEq1dv/jemXQXdfBB4vPf7JS/PDRmVyrK+jNLC4tLySnm1sra+sbllbu+0ZRAJTFo4YIHoukgSRjlpKaoY6YaCIN9lpONOLjO/c0eEpAG/VXFI+j4acepRjJSWBuahk9+RuCwiKbyuOT5SY9dL7tNz+MPj9GhgVq26lQPOE7sgVVCgOTA/nWGAI59whRmSsmdboeonSCiKGUkrTiRJiPAEjUhPU458IvtJniWFB1oZQi8Q+nAFc/X3RoJ8KWPf1ZNZRDnrZeJ/Xi9S3lk/oTyMFOF4+pAXMagCmJUDh1QQrFisCcKC6qwQj5FAWOkKK7oEe/bL86R9XLetun1zUm1cFHWUwR7YBzVgg1PQAFegCVoAgwfwBF7Aq/FoPBtvxvt0tGQUO7vgD4yPb1Z0nNQ=
    AAACD3icbVDNS8MwHE3n15xfVY9egkOZl9GKoOBl6EVvE9wHrGOkWbqFpWlJUrGU/gde/Fe8eFDEq1dv/jemXQXdfBB4vPf7JS/PDRmVyrK+jNLC4tLySnm1sra+sbllbu+0ZRAJTFo4YIHoukgSRjlpKaoY6YaCIN9lpONOLjO/c0eEpAG/VXFI+j4acepRjJSWBuahk9+RuCwiKbyuOT5SY9dL7tNz+MPj9GhgVq26lQPOE7sgVVCgOTA/nWGAI59whRmSsmdboeonSCiKGUkrTiRJiPAEjUhPU458IvtJniWFB1oZQi8Q+nAFc/X3RoJ8KWPf1ZNZRDnrZeJ/Xi9S3lk/oTyMFOF4+pAXMagCmJUDh1QQrFisCcKC6qwQj5FAWOkKK7oEe/bL86R9XLetun1zUm1cFHWUwR7YBzVgg1PQAFegCVoAgwfwBF7Aq/FoPBtvxvt0tGQUO7vgD4yPb1Z0nNQ=
    Sweet Spot
    I(v1; v2) = I(x; y)
    AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=
    AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=
    AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=
    AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=
    I(v1; v2)
    AAACFnicbVDLSsNAFJ3UV62vqEs3g0WoC0tSBAU3RTe6q2Af0IQymU7s0MkkzEwKJeQr3Pgrblwo4lbc+TdO0oDaemHg3HNfZ44XMSqVZX0ZpaXlldW18nplY3Nre8fc3evIMBaYtHHIQtHzkCSMctJWVDHSiwRBgcdI1xtfZfXuhAhJQ36nphFxA3TPqU8xUpoamCdOviPRNOEKpfCm5gRIjTw/mQzs9AL+ZI30eGBWrbqVB1wEdgGqoIjWwPx0hiGOA70bMyRl37Yi5SZIKIoZSStOLEmE8Fhf72vIUUCkm+SSUnikmSH0Q6EfVzBnf08kKJByGni6MxMp52sZ+V+tHyv/3E0oj2JFOJ4d8mMGVQgzj+CQCoIVm2qAsKBaK8QjJBBW2smKNsGe//Ii6DTqtlW3b0+rzcvCjjI4AIegBmxwBprgGrRAG2DwAJ7AC3g1Ho1n4814n7WWjGJmH/wJ4+MbTMifaw==
    AAACFnicbVDLSsNAFJ3UV62vqEs3g0WoC0tSBAU3RTe6q2Af0IQymU7s0MkkzEwKJeQr3Pgrblwo4lbc+TdO0oDaemHg3HNfZ44XMSqVZX0ZpaXlldW18nplY3Nre8fc3evIMBaYtHHIQtHzkCSMctJWVDHSiwRBgcdI1xtfZfXuhAhJQ36nphFxA3TPqU8xUpoamCdOviPRNOEKpfCm5gRIjTw/mQzs9AL+ZI30eGBWrbqVB1wEdgGqoIjWwPx0hiGOA70bMyRl37Yi5SZIKIoZSStOLEmE8Fhf72vIUUCkm+SSUnikmSH0Q6EfVzBnf08kKJByGni6MxMp52sZ+V+tHyv/3E0oj2JFOJ4d8mMGVQgzj+CQCoIVm2qAsKBaK8QjJBBW2smKNsGe//Ii6DTqtlW3b0+rzcvCjjI4AIegBmxwBprgGrRAG2DwAJ7AC3g1Ho1n4814n7WWjGJmH/wJ4+MbTMifaw==
    AAACFnicbVDLSsNAFJ3UV62vqEs3g0WoC0tSBAU3RTe6q2Af0IQymU7s0MkkzEwKJeQr3Pgrblwo4lbc+TdO0oDaemHg3HNfZ44XMSqVZX0ZpaXlldW18nplY3Nre8fc3evIMBaYtHHIQtHzkCSMctJWVDHSiwRBgcdI1xtfZfXuhAhJQ36nphFxA3TPqU8xUpoamCdOviPRNOEKpfCm5gRIjTw/mQzs9AL+ZI30eGBWrbqVB1wEdgGqoIjWwPx0hiGOA70bMyRl37Yi5SZIKIoZSStOLEmE8Fhf72vIUUCkm+SSUnikmSH0Q6EfVzBnf08kKJByGni6MxMp52sZ+V+tHyv/3E0oj2JFOJ4d8mMGVQgzj+CQCoIVm2qAsKBaK8QjJBBW2smKNsGe//Ii6DTqtlW3b0+rzcvCjjI4AIegBmxwBprgGrRAG2DwAJ7AC3g1Ho1n4814n7WWjGJmH/wJ4+MbTMifaw==
    AAACFnicbVDLSsNAFJ3UV62vqEs3g0WoC0tSBAU3RTe6q2Af0IQymU7s0MkkzEwKJeQr3Pgrblwo4lbc+TdO0oDaemHg3HNfZ44XMSqVZX0ZpaXlldW18nplY3Nre8fc3evIMBaYtHHIQtHzkCSMctJWVDHSiwRBgcdI1xtfZfXuhAhJQ36nphFxA3TPqU8xUpoamCdOviPRNOEKpfCm5gRIjTw/mQzs9AL+ZI30eGBWrbqVB1wEdgGqoIjWwPx0hiGOA70bMyRl37Yi5SZIKIoZSStOLEmE8Fhf72vIUUCkm+SSUnikmSH0Q6EfVzBnf08kKJByGni6MxMp52sZ+V+tHyv/3E0oj2JFOJ4d8mMGVQgzj+CQCoIVm2qAsKBaK8QjJBBW2smKNsGe//Ii6DTqtlW3b0+rzcvCjjI4AIegBmxwBprgGrRAG2DwAJ7AC3g1Ho1n4814n7WWjGJmH/wJ4+MbTMifaw==
    I(v1; v2)
    AAACBnicbVDLSsNAFL3xWesr6lKEwSLUTUmKoOCm6EZ3FewD2hAm00k7dPJgZlIooSs3/oobF4q49Rvc+TdO2oDaemDgzDn3cu89XsyZVJb1ZSwtr6yurRc2iptb2zu75t5+U0aJILRBIh6Jtocl5SykDcUUp+1YUBx4nLa84XXmt0ZUSBaF92ocUyfA/ZD5jGClJdc8ui13A6wGnp+OXHtyiX5+1cmpa5asijUFWiR2TkqQo+6an91eRJKAhopwLGXHtmLlpFgoRjidFLuJpDEmQ9ynHU1DHFDppNMzJuhEKz3kR0K/UKGp+rsjxYGU48DTldmSct7LxP+8TqL8CydlYZwoGpLZID/hSEUoywT1mKBE8bEmmAimd0VkgAUmSidX1CHY8ycvkma1YlsV++6sVLvK4yjAIRxDGWw4hxrcQB0aQOABnuAFXo1H49l4M95npUtG3nMAf2B8fANJI5hb
    AAACBnicbVDLSsNAFL3xWesr6lKEwSLUTUmKoOCm6EZ3FewD2hAm00k7dPJgZlIooSs3/oobF4q49Rvc+TdO2oDaemDgzDn3cu89XsyZVJb1ZSwtr6yurRc2iptb2zu75t5+U0aJILRBIh6Jtocl5SykDcUUp+1YUBx4nLa84XXmt0ZUSBaF92ocUyfA/ZD5jGClJdc8ui13A6wGnp+OXHtyiX5+1cmpa5asijUFWiR2TkqQo+6an91eRJKAhopwLGXHtmLlpFgoRjidFLuJpDEmQ9ynHU1DHFDppNMzJuhEKz3kR0K/UKGp+rsjxYGU48DTldmSct7LxP+8TqL8CydlYZwoGpLZID/hSEUoywT1mKBE8bEmmAimd0VkgAUmSidX1CHY8ycvkma1YlsV++6sVLvK4yjAIRxDGWw4hxrcQB0aQOABnuAFXo1H49l4M95npUtG3nMAf2B8fANJI5hb
    AAACBnicbVDLSsNAFL3xWesr6lKEwSLUTUmKoOCm6EZ3FewD2hAm00k7dPJgZlIooSs3/oobF4q49Rvc+TdO2oDaemDgzDn3cu89XsyZVJb1ZSwtr6yurRc2iptb2zu75t5+U0aJILRBIh6Jtocl5SykDcUUp+1YUBx4nLa84XXmt0ZUSBaF92ocUyfA/ZD5jGClJdc8ui13A6wGnp+OXHtyiX5+1cmpa5asijUFWiR2TkqQo+6an91eRJKAhopwLGXHtmLlpFgoRjidFLuJpDEmQ9ynHU1DHFDppNMzJuhEKz3kR0K/UKGp+rsjxYGU48DTldmSct7LxP+8TqL8CydlYZwoGpLZID/hSEUoywT1mKBE8bEmmAimd0VkgAUmSidX1CHY8ycvkma1YlsV++6sVLvK4yjAIRxDGWw4hxrcQB0aQOABnuAFXo1H49l4M95npUtG3nMAf2B8fANJI5hb
    AAACBnicbVDLSsNAFL3xWesr6lKEwSLUTUmKoOCm6EZ3FewD2hAm00k7dPJgZlIooSs3/oobF4q49Rvc+TdO2oDaemDgzDn3cu89XsyZVJb1ZSwtr6yurRc2iptb2zu75t5+U0aJILRBIh6Jtocl5SykDcUUp+1YUBx4nLa84XXmt0ZUSBaF92ocUyfA/ZD5jGClJdc8ui13A6wGnp+OXHtyiX5+1cmpa5asijUFWiR2TkqQo+6an91eRJKAhopwLGXHtmLlpFgoRjidFLuJpDEmQ9ynHU1DHFDppNMzJuhEKz3kR0K/UKGp+rsjxYGU48DTldmSct7LxP+8TqL8CydlYZwoGpLZID/hSEUoywT1mKBE8bEmmAimd0VkgAUmSidX1CHY8ycvkma1YlsV++6sVLvK4yjAIRxDGWw4hxrcQB0aQOABnuAFXo1H49l4M95npUtG3nMAf2B8fANJI5hb
    missing
    info
    # bits
    captured in
    Sweet Spot
    I(v1; v2) = I(x; y
    AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=
    AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=
    AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=
    AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=
    missing
    info
    e
    # bits
    captured in
    Sweet Spot
    I(v1; v2) = I(x; y
    AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=
    AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=
    AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=
    AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=
    missing
    info
    excess
    info
    # bits transfer
    performance
    hypothesis too m
    no
    not enough
    signal
    captured info
    I(x; y)
    AAACD3icbVDNS8MwHE3n15xfVY9egkOZl9GKoOBl6EVvE9wHrGOkWbqFpWlJUrGU/gde/Fe8eFDEq1dv/jemXQXdfBB4vPf7JS/PDRmVyrK+jNLC4tLySnm1sra+sbllbu+0ZRAJTFo4YIHoukgSRjlpKaoY6YaCIN9lpONOLjO/c0eEpAG/VXFI+j4acepRjJSWBuahk9+RuCwiKbyuOT5SY9dL7tNz+MPj9GhgVq26lQPOE7sgVVCgOTA/nWGAI59whRmSsmdboeonSCiKGUkrTiRJiPAEjUhPU458IvtJniWFB1oZQi8Q+nAFc/X3RoJ8KWPf1ZNZRDnrZeJ/Xi9S3lk/oTyMFOF4+pAXMagCmJUDh1QQrFisCcKC6qwQj5FAWOkKK7oEe/bL86R9XLetun1zUm1cFHWUwR7YBzVgg1PQAFegCVoAgwfwBF7Aq/FoPBtvxvt0tGQUO7vgD4yPb1Z0nNQ=
    AAACD3icbVDNS8MwHE3n15xfVY9egkOZl9GKoOBl6EVvE9wHrGOkWbqFpWlJUrGU/gde/Fe8eFDEq1dv/jemXQXdfBB4vPf7JS/PDRmVyrK+jNLC4tLySnm1sra+sbllbu+0ZRAJTFo4YIHoukgSRjlpKaoY6YaCIN9lpONOLjO/c0eEpAG/VXFI+j4acepRjJSWBuahk9+RuCwiKbyuOT5SY9dL7tNz+MPj9GhgVq26lQPOE7sgVVCgOTA/nWGAI59whRmSsmdboeonSCiKGUkrTiRJiPAEjUhPU458IvtJniWFB1oZQi8Q+nAFc/X3RoJ8KWPf1ZNZRDnrZeJ/Xi9S3lk/oTyMFOF4+pAXMagCmJUDh1QQrFisCcKC6qwQj5FAWOkKK7oEe/bL86R9XLetun1zUm1cFHWUwR7YBzVgg1PQAFegCVoAgwfwBF7Aq/FoPBtvxvt0tGQUO7vgD4yPb1Z0nNQ=
    AAACD3icbVDNS8MwHE3n15xfVY9egkOZl9GKoOBl6EVvE9wHrGOkWbqFpWlJUrGU/gde/Fe8eFDEq1dv/jemXQXdfBB4vPf7JS/PDRmVyrK+jNLC4tLySnm1sra+sbllbu+0ZRAJTFo4YIHoukgSRjlpKaoY6YaCIN9lpONOLjO/c0eEpAG/VXFI+j4acepRjJSWBuahk9+RuCwiKbyuOT5SY9dL7tNz+MPj9GhgVq26lQPOE7sgVVCgOTA/nWGAI59whRmSsmdboeonSCiKGUkrTiRJiPAEjUhPU458IvtJniWFB1oZQi8Q+nAFc/X3RoJ8KWPf1ZNZRDnrZeJ/Xi9S3lk/oTyMFOF4+pAXMagCmJUDh1QQrFisCcKC6qwQj5FAWOkKK7oEe/bL86R9XLetun1zUm1cFHWUwR7YBzVgg1PQAFegCVoAgwfwBF7Aq/FoPBtvxvt0tGQUO7vgD4yPb1Z0nNQ=
    AAACD3icbVDNS8MwHE3n15xfVY9egkOZl9GKoOBl6EVvE9wHrGOkWbqFpWlJUrGU/gde/Fe8eFDEq1dv/jemXQXdfBB4vPf7JS/PDRmVyrK+jNLC4tLySnm1sra+sbllbu+0ZRAJTFo4YIHoukgSRjlpKaoY6YaCIN9lpONOLjO/c0eEpAG/VXFI+j4acepRjJSWBuahk9+RuCwiKbyuOT5SY9dL7tNz+MPj9GhgVq26lQPOE7sgVVCgOTA/nWGAI59whRmSsmdboeonSCiKGUkrTiRJiPAEjUhPU458IvtJniWFB1oZQi8Q+nAFc/X3RoJ8KWPf1ZNZRDnrZeJ/Xi9S3lk/oTyMFOF4+pAXMagCmJUDh1QQrFisCcKC6qwQj5FAWOkKK7oEe/bL86R9XLetun1zUm1cFHWUwR7YBzVgg1PQAFegCVoAgwfwBF7Aq/FoPBtvxvt0tGQUO7vgD4yPb1Z0nNQ=
    Sweet Spot
    I(v1; v2) = I(x; y)
    AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=
    AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=
    AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=
    AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=
    I(v1; v2)
    AAACFnicbVDLSsNAFJ3UV62vqEs3g0WoC0tSBAU3RTe6q2Af0IQymU7s0MkkzEwKJeQr3Pgrblwo4lbc+TdO0oDaemHg3HNfZ44XMSqVZX0ZpaXlldW18nplY3Nre8fc3evIMBaYtHHIQtHzkCSMctJWVDHSiwRBgcdI1xtfZfXuhAhJQ36nphFxA3TPqU8xUpoamCdOviPRNOEKpfCm5gRIjTw/mQzs9AL+ZI30eGBWrbqVB1wEdgGqoIjWwPx0hiGOA70bMyRl37Yi5SZIKIoZSStOLEmE8Fhf72vIUUCkm+SSUnikmSH0Q6EfVzBnf08kKJByGni6MxMp52sZ+V+tHyv/3E0oj2JFOJ4d8mMGVQgzj+CQCoIVm2qAsKBaK8QjJBBW2smKNsGe//Ii6DTqtlW3b0+rzcvCjjI4AIegBmxwBprgGrRAG2DwAJ7AC3g1Ho1n4814n7WWjGJmH/wJ4+MbTMifaw==
    AAACFnicbVDLSsNAFJ3UV62vqEs3g0WoC0tSBAU3RTe6q2Af0IQymU7s0MkkzEwKJeQr3Pgrblwo4lbc+TdO0oDaemHg3HNfZ44XMSqVZX0ZpaXlldW18nplY3Nre8fc3evIMBaYtHHIQtHzkCSMctJWVDHSiwRBgcdI1xtfZfXuhAhJQ36nphFxA3TPqU8xUpoamCdOviPRNOEKpfCm5gRIjTw/mQzs9AL+ZI30eGBWrbqVB1wEdgGqoIjWwPx0hiGOA70bMyRl37Yi5SZIKIoZSStOLEmE8Fhf72vIUUCkm+SSUnikmSH0Q6EfVzBnf08kKJByGni6MxMp52sZ+V+tHyv/3E0oj2JFOJ4d8mMGVQgzj+CQCoIVm2qAsKBaK8QjJBBW2smKNsGe//Ii6DTqtlW3b0+rzcvCjjI4AIegBmxwBprgGrRAG2DwAJ7AC3g1Ho1n4814n7WWjGJmH/wJ4+MbTMifaw==
    AAACFnicbVDLSsNAFJ3UV62vqEs3g0WoC0tSBAU3RTe6q2Af0IQymU7s0MkkzEwKJeQr3Pgrblwo4lbc+TdO0oDaemHg3HNfZ44XMSqVZX0ZpaXlldW18nplY3Nre8fc3evIMBaYtHHIQtHzkCSMctJWVDHSiwRBgcdI1xtfZfXuhAhJQ36nphFxA3TPqU8xUpoamCdOviPRNOEKpfCm5gRIjTw/mQzs9AL+ZI30eGBWrbqVB1wEdgGqoIjWwPx0hiGOA70bMyRl37Yi5SZIKIoZSStOLEmE8Fhf72vIUUCkm+SSUnikmSH0Q6EfVzBnf08kKJByGni6MxMp52sZ+V+tHyv/3E0oj2JFOJ4d8mMGVQgzj+CQCoIVm2qAsKBaK8QjJBBW2smKNsGe//Ii6DTqtlW3b0+rzcvCjjI4AIegBmxwBprgGrRAG2DwAJ7AC3g1Ho1n4814n7WWjGJmH/wJ4+MbTMifaw==
    AAACFnicbVDLSsNAFJ3UV62vqEs3g0WoC0tSBAU3RTe6q2Af0IQymU7s0MkkzEwKJeQr3Pgrblwo4lbc+TdO0oDaemHg3HNfZ44XMSqVZX0ZpaXlldW18nplY3Nre8fc3evIMBaYtHHIQtHzkCSMctJWVDHSiwRBgcdI1xtfZfXuhAhJQ36nphFxA3TPqU8xUpoamCdOviPRNOEKpfCm5gRIjTw/mQzs9AL+ZI30eGBWrbqVB1wEdgGqoIjWwPx0hiGOA70bMyRl37Yi5SZIKIoZSStOLEmE8Fhf72vIUUCkm+SSUnikmSH0Q6EfVzBnf08kKJByGni6MxMp52sZ+V+tHyv/3E0oj2JFOJ4d8mMGVQgzj+CQCoIVm2qAsKBaK8QjJBBW2smKNsGe//Ii6DTqtlW3b0+rzcvCjjI4AIegBmxwBprgGrRAG2DwAJ7AC3g1Ho1n4814n7WWjGJmH/wJ4+MbTMifaw==
    I(v1; v2)
    AAACBnicbVDLSsNAFL3xWesr6lKEwSLUTUmKoOCm6EZ3FewD2hAm00k7dPJgZlIooSs3/oobF4q49Rvc+TdO2oDaemDgzDn3cu89XsyZVJb1ZSwtr6yurRc2iptb2zu75t5+U0aJILRBIh6Jtocl5SykDcUUp+1YUBx4nLa84XXmt0ZUSBaF92ocUyfA/ZD5jGClJdc8ui13A6wGnp+OXHtyiX5+1cmpa5asijUFWiR2TkqQo+6an91eRJKAhopwLGXHtmLlpFgoRjidFLuJpDEmQ9ynHU1DHFDppNMzJuhEKz3kR0K/UKGp+rsjxYGU48DTldmSct7LxP+8TqL8CydlYZwoGpLZID/hSEUoywT1mKBE8bEmmAimd0VkgAUmSidX1CHY8ycvkma1YlsV++6sVLvK4yjAIRxDGWw4hxrcQB0aQOABnuAFXo1H49l4M95npUtG3nMAf2B8fANJI5hb
    AAACBnicbVDLSsNAFL3xWesr6lKEwSLUTUmKoOCm6EZ3FewD2hAm00k7dPJgZlIooSs3/oobF4q49Rvc+TdO2oDaemDgzDn3cu89XsyZVJb1ZSwtr6yurRc2iptb2zu75t5+U0aJILRBIh6Jtocl5SykDcUUp+1YUBx4nLa84XXmt0ZUSBaF92ocUyfA/ZD5jGClJdc8ui13A6wGnp+OXHtyiX5+1cmpa5asijUFWiR2TkqQo+6an91eRJKAhopwLGXHtmLlpFgoRjidFLuJpDEmQ9ynHU1DHFDppNMzJuhEKz3kR0K/UKGp+rsjxYGU48DTldmSct7LxP+8TqL8CydlYZwoGpLZID/hSEUoywT1mKBE8bEmmAimd0VkgAUmSidX1CHY8ycvkma1YlsV++6sVLvK4yjAIRxDGWw4hxrcQB0aQOABnuAFXo1H49l4M95npUtG3nMAf2B8fANJI5hb
    AAACBnicbVDLSsNAFL3xWesr6lKEwSLUTUmKoOCm6EZ3FewD2hAm00k7dPJgZlIooSs3/oobF4q49Rvc+TdO2oDaemDgzDn3cu89XsyZVJb1ZSwtr6yurRc2iptb2zu75t5+U0aJILRBIh6Jtocl5SykDcUUp+1YUBx4nLa84XXmt0ZUSBaF92ocUyfA/ZD5jGClJdc8ui13A6wGnp+OXHtyiX5+1cmpa5asijUFWiR2TkqQo+6an91eRJKAhopwLGXHtmLlpFgoRjidFLuJpDEmQ9ynHU1DHFDppNMzJuhEKz3kR0K/UKGp+rsjxYGU48DTldmSct7LxP+8TqL8CydlYZwoGpLZID/hSEUoywT1mKBE8bEmmAimd0VkgAUmSidX1CHY8ycvkma1YlsV++6sVLvK4yjAIRxDGWw4hxrcQB0aQOABnuAFXo1H49l4M95npUtG3nMAf2B8fANJI5hb
    AAACBnicbVDLSsNAFL3xWesr6lKEwSLUTUmKoOCm6EZ3FewD2hAm00k7dPJgZlIooSs3/oobF4q49Rvc+TdO2oDaemDgzDn3cu89XsyZVJb1ZSwtr6yurRc2iptb2zu75t5+U0aJILRBIh6Jtocl5SykDcUUp+1YUBx4nLa84XXmt0ZUSBaF92ocUyfA/ZD5jGClJdc8ui13A6wGnp+OXHtyiX5+1cmpa5asijUFWiR2TkqQo+6an91eRJKAhopwLGXHtmLlpFgoRjidFLuJpDEmQ9ynHU1DHFDppNMzJuhEKz3kR0K/UKGp+rsjxYGU48DTldmSct7LxP+8TqL8CydlYZwoGpLZID/hSEUoywT1mKBE8bEmmAimd0VkgAUmSidX1CHY8ycvkma1YlsV++6sVLvK4yjAIRxDGWw4hxrcQB0aQOABnuAFXo1H49l4M95npUtG3nMAf2B8fANJI5hb
    I(v1; v2) = I(x; y)
    AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=
    AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=
    AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=
    AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=
    missing
    info
    excess
    info
    # bits transfer
    performance
    hypothesis too
    n
    not enough
    signal
    captured info
    I(x; y)
    AAACD3icbVDNS8MwHE3n15xfVY9egkOZl9GKoOBl6EVvE9wHrGOkWbqFpWlJUrGU/gde/Fe8eFDEq1dv/jemXQXdfBB4vPf7JS/PDRmVyrK+jNLC4tLySnm1sra+sbllbu+0ZRAJTFo4YIHoukgSRjlpKaoY6YaCIN9lpONOLjO/c0eEpAG/VXFI+j4acepRjJSWBuahk9+RuCwiKbyuOT5SY9dL7tNz+MPj9GhgVq26lQPOE7sgVVCgOTA/nWGAI59whRmSsmdboeonSCiKGUkrTiRJiPAEjUhPU458IvtJniWFB1oZQi8Q+nAFc/X3RoJ8KWPf1ZNZRDnrZeJ/Xi9S3lk/oTyMFOF4+pAXMagCmJUDh1QQrFisCcKC6qwQj5FAWOkKK7oEe/bL86R9XLetun1zUm1cFHWUwR7YBzVgg1PQAFegCVoAgwfwBF7Aq/FoPBtvxvt0tGQUO7vgD4yPb1Z0nNQ=
    AAACD3icbVDNS8MwHE3n15xfVY9egkOZl9GKoOBl6EVvE9wHrGOkWbqFpWlJUrGU/gde/Fe8eFDEq1dv/jemXQXdfBB4vPf7JS/PDRmVyrK+jNLC4tLySnm1sra+sbllbu+0ZRAJTFo4YIHoukgSRjlpKaoY6YaCIN9lpONOLjO/c0eEpAG/VXFI+j4acepRjJSWBuahk9+RuCwiKbyuOT5SY9dL7tNz+MPj9GhgVq26lQPOE7sgVVCgOTA/nWGAI59whRmSsmdboeonSCiKGUkrTiRJiPAEjUhPU458IvtJniWFB1oZQi8Q+nAFc/X3RoJ8KWPf1ZNZRDnrZeJ/Xi9S3lk/oTyMFOF4+pAXMagCmJUDh1QQrFisCcKC6qwQj5FAWOkKK7oEe/bL86R9XLetun1zUm1cFHWUwR7YBzVgg1PQAFegCVoAgwfwBF7Aq/FoPBtvxvt0tGQUO7vgD4yPb1Z0nNQ=
    AAACD3icbVDNS8MwHE3n15xfVY9egkOZl9GKoOBl6EVvE9wHrGOkWbqFpWlJUrGU/gde/Fe8eFDEq1dv/jemXQXdfBB4vPf7JS/PDRmVyrK+jNLC4tLySnm1sra+sbllbu+0ZRAJTFo4YIHoukgSRjlpKaoY6YaCIN9lpONOLjO/c0eEpAG/VXFI+j4acepRjJSWBuahk9+RuCwiKbyuOT5SY9dL7tNz+MPj9GhgVq26lQPOE7sgVVCgOTA/nWGAI59whRmSsmdboeonSCiKGUkrTiRJiPAEjUhPU458IvtJniWFB1oZQi8Q+nAFc/X3RoJ8KWPf1ZNZRDnrZeJ/Xi9S3lk/oTyMFOF4+pAXMagCmJUDh1QQrFisCcKC6qwQj5FAWOkKK7oEe/bL86R9XLetun1zUm1cFHWUwR7YBzVgg1PQAFegCVoAgwfwBF7Aq/FoPBtvxvt0tGQUO7vgD4yPb1Z0nNQ=
    AAACD3icbVDNS8MwHE3n15xfVY9egkOZl9GKoOBl6EVvE9wHrGOkWbqFpWlJUrGU/gde/Fe8eFDEq1dv/jemXQXdfBB4vPf7JS/PDRmVyrK+jNLC4tLySnm1sra+sbllbu+0ZRAJTFo4YIHoukgSRjlpKaoY6YaCIN9lpONOLjO/c0eEpAG/VXFI+j4acepRjJSWBuahk9+RuCwiKbyuOT5SY9dL7tNz+MPj9GhgVq26lQPOE7sgVVCgOTA/nWGAI59whRmSsmdboeonSCiKGUkrTiRJiPAEjUhPU458IvtJniWFB1oZQi8Q+nAFc/X3RoJ8KWPf1ZNZRDnrZeJ/Xi9S3lk/oTyMFOF4+pAXMagCmJUDh1QQrFisCcKC6qwQj5FAWOkKK7oEe/bL86R9XLetun1zUm1cFHWUwR7YBzVgg1PQAFegCVoAgwfwBF7Aq/FoPBtvxvt0tGQUO7vgD4yPb1Z0nNQ=
    Sweet Spot
    I(v1; v2) = I(x; y)
    AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=
    AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=
    AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=
    AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=
    I(v1; v2)
    AAACFnicbVDLSsNAFJ3UV62vqEs3g0WoC0tSBAU3RTe6q2Af0IQymU7s0MkkzEwKJeQr3Pgrblwo4lbc+TdO0oDaemHg3HNfZ44XMSqVZX0ZpaXlldW18nplY3Nre8fc3evIMBaYtHHIQtHzkCSMctJWVDHSiwRBgcdI1xtfZfXuhAhJQ36nphFxA3TPqU8xUpoamCdOviPRNOEKpfCm5gRIjTw/mQzs9AL+ZI30eGBWrbqVB1wEdgGqoIjWwPx0hiGOA70bMyRl37Yi5SZIKIoZSStOLEmE8Fhf72vIUUCkm+SSUnikmSH0Q6EfVzBnf08kKJByGni6MxMp52sZ+V+tHyv/3E0oj2JFOJ4d8mMGVQgzj+CQCoIVm2qAsKBaK8QjJBBW2smKNsGe//Ii6DTqtlW3b0+rzcvCjjI4AIegBmxwBprgGrRAG2DwAJ7AC3g1Ho1n4814n7WWjGJmH/wJ4+MbTMifaw==
    AAACFnicbVDLSsNAFJ3UV62vqEs3g0WoC0tSBAU3RTe6q2Af0IQymU7s0MkkzEwKJeQr3Pgrblwo4lbc+TdO0oDaemHg3HNfZ44XMSqVZX0ZpaXlldW18nplY3Nre8fc3evIMBaYtHHIQtHzkCSMctJWVDHSiwRBgcdI1xtfZfXuhAhJQ36nphFxA3TPqU8xUpoamCdOviPRNOEKpfCm5gRIjTw/mQzs9AL+ZI30eGBWrbqVB1wEdgGqoIjWwPx0hiGOA70bMyRl37Yi5SZIKIoZSStOLEmE8Fhf72vIUUCkm+SSUnikmSH0Q6EfVzBnf08kKJByGni6MxMp52sZ+V+tHyv/3E0oj2JFOJ4d8mMGVQgzj+CQCoIVm2qAsKBaK8QjJBBW2smKNsGe//Ii6DTqtlW3b0+rzcvCjjI4AIegBmxwBprgGrRAG2DwAJ7AC3g1Ho1n4814n7WWjGJmH/wJ4+MbTMifaw==
    AAACFnicbVDLSsNAFJ3UV62vqEs3g0WoC0tSBAU3RTe6q2Af0IQymU7s0MkkzEwKJeQr3Pgrblwo4lbc+TdO0oDaemHg3HNfZ44XMSqVZX0ZpaXlldW18nplY3Nre8fc3evIMBaYtHHIQtHzkCSMctJWVDHSiwRBgcdI1xtfZfXuhAhJQ36nphFxA3TPqU8xUpoamCdOviPRNOEKpfCm5gRIjTw/mQzs9AL+ZI30eGBWrbqVB1wEdgGqoIjWwPx0hiGOA70bMyRl37Yi5SZIKIoZSStOLEmE8Fhf72vIUUCkm+SSUnikmSH0Q6EfVzBnf08kKJByGni6MxMp52sZ+V+tHyv/3E0oj2JFOJ4d8mMGVQgzj+CQCoIVm2qAsKBaK8QjJBBW2smKNsGe//Ii6DTqtlW3b0+rzcvCjjI4AIegBmxwBprgGrRAG2DwAJ7AC3g1Ho1n4814n7WWjGJmH/wJ4+MbTMifaw==
    AAACFnicbVDLSsNAFJ3UV62vqEs3g0WoC0tSBAU3RTe6q2Af0IQymU7s0MkkzEwKJeQr3Pgrblwo4lbc+TdO0oDaemHg3HNfZ44XMSqVZX0ZpaXlldW18nplY3Nre8fc3evIMBaYtHHIQtHzkCSMctJWVDHSiwRBgcdI1xtfZfXuhAhJQ36nphFxA3TPqU8xUpoamCdOviPRNOEKpfCm5gRIjTw/mQzs9AL+ZI30eGBWrbqVB1wEdgGqoIjWwPx0hiGOA70bMyRl37Yi5SZIKIoZSStOLEmE8Fhf72vIUUCkm+SSUnikmSH0Q6EfVzBnf08kKJByGni6MxMp52sZ+V+tHyv/3E0oj2JFOJ4d8mMGVQgzj+CQCoIVm2qAsKBaK8QjJBBW2smKNsGe//Ii6DTqtlW3b0+rzcvCjjI4AIegBmxwBprgGrRAG2DwAJ7AC3g1Ho1n4814n7WWjGJmH/wJ4+MbTMifaw==
    I(v1; v2)
    AAACBnicbVDLSsNAFL3xWesr6lKEwSLUTUmKoOCm6EZ3FewD2hAm00k7dPJgZlIooSs3/oobF4q49Rvc+TdO2oDaemDgzDn3cu89XsyZVJb1ZSwtr6yurRc2iptb2zu75t5+U0aJILRBIh6Jtocl5SykDcUUp+1YUBx4nLa84XXmt0ZUSBaF92ocUyfA/ZD5jGClJdc8ui13A6wGnp+OXHtyiX5+1cmpa5asijUFWiR2TkqQo+6an91eRJKAhopwLGXHtmLlpFgoRjidFLuJpDEmQ9ynHU1DHFDppNMzJuhEKz3kR0K/UKGp+rsjxYGU48DTldmSct7LxP+8TqL8CydlYZwoGpLZID/hSEUoywT1mKBE8bEmmAimd0VkgAUmSidX1CHY8ycvkma1YlsV++6sVLvK4yjAIRxDGWw4hxrcQB0aQOABnuAFXo1H49l4M95npUtG3nMAf2B8fANJI5hb
    AAACBnicbVDLSsNAFL3xWesr6lKEwSLUTUmKoOCm6EZ3FewD2hAm00k7dPJgZlIooSs3/oobF4q49Rvc+TdO2oDaemDgzDn3cu89XsyZVJb1ZSwtr6yurRc2iptb2zu75t5+U0aJILRBIh6Jtocl5SykDcUUp+1YUBx4nLa84XXmt0ZUSBaF92ocUyfA/ZD5jGClJdc8ui13A6wGnp+OXHtyiX5+1cmpa5asijUFWiR2TkqQo+6an91eRJKAhopwLGXHtmLlpFgoRjidFLuJpDEmQ9ynHU1DHFDppNMzJuhEKz3kR0K/UKGp+rsjxYGU48DTldmSct7LxP+8TqL8CydlYZwoGpLZID/hSEUoywT1mKBE8bEmmAimd0VkgAUmSidX1CHY8ycvkma1YlsV++6sVLvK4yjAIRxDGWw4hxrcQB0aQOABnuAFXo1H49l4M95npUtG3nMAf2B8fANJI5hb
    AAACBnicbVDLSsNAFL3xWesr6lKEwSLUTUmKoOCm6EZ3FewD2hAm00k7dPJgZlIooSs3/oobF4q49Rvc+TdO2oDaemDgzDn3cu89XsyZVJb1ZSwtr6yurRc2iptb2zu75t5+U0aJILRBIh6Jtocl5SykDcUUp+1YUBx4nLa84XXmt0ZUSBaF92ocUyfA/ZD5jGClJdc8ui13A6wGnp+OXHtyiX5+1cmpa5asijUFWiR2TkqQo+6an91eRJKAhopwLGXHtmLlpFgoRjidFLuJpDEmQ9ynHU1DHFDppNMzJuhEKz3kR0K/UKGp+rsjxYGU48DTldmSct7LxP+8TqL8CydlYZwoGpLZID/hSEUoywT1mKBE8bEmmAimd0VkgAUmSidX1CHY8ycvkma1YlsV++6sVLvK4yjAIRxDGWw4hxrcQB0aQOABnuAFXo1H49l4M95npUtG3nMAf2B8fANJI5hb
    AAACBnicbVDLSsNAFL3xWesr6lKEwSLUTUmKoOCm6EZ3FewD2hAm00k7dPJgZlIooSs3/oobF4q49Rvc+TdO2oDaemDgzDn3cu89XsyZVJb1ZSwtr6yurRc2iptb2zu75t5+U0aJILRBIh6Jtocl5SykDcUUp+1YUBx4nLa84XXmt0ZUSBaF92ocUyfA/ZD5jGClJdc8ui13A6wGnp+OXHtyiX5+1cmpa5asijUFWiR2TkqQo+6an91eRJKAhopwLGXHtmLlpFgoRjidFLuJpDEmQ9ynHU1DHFDppNMzJuhEKz3kR0K/UKGp+rsjxYGU48DTldmSct7LxP+8TqL8CydlYZwoGpLZID/hSEUoywT1mKBE8bEmmAimd0VkgAUmSidX1CHY8ycvkma1YlsV++6sVLvK4yjAIRxDGWw4hxrcQB0aQOABnuAFXo1H49l4M95npUtG3nMAf2B8fANJI5hb
    I(v1; v2) = I(x; y)
    AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=
    AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=
    AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=
    AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=
    (a)
    Figure 2: As the mutual information between views is changed
    and nuisance variables (red) can be selectively included or ex
    depicts a scenario where views are chosen to preserve down
    throwing out nuisance information, while in (b) reducing MI alw
    leading to decreasing performance as MI is reduced.

    View Slide

  25. w ଛࣦؔ਺ͷԹ౓ύϥϝʔλͱ֫ಘ͢Δಛ௃දݱͷؔ܎ʹ͍ͭͯ෼ੳ
    Թ౓ύϥϝʔλখɿϙδςΟϒϖΞͱωΨςΟϒϖΞ͕෼཭
    Թ౓ύϥϝʔλେɿϙδςΟϒϖΞͷྨࣅ౓͕ʹۙͮ͘
    w Թ౓ύϥϝʔλ͕ͷͱ͖ྑ͍ੑೳΛୡ੒
    ରরֶशͷ෼ੳɿ6OEFSTUBOEJOHUIF#FIBWJPVSPG$POUSBTUJWF-PTT
    <8BOHBOE-JV $713>

    Dataset Result
    Contrastive
    Simple
    HardContrastive
    HardSimple
    0.07 0.3 0.7 1.0 0.07 0.3 0.7 1.0
    CIFAR10
    accuracy 79.75 83.27 82.69 82.21 74.83 79.2 83.63 84.19 84.19 84.84
    uniformity 3.86 3.60 3.17 2.96 1.68 3.88 3.89 3.87 3.86 3.85
    tolerance 0.04 0.178 0.333 0.372 0.61 0.034 0.0267 0.030 0.030 0.030
    CIFAR100
    accuracy 51.82 56.44 50.99 48.33 39.31 50.77 56.55 57.54 56.77 55.71
    uniformity 3.86 3.60 3.18 2.96 2.12 3.87 3.88 3.87 3.86 3.86
    tolerance 0.10 0.269 0.331 0.343 0.39 0.088 0.124 0.158 0.172 0.174
    SVHN
    accuracy 92.55 95.47 94.17 92.07 70.83 91.82 94.79 95.02 95.26 94.99
    uniformity 3.88 3.65 3.27 3.05 1.50 3.89 3.91 3.90 3.88 3.85
    tolerance 0.032 0.137 0.186 0.197 0.074 0.025 0.021 0.021 0.023 0.026
    ImageNet100
    accuracy 71.53 75.10 69.03 63.57 48.09 68.33 74.21 74.70 74.28 74.31
    uniformity 3.917 3.693 3.323 3.08 1.742 3.929 3.932 3.927 3.923 3.917
    tolerance 0.093 0.380 0.427 0.456 0.528 0.067 0.096 0.121 0.134 0.157
    Table 1. We report the accuracy of linear classification on CIFAR10, CIFAR100 and SVHN, including models trained with the ordinary
    contrastive loss, simple contrastive loss, hard contrastive loss and hard simple contrastive loss. For models trained on ordinary contrastive
    loss and hard contrastive loss, we select several representative temperatures. More results are shown in the supplementary material.
    Թ౓ύϥϝʔλʹΑΔਫ਼౓มԽ

    View Slide

  26. w ଛࣦؔ਺ͷԹ౓ύϥϝʔλͱ֫ಘ͢Δಛ௃දݱͷؔ܎ʹ͍ͭͯ෼ੳ
    Թ౓ύϥϝʔλখɿϙδςΟϒϖΞͱωΨςΟϒϖΞ͕෼཭
    Թ౓ύϥϝʔλେɿϙδςΟϒϖΞͷྨࣅ౓͕ʹۙͮ͘
    w Թ౓ύϥϝʔλ͕ͷͱ͖ྑ͍ੑೳΛୡ੒
    ରরֶशͷ෼ੳɿ6OEFSTUBOEJOHUIF#FIBWJPVSPG$POUSBTUJWF-PTT
    <8BOHBOE-JV $713>

    Figure 8. We display the similarity distribution of positive samples and the top-10 nearest negative samples that are marked as ’pos’ and
    ϙδςΟϒɾωΨςΟϒϖΞͷಛ௃ྔͷྨࣅ౓ͱԹ౓ύϥϝʔλͷؔ܎

    View Slide

  27. ࣗݾڭࢣ͋Γֶशͷ୅දతͳख๏

    #BSMPX5XJOT
    <+;CPOUBS *$.-`>
    CBUDIEJNFOTJPO
    $1$
    <"WE0PSE BS9JW`>
    ύονؒͰϖΞΛ
    ࡞੒ͯ͠ରরֶश
    $1$W
    <0+)ÉOB
    ff
    *$.-`>
    ϖΞͷ࡞੒΍
    Ϟσϧߏ଄ͳͲΛվળ
    ϚεΫͨ͠ྖҬͷϐΫηϧΛ༧ଌ
    $POUFYU&ODPEFST
    <%1BUIBL $713`>
    δάιʔύζϧ
    ৭৘ใΛ༧ଌ ճస֯౓Λ༧ଌ
    *NBHF3PUBUJPOT
    <4(JEBSJT *$-3`>
    $POUFYU1SFEJDUJPO
    <$%PFSTDI *$$7`>
    ύον෼ׂͨ͠ը૾ͷ
    ύονؒͷ૬ରҐஔΛ༧ଌ
    1*3-
    <*.JTSBBOE-.BBUFO $713`>
    δάιʔύζϧΛಋೖ
    4JN$-3
    <5$IFO *$.-`>
    4JN$-3W
    <5$IFO /FVS*14`>
    .P$P
    <,)F $713> .P$PW
    <9$IFO BS9JW>
    4JN$-3ͷςΫχοΫΛಋೖ
    ରরֶश
    $PVOUJOH
    <./PSPP[J *$$7`>
    ֤ύονग़ྗͷ࿨ͱը૾શମͷग़ྗ
    ͕Ұக͢ΔΑ͏ʹֶश
    +JHTBX
    <./PSPP[J $713`>
    ͭͷը૾ͷύζϧΛϛοΫε
    &NCFEEJOH-FBSOJOH
    <.:F $713`>
    6OTVQFSWJTFE
    8PSEWFD
    <5.JLPMPW BS9JW>
    #&35
    <+%FWMJO /""$->
    ࣗવݴޠॲཧ෼໺
    1$-
    <+-J *$-3`>
    ϓϩτλΠϓΛಋೖ
    MPDBM͔ΒMPDBM΋༧ଌ
    &T7J5
    <$-J *$-3`>
    .BTLFE*NBHF.PEFMJOH .*.



    $//Λର৅ͱͨࣗ͠ݾڭࢣ͋Γֶश
    7J5Λର৅ͱͨࣗ͠ݾڭࢣ͋Γֶश
    1SFUFYUλεΫͷվળ
    ྡ઀͢Δ୯ޠͷ༧ଌ
    Λը૾΁Ԡ༻
    %*/0
    <.$BSPO *$$7>
    σʔλ૿෯ͱෳ਺ͷը૾
    Λ༻͍ͨରরֶशΛఏҊ
    ը૾਺ʹΫϥε਺ͱֶͯ͠श
    աڈͷग़ྗΛ
    ωΨςΟϒϖΞͱͯ͠׆༻
    .BTLFE-BOHVBHF.PEFMJOH
    .-.
    Λը૾΁Ԡ༻
    େن໛ωοτϫʔΫͷಋೖ
    +JHTBX
    <./PSPP[JBOE1'BWBSP &$$7`>
    $PMPSJ[BUJPO
    <3;IBOH &$$7`>
    *OTUBODF%JTDSJNJOBUJPO
    <;8V $713`>
    MPDBM͔ΒHMPCBM
    HMPCBM͔ΒHMPCBMΛ༧ଌ
    ಛ௃ྔʹϚεΫΩϯά

    ϚεΫྖҬͷಛ௃ྔΛ༧ଌ
    #&J5
    <)#BP *$-3`>
    J#05
    <+;IPV *$-3`>
    ϚεΫྖҬͷըૉΛ༧ଌ
    ."&
    <,)F $713`>
    4JN.*.
    <;9JF $713`>
    4QPU"SUJGBDUT
    <4+FOOJBOE1'BWBSP $713`>
    .$5
    <9:VBO $713`>
    ϚϧνϞʔμϧ΁֦ு
    ʢը૾ʴςΩετʣ
    .P$P#:0-
    .P#:
    <;9JF BS9JW`>
    .P$PW
    <9$IFO *$$7>
    7J5Ͱͷ༗ޮੑΛධՁ
    7P-5"
    <41SBNBOJDL BS9JW`>
    MPDBMGFBUVSF"MJHONFOU
    ϚϧνϞʔμϧʢը૾ʴςΩετʣ
    γϯϓϧͳରরֶशΛఏҊ
    $-*1
    <"3BEGPSE *$.-`>
    ;FSP4IPU5SBOTGFS
    ϚϧνϞʔμϧʢը૾ʴςΩετʣ
    7J5ͷͨΊͷֶशํ๏
    ෼ੳ
    6OEFSTUBOEJOHUIF#FIBWJPVS
    <'8BOHBOE)-JV $713`>
    ଛࣦઃܭ΍ֶशޮՌʹ͍ͭͯ෼ੳ
    )PX8FMM%P4FMG4VQFSWJTFE.PEFMT5SBOTGFS
    <-&SJDTTPO $713`>
    ༷ʑͳ໰୊ઃఆ΁ͷసҠੑΛධՁ
    8IFO%PFT$POUSBTUJWF7JTVBM3FQSFTFOUBUJPO-FBSOJOH8PSL
    <&$PMF $713`>
    σʔληοτͱͷؔ܎ੑʹ͍ͭͯ෼ੳ
    *OGP.JO
    <:5JBO /FVS*14`>
    ϙδςΟϒϖΞͷ૊Έ߹Θͤʹ͍ͭͯ෼ੳ
    #:0-
    <+(SJMM /FVS*14>
    ϙδςΟϒϖΞͷΈͰֶश
    4JN4JBN
    <9$IFO $713>
    ΑΓγϯϓϧͳֶशΛఏҊ
    ෼ੳ
    #:0-XPSLTFWFOXJUIPVUCBUDITUBUJTUJDT
    <13JDIFNPOE BS9JW`>
    όονਖ਼نԽͷ౷ܭ৘ใ͕҉໧తͳωΨςΟϒϖΞͳͷͰ͸ʁ
    ˠਖ਼نԽʹΑΔֶशͷ҆ఆԽ͕ॏཁ
    4X"7
    <.$BSPO /FVS*14`>
    ϙδςΟϒϖΞͷଐ͢Δ
    ΫϥελΛਪఆ
    /FHBUJWFGSFF

    View Slide

  28. w #:0-ɿ#PPUTUSBQ:PVS0XO-BUFOU
    w ΦϯϥΠϯωοτϫʔΫͱλʔήοτωοτϫʔΫͷ̎ͭͷωοτϫʔΫΛར༻
    w ݩը૾͕ಉ͡ಛ௃ྔͱͷྨࣅ౓Λେ͖͘͢ΔΑ͏ʹֶशʢϙδςΟϒϖΞͷΈΛར༻ʣ
    ໰୊ઃఆɿΦϯϥΠϯͷநग़ͨ͠ಛ௃ྔ͔Βλʔήοτͷநग़ͨ͠ผͷϏϡʔͷಛ௃ྔΛ༧ଌ
    /FHBUJWFGSFFɿ#:0-<(SJMM /FVS*14>

    Τϯίʔμ ɿग़ྗ૚Λআ͍ͨωοτϫʔΫ
    ϓϩδΣΫλ
    ɿ૚ͷ.-1
    QSFEJDUPS ɿ૚ͷ.-1
    ϛχόον
    Τϯίʔμ ϓϩδΣΫλ
    QSFEJDUPS
    ΦϯϥΠϯωοτϫʔΫ
    λʔήοτωοτϫʔΫ
    σʔλ૿෯
    ಛ௃ྔ
    ଛࣦܭࢉ
    .4&

    .-1
    .-1 .-1
    TUPQHSBE

    View Slide

  29. w λʔήοτͷύϥϝʔλ͸ΦϯϥΠϯͷࢦ਺Ҡಈฏۉ͢Δ͜ͱͰߋ৽
    $PTJOFTDIFEVMFSʹج͍ͮͯ Λมߋ
    0.996 ≤ λ ≤ 1
    /FHBUJWFGSFFɿ#:0-<(SJMM /FVS*14>

    θt
    ← λθt
    + (1 − λ)θo
    θt
    λʔήοτͷύϥϝʔλ
    θo
    ΦϯϥΠϯͷύϥϝʔλ
    λ ߋ৽ύϥϝʔλʹର͢ΔॏΈ
    ϛχόον
    Τϯίʔμ ϓϩδΣΫλ
    .-1
    .-1 .-1
    ࢦ਺Ҡಈฏۉ QSFEJDUPS
    TUPQHSBE
    ΦϯϥΠϯωοτϫʔΫ
    λʔήοτωοτϫʔΫ
    σʔλ૿෯
    ಛ௃ྔ
    ଛࣦܭࢉ
    .4&

    #BDLQSPQ
    Τϯίʔμ ɿग़ྗ૚Λআ͍ͨωοτϫʔΫ
    ϓϩδΣΫλ
    ɿ૚ͷ.-1
    QSFEJDUPS ɿ૚ͷ.-1

    View Slide

  30. /FHBUJWFGSFFɿ#:0-<(SJMM /FVS*14>

    25M 50M 100M 200M 400M
    Number of parameters
    68
    70
    72
    74
    76
    78
    80
    ImageNet top-1 accuracy (%)
    Sup.
    Sup.(2£)
    Sup.(4£)
    InfoMin
    SimCLR
    SimCLR (2£)
    SimCLR (4£)
    MoCov2
    CPCv2-L
    MoCo
    CMC
    AMDIM
    BYOL
    BYOL (2£)
    BYOL (4£)
    BYOL (200-2£)
    Sup.(200-2£)
    w
    fi
    OFUVOJOHʹ͓͚Δਫ਼౓ൺֱ
    ࣗݾڭࢣ͋Γֶश ɿ*NBHF/FU,
    ෺ମݕग़ ɿ70$
    ηάϝϯςʔγϣϯɿ70$
    Supervised-IN baseline (+1.9 mIoU) and SimCLR (
    Similarly, we evaluate on object detection by repro
    as detailed in Appendix D.5. We fine-tune on trai
    AP50
    metric; BYOL is significantly better than the S
    Finally, we evaluate on depth estimation on the N
    given a single RGB image. Depth prediction mea
    that information can be localized to pixel accuracy
    We evaluate on the commonly used test subset o
    in Table 4b: relative (rel) error, root mean square
    max(dgt/dp, dp/dgt), is below 1.25n thresholds
    depth [40]. BYOL is better or on par with other me
    measure is respectively improved by +3.5 points a
    Method AP50
    mIoU
    Supervised-IN [9] 74.4 74.4
    MoCo [9] 74.9 72.5
    SimCLR (repro) 75.2 75.2
    BYOL (ours) 77.5 76.3
    (a) Transfer results in semantic
    segmentation and object detection.
    Method
    Supervised-IN
    SimCLR (repro
    BYOL (ours)
    Table 4: Results on transferring
    5 Building intuitions with ablations
    ύϥϝʔλ਺͕ଟ͍৔߹ʹڭࢣ͋Γֶशͱಉఔ౓ͷੑೳΛൃش *NBHF/FUͷڭࢣ͋ΓࣄલֶशϞσϧΛ௒͑ΔੑೳΛൃش
    w *NBHF/FU,ʹ͓͚Δਫ਼౓ൺֱ
    ઢܗධՁʹΑΓ෼ྨੑೳΛධՁ

    View Slide

  31. w 4JN4JBNɿTJNQMF4JBNFTFOFUXPSLT
    w #:0-ΛΑΓγϯϓϧʹͨ͠ख๏ΛఏҊ
    ࢦ਺Ҡಈฏۉ΍ΫϥελϦϯάͳͲͷطଘख๏ʹ͋Δෳࡶͳॲཧ͕ෆཁ
    খ͞ͳϛχόον਺ͱֶशճ਺ͰֶशՄೳ
    /FHBUJWFGSFFɿ4JN4JBN<$IFO $713>

    ϛχόον
    Τϯίʔμ ϓϩδΣΫλ QSFEJDUPS
    ಛ௃ྔ
    ଛࣦܭࢉ
    ෛͷίαΠϯྨࣅ౓

    Τϯίʔμ ɿग़ྗ૚Λআ͍ͨωοτϫʔΫ
    ϓϩδΣΫλ
    ɿ૚ͷ.-1
    QSFEJDUPS ɿ૚ͷ.-1ʢ#PUUMFOFDLߏ଄ʣ
    TUPQHSBE
    σʔλ૿෯
    #BDLQSPQ
    .-1
    .-1

    View Slide

  32. w 4JN4JBNΛରরֶश΍/FHBUJWFGSFFͷڞ௨ϑϨʔϜϫʔΫͱͯ͠ଊ͑Δ͜ͱ͕Մೳ
    w 4JN4JBN΁طଘͷςΫχοΫΛ௥Ճɾ࡟আ͢Δ͜ͱͰҟͳΔख๏Λදݱ
    4JN$-3
    ɿωΨςΟϒϖΞɼQSFEJDUPSɼTUPQHSBE
    #:0- ɿࢦ਺ҠಈฏۉϞσϧ
    4X"7 ɿΫϥελϦϯάɼQSFEJDUPS
    /FHBUJWFGSFFɿ4JN4JBN<$IFO $713>

    similarity
    predictor
    encoder
    similarity &
    dissimilarity
    encoder
    image
    SimCLR
    similarity
    Sinkhorn-Knopp
    encoder
    similarity
    momentum
    encoder
    predictor
    image
    moving
    average
    BYOL
    grad grad
    grad grad
    grad
    similarity
    predictor
    encoder
    similarity &
    dissimilarity
    encoder
    image
    SimCLR
    similarity
    Sinkhorn-Knopp
    encoder
    similarity
    momentum
    encoder
    predictor
    image
    moving
    average
    BYOL
    grad grad
    grad grad
    grad
    encode
    predicto
    encoder
    similarity &
    dissimilarity
    encoder
    image
    SimCLR
    encoder
    similarity
    encoder
    Sinkhorn-Knopp
    image
    SwAV
    encode
    predicto
    grad grad
    grad grad
    grad
    encoder
    similarity
    encoder
    predictor
    image
    SimSiam
    ncoder
    ncoder
    horn-Knopp
    encoder
    similarity
    momentum
    encoder
    predictor
    image
    moving
    average
    BYOL
    grad
    grad
    grad

    View Slide

  33. w *NBHF/FU,ʹ͓͚Δਫ਼౓ൺֱ
    *NBHF/FU,Ͱࣗݾڭࢣ͋Γֶशˠ*NBHF/FU,ͰઢܗධՁ
    ωοτϫʔΫɿ3FT/FU
    /FHBUJWFGSFFɿ4JN4JBN<$IFO $713>

    method
    batch
    size
    negative
    pairs
    momentum
    encoder 100 ep 200 ep 400 ep 800 ep
    SimCLR (repro.+) 4096 X 66.5 68.3 69.8 70.4
    MoCo v2 (repro.+) 256 X X 67.4 69.9 71.0 72.2
    BYOL (repro.) 4096 X 66.5 70.6 73.2 74.3
    SwAV (repro.+) 4096 66.5 69.1 70.7 71.8
    SimSiam 256 68.1 70.0 70.8 71.3
    Table 4. Comparisons on ImageNet linear classification. All are based on ResNet-50 pre-trained with two 224⇥224 views. Evaluation
    is on a single crop. All competitors are from our reproduction, and “+” denotes improved reproduction vs. original papers (see supplement).
    VOC 07 detection VOC 07+12 detection COCO detection COCO instance seg.
    pre-train AP50 AP AP75 AP50 AP AP75 AP50 AP AP75 APmask
    50
    APmask APmask
    75
    scratch 35.9 16.8 13.0 60.2 33.8 33.1 44.0 26.4 27.8 46.9 29.3 30.8
    ImageNet supervised 74.4 42.4 42.7 81.3 53.5 58.8 58.2 38.2 41.2 54.7 33.3 35.2
    SimCLR (repro.+) 75.9 46.8 50.1 81.8 55.5 61.4 57.7 37.9 40.9 54.6 33.3 35.3
    4JN4JBN͸গͳ͍όοναΠζɾֶशճ਺Ͱߴ͍ੑೳΛൃش

    View Slide

  34. w ԼྲྀλεΫʹ͓͚Δਫ਼౓ൺֱ
    *NBHF/FU,Ͱࣗݾڭࢣ͋ΓֶशˠԼྲྀλεΫ΁
    fi
    OFUVOJOH
    FQPDIͷࣗݾڭࢣ͋Γֶशʹ͓͍֤ͯख๏Λൺֱ
    /FHBUJWFGSFFɿ4JN4JBN<$IFO $713>

    method
    batch
    size
    negative
    pairs
    momentum
    encoder 100 ep 200 ep 400 ep 800 ep
    SimCLR (repro.+) 4096 X 66.5 68.3 69.8 70.4
    MoCo v2 (repro.+) 256 X X 67.4 69.9 71.0 72.2
    BYOL (repro.) 4096 X 66.5 70.6 73.2 74.3
    SwAV (repro.+) 4096 66.5 69.1 70.7 71.8
    SimSiam 256 68.1 70.0 70.8 71.3
    Table 4. Comparisons on ImageNet linear classification. All are based on ResNet-50 pre-trained with two 224⇥224 views. Evaluation
    is on a single crop. All competitors are from our reproduction, and “+” denotes improved reproduction vs. original papers (see supplement).
    VOC 07 detection VOC 07+12 detection COCO detection COCO instance seg.
    pre-train AP50 AP AP75 AP50 AP AP75 AP50 AP AP75 APmask
    50
    APmask APmask
    75
    scratch 35.9 16.8 13.0 60.2 33.8 33.1 44.0 26.4 27.8 46.9 29.3 30.8
    ImageNet supervised 74.4 42.4 42.7 81.3 53.5 58.8 58.2 38.2 41.2 54.7 33.3 35.2
    SimCLR (repro.+) 75.9 46.8 50.1 81.8 55.5 61.4 57.7 37.9 40.9 54.6 33.3 35.3
    MoCo v2 (repro.+) 77.1 48.5 52.5 82.3 57.0 63.3 58.8 39.2 42.5 55.5 34.3 36.6
    BYOL (repro.) 77.1 47.0 49.9 81.4 55.3 61.1 57.8 37.9 40.9 54.3 33.2 35.0
    SwAV (repro.+) 75.5 46.5 49.6 81.5 55.4 61.4 57.6 37.6 40.3 54.2 33.1 35.1
    SimSiam, base 75.5 47.0 50.2 82.0 56.4 62.8 57.5 37.9 40.9 54.2 33.2 35.2
    SimSiam, optimal 77.3 48.5 52.5 82.4 57.0 63.7 59.3 39.2 42.1 56.0 34.4 36.7
    Table 5. Transfer Learning. All unsupervised methods are based on 200-epoch pre-training in ImageNet. VOC 07 detection: Faster
    R-CNN [32] fine-tuned in VOC 2007 trainval, evaluated in VOC 2007 test; VOC 07+12 detection: Faster R-CNN fine-tuned in VOC 2007
    trainval + 2012 train, evaluated in VOC 2007 test; COCO detection and COCO instance segmentation: Mask R-CNN [18] (1⇥ schedule)
    fine-tuned in COCO 2017 train, evaluated in COCO 2017 val. All Faster/Mask R-CNN models are with the C4-backbone [13]. All VOC
    results are the average over 5 trials. Bold entries are within 0.5 below the best.
    4JN4JBN͸γϯϓϧͳֶशํ๏Ͱैདྷ๏ͱಉఔ౓ͷੑೳΛൃش

    View Slide

  35. ࣗݾڭࢣ͋Γֶशͷ୅දతͳख๏

    #BSMPX5XJOT
    <+;CPOUBS *$.-`>
    CBUDIEJNFOTJPO
    $1$
    <"WE0PSE BS9JW`>
    ύονؒͰϖΞΛ
    ࡞੒ͯ͠ରরֶश
    $1$W
    <0+)ÉOB
    ff
    *$.-`>
    ϖΞͷ࡞੒΍
    Ϟσϧߏ଄ͳͲΛվળ
    ϚεΫͨ͠ྖҬͷϐΫηϧΛ༧ଌ
    $POUFYU&ODPEFST
    <%1BUIBL $713`>
    δάιʔύζϧ
    ৭৘ใΛ༧ଌ ճస֯౓Λ༧ଌ
    *NBHF3PUBUJPOT
    <4(JEBSJT *$-3`>
    $POUFYU1SFEJDUJPO
    <$%PFSTDI *$$7`>
    ύον෼ׂͨ͠ը૾ͷ
    ύονؒͷ૬ରҐஔΛ༧ଌ
    1*3-
    <*.JTSBBOE-.BBUFO $713`>
    δάιʔύζϧΛಋೖ
    4JN$-3
    <5$IFO *$.-`>
    4JN$-3W
    <5$IFO /FVS*14`>
    .P$P
    <,)F $713> .P$PW
    <9$IFO BS9JW>
    4JN$-3ͷςΫχοΫΛಋೖ
    ରরֶश
    $PVOUJOH
    <./PSPP[J *$$7`>
    ֤ύονग़ྗͷ࿨ͱը૾શମͷग़ྗ
    ͕Ұக͢ΔΑ͏ʹֶश
    +JHTBX
    <./PSPP[J $713`>
    ͭͷը૾ͷύζϧΛϛοΫε
    &NCFEEJOH-FBSOJOH
    <.:F $713`>
    6OTVQFSWJTFE
    8PSEWFD
    <5.JLPMPW BS9JW>
    #&35
    <+%FWMJO /""$->
    ࣗવݴޠॲཧ෼໺
    1$-
    <+-J *$-3`>
    ϓϩτλΠϓΛಋೖ
    MPDBM͔ΒMPDBM΋༧ଌ
    &T7J5
    <$-J *$-3`>
    .BTLFE*NBHF.PEFMJOH .*.



    $//Λର৅ͱͨࣗ͠ݾڭࢣ͋Γֶश
    7J5Λର৅ͱͨࣗ͠ݾڭࢣ͋Γֶश
    1SFUFYUλεΫͷվળ
    ྡ઀͢Δ୯ޠͷ༧ଌ
    Λը૾΁Ԡ༻
    %*/0
    <.$BSPO *$$7>
    σʔλ૿෯ͱෳ਺ͷը૾
    Λ༻͍ͨରরֶशΛఏҊ
    ը૾਺ʹΫϥε਺ͱֶͯ͠श
    աڈͷग़ྗΛ
    ωΨςΟϒϖΞͱͯ͠׆༻
    .BTLFE-BOHVBHF.PEFMJOH
    .-.
    Λը૾΁Ԡ༻
    େن໛ωοτϫʔΫͷಋೖ
    +JHTBX
    <./PSPP[JBOE1'BWBSP &$$7`>
    $PMPSJ[BUJPO
    <3;IBOH &$$7`>
    *OTUBODF%JTDSJNJOBUJPO
    <;8V $713`>
    MPDBM͔ΒHMPCBM
    HMPCBM͔ΒHMPCBMΛ༧ଌ
    ಛ௃ྔʹϚεΫΩϯά

    ϚεΫྖҬͷಛ௃ྔΛ༧ଌ
    #&J5
    <)#BP *$-3`>
    J#05
    <+;IPV *$-3`>
    ϚεΫྖҬͷըૉΛ༧ଌ
    ."&
    <,)F $713`>
    4JN.*.
    <;9JF $713`>
    4QPU"SUJGBDUT
    <4+FOOJBOE1'BWBSP $713`>
    .$5
    <9:VBO $713`>
    ϚϧνϞʔμϧ΁֦ு
    ʢը૾ʴςΩετʣ
    .P$P#:0-
    .P#:
    <;9JF BS9JW`>
    .P$PW
    <9$IFO *$$7>
    7J5Ͱͷ༗ޮੑΛධՁ
    7P-5"
    <41SBNBOJDL BS9JW`>
    MPDBMGFBUVSF"MJHONFOU
    ϚϧνϞʔμϧʢը૾ʴςΩετʣ
    γϯϓϧͳରরֶशΛఏҊ
    $-*1
    <"3BEGPSE *$.-`>
    ;FSP4IPU5SBOTGFS
    ϚϧνϞʔμϧʢը૾ʴςΩετʣ
    7J5ͷͨΊͷֶशํ๏
    ෼ੳ
    6OEFSTUBOEJOHUIF#FIBWJPVS
    <'8BOHBOE)-JV $713`>
    ଛࣦઃܭ΍ֶशޮՌʹ͍ͭͯ෼ੳ
    )PX8FMM%P4FMG4VQFSWJTFE.PEFMT5SBOTGFS
    <-&SJDTTPO $713`>
    ༷ʑͳ໰୊ઃఆ΁ͷసҠੑΛධՁ
    8IFO%PFT$POUSBTUJWF7JTVBM3FQSFTFOUBUJPO-FBSOJOH8PSL
    <&$PMF $713`>
    σʔληοτͱͷؔ܎ੑʹ͍ͭͯ෼ੳ
    *OGP.JO
    <:5JBO /FVS*14`>
    ϙδςΟϒϖΞͷ૊Έ߹Θͤʹ͍ͭͯ෼ੳ
    #:0-
    <+(SJMM /FVS*14>
    ϙδςΟϒϖΞͷΈͰֶश
    4JN4JBN
    <9$IFO $713>
    ΑΓγϯϓϧͳֶशΛఏҊ
    ෼ੳ
    #:0-XPSLTFWFOXJUIPVUCBUDITUBUJTUJDT
    <13JDIFNPOE BS9JW`>
    όονਖ਼نԽͷ౷ܭ৘ใ͕҉໧తͳωΨςΟϒϖΞͳͷͰ͸ʁ
    ˠਖ਼نԽʹΑΔֶशͷ҆ఆԽ͕ॏཁ
    4X"7
    <.$BSPO /FVS*14`>
    ϙδςΟϒϖΞͷଐ͢Δ
    ΫϥελΛਪఆ
    /FHBUJWFGSFF

    View Slide

  36. w ̑छྨͷλεΫʹର͢ΔసҠੑɼ*NBHF/FUͱԼྲྀλεΫؒͷ૬ؔΛ෼ੳ
    .BOZTIPUSFDPHOJUJPO ɿछྨͷσʔληοτ
    'FXTIPUSFDPHOJUJPO ɿछྨͷσʔληοτ
    0CKFDUEFUFDUJPO ɿछྨͷσʔληοτ
    4FNBOUJDTFHNFOUBUJPO ɿछྨͷσʔληοτ
    4VSGBDFOPSNBMFTUJNBUJPO
    ɿछྨͷσʔληοτ
    ࣗݾڭࢣ͋ΓֶशͷసҠੑɿ)PX8FMM%P4FMG4VQFSWJTFE.PEFMT5SBOTGFS
    <&SJDTTPO $713>

    ࣗݾڭࢣ͋Γֶशͷख๏ʹΑͬͯಘҙͳԼྲྀλεΫ͕ҟͳΔ

    View Slide

  37. ࣗݾڭࢣ͋Γֶशͷ୅දతͳख๏

    #BSMPX5XJOT
    <+;CPOUBS *$.-`>
    CBUDIEJNFOTJPO
    $1$
    <"WE0PSE BS9JW`>
    ύονؒͰϖΞΛ
    ࡞੒ͯ͠ରরֶश
    $1$W
    <0+)ÉOB
    ff
    *$.-`>
    ϖΞͷ࡞੒΍
    Ϟσϧߏ଄ͳͲΛվળ
    ϚεΫͨ͠ྖҬͷϐΫηϧΛ༧ଌ
    $POUFYU&ODPEFST
    <%1BUIBL $713`>
    δάιʔύζϧ
    ৭৘ใΛ༧ଌ ճస֯౓Λ༧ଌ
    *NBHF3PUBUJPOT
    <4(JEBSJT *$-3`>
    $POUFYU1SFEJDUJPO
    <$%PFSTDI *$$7`>
    ύον෼ׂͨ͠ը૾ͷ
    ύονؒͷ૬ରҐஔΛ༧ଌ
    1*3-
    <*.JTSBBOE-.BBUFO $713`>
    δάιʔύζϧΛಋೖ
    4JN$-3
    <5$IFO *$.-`>
    4JN$-3W
    <5$IFO /FVS*14`>
    .P$P
    <,)F $713> .P$PW
    <9$IFO BS9JW>
    4JN$-3ͷςΫχοΫΛಋೖ
    ରরֶश
    $PVOUJOH
    <./PSPP[J *$$7`>
    ֤ύονग़ྗͷ࿨ͱը૾શମͷग़ྗ
    ͕Ұக͢ΔΑ͏ʹֶश
    +JHTBX
    <./PSPP[J $713`>
    ͭͷը૾ͷύζϧΛϛοΫε
    &NCFEEJOH-FBSOJOH
    <.:F $713`>
    6OTVQFSWJTFE
    8PSEWFD
    <5.JLPMPW BS9JW>
    #&35
    <+%FWMJO /""$->
    ࣗવݴޠॲཧ෼໺
    1$-
    <+-J *$-3`>
    ϓϩτλΠϓΛಋೖ
    MPDBM͔ΒMPDBM΋༧ଌ
    &T7J5
    <$-J *$-3`>
    .BTLFE*NBHF.PEFMJOH .*.



    $//Λର৅ͱͨࣗ͠ݾڭࢣ͋Γֶश
    7J5Λର৅ͱͨࣗ͠ݾڭࢣ͋Γֶश
    1SFUFYUλεΫͷվળ
    ྡ઀͢Δ୯ޠͷ༧ଌ
    Λը૾΁Ԡ༻
    %*/0
    <.$BSPO *$$7>
    σʔλ૿෯ͱෳ਺ͷը૾
    Λ༻͍ͨରরֶशΛఏҊ
    ը૾਺ʹΫϥε਺ͱֶͯ͠श
    աڈͷग़ྗΛ
    ωΨςΟϒϖΞͱͯ͠׆༻
    .BTLFE-BOHVBHF.PEFMJOH
    .-.
    Λը૾΁Ԡ༻
    େن໛ωοτϫʔΫͷಋೖ
    +JHTBX
    <./PSPP[JBOE1'BWBSP &$$7`>
    $PMPSJ[BUJPO
    <3;IBOH &$$7`>
    *OTUBODF%JTDSJNJOBUJPO
    <;8V $713`>
    MPDBM͔ΒHMPCBM
    HMPCBM͔ΒHMPCBMΛ༧ଌ
    ಛ௃ྔʹϚεΫΩϯά

    ϚεΫྖҬͷಛ௃ྔΛ༧ଌ
    #&J5
    <)#BP *$-3`>
    J#05
    <+;IPV *$-3`>
    ϚεΫྖҬͷըૉΛ༧ଌ
    ."&
    <,)F $713`>
    4JN.*.
    <;9JF $713`>
    4QPU"SUJGBDUT
    <4+FOOJBOE1'BWBSP $713`>
    .$5
    <9:VBO $713`>
    ϚϧνϞʔμϧ΁֦ு
    ʢը૾ʴςΩετʣ
    .P$P#:0-
    .P#:
    <;9JF BS9JW`>
    .P$PW
    <9$IFO *$$7>
    7J5Ͱͷ༗ޮੑΛධՁ
    7P-5"
    <41SBNBOJDL BS9JW`>
    MPDBMGFBUVSF"MJHONFOU
    ϚϧνϞʔμϧʢը૾ʴςΩετʣ
    γϯϓϧͳରরֶशΛఏҊ
    $-*1
    <"3BEGPSE *$.-`>
    ;FSP4IPU5SBOTGFS
    ϚϧνϞʔμϧʢը૾ʴςΩετʣ
    7J5ͷͨΊͷֶशํ๏
    ෼ੳ
    6OEFSTUBOEJOHUIF#FIBWJPVS
    <'8BOHBOE)-JV $713`>
    ଛࣦઃܭ΍ֶशޮՌʹ͍ͭͯ෼ੳ
    )PX8FMM%P4FMG4VQFSWJTFE.PEFMT5SBOTGFS
    <-&SJDTTPO $713`>
    ༷ʑͳ໰୊ઃఆ΁ͷసҠੑΛධՁ
    8IFO%PFT$POUSBTUJWF7JTVBM3FQSFTFOUBUJPO-FBSOJOH8PSL
    <&$PMF $713`>
    σʔληοτͱͷؔ܎ੑʹ͍ͭͯ෼ੳ
    *OGP.JO
    <:5JBO /FVS*14`>
    ϙδςΟϒϖΞͷ૊Έ߹Θͤʹ͍ͭͯ෼ੳ
    #:0-
    <+(SJMM /FVS*14>
    ϙδςΟϒϖΞͷΈͰֶश
    4JN4JBN
    <9$IFO $713>
    ΑΓγϯϓϧͳֶशΛఏҊ
    ෼ੳ
    #:0-XPSLTFWFOXJUIPVUCBUDITUBUJTUJDT
    <13JDIFNPOE BS9JW`>
    όονਖ਼نԽͷ౷ܭ৘ใ͕҉໧తͳωΨςΟϒϖΞͳͷͰ͸ʁ
    ˠਖ਼نԽʹΑΔֶशͷ҆ఆԽ͕ॏཁ
    4X"7
    <.$BSPO /FVS*14`>
    ϙδςΟϒϖΞͷଐ͢Δ
    ΫϥελΛਪఆ
    /FHBUJWFGSFF

    View Slide

  38. w %*/0ɿTFMGEJTUJMMBUJPOXJUIOPMBCFMT
    w ੜెωοτϫʔΫͷग़ྗ෼෍͕ڭࢣωοτϫʔΫͷग़ྗ෼෍ʹۙͮ͘Α͏ʹֶश
    w ಛ௃ྔʹରͯ͠Թ౓෇͖ιϑτϚοΫεؔ਺Λద༻͢Δ͜ͱͰ֬཰෼෍Λܭࢉ
    7J5ͷͨΊͷֶशํ๏ɿ%*/0<$BSPO *$$7>

    ࢦ਺Ҡಈฏۉ
    4PGUNBY
    4PGUNBY
    $FOUFS
    TUPQHSBE
    ಛ௃ྔ
    Τϯίʔμ ϓϩδΣΫλ TIBSQFOJOH ֬཰෼෍
    MPDBM
    HMPCBM
    ੜెωοτϫʔΫ
    ڭࢣωοτϫʔΫ
    σʔλ૿෯ ଛࣦܭࢉ
    7J5
    7J5
    .-1
    .-1
    ہॴྖҬΛΫϩοϓ
    ʢ೉͍͠໰୊ʣ
    ޿͍ൣғΛΫϩοϓ
    ʢ༏͍͠໰୊ʣ
    Τϯίʔμ ɿग़ྗ૚Λআ͍ͨ$//.-1)FBEΛআ͍ͨ7J5
    ϓϩδΣΫλ
    ɿ૚ͷ.-1ʢ#PUUMFOFDLߏ଄ʣ
    DFOUFSJOH

    View Slide

  39. w TIBSQFOJOHɿಛ௃ྔͷதͰͭͷಛ௃Λڧௐ͢ΔΑ͏ʹௐ੔
    w DFOUFSJOH ɿͲΜͳը૾ʹରͯ͠΋ಉ͡ಛ௃Λڧௐ͠ͳ͍Α͏ʹௐ੔
    w DFOUFSJOH஋
    7J5ͷͨΊͷֶशํ๏ɿ%*/0<$BSPO *$$7>

    Ps
    (x)(i) =
    exp(gθs
    (x)(i)/τs
    )
    ∑K
    k=1
    exp(gθs
    (x)(k)/τs
    )
    Pt
    (x)(i) =
    exp((gθt
    (x)(i) − c)/τt
    )
    ∑K
    k=1
    exp((gθt
    (x)(k) − c)/τt
    )
    τt
    Թ౓ύϥϝʔλ

    c DFOUFSJOH஋
    ੜెωοτϫʔΫɿ
    τs
    Թ౓ύϥϝʔλ

    m
    B

    όοναΠζ
    ϋΠύϥ

    ڭࢣωοτϫʔΫɿ
    c ← mc + (1 − m)
    1
    B
    B

    i=1
    gθt
    (xi
    )

    View Slide

  40. w ڭࢣͷύϥϝʔλ͸ੜెͷύϥϝʔλΛࢦ਺Ҡಈฏۉ͢Δ͜ͱͰߋ৽
    $PTJOFTDIFEVMFSʹج͍ͮͯ Λมߋ
    0.996 ≤ λ ≤ 1
    7J5ͷͨΊͷֶशํ๏ɿ%*/0<$BSPO *$$7>

    TUPQHSBE
    ಛ௃ྔ
    Τϯίʔμ ϓϩδΣΫλ ֬཰෼෍
    MPDBM
    HMPCBM
    ੜెωοτϫʔΫ
    ڭࢣωοτϫʔΫ
    σʔλ૿෯ ଛࣦܭࢉ
    7J5
    .-1
    .-1
    θt
    ← λθt
    + (1 − λ)θs
    θt
    ڭࢣͷύϥϝʔλ
    θs
    ੜెͷύϥϝʔλ
    λ ߋ৽ύϥϝʔλʹର͢ΔॏΈ
    7J5
    ࢦ਺Ҡಈฏۉ
    #BDLQSPQ
    4PGUNBY
    4PGUNBY
    $FOUFS
    TIBSQFOJOH
    DFOUFSJOH

    View Slide

  41. w *NBHF/FU,ʹ͓͚Δਫ਼౓ൺֱ
    *NBHF/FU,Ͱࣗݾڭࢣ͋Γֶशˠ*NBHF/FU,ͰઢܗධՁL//๏ʹΑΔධՁ
    7J5ͷͨΊͷֶशํ๏ɿ%*/0<$BSPO *$$7>

    previous works[18, 19, 69], even though it is not attached
    o any label nor supervision in our case. The set of patch
    okens and [CLS] token are fed to a standard Transformer
    etwork with a “pre-norm” layer normalization [11, 39]. The
    Transformer is a sequence of self-attention and feed-forward
    ayers, paralleled with skip connections. The self-attention
    ayers update the token representations by looking at the
    ther token representations with an attention mechanism [4].
    mplementation details. We pretrain the models on the
    mageNet dataset [60] without labels. We train with the
    damw optimizer [44] and a batch size of 1024, distributed
    over 16 GPUs when using ViT-S/16. The learning rate is
    inearly ramped up during the first 10 epochs to its base
    alue determined with the following linear scaling rule [29]:
    r = 0.0005 ⇤ batchsize/256. After this warmup, we decay
    he learning rate with a cosine schedule [43]. The weight
    decay also follows a cosine schedule from 0.04 to 0.4. The
    emperature ⌧s
    is set to 0.1 while we use a linear warm-up
    or ⌧t
    from 0.04 to 0.07 during the first 30 epochs. We
    ollow the data augmentations of BYOL [30] (color jittering,
    Gaussian blur and solarization) and multi-crop [10] with a
    bicubic interpolation to adapt the position embeddings to
    he scales [19, 69]. The code and models to reproduce our
    esults is publicly available.
    Evaluation protocols. Standard protocols for self-
    Table 2: Linear and k-NN classification on ImageNet. We report
    top-1 accuracy for linear and k-NN evaluations on the validation
    set of ImageNet for different self-supervised methods. We focus
    on ResNet-50 and ViT-small architectures, but also report the best
    results obtained across architectures. ⇤ are run by us. We run the
    k-NN evaluation for models with official released weights. The
    throughput (im/s) is calculated on a NVIDIA V100 GPU with 128
    samples per forward. Parameters (M) are of the feature extractor.
    Method Arch. Param. im/s Linear k-NN
    Supervised RN50 23 1237 79.3 79.3
    SCLR [12] RN50 23 1237 69.1 60.7
    MoCov2 [15] RN50 23 1237 71.1 61.9
    InfoMin [67] RN50 23 1237 73.0 65.3
    BarlowT [81] RN50 23 1237 73.2 66.0
    OBoW [27] RN50 23 1237 73.8 61.9
    BYOL [30] RN50 23 1237 74.4 64.8
    DCv2 [10] RN50 23 1237 75.2 67.1
    SwAV [10] RN50 23 1237 75.3 65.7
    DINO RN50 23 1237 75.3 67.5
    Supervised ViT-S 21 1007 79.8 79.8
    BYOL⇤ [30] ViT-S 21 1007 71.4 66.6
    MoCov2⇤ [15] ViT-S 21 1007 72.7 64.4
    SwAV⇤ [10] ViT-S 21 1007 73.5 66.3
    DINO ViT-S 21 1007 77.0 74.5
    Comparison across architectures
    SCLR [12] RN50w4 375 117 76.8 69.3
    SwAV [10] RN50w2 93 384 77.3 67.3
    BYOL [30] RN50w2 93 384 77.4 –
    DINO ViT-B/16 85 312 78.2 76.1
    SwAV [10] RN50w5 586 76 78.5 67.1
    BYOL [30] RN50w4 375 117 78.6 –
    BYOL [30] RN200w2 250 123 79.6 73.9
    ther token representations with an attention mechanism [4].
    mplementation details. We pretrain the models on the
    mageNet dataset [60] without labels. We train with the
    damw optimizer [44] and a batch size of 1024, distributed
    over 16 GPUs when using ViT-S/16. The learning rate is
    inearly ramped up during the first 10 epochs to its base
    alue determined with the following linear scaling rule [29]:
    r = 0.0005 ⇤ batchsize/256. After this warmup, we decay
    he learning rate with a cosine schedule [43]. The weight
    decay also follows a cosine schedule from 0.04 to 0.4. The
    emperature ⌧s
    is set to 0.1 while we use a linear warm-up
    or ⌧t
    from 0.04 to 0.07 during the first 30 epochs. We
    ollow the data augmentations of BYOL [30] (color jittering,
    Gaussian blur and solarization) and multi-crop [10] with a
    bicubic interpolation to adapt the position embeddings to
    he scales [19, 69]. The code and models to reproduce our
    esults is publicly available.
    Evaluation protocols. Standard protocols for self-
    upervised learning are to either learn a linear classifier
    on frozen features [82, 33] or to finetune the features
    on downstream tasks. For linear evaluations, we apply
    andom resize crops and horizontal flips augmentation
    during training, and report accuracy on a central crop.
    For finetuning evaluations, we initialize networks with
    he pretrained weights and adapt them during training.
    However, both evaluations are sensitive to hyperparameters,
    nd we observe a large variance in accuracy between runs
    when varying the learning rate for example. We thus also
    valuate the quality of features with a simple weighted
    Method Arch. Param. im/s Linear k-NN
    Supervised RN50 23 1237 79.3 79.3
    SCLR [12] RN50 23 1237 69.1 60.7
    MoCov2 [15] RN50 23 1237 71.1 61.9
    InfoMin [67] RN50 23 1237 73.0 65.3
    BarlowT [81] RN50 23 1237 73.2 66.0
    OBoW [27] RN50 23 1237 73.8 61.9
    BYOL [30] RN50 23 1237 74.4 64.8
    DCv2 [10] RN50 23 1237 75.2 67.1
    SwAV [10] RN50 23 1237 75.3 65.7
    DINO RN50 23 1237 75.3 67.5
    Supervised ViT-S 21 1007 79.8 79.8
    BYOL⇤ [30] ViT-S 21 1007 71.4 66.6
    MoCov2⇤ [15] ViT-S 21 1007 72.7 64.4
    SwAV⇤ [10] ViT-S 21 1007 73.5 66.3
    DINO ViT-S 21 1007 77.0 74.5
    Comparison across architectures
    SCLR [12] RN50w4 375 117 76.8 69.3
    SwAV [10] RN50w2 93 384 77.3 67.3
    BYOL [30] RN50w2 93 384 77.4 –
    DINO ViT-B/16 85 312 78.2 76.1
    SwAV [10] RN50w5 586 76 78.5 67.1
    BYOL [30] RN50w4 375 117 78.6 –
    BYOL [30] RN200w2 250 123 79.6 73.9
    DINO ViT-S/8 21 180 79.7 78.3
    SCLRv2 [13] RN152w3+SK 794 46 79.8 73.1
    DINO ViT-B/8 85 63 80.1 77.4
    4. Main Results
    We first validate the DINO framework used in this study
    with the standard self-supervised benchmark on ImageNet.
    We then study the properties of the resulting features for
    retrieval, object discovery and transfer-learning.
    3FT/FUɿैདྷ๏ͱಉఔ౓ͷੑೳΛൃش
    7JTJPO5SBOTGPSNFS 7J5
    ɿैདྷ๏Λ௒͑ΔੑೳΛൃش

    View Slide

  42. w ԼྲྀλεΫʹ͓͚Δਫ਼౓ൺֱ
    4VQ ɿ*NBHF/FU,Ͱڭࢣ͋ΓֶशˠԼྲྀλεΫ΁
    fi
    OFUVOJOH
    %*/0ɿ*NBHF/FU,Ͱࣗݾڭࢣ͋ΓֶशˠԼྲྀλεΫ΁
    fi
    OFUVOJOH
    7J5ͷͨΊͷֶशํ๏ɿ%*/0<$BSPO *$$7>

    ViT-S/16 22.0 27.3 45.9
    ViT-S/8 21.8 23.7 44.7
    Figure 4: Segmentations from supervised versus DINO. We vi-
    sualize masks obtained by thresholding the self-attention maps to
    keep 60% of the mass. On top, we show the resulting masks for
    a ViT-S/8 trained with supervision and DINO. We show the best
    head for both models. The table at the bottom compares the Jac-
    card similarity between the ground truth and these masks on the
    validation images of PASCAL VOC12 dataset.
    Table 6: Transfer learning by finetuning pretrained models on
    different datasets. We report top-1 accuracy. Self-supervised
    pretraining with DINO transfers better than supervised pretraining.
    Cifar10
    Cifar100
    INat18
    INat19
    Flwrs Cars INet
    ViT-S/16
    Sup. [69] 99.0 89.5 70.7 76.6 98.2 92.1 79.9
    DINO 99.0 90.5 72.0 78.2 98.5 93.0 81.5
    ViT-B/16
    Sup. [69] 99.0 90.8 73.2 77.7 98.4 92.1 81.8
    DINO 99.1 91.7 72.6 78.6 98.8 93.0 82.8
    In Table 7, we report different model variants as we add
    or remove components. First, we observe that in the absence
    of momentum, our framework does not work (row 2) and
    more advanced operations, SK for example, are required to
    avoid collapse (row 9). However, with momentum, using
    SK has little impact (row 3). In addtition, comparing rows 3
    and 9 highlights the importance of the momentum encoder
    for performance. Second, in rows 4 and 5, we observe that
    multi-crop training and the cross-entropy loss in DINO are
    3 X X X CE 7 72.2
    4 X 7 7 CE 7 67.9
    5 X 7 X MSE 7 52.6
    6 X 7 X CE X 71.8
    7 BYOL X 7 7 MSE X 66.6
    8 MoCov2 X 7 7 INCE 7 62.0
    9 SwAV 7 X X CE 7 64.7
    SK: Sinkhorn-Knopp, MC: Multi-Crop, Pred.: Predi
    CE: Cross-Entropy, MSE: Mean Square Error, INCE: In
    Figure 5: Effe
    Patch Size. k-NN
    uation as a funct
    the throughputs f
    ferent input patch
    with ViT-B and
    Models are train
    300 epochs.
    with different patch sizes, 16 ⇥ 16, 8 ⇥ 8 and 5 ⇥
    also compare to ViT-B with 16 ⇥ 16 and 8 ⇥ 8 patc
    the models are trained for 300 epochs. We observe
    performance greatly improves as we decrease the siz
    patch. It is interesting to see that performance can be
    improved without adding additional parameters. H
    the performance gain from using smaller patches c
    the expense of throughput: when using 5⇥5 patc
    throughput falls to 44 im/s, vs 180 im/s for 8⇥8 pat
    ڭࢣ͋ΓࣄલֶशϞσϧΛ௒͑ΔੑೳΛൃش

    View Slide

  43. w %*/0Ͱֶशͨ͠7J5ͷ$-45PLFOʹର͢Δ"UUFOUJPOXFJHIUΛՄࢹԽ
    ࠷ऴ૚ͷ.VMUJ)FBE4FMG"UUFOUJPOͷதͰ࠷΋લܠʹண໨͍ͯ͠Δ)FBEʹ͍ͭͯՄࢹԽ
    w "UUFOUJPOXFJHIU΁ᮢ஋ॲཧΛ͔͚ͯՄࢹԽ
    7J5ͷͨΊͷֶशํ๏ɿ%*/0<$BSPO *$$7>

    Emerging Properties in Self-Supervised Vision Transformers
    Mathilde Caron1,2 Hugo Touvron1,3 Ishan Misra1 Herv´
    e Jegou1
    Julien Mairal2 Piotr Bojanowski1 Armand Joulin1
    1 Facebook AI Research 2 Inria⇤ 3 Sorbonne University
    Figure 1: Self-attention from a Vision Transformer with 8 ⇥ 8 patches trained with no supervision. We look at the self-attention of
    the [CLS] token on the heads of the last layer. This token is not attached to any label nor supervision. These maps show that the model
    automatically learns class-specific features leading to unsupervised object segmentations.
    Abstract
    In this paper, we question if self-supervised learning pro-
    vides new properties to Vision Transformer (ViT) [19] that
    stand out compared to convolutional networks (convnets).
    Beyond the fact that adapting self-supervised methods to this
    architecture works particularly well, we make the follow-
    ing observations: first, self-supervised ViT features contain
    explicit information about the semantic segmentation of an
    image, which does not emerge as clearly with supervised
    1. Introduction
    Transformers [70] have recently emerged as an alternative
    to convolutional neural networks (convnets) for visual recog-
    nition [19, 69, 83]. Their adoption has been coupled with
    a training strategy inspired by natural language processing
    (NLP), that is, pretraining on large quantities of data and
    finetuning on the target dataset [18, 55]. The resulting Vision
    Transformers (ViT) [19] are competitive with convnets but,
    they have not yet delivered clear benefits over them: they
    v:2104.14294v2 [cs.CV] 24 May 2021
    ϥϕϧ৘ใ͕ແͯ͘΋ਖ਼֬ͳ෺ମྖҬΛ֫ಘ
    Supervised
    DINO
    Table 7: Important component for self-supervised ViT pre-
    training. Models are trained for 300 epochs with ViT-S/16. We
    study the different components that matter for the k-NN and linear
    (“Lin.”) evaluations. For the different variants, we highlight the
    differences from the default DINO setting. The best combination
    is the momentum encoder with the multicrop augmentation and
    the cross-entropy loss. We also report results with BYOL [30],
    MoCo-v2 [15] and SwAV [10].
    Supervised
    DINO
    Random Supervised DINO
    ViT-S/16 22.0 27.3 45.9
    Table 7: Imp
    training. Mod
    study the diffe
    (“Lin.”) evalu
    differences fro
    is the momen
    the cross-entr
    MoCo-v2 [15]
    Method M
    1 DINO
    2
    3
    ڭࢣ͋Γֶशͱൺ΂ͯ%*/0͸෺ମྖҬʹूத

    View Slide

  44. ڭࢣ͋Γֶशͨ͠ࡍͷ"UUFOUJPONBQ %*/0ͷ"UUFOUJPONBQ

    View Slide

  45. ࣗݾڭࢣ͋Γֶशͷ୅දతͳख๏

    #BSMPX5XJOT
    <+;CPOUBS *$.-`>
    CBUDIEJNFOTJPO
    $1$
    <"WE0PSE BS9JW`>
    ύονؒͰϖΞΛ
    ࡞੒ͯ͠ରরֶश
    $1$W
    <0+)ÉOB
    ff
    *$.-`>
    ϖΞͷ࡞੒΍
    Ϟσϧߏ଄ͳͲΛվળ
    ϚεΫͨ͠ྖҬͷϐΫηϧΛ༧ଌ
    $POUFYU&ODPEFST
    <%1BUIBL $713`>
    δάιʔύζϧ
    ৭৘ใΛ༧ଌ ճస֯౓Λ༧ଌ
    *NBHF3PUBUJPOT
    <4(JEBSJT *$-3`>
    $POUFYU1SFEJDUJPO
    <$%PFSTDI *$$7`>
    ύον෼ׂͨ͠ը૾ͷ
    ύονؒͷ૬ରҐஔΛ༧ଌ
    1*3-
    <*.JTSBBOE-.BBUFO $713`>
    δάιʔύζϧΛಋೖ
    4JN$-3
    <5$IFO *$.-`>
    4JN$-3W
    <5$IFO /FVS*14`>
    .P$P
    <,)F $713> .P$PW
    <9$IFO BS9JW>
    4JN$-3ͷςΫχοΫΛಋೖ
    ରরֶश
    $PVOUJOH
    <./PSPP[J *$$7`>
    ֤ύονग़ྗͷ࿨ͱը૾શମͷग़ྗ
    ͕Ұக͢ΔΑ͏ʹֶश
    +JHTBX
    <./PSPP[J $713`>
    ͭͷը૾ͷύζϧΛϛοΫε
    &NCFEEJOH-FBSOJOH
    <.:F $713`>
    6OTVQFSWJTFE
    8PSEWFD
    <5.JLPMPW BS9JW>
    #&35
    <+%FWMJO /""$->
    ࣗવݴޠॲཧ෼໺
    1$-
    <+-J *$-3`>
    ϓϩτλΠϓΛಋೖ
    MPDBM͔ΒMPDBM΋༧ଌ
    &T7J5
    <$-J *$-3`>
    .BTLFE*NBHF.PEFMJOH .*.



    $//Λର৅ͱͨࣗ͠ݾڭࢣ͋Γֶश
    7J5Λର৅ͱͨࣗ͠ݾڭࢣ͋Γֶश
    1SFUFYUλεΫͷվળ
    ྡ઀͢Δ୯ޠͷ༧ଌ
    Λը૾΁Ԡ༻
    %*/0
    <.$BSPO *$$7>
    σʔλ૿෯ͱෳ਺ͷը૾
    Λ༻͍ͨରরֶशΛఏҊ
    ը૾਺ʹΫϥε਺ͱֶͯ͠श
    աڈͷग़ྗΛ
    ωΨςΟϒϖΞͱͯ͠׆༻
    .BTLFE-BOHVBHF.PEFMJOH
    .-.
    Λը૾΁Ԡ༻
    େن໛ωοτϫʔΫͷಋೖ
    +JHTBX
    <./PSPP[JBOE1'BWBSP &$$7`>
    $PMPSJ[BUJPO
    <3;IBOH &$$7`>
    *OTUBODF%JTDSJNJOBUJPO
    <;8V $713`>
    MPDBM͔ΒHMPCBM
    HMPCBM͔ΒHMPCBMΛ༧ଌ
    ಛ௃ྔʹϚεΫΩϯά

    ϚεΫྖҬͷಛ௃ྔΛ༧ଌ
    #&J5
    <)#BP *$-3`>
    J#05
    <+;IPV *$-3`>
    ϚεΫྖҬͷըૉΛ༧ଌ
    ."&
    <,)F $713`>
    4JN.*.
    <;9JF $713`>
    4QPU"SUJGBDUT
    <4+FOOJBOE1'BWBSP $713`>
    .$5
    <9:VBO $713`>
    ϚϧνϞʔμϧ΁֦ு
    ʢը૾ʴςΩετʣ
    .P$P#:0-
    .P#:
    <;9JF BS9JW`>
    .P$PW
    <9$IFO *$$7>
    7J5Ͱͷ༗ޮੑΛධՁ
    7P-5"
    <41SBNBOJDL BS9JW`>
    MPDBMGFBUVSF"MJHONFOU
    ϚϧνϞʔμϧʢը૾ʴςΩετʣ
    γϯϓϧͳରরֶशΛఏҊ
    $-*1
    <"3BEGPSE *$.-`>
    ;FSP4IPU5SBOTGFS
    ϚϧνϞʔμϧʢը૾ʴςΩετʣ
    7J5ͷͨΊͷֶशํ๏
    ෼ੳ
    6OEFSTUBOEJOHUIF#FIBWJPVS
    <'8BOHBOE)-JV $713`>
    ଛࣦઃܭ΍ֶशޮՌʹ͍ͭͯ෼ੳ
    )PX8FMM%P4FMG4VQFSWJTFE.PEFMT5SBOTGFS
    <-&SJDTTPO $713`>
    ༷ʑͳ໰୊ઃఆ΁ͷసҠੑΛධՁ
    8IFO%PFT$POUSBTUJWF7JTVBM3FQSFTFOUBUJPO-FBSOJOH8PSL
    <&$PMF $713`>
    σʔληοτͱͷؔ܎ੑʹ͍ͭͯ෼ੳ
    *OGP.JO
    <:5JBO /FVS*14`>
    ϙδςΟϒϖΞͷ૊Έ߹Θͤʹ͍ͭͯ෼ੳ
    #:0-
    <+(SJMM /FVS*14>
    ϙδςΟϒϖΞͷΈͰֶश
    4JN4JBN
    <9$IFO $713>
    ΑΓγϯϓϧͳֶशΛఏҊ
    ෼ੳ
    #:0-XPSLTFWFOXJUIPVUCBUDITUBUJTUJDT
    <13JDIFNPOE BS9JW`>
    όονਖ਼نԽͷ౷ܭ৘ใ͕҉໧తͳωΨςΟϒϖΞͳͷͰ͸ʁ
    ˠਖ਼نԽʹΑΔֶशͷ҆ఆԽ͕ॏཁ
    4X"7
    <.$BSPO /FVS*14`>
    ϙδςΟϒϖΞͷଐ͢Δ
    ΫϥελΛਪఆ
    /FHBUJWFGSFF

    View Slide

  46. w $-*1ɿ$POUSBTUJWF-BOHVBHF*NBHF1SFUSBJOJOH
    w ը૾ͱςΩετΛ༻͍ͨϚϧνϞʔμϧࣗݾڭࢣ͋ΓֶशΛఏҊ
    σʔληοτͰ༻ҙ͞Ε͍ͯΔը૾ͱςΩετͷϖΞΛϙδςΟϒϖΞͱͯ͠ରরֶश
    w ;FSPTIPUͰը૾ͷΫϥε෼ྨ໰୊΁ద༻Մೳ
    ը૾ͱQSPNQUUFNQMBUF"QIPUPPGB\Ϋϥε໊^ͷಛ௃ྔͷྨࣅ౓ΛΫϥεείΞͱͯ͠ར༻
    ϚϧνϞʔμϧɿ$-*1<3BEGPSE *$.->

    I1·T2 I1·T3 …
    I2·T1 I2·T3 …
    I3·T1 I3·T2 …
    ⋮ ⋮ ⋮
    I1·T1
    I2·T2
    I3·T3
    (1) Contrastive pre-training
    Image
    Encoder
    Text
    Encoder
    Pepper the
    aussie pup
    Pepper the
    aussie pup
    Pepper the
    aussie pup
    Pepper the
    aussie pup
    T1 T2 T3 …
    I1
    I2
    I3

    (2) Create dataset classifier from label text
    plane
    car
    dog

    bird
    A photo of
    a {object}.

    Text
    Encoder
    T1 T2 T3 TN

    (3) Use for zero-shot prediction
    Image
    Encoder
    I1 I1·T2 I1·TN
    I1·T1


    A photo of
    a dog.
    TN
    IN·T1 IN·T2 IN·T3
    I1·TN
    I2·TN
    I3·TN


    IN

    ⋮ ⋱
    IN·TN
    I1·T3

    View Slide

  47. w ࿦จஶऀΒ͕࡞੒ͨ͠8FC*NBHF5FYU 8*5
    σʔληοτΛ༻͍ͯࣗݾڭࢣ͋Γֶश
    Πϯλʔωοτ͔Βऩूͨ͠ԯͷը૾ͱςΩετͷϖΞ͔Βߏ੒
    w ;FSP4IPU5SBOTGFSʹΑΔਫ਼౓Λڭࢣ͋ΓֶशϞσϧͱൺֱ
    ωοτϫʔΫɿ3FT/FU
    ϚϧνϞʔμϧɿ$-*1<3BEGPSE *$.->

    छྨͷதͰछྨͷσʔληοτͰਫ਼౓͕޲্
    Ӵ੕ը૾΍ಓ࿏ަ௨ඪࣝͷ෼ྨͳͲͷෳࡶͳλεΫ
    ʹ͓͍ͯ͸ਫ਼౓͕େ͖͘௿Լ

    View Slide

  48. w ԼྲྀλεΫ΁ͷసҠੑʢઢܗධՁʣΛڭࢣ͋Γֶश΍ࣗݾڭࢣ͋Γֶशͷैདྷ๏ͱൺֱ
    ϚϧνϞʔμϧɿ$-*1<3BEGPSE *$.->

    $-*1ʹΑΔࣄલֶश͸3FT/FU 7JTJPO5SBOTGPSNFS 7J5
    ʹ໰Θͣߴ͍ੑೳΛൃش

    View Slide

  49. w 7P-5"ɿ7JTJPO-BOHVBHF5SBOTGPSNFSXJUIXFBLMZTVQFSWJTFEMPDBMGFBUVSF"MJHONFOU
    w ը૾ͱςΩετ $BQUJPO
    Λ༻͍ͨϚϧνϞʔμϧࣗݾڭࢣ͋ΓֶशΛఏҊ
    #PVOEJOH#PYΛ࢖༻ͤͣʹ$BQUJPOͷΈΛ༻͍ͯը૾಺ͷৄࡉͳঢ়گʢؔ܎ʣʹֶ͍ͭͯश
    (BUJOHػߏΛ࣋ͬͨ$SPTTBUUFOUJPOΛಋೖ͢Δ͜ͱͰύϥϝʔλ਺Λ࡟ݮ
    w ͭͷࣗݾڭࢣ͋ΓֶशλεΫʹΑΓֶश
    ϚϧνϞʔμϧɿ7P-5"<1SBNBOJDL BS9JW>

    View Slide

  50. w (BUJOHػߏΛ࣋ͬͨ$SPTTBUUFOUJPOΛ֤ϞʔμϧϞσϧͷ4FMGBUUFOUJPO΁ಋೖ
    ֶशՄೳͳύϥϝʔλͱͯ͠HBUJOHTDBMBSɹΛಋೖ
    ɹΛͱ͢Δ͜ͱͰ$SPTT"UUFOUJPOػߏͷΦϑ͕Մೳ
    ࣗݾڭࢣ͋ΓֶशͷλεΫʹԠͯ͡ΦϯɾΦϑΛมߋ
    w $SPTTNPEBMGVTJPOͷͨΊͷ௥Ճͷ૚͕ඞཁͳ͍ͨΊύϥϝʔλ਺Λ࡟ݮՄೳ
    ϚϧνϞʔμϧɿ7P-5"<1SBNBOJDL BS9JW>

    α
    α
    α = 0 α ≠ 0
    ̂
    x = SelfAtt(x)
    x = x + ̂
    x + α ⋅ CrossAtt( ̂
    x, y)
    x = x + FFN(x)
    ը૾Ϟσϧʹ͓͚Δॲཧ
    ɿը૾ͷ֤ύονಛ௃ྔ
    x ɿ$BQUJPOͷ֤τʔΫϯಛ௃ྔ
    y

    View Slide

  51. w ̏εςοϓͷॲཧʹΑΓ֤&ODPEFSΛֶश
    $SPTTBUUFOUJPOΛΦϑʹͯ͠ը૾ͱ$BQUJPOΛೖྗ͠ɼଛࣦɹɹɹɹɹɹɹɹɹɹɹΛܭࢉ
    $SPTTBUUFOUJPOΛΦϯʹͯ͠ը૾ͱ$BQUJPOΛೖྗ͠ɼଛࣦɹɹɹɹɹΛܭࢉ
    શͯͷଛࣦΛ଍͠߹Θͤͯ#BDLQSPQ
    ϚϧνϞʔμϧɿ7P-5"<1SBNBOJDL BS9JW>

    4UFQ 4UFQ
    LIT′

    BT
    , LII′

    BT
    , LI′

    T
    BT
    , LTT′

    BT
    , LGOT
    LMLM
    , LITM

    View Slide

  52. Ϟʔμϧ಺ɾϞʔμϧؒͷରরֶशʢ#BSMPX5XJOTʣ
    #BSMPX5XJOTʹج͍ͮͨରরֶशʹΑΓಛ௃ϕΫτϧͷ֤࣍ݩ͕ಠཱͨ͠ಛ௃ͱͳΔΑ͏ʹֶश
    ϚϧνϞʔμϧɿ7P-5"<1SBNBOJDL BS9JW>

    LAB
    BT
    = ∑
    i
    (1 − Cii
    )2 + λ∑
    i

    j≠i
    (Cij
    )2
    Cij
    =

    b
    zA
    b,i
    zB
    b,j

    b
    (zA
    b,i
    )2 ∑
    b
    (zB
    b,j
    )2
    ɿಛ௃ϕΫτϧͷ࣍ݩͷΠϯσοΫε
    i, j
    ɿϛχόονͷΠϯσοΫε
    b
    ɿಛ௃ϕΫτϧ
    zA, zB
    ɿQPTJUJWFXFJHIUJOHGBDUPS
    λ

    View Slide

  53. Ϟʔμϧ಺ɾϞʔμϧؒͷରরֶशʢ#BSMPX5XJOTʣ
    #BSMPX5XJOTʹج͍ͮͨରরֶशʹΑΓಛ௃ϕΫτϧͷ֤࣍ݩ͕ಠཱͨ͠ಛ௃ͱͳΔΑ͏ʹֶश
    ϚϧνϞʔμϧɿ7P-5"<1SBNBOJDL BS9JW>

    LAB
    BT
    = ∑
    i
    (1 − Cii
    )2 + λ∑
    i

    j≠i
    (Cij
    )2
    Cij
    =

    b
    zA
    b,i
    zB
    b,j

    b
    (zA
    b,i
    )2 ∑
    b
    (zB
    b,j
    )2
    ɿಛ௃ϕΫτϧͷ࣍ݩͷΠϯσοΫε
    i, j
    ɿϛχόονͷΠϯσοΫε
    b
    ɿಛ௃ϕΫτϧ
    zA, zB
    zA
    ࣍ݩ਺
    ϛχόον਺
    zB
    ࣍ݩ਺
    ϛχόον਺
    i j
    Ci,j
    ˠಛ௃ϕΫτϧͷ࣍ݩؒͷ৑௕ੑΛ࡟ݮ
    ɿQPTJUJWFXFJHIUJOHGBDUPS
    λ
    zA
    0,i
    zA
    1,i
    zA
    2,i
    zA
    3,i
    zA
    4,i
    zB
    0,j
    zB
    1,j
    zB
    2,j
    zB
    3,j
    zB
    4,j

    View Slide

  54. (SBQI0QUJNBM5SBOTQPSU 8BTTFSTUFJO%JTUBODF

    ը૾ύονؒͷؔ܎ɼ$BQUJPOτʔΫϯؒͷؔ܎ΛάϥϑͰදݱ
    ϊʔυͱͯ͠ύοντʔΫϯͷಛ௃ϕΫτϧɼΤοδͱͯ͠ಛ௃ϕΫτϧؒͷྨࣅ౓Λ࢖༻
    Ϟʔμϧؒʹ͓͍ͯϊʔυͷ࠷ద༌ૹɼΤοδͷ࠷ద༌ૹʹΑΓMPDBMGFBUVSFΛBMJHONFOU
    ϚϧνϞʔμϧɿ7P-5"<1SBNBOJDL BS9JW>

    DW
    (ϕ, ψ) = min
    T∈Π(u,v)

    i

    j
    Tij
    ⋅ c(xi
    , yj
    )
    Dgw
    (ϕ, ψ) = min
    ̂
    T∈Π(u,v)

    i,i′

    ,j,j′

    ̂
    Tij
    ̂
    Ti′

    j′

    ∥c1
    (xi
    , x′

    i
    ) − c2
    (yj
    , y′

    j
    )∥
    LGOT
    (ϕ, ψ) = γDW
    (ϕ, ψ) + (1 − γ)Dgw
    (ϕ, ψ)
    ɿը૾ͷύονಛ௃ྔ
    xi
    , xj
    ɿUSBOTQPSUQMBO
    T, ̂
    T ɿ$BQUJPOͷτʔΫϯಛ௃ྔ
    yi
    , yj
    ɿίαΠϯྨࣅ౓
    c( ⋅ , ⋅ ), c1
    ( ⋅ , ⋅ ), c2
    ( ⋅ , ⋅ ) ɿͭͷଛࣦΛௐ੔͢ΔॏΈ
    γ

    View Slide

  55. (SBQI0QUJNBM5SBOTQPSU 8BTTFSTUFJO%JTUBODF

    ը૾ύονؒͷؔ܎ɼ$BQUJPOτʔΫϯؒͷؔ܎ΛάϥϑͰදݱ
    ϊʔυͱͯ͠ύοντʔΫϯͷಛ௃ϕΫτϧɼΤοδͱͯ͠ಛ௃ϕΫτϧؒͷྨࣅ౓Λ࢖༻
    Ϟʔμϧؒʹ͓͍ͯϊʔυͷ࠷ద༌ૹɼΤοδͷ࠷ద༌ૹʹΑΓMPDBMGFBUVSFΛBMJHONFOU
    ϚϧνϞʔμϧɿ7P-5"<1SBNBOJDL BS9JW>

    DW
    (ϕ, ψ) = min
    T∈Π(u,v)

    i

    j
    Tij
    ⋅ c(xi
    , yj
    )
    Dgw
    (ϕ, ψ) = min
    ̂
    T∈Π(u,v)

    i,i′

    ,j,j′

    ̂
    Tij
    ̂
    Ti′

    j′

    ∥c1
    (xi
    , x′

    i
    ) − c2
    (yj
    , y′

    j
    )∥
    LGOT
    (ϕ, ψ) = γDW
    (ϕ, ψ) + (1 − γ)Dgw
    (ϕ, ψ)
    ɿը૾ͷύονಛ௃ྔ
    xi
    , xj
    ɿUSBOTQPSUQMBO
    T, ̂
    T ɿ$BQUJPOͷτʔΫϯಛ௃ྔ
    yi
    , yj
    ɿίαΠϯྨࣅ౓
    c( ⋅ , ⋅ ), c1
    ( ⋅ , ⋅ ), c2
    ( ⋅ , ⋅ ) ɿͭͷଛࣦΛௐ੔͢ΔॏΈ
    γ
    ∥c1
    (xi
    , x′

    i
    ) − c2
    (yj
    , y′

    j
    )∥
    c(xi
    , yj
    )

    View Slide

  56. .BTLFE-BOHVBHF.PEFMJOH
    $BQUJPO಺ͷτʔΫϯͷˋʹϚεΫॲཧΛద༻
    $BQUJPOͷτʔΫϯಛ௃ྔ͔Β.-.)FBEʹΑΓϚεΫͨ͠τʔΫϯΛ༧ଌ
    *NBHF5FYU.BUDIJOH
    ը૾ͱϖΞʹͳ͍ͬͯΔ$BQUJPOΛϥϯμϜʹมߋ
    ֤Ϟʔμϧͷಛ௃ྔ͔Β*5.)FBEʹΑΓը૾ͱ$BQUJPO͕ਖ਼͍͠ϖΞ͔༧ଌ
    ϚϧνϞʔμϧɿ7P-5"<1SBNBOJDL BS9JW>

    View Slide

  57. w *NBHF/FU,ʹ͓͚Δਫ਼౓ൺֱ
    $0$0σʔληοτΛ࢖༻ͯࣗ͠ݾڭࢣ͋Γֶश
    $SPTT"UUFOUJPOΛΦϑʹͨ͠*NBHFFODPEFSʹରͯ͠ઢܗධՁ
    XP$."'ɿࣗݾڭࢣ͋ΓֶशͷશͯͷλεΫʹ͓͍ͯ$SPTT"UUFOUJPOΛΦϑʹֶͯ͠श
    ϚϧνϞʔμϧɿ7P-5"<1SBNBOJDL BS9JW>

    Table 1: Uni-modal downstream: image classification. We benchmark learned representation
    classification task by training linear classifiers on fixed features. We report top-1 accuracy on Im
    validation set, classification mAP on VOC07, and per-class (PC) and overall (O) F1 scores on COCO
    with † are re-implemented by Yuan et al. (2021), and the numbers with ‡ are re-implemented by u
    trained with significantly larger dataset are colored gray. Best results are in bold.
    Linear probing on ImageNet Validation Set Linear probing on VOC07 and COCO
    Method Pre-train Arch. Supervision Top-1(%) Method Pre-train Arch.
    VOC07 CO
    SVM MLP MLP
    Sup. IN-1K RN50 Label 76.5 Sup. IN-1K RN50 87.5 90.8 55.2
    Sup. IN-100 RN50 Label 53.3† SimCLR IN-1K RN50 85.5
    MoCo COCO RN50 NA 44.5† MoCo IN-1K RN50 79.8
    MoCo-v2 COCO RN50 NA 49.3† MoCo-v2 IN-1K RN50 86.4
    VirTex COCO RN50 Caption 52.8 SwAV IN-1K RN50 88.9
    ICMLM COCO RN50 Caption 51.9 BYOL IN-1K RN50 86.6
    MCT COCO RN50 Caption 54.9 BT IN-1K RN50 86.2 91.9‡ 56.1/
    MCT COCO RN50 Caption+Tag 55.3 VICReg IN-1K RN50 86.6 91.1‡ 51.0/
    VoLTA(w/o CMAF) COCO RN50 Caption 55.3 VoLTA(w/o CMAF) COCO RN50 89.6 94.3 71.4
    VoLTA(w/o CMAF) COCO Swin-T Caption 56.3 VoLTA(w/o CMAF) COCO Swin-T 88.2 93.5 73.4
    VoLTA(w/o CMAF) COCO Swin-B Caption 62.5 VoLTA(w/o CMAF) COCO Swin-B 88.5 93.9 74.1
    VoLTA COCO Swin-B Caption 62.5 VoLTA COCO Swin-B 88.7 94.0 74.5
    Table 2: Uni-modal downstream: object detection and instance segmentation. We benchm
    representations on VOC07 + 12 object detection task using faster R-CNN (Ren et al., 2015), and on C
    object detection and instance segmentation using mask R-CNN He et al. (2017), both with C4 backb
    (Wu et al., 2019). Best results are in bold.
    3FT/FUɿ$BQUJPOͷΈΛ༻͍ͯߴ͍ਫ਼౓Λୡ੒
    4XJO5SBOTGPSNFSɿ7J5Ϟσϧʹ͓͍ͯ΋ߴ͍ਫ਼౓ΛൃشՄೳ

    View Slide

  58. w (SBQI0QUJNBM5SBOTQPSUʹΑΔ$BQUJPOτʔΫϯͱը૾ύονͷϚονϯά݁ՌΛՄࢹԽ
    ੺จࣈͷ$BQUJPOτʔΫϯʹର͢Δը૾ύονͷϚονϯά݁ՌΛ੺৭Ͱදݱ
    w $0$0σʔληοτΛ࢖༻ͯࣗ͠ݾڭࢣ͋Γֶशͨ͠ϞσϧΛ࢖༻
    ϚϧνϞʔμϧɿ7P-5"<1SBNBOJDL BS9JW>

    $BQUJPOͷΈͰը૾ͱ$BQUJPOؒͷରԠؔ܎ʹ͍ͭͯ֫ಘ

    View Slide

  59. ࣗݾڭࢣ͋Γֶशͷ୅දతͳख๏

    #BSMPX5XJOT
    <+;CPOUBS *$.-`>
    CBUDIEJNFOTJPO
    $1$
    <"WE0PSE BS9JW`>
    ύονؒͰϖΞΛ
    ࡞੒ͯ͠ରরֶश
    $1$W
    <0+)ÉOB
    ff
    *$.-`>
    ϖΞͷ࡞੒΍
    Ϟσϧߏ଄ͳͲΛվળ
    ϚεΫͨ͠ྖҬͷϐΫηϧΛ༧ଌ
    $POUFYU&ODPEFST
    <%1BUIBL $713`>
    δάιʔύζϧ
    ৭৘ใΛ༧ଌ ճస֯౓Λ༧ଌ
    *NBHF3PUBUJPOT
    <4(JEBSJT *$-3`>
    $POUFYU1SFEJDUJPO
    <$%PFSTDI *$$7`>
    ύον෼ׂͨ͠ը૾ͷ
    ύονؒͷ૬ରҐஔΛ༧ଌ
    1*3-
    <*.JTSBBOE-.BBUFO $713`>
    δάιʔύζϧΛಋೖ
    4JN$-3
    <5$IFO *$.-`>
    4JN$-3W
    <5$IFO /FVS*14`>
    .P$P
    <,)F $713> .P$PW
    <9$IFO BS9JW>
    4JN$-3ͷςΫχοΫΛಋೖ
    ରরֶश
    $PVOUJOH
    <./PSPP[J *$$7`>
    ֤ύονग़ྗͷ࿨ͱը૾શମͷग़ྗ
    ͕Ұக͢ΔΑ͏ʹֶश
    +JHTBX
    <./PSPP[J $713`>
    ͭͷը૾ͷύζϧΛϛοΫε
    &NCFEEJOH-FBSOJOH
    <.:F $713`>
    6OTVQFSWJTFE
    8PSEWFD
    <5.JLPMPW BS9JW>
    #&35
    <+%FWMJO /""$->
    ࣗવݴޠॲཧ෼໺
    1$-
    <+-J *$-3`>
    ϓϩτλΠϓΛಋೖ
    MPDBM͔ΒMPDBM΋༧ଌ
    &T7J5
    <$-J *$-3`>


    $//Λର৅ͱͨࣗ͠ݾڭࢣ͋Γֶश
    7J5Λର৅ͱͨࣗ͠ݾڭࢣ͋Γֶश
    1SFUFYUλεΫͷվળ
    ྡ઀͢Δ୯ޠͷ༧ଌ
    Λը૾΁Ԡ༻
    %*/0
    <.$BSPO *$$7>
    σʔλ૿෯ͱෳ਺ͷը૾
    Λ༻͍ͨରরֶशΛఏҊ
    ը૾਺ʹΫϥε਺ͱֶͯ͠श
    աڈͷग़ྗΛ
    ωΨςΟϒϖΞͱͯ͠׆༻
    .BTLFE-BOHVBHF.PEFMJOH
    .-.
    Λը૾΁Ԡ༻
    େن໛ωοτϫʔΫͷಋೖ
    +JHTBX
    <./PSPP[JBOE1'BWBSP &$$7`>
    $PMPSJ[BUJPO
    <3;IBOH &$$7`>
    *OTUBODF%JTDSJNJOBUJPO
    <;8V $713`>
    MPDBM͔ΒHMPCBM
    HMPCBM͔ΒHMPCBMΛ༧ଌ
    ಛ௃ྔʹϚεΫΩϯά

    ϚεΫྖҬͷಛ௃ྔΛ༧ଌ
    #&J5
    <)#BP *$-3`>
    J#05
    <+;IPV *$-3`>
    ϚεΫྖҬͷըૉΛ༧ଌ
    ."&
    <,)F $713`>
    4JN.*.
    <;9JF $713`>
    4QPU"SUJGBDUT
    <4+FOOJBOE1'BWBSP $713`>
    .$5
    <9:VBO $713`>
    ϚϧνϞʔμϧ΁֦ு
    ʢը૾ʴςΩετʣ
    .P$P#:0-
    .P#:
    <;9JF BS9JW`>
    .P$PW
    <9$IFO *$$7>
    7J5Ͱͷ༗ޮੑΛධՁ
    7P-5"
    <41SBNBOJDL BS9JW`>
    MPDBMGFBUVSF"MJHONFOU
    ϚϧνϞʔμϧʢը૾ʴςΩετʣ
    γϯϓϧͳରরֶशΛఏҊ
    $-*1
    <"3BEGPSE *$.-`>
    ;FSP4IPU5SBOTGFS
    ϚϧνϞʔμϧʢը૾ʴςΩετʣ
    7J5ͷͨΊͷֶशํ๏
    .BTLFE*NBHF.PEFMJOH .*.

    ෼ੳ
    6OEFSTUBOEJOHUIF#FIBWJPVS
    <'8BOHBOE)-JV $713`>
    ଛࣦઃܭ΍ֶशޮՌʹ͍ͭͯ෼ੳ
    )PX8FMM%P4FMG4VQFSWJTFE.PEFMT5SBOTGFS
    <-&SJDTTPO $713`>
    ༷ʑͳ໰୊ઃఆ΁ͷసҠੑΛධՁ
    8IFO%PFT$POUSBTUJWF7JTVBM3FQSFTFOUBUJPO-FBSOJOH8PSL
    <&$PMF $713`>
    σʔληοτͱͷؔ܎ੑʹ͍ͭͯ෼ੳ
    *OGP.JO
    <:5JBO /FVS*14`>
    ϙδςΟϒϖΞͷ૊Έ߹Θͤʹ͍ͭͯ෼ੳ
    #:0-
    <+(SJMM /FVS*14>
    ϙδςΟϒϖΞͷΈͰֶश
    4JN4JBN
    <9$IFO $713>
    ΑΓγϯϓϧͳֶशΛఏҊ
    ෼ੳ
    #:0-XPSLTFWFOXJUIPVUCBUDITUBUJTUJDT
    <13JDIFNPOE BS9JW`>
    όονਖ਼نԽͷ౷ܭ৘ใ͕҉໧తͳωΨςΟϒϖΞͳͷͰ͸ʁ
    ˠਖ਼نԽʹΑΔֶशͷ҆ఆԽ͕ॏཁ
    4X"7
    <.$BSPO /FVS*14`>
    ϙδςΟϒϖΞͷଐ͢Δ
    ΫϥελΛਪఆ
    /FHBUJWFGSFF

    View Slide

  60. .BTLFE*NBHF.PEFMJOHͷ୅දతͳख๏



    #&35
    <+%FWMJO /""$->
    ࣗવݴޠॲཧ෼໺
    ϚεΫྖҬͷըૉΛ༧ଌ
    ."&
    <,)F $713`>
    4JN.*.
    <;9JF $713`>
    .BTLFE*NBHF.PEFMJOH .*.

    #&J5
    <)#BP *$-3`>
    J#05
    <+;IPV *$-3`>
    .BTLFE-BOHVBHF.PEFMJOH .-.
    )0(ಛ௃ྔΛ༧ଌ
    .BTLFE'FBUVSF1SFEJDUJPO
    <$8FJ $713`>
    ҟͳΔϞʔμϧͷద༻
    ը૾ͷ࠶ߏ੒
    NVMUJGPMENBTLJOHTUSBUFHZ
    4E"&
    <:$IFO &$$7`>
    "UUFOUJPOXFJHIUʹج͍ͮͨϚεΫ࡞੒
    "UUFOUJPO(VJEFE.*.
    <*,BLPHFPSHJPV &$$7`>
    NVMUJCMPDLNBTLJOHTUSBUFHZ
    *+&1"
    <."TTSBO BS9JW`>
    Ϛϧνεέʔϧͳಛ௃ྔΛ֫ಘ
    .$."&
    <1(BP /FVS*14`>
    ΞʔΩςΫνϟͷվળ
    ը૾΁Ԡ༻
    ରরֶश
    ͷಋೖ
    ϚεΫͷվળ
    5PLFOJ[FS
    GSFF
    ϚϧνλεΫ
    4JN.*.ରরֶश

    4J5
    <4"UJUP BS9JW`>
    $."&
    <+.BP BS9JW`>
    ϚϧνλεΫ
    ."&ରরֶश

    '-*1
    <:-J BS9JW>
    ϚεΫͨ͠ը૾Λ༻͍ͨ$-*1
    .4/
    <."TTSBO &$$7`>
    ϚεΫͨ͠ը૾Λ༻͍ͨ
    /FHBUJWFGSFF
    Ի੠
    ."&UIBU-JTUFO
    <1)VBOH /FVS*14`>
    ."&"T4QBUJPUFNQPSBM-FBSOFST
    <$'FJDIUFOIPGFS /FVS*14`>
    ಈը૾
    ϚϧνϞʔμϧ
    .VMUJ."&
    <3#BDINBOO &$$7`>

    View Slide

  61. .BTLFE*NBHF.PEFMJOHͷ୅දతͳख๏



    #&35
    <+%FWMJO /""$->
    ࣗવݴޠॲཧ෼໺
    ϚεΫྖҬͷըૉΛ༧ଌ
    ."&
    <,)F $713`>
    4JN.*.
    <;9JF $713`>
    .BTLFE*NBHF.PEFMJOH .*.

    #&J5
    <)#BP *$-3`>
    J#05
    <+;IPV *$-3`>
    .BTLFE-BOHVBHF.PEFMJOH .-.
    )0(ಛ௃ྔΛ༧ଌ
    .BTLFE'FBUVSF1SFEJDUJPO
    <$8FJ $713`>
    ҟͳΔϞʔμϧͷద༻
    ը૾ͷ࠶ߏ੒
    NVMUJGPMENBTLJOHTUSBUFHZ
    4E"&
    <:$IFO &$$7`>
    "UUFOUJPOXFJHIUʹج͍ͮͨϚεΫ࡞੒
    "UUFOUJPO(VJEFE.*.
    <*,BLPHFPSHJPV &$$7`>
    NVMUJCMPDLNBTLJOHTUSBUFHZ
    *+&1"
    <."TTSBO BS9JW`>
    Ϛϧνεέʔϧͳಛ௃ྔΛ֫ಘ
    .$."&
    <1(BP /FVS*14`>
    ΞʔΩςΫνϟͷվળ
    ը૾΁Ԡ༻
    ରরֶश
    ͷಋೖ
    ϚεΫͷվળ
    5PLFOJ[FS
    GSFF
    ϚϧνλεΫ
    4JN.*.ରরֶश

    4J5
    <4"UJUP BS9JW`>
    $."&
    <+.BP BS9JW`>
    ϚϧνλεΫ
    ."&ରরֶश

    '-*1
    <:-J BS9JW>
    ϚεΫͨ͠ը૾Λ༻͍ͨ$-*1
    .4/
    <."TTSBO &$$7`>
    ϚεΫͨ͠ը૾Λ༻͍ͨ
    /FHBUJWFGSFF
    Ի੠
    ."&UIBU-JTUFO
    <1)VBOH /FVS*14`>
    ."&"T4QBUJPUFNQPSBM-FBSOFST
    <$'FJDIUFOIPGFS /FVS*14`>
    ಈը૾
    ϚϧνϞʔμϧ
    .VMUJ."&
    <3#BDINBOO &$$7`>

    View Slide

  62. w #&35ɿ#JEJSFDUJPOBM&ODPEFS3FQSFTFOUBUJPOTGSPN5SBOTGPSNFST
    w #JEJSFDUJPOBM5SBOTGPSNFSΛ1SFUSBJOJOHͱ'JOF5VOJOHͷͭͷ4UFQʹΑΓֶश
    w ࣄલֶश 1SFUSBJOJOH
    ͱͯͭ͠ͷλεΫʹֶ͍ͭͯश
    .BTLFE-BOHVBHF.PEFMJOHɿͷτʔΫϯΛϚεΫ͠ɼϚεΫͨ͠τʔΫϯͷ୯ޠΛ༧ଌ
    /FYU4FOUFODF1SFEJDUJPO ɿ4FOUFODF#͕4FOUFODF"ͷଓ͖ͷจষ͔༧ଌ
    .BTLFE-BOHVBHF.PEFMJOHɿ#&35<%FWMJO /""$->

    BERT BERT
    E
    [CLS]
    E
    1
    E
    [SEP]
    ... E
    N
    E
    1
    ’ ... E
    M

    C T
    1
    T
    [SEP]
    ... T
    N
    T
    1
    ’ ... T
    M

    [CLS] Tok 1 [SEP]
    ... Tok N Tok 1 ... TokM
    Question Paragraph
    Start/End Span
    BERT
    E
    [CLS]
    E
    1
    E
    [SEP]
    ... E
    N
    E
    1
    ’ ... E
    M

    C T
    1
    T
    [SEP]
    ... T
    N
    T
    1
    ’ ... T
    M

    [CLS] Tok 1 [SEP]
    ... Tok N Tok 1 ... TokM
    Masked Sentence A Masked Sentence B
    Pre-training Fine-Tuning
    NSP Mask LM Mask LM
    Unlabeled Sentence A and B Pair
    SQuAD
    Question Answer Pair
    NER
    MNLI

    View Slide

  63. w #&J5ɿ#JEJSFDUJPOBM&ODPEFSSFQSFTFOUBUJPOGSPN*NBHF5SBOTGPSNFST
    w ϚεΫͨ͠ύονͷಛ௃ྔΛ༧ଌ
    w "VUPFODPEFSߏ଄Ͱֶशͨ͠5PLFOJ[FSͷग़ྗΛਖ਼ղ৘ใͱͯ͠ར༻
    5PLFOJ[FSɿ%"--&<3BNFTI *$.->ͷֶशࡁΈEJTDSFUFWBSJBUJPOBMBVUPFODPEFSΛ࢖༻
    .BTLFE*NBHF.PEFMJOHɿ#&J5<#BP *$-3>

    123 234 456 567
    987 876 765 543
    112 223 334 445
    211 322 433 544
    + + + + + + + + + + + + + + + +
    +
    BEIT Encoder
    Blockwise
    Masking
    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
    0
    Flatten
    Tokenizer Decoder
    Position
    Embedding
    Patch
    Embedding
    Original
    Image
    Image
    Patches
    Visual Tokens
    𝐡2
    L 𝐡3
    L 𝐡6
    L 𝐡7
    L 𝐡14
    L
    Masked Image Modeling Head
    Reconstructed
    Image
    Unused During
    Pre-Training
    234 456 876 765 322
    [S] [M] [M] [M] [M] [M]

    View Slide

  64. w J#05ɿJNBHF#&35QSF5SBJOJOHXJUI0OMJOF5PLFOJ[FS
    w ϚεΫͨ͠ύονͷಛ௃ྔͱϚεΫΛ͍ͯ͠ͳ͍ҟͳΔ7JFXͷΫϥετʔΫϯΛ༧ଌ
    ύονಛ௃ྔͷ༧ଌɹɹɹͱҟͳΔ7JFXͷΫϥετʔΫϯͷ༧ଌɹɹɹͷͭͷଛࣦ͔Βֶश
    w ࢦ਺ҠಈฏۉϞσϧ POMJOFUPLFOJ[FS
    ͷग़ྗΛਖ਼ղ৘ใͱͯ͠ར༻
    .BTLFE*NBHF.PEFMJOHɿJ#05<;IPV *$-3>

    !~ℐ $ ~%
    &!
    &"
    ℎ!
    "#$%&
    ℎ!
    [()*]

    $
    "#$%&

    $
    [()*]
    ℒ[$%&]
    ℒ()(
    online tokenizer
    "
    #!
    "#$%&
    "
    #!
    [()*]
    #
    $
    [()*]
    #
    $
    "#$%&
    $
    $
    [()*]
    $
    $
    "#$%&
    "
    $!
    [()*]
    stop grad
    stop grad
    EMA
    "
    $!
    "#$%&
    %[,-*.]
    (
    )
    ℒMIM
    ℒ[CLS]

    View Slide

  65. .BTLFE*NBHF.PEFMJOHͷ୅දతͳख๏



    #&35
    <+%FWMJO /""$->
    ࣗવݴޠॲཧ෼໺
    ϚεΫྖҬͷըૉΛ༧ଌ
    ."&
    <,)F $713`>
    4JN.*.
    <;9JF $713`>
    .BTLFE*NBHF.PEFMJOH .*.

    #&J5
    <)#BP *$-3`>
    J#05
    <+;IPV *$-3`>
    .BTLFE-BOHVBHF.PEFMJOH .-.
    )0(ಛ௃ྔΛ༧ଌ
    .BTLFE'FBUVSF1SFEJDUJPO
    <$8FJ $713`>
    ҟͳΔϞʔμϧͷద༻
    ը૾ͷ࠶ߏ੒
    NVMUJGPMENBTLJOHTUSBUFHZ
    4E"&
    <:$IFO &$$7`>
    "UUFOUJPOXFJHIUʹج͍ͮͨϚεΫ࡞੒
    "UUFOUJPO(VJEFE.*.
    <*,BLPHFPSHJPV &$$7`>
    NVMUJCMPDLNBTLJOHTUSBUFHZ
    *+&1"
    <."TTSBO BS9JW`>
    Ϛϧνεέʔϧͳಛ௃ྔΛ֫ಘ
    .$."&
    <1(BP /FVS*14`>
    ΞʔΩςΫνϟͷվળ
    ը૾΁Ԡ༻
    ରরֶश
    ͷಋೖ
    ϚεΫͷվળ
    5PLFOJ[FS
    GSFF
    ϚϧνλεΫ
    4JN.*.ରরֶश

    4J5
    <4"UJUP BS9JW`>
    $."&
    <+.BP BS9JW`>
    ϚϧνλεΫ
    ."&ରরֶश

    '-*1
    <:-J BS9JW>
    ϚεΫͨ͠ը૾Λ༻͍ͨ$-*1
    .4/
    <."TTSBO &$$7`>
    ϚεΫͨ͠ը૾Λ༻͍ͨ
    /FHBUJWFGSFF
    Ի੠
    ."&UIBU-JTUFO
    <1)VBOH /FVS*14`>
    ."&"T4QBUJPUFNQPSBM-FBSOFST
    <$'FJDIUFOIPGFS /FVS*14`>
    ಈը૾
    ϚϧνϞʔμϧ
    .VMUJ."&
    <3#BDINBOO &$$7`>

    View Slide

  66. w ."&ɿ.BTLFE"VUPFODPEFS
    w ϚεΫͨ͠ύονͷըૉΛ༧ଌ
    &ODPEFSɿϚεΫ͞Ε͍ͯͳ͍ύονΛೖྗͱ͢Δ7J5
    %FDPEFSɿύοντʔΫϯͱϚεΫτʔΫϯ͔Βը૾Λ࠶ߏ੒͢Δখن໛ͷ7J5
    ը૾ͷ࠶ߏ੒ɿ."&<)F $713>

    ଛࣦܭࢉ.4&
    ʢϚεΫτʔΫϯʹର͢Δग़ྗʹͷΈܭࢉʣ
    ɿϚεΫτʔΫϯ
    ɿΤϯίʔυ͞ΕͨύοντʔΫϯ
    *OQVU
    &ODPEFS
    1&
    %FDPEFS
    1&
    ࣗݾڭࢣ͋Γֶशޙ͸
    &ODPEFSͷΈΛར༻

    View Slide

  67. w *NBHF/FU,ͷධՁ༻σʔλʹର͢Δ෮ݩ݁Ռ
    w *NBHF/FU,ʹ͓͚Δਫ਼౓ൺֱʢCBTFMJOF."&ɿ."&ʹΑΔֶश
    fi
    OFUVOJOHʣ
    ը૾ͷ࠶ߏ੒ɿ."&<)F $713>

    ϚεΫ͞Ε͍ͯͳ͍ύον͔Βը૾શମͷ࠶ߏ੒͕Մೳ
    ೖྗը૾ ෮ݩ݁Ռ ݪը૾ ೖྗը૾ ෮ݩ݁Ռ ݪը૾ ೖྗը૾ ෮ݩ݁Ռ ݪը૾ ೖྗը૾ ෮ݩ݁Ռ ݪը૾
    encoder design. We
    narrower and shal-
    our default decoder
    e encoder. With this
    kens are only pro-
    ch significantly re-
    constructs the input
    masked patch. Each
    ctor of pixel values
    he decoder is a lin-
    channels equals the
    decoder’s output is
    . Our loss function
    between the recon-
    space. We compute
    r to BERT [14].1
    nstruction target is
    sked patch. Specif-
    ard deviation of all
    10 20 30 40 50 60 70 80 90
    masking ratio (%)
    Figure 5. Masking ratio. A high masking ratio (75%) works well
    for both fine-tuning (top) and linear probing (bottom). The y-axes
    are ImageNet-1K validation accuracy (%) in all plots in this paper.
    4. ImageNet Experiments
    We do self-supervised pre-training on the ImageNet-1K
    (IN1K) [13] training set. Then we do supervised training to
    evaluate the representations with (i) end-to-end fine-tuning
    or (ii) linear probing. We report top-1 validation accuracy
    of a single 224⇥224 crop. Details are in Appendix A.1.
    Baseline: ViT-Large. We use ViT-Large (ViT-L/16) [16]
    as the backbone in our ablation study. ViT-L is very big (an
    order of magnitude bigger than ResNet-50 [25]) and tends
    to overfit. The following is a comparison between ViT-L
    trained from scratch vs. fine-tuned from our baseline MAE:
    scratch, original [16] scratch, our impl. baseline MAE
    76.5 82.5 84.9
    We note that it is nontrivial to train supervised ViT-L from

    View Slide

  68. w 4JN.*.ɿ4JNQMF'SBNFXPSLGPS.BTLFE*NBHF.PEFMJOH
    w ϚεΫͨ͠ύονͱϚεΫ͍ͯ͠ͳ͍ύονͷ྆ํΛ&ODPEFS΁ೖྗ
    w %FDPEFSͱͯ͠૚ͷઢܗ૚Λ࢖༻
    w ϚεΫํ๏ɼϚεΫ཰ʹΑΔਫ਼౓มԽΛ࣮ݧతʹ෼ੳ
    "WH%JTUɿϚεΫͨ͠ϐΫηϧͱ࠷ۙ๣ͷϚεΫ͍ͯ͠ͳ͍ϐΫηϧͱͷϢʔΫϦουڑ཭ͷฏۉ
    ը૾ͷ࠶ߏ੒ɿ4JN.*.<9JF $713>

    View Slide

  69. .BTLFE*NBHF.PEFMJOHͷ୅දతͳख๏



    #&35
    <+%FWMJO /""$->
    ࣗવݴޠॲཧ෼໺
    ϚεΫྖҬͷըૉΛ༧ଌ
    ."&
    <,)F $713`>
    4JN.*.
    <;9JF $713`>
    .BTLFE*NBHF.PEFMJOH .*.

    #&J5
    <)#BP *$-3`>
    J#05
    <+;IPV *$-3`>
    .BTLFE-BOHVBHF.PEFMJOH .-.
    )0(ಛ௃ྔΛ༧ଌ
    .BTLFE'FBUVSF1SFEJDUJPO
    <$8FJ $713`>
    ҟͳΔϞʔμϧͷద༻
    ը૾ͷ࠶ߏ੒
    NVMUJGPMENBTLJOHTUSBUFHZ
    4E"&
    <:$IFO &$$7`>
    "UUFOUJPOXFJHIUʹج͍ͮͨϚεΫ࡞੒
    "UUFOUJPO(VJEFE.*.
    <*,BLPHFPSHJPV &$$7`>
    NVMUJCMPDLNBTLJOHTUSBUFHZ
    *+&1"
    <."TTSBO BS9JW`>
    Ϛϧνεέʔϧͳಛ௃ྔΛ֫ಘ
    .$."&
    <1(BP /FVS*14`>
    ΞʔΩςΫνϟͷվળ
    ը૾΁Ԡ༻
    ରরֶश
    ͷಋೖ
    ϚεΫͷվળ
    5PLFOJ[FS
    GSFF
    ϚϧνλεΫ
    4JN.*.ରরֶश

    4J5
    <4"UJUP BS9JW`>
    $."&
    <+.BP BS9JW`>
    ϚϧνλεΫ
    ."&ରরֶश

    '-*1
    <:-J BS9JW>
    ϚεΫͨ͠ը૾Λ༻͍ͨ$-*1
    .4/
    <."TTSBO &$$7`>
    ϚεΫͨ͠ը૾Λ༻͍ͨ
    /FHBUJWFGSFF
    Ի੠
    ."&UIBU-JTUFO
    <1)VBOH /FVS*14`>
    ."&"T4QBUJPUFNQPSBM-FBSOFST
    <$'FJDIUFOIPGFS /FVS*14`>
    ಈը૾
    ϚϧνϞʔμϧ
    .VMUJ."&
    <3#BDINBOO &$$7`>

    View Slide

  70. w ϚεΫͨ͠ύονͷಛ௃ྔΛ༧ଌ
    w )0(ಛ௃ྔΛਖ਼ղ৘ใͱͯ͠ར༻
    5PLFOJ[FSGSFFɿ.BTLFE'FBUVSF1SFEJDUJPO<8FJ $713>

    s, from pixel
    o discrete vi-
    seudo-labels
    (center col-
    nd SIFT [62]
    on for over a
    MaskFeat in
    ual signals is
    d continuous
    ell.
    tations is not
    ng local pat-
    g supervised
    ed data leads
    !
    !
    !
    transformer
    linear head
    masked input target feature
    e.g., HOG
    H
    W
    T
    Figure 2. MaskFeat pre-training. We randomly replace the in-
    put space-time cubes of a video with a [MASK] token and di-
    rectly regress features (e.g. HOG) of the masked regions. After
    pre-training, the Transformer is fine-tuned on end tasks.
    scratch - MViT-S [56] 81
    pixel 3 RGB 80
    image descriptor 3 HOG [22] 82
    dVAE 7 DALL-E [73] 81
    unsupervised feature 7 DINO [9], ViT-B 82
    supervised feature 7 MViT-B [31] 81
    Table 1. Comparing target features for MaskFeat (video).
    variants are pre-trained with MaskFeat for 300 epochs on MVi
    16⇥4. We report fine-tuning accuracy on K400. Default is g
    feature type
    scratch
    pixel colors
    image descriptor
    dVAE token
    unsupervised feature
    unsupervised feature
    unsupervised feature
    supervised feature
    supervised feature
    pseudo-label
    Table 2. Comparing target features for MaskFeat (image).
    We report 100-epoch fine-tuning accuracy on IN-1K. For two
    and effective epoch† on IN-1K. The default entry is marked in
    † Different teachers use different training strategies. dVAE is pre-trai
    training. To measure the cost in a unified way, we normalize the num
    )0(ಛ௃ྔͷ༧ଌ͸ֶशࡁΈϞσϧͷಛ௃ྔͷ༧ଌͱಉఔ౓ͷਫ਼౓Λୡ੒

    View Slide

  71. .BTLFE*NBHF.PEFMJOHͷ୅දతͳख๏



    #&35
    <+%FWMJO /""$->
    ࣗવݴޠॲཧ෼໺
    ϚεΫྖҬͷըૉΛ༧ଌ
    ."&
    <,)F $713`>
    4JN.*.
    <;9JF $713`>
    .BTLFE*NBHF.PEFMJOH .*.

    #&J5
    <)#BP *$-3`>
    J#05
    <+;IPV *$-3`>
    .BTLFE-BOHVBHF.PEFMJOH .-.
    )0(ಛ௃ྔΛ༧ଌ
    .BTLFE'FBUVSF1SFEJDUJPO
    <$8FJ $713`>
    ҟͳΔϞʔμϧͷద༻
    ը૾ͷ࠶ߏ੒
    NVMUJGPMENBTLJOHTUSBUFHZ
    4E"&
    <:$IFO &$$7`>
    "UUFOUJPOXFJHIUʹج͍ͮͨϚεΫ࡞੒
    "UUFOUJPO(VJEFE.*.
    <*,BLPHFPSHJPV &$$7`>
    NVMUJCMPDLNBTLJOHTUSBUFHZ
    *+&1"
    <."TTSBO BS9JW`>
    Ϛϧνεέʔϧͳಛ௃ྔΛ֫ಘ
    .$."&
    <1(BP /FVS*14`>
    ΞʔΩςΫνϟͷվળ
    ը૾΁Ԡ༻
    ରরֶश
    ͷಋೖ
    ϚεΫͷվળ
    5PLFOJ[FS
    GSFF
    ϚϧνλεΫ
    4JN.*.ରরֶश

    4J5
    <4"UJUP BS9JW`>
    $."&
    <+.BP BS9JW`>
    ϚϧνλεΫ
    ."&ରরֶश

    '-*1
    <:-J BS9JW>
    ϚεΫͨ͠ը૾Λ༻͍ͨ$-*1
    .4/
    <."TTSBO &$$7`>
    ϚεΫͨ͠ը૾Λ༻͍ͨ
    /FHBUJWFGSFF
    Ի੠
    ."&UIBU-JTUFO
    <1)VBOH /FVS*14`>
    ."&"T4QBUJPUFNQPSBM-FBSOFST
    <$'FJDIUFOIPGFS /FVS*14`>
    ಈը૾
    ϚϧνϞʔμϧ
    .VMUJ."&
    <3#BDINBOO &$$7`>

    View Slide

  72. w 5FBDIFSͷΫϥετʔΫϯʹର͢Δ"UUFOUJPOXFJHIUʹج͍ͮͯϚεΫΛ࡞੒
    "UU.BTL)JHIɿ"UUFOUJPOXFJHIͷߴ͍ྖҬΛϚεΩϯά
    "UU.BTL)JOU ɿ"UUFOUJPOXFJHIͷߴ͍ྖҬͷҰ෦͕࢒ΔΑ͏ʹߴ͍ྖҬΛϚεΩϯά
    "UU.BTL-PX ɿ"UUFOUJPOXFJHIͷ௿͍ྖҬΛϚεΩϯά
    w 5FBDIFSͱͯ͠4UVEFOUͷࢦ਺ҠಈฏۉϞσϧΛ࢖༻
    ϚεΫͷվળɿ"UUFOUJPO(VJEFE.*.<,BLPHFPSHJPV &$$7>

    Attention-Guided Masked Image Modeling 3
    (a) Input (b) Random (c) Random (d) Block (e) Attention (f) AttMask (g) AttMask
    Image (30) (75) Wise Map High Low
    Fig. 1. Different than random masking strategies (b-d), our attention-guided masking (AttMask)
    uses the attention map arising in the encoder (e) to mask the most highly attended by default (f),

    View Slide

  73. w J#05΁"UUFOUJPOXFJHIUʹج͍ͮͨϚεΫઓུΛಋೖͯ͠ධՁ
    w *NBHF/FU,ʹ͓͚Δਫ਼౓ൺֱ
    *NBHF/FU,Ͱࣗݾڭࢣ͋Γֶशˠ*NBHF/FU,ͰL//๏ʹΑΔධՁઢܗධՁ
    fi
    OFUVOJOH
    ϚεΫͷվળɿ"UUFOUJPO(VJEFE.*.<,BLPHFPSHJPV &$$7>

    "UUFOUJPOXFJHIUͷߴ͍ྖҬΛϚεΫ͢Δ͜ͱͰਫ਼౓͕޲্
    10 I. Kakogeorgiou et al.
    Table 1. Different masking strategies for iBOT [78] pre-training on 20% of ImageNet. Top-1
    accuracy for k-NN, linear probing on ImageNet validation set; fine-tuning on CIFAR10/100.
    †: default iBOT masking strategy from BEiT [2]. ‡: aggressive random masking strategy from
    MAE [24].
    IBOT MASKING RATIO (%)
    IMAGENET-1K CIFAR10 CIFAR100
    k-NN LINEAR FINE-TUNING
    Random Block-Wise† 10-50 46.7 56.4 98.0 86.0
    Random‡ 75 47.3 55.5 97.7 85.5
    Random 10-50 47.8 56.7 98.0 86.1
    AttMask-Low (ours) 10-50 44.0 53.4 97.6 84.6
    AttMask-Hint (ours) 10-50 49.5 57.5 98.1 86.6
    AttMask-High (ours) 10-50 49.7 57.9 98.2 86.6
    Table 2. Top-1 k-NN accuracy on ImageNet-1k
    validation for iBOT pre-training on different per-
    centage (%) of ImageNet-1k. †: default iBOT
    masking strategy from BEiT [2].
    0 20 40 60 80 100
    0
    10
    20
    30
    40
    50
    42% fewer
    epochs
    k-NN
    Random Block-Wise†
    AttMask-High (ours)

    View Slide

  74. w 4E"&ɿ4FMGEJTUJMMBUFE.BTLFE"VUPFODPEFS
    w 5BSHFUͱͳΔಛ௃ྔͷ࡞੒ʹϚεΫΛಋೖ
    4UVEFOUɿ."&ߏ଄ʹΑΓϚεΫͨ͠ύονͷಛ௃ྔΛ༧ଌ
    5FBDIFSɿ4UVEFOUͱҟͳΔύονΛϚεΫͯ͠ύονͷಛ௃ྔΛநग़ʢ5BSHFUͷ࡞੒ʣ
    w 5FBDIFSͱͯ͠4UVEFOUͷࢦ਺ҠಈฏۉϞσϧΛ࢖༻
    ϚεΫͷվળɿ4E"&<$IFO &$$7>

    Multi-fold
    Mask
    Encoder
    Input
    Encoder
    Decoder
    EMA
    Normalize
    Select
    Feature
    Cosine
    Similarity
    Teacher
    Student

    View Slide

  75. w *NBHF/FU,ʹ͓͚Δਫ਼౓ൺֱ
    *NBHF/FU,Ͱࣗݾڭࢣ͋Γֶशˠ*NBHF/FU,Ͱ
    fi
    OFUVOJOHઢܗධՁ
    ϚεΫͷվળɿ4E"&<$IFO &$$7>

    ରরֶश΍ैདྷͷ.*.ͱൺ΂ͯ
    fi
    OFUVOJOH࣌ͷਫ਼౓͕޲্
    10 Y. Chen et al.
    Table 2. Image classification results on the ILSVRC-2012 ImageNet dataset with top
    1 accuracy. “Epochs” refers to the number of pre-training epochs. MoCo v3 and DINO
    adopt multi-crop augmentation for pre-training. MoCo v3: 2 global crops of 224 × 224.
    DINO: 2 global crops of 224 × 224 and 10 local crops of 96 × 96.
    Method Epochs Crops Finetune Linear
    Methods using ViT-B:
    Train from Scratch 300 81.8
    MoCo v3 300 2 83.2 76.2
    DINO 400 12 83.3 77.3
    BEiT 300 1 83.0 49.4
    MAE 100 1 82.1 54.8
    MAE 300 1 82.9 61.5
    MAE 1600 1 83.6 67.8
    CAE 300 1 83.3 64.2
    SdAE 100 1 83.5 60.3
    SdAE 300 1 84.1 64.9
    top-1 accuracy. Moreover, compared to the recently proposed CAE, our SdAE
    achieves 0.8% top-1 accuracy gain, demonstrating the e↵ectiveness of our self-
    distillated design and multi-fold masking strategy. In addition, with only 100
    epochs pre-training, SdAE can achieve comparable performance with MAE us-

    View Slide

  76. w *+&1"ɿ*NBHFCBTFE+PJOU&NCFEEJOH1SFEJDUJWF"SDIJUFDUVSF
    w 5BSHFUͱͳΔྖҬΛ෼ׂ͠ɼ5BSHFU͝ͱʹ5BSHFUྖҬͷύονಛ௃ྔΛ༧ଌ
    UBSHFU ɿBTQFDUSBUJP< > TDBMF< >ͷൣғͰݸͷUBSHFUCMPDLΛ࡞੒
    DPOUFYU
    ɿBTQFDU TDBMF< >ͷൣғͰDPOUFYUCMPDLΛ࡞੒͠ɼUBSHFUͱॏͳͬͨྖҬΛ࡟আ
    w UBSHFUFODPEFSͱͯ͠DPOUFYUFODPEFSͷࢦ਺ҠಈฏۉϞσϧΛ࢖༻
    ϚεΫͷվળɿ*+&1"<"TTSBO BS9JW>

    f
    q
    gf
    gf
    gf
    f ¯
    q
    L2
    context
    encoder
    predictor
    target
    encoder
    target
    context
    original context targets

    View Slide

  77. w ಛ௃ྔͷநग़
    UBSHFU ɿϚεΫΛద༻͍ͯ͠ͳ͍JOQVUJNBHFΛUBSHFUFODPEFS΁ೖྗ
    DPOUFYU
    ɿDPOUFYUCMPDL಺ͷύονͷΈΛDPOUFYUFODPEFS΁ೖྗ
    w UBSHFUCMPDLͷ༧ଌ
    UBSHFUCMPDL͝ͱʹύοντʔΫϯͱϚεΫτʔΫϯΛ༻͍ͯQSFEJDUPSʹΑΓύονಛ௃ྔΛ༧ଌ
    QSFEJDUPSͱͯ͠খن໛ͳ7J5Λ࢖༻
    ϚεΫͷվળɿ*+&1"<"TTSBO BS9JW>

    f
    q
    gf
    gf
    gf
    f ¯
    q
    L2
    context
    encoder
    predictor
    target
    encoder
    target
    context

    View Slide

  78. w *NBHF/FU,ʹ͓͚Δਫ਼౓ൺֱ
    *NBHF/FU,Ͱࣗݾڭࢣ͋Γֶशˠ*NBHF/FU,ͰઢܗධՁ
    ϚεΫͷվળɿ*+&1"<"TTSBO BS9JW>

    Method Arch. Epochs Top-1
    Methods without view data augmentations
    data2vec [7] ViT-L/16 1600 53.5
    MAE [34]
    ViT-B/16 1600 68.0
    ViT-L/16 1600 76.0
    ViT-H/14 1600 77.2
    I-JEPA
    ViT-B/16 600 72.9
    ViT-L/16 600 77.5
    ViT-H/14 300 79.3
    ViT-H/16448
    300 81.1
    Methods using extra view data augmentations
    SimCLR v2 [20] RN152 (2⇥) 800 79.1
    DINO [17] ViT-B/8 300 80.1
    iBOT [74] ViT-L/16 250 81.0
    Table 1. ImageNet. Linear-evaluation on ImageNet-1k. ViT-
    H/16448
    is pretrained at at a resolution of 448 ⇥ 448. I-JEPA sig-
    nificantly improves linear probing performance compared to other
    methods that do not rely on hand-crafted data-augmentations dur-
    ing pretraining (MAE and data2vec). Moreover, I-JEPA demon-
    strate good scalability behavior and the larger I-JEPA model
    matches the performance of view-invariance approaches without
    requiring view data-augmentions.
    Method Arch. Epochs Top-1
    Methods without view data augmentations
    data2vec [7] ViT-L/16 1600 73.3
    MAE [34]
    ViT-L/16 1600 67.1
    ViT-H/14 1600 71.5
    I-JEPA
    ViT-L/16 600 69.4
    ViT-H/14 300 73.3
    ViT-H/16448
    300 77.3
    Methods using extra view data augmentations
    iBOT [74] ViT-B/16 250 69.7
    DINO [17] ViT-B/8 300 70.0
    SimCLR v2 [33] RN151 (2⇥) 800 70.2
    BYOL [33] RN200 (2⇥) 800 71.2
    MSN [3] ViT-B/4 300 75.7
    Table 2. ImageNet-1%. Semi-supervised evaluation on
    ImageNet-1K using only 1% of the available labels. Models are
    adapted via fine-tuning or linear-probing, depending on whichever
    works best for each respective method. ViT-H/16448
    is pretrained
    at at a resolution of 448 ⇥ 448. I-JEPA pretraining outperforms
    MAE which also does not rely on hand-crafted data-augmentations
    during pretraining. Moreover, I-JEPA benefits from scale. A ViT-
    H/16 trained at resolution 448 surpasses previous methods includ-
    103 104
    68
    70
    72
    74
    76
    78
    80
    ViT-B/16
    ViT-H/14
    ViT-B/16
    ViT-L/16
    ViT-H/14
    Pretraining GPU Hours
    Top 1 (%)
    ImageNet Linear Evaluation vs GPU Hours
    I-JEPA MAE
    ैདྷͷ.*.΍ରরֶशɾ/FHBUJWFGSFFͱൺ΂ͯਫ਼౓͕޲্ ."&ͱൺ΂ͯগͳֶ͍श࣌ؒͰߴ͍ਫ਼౓Λൃش

    View Slide

  79. .BTLFE*NBHF.PEFMJOHͷ୅දతͳख๏



    #&35
    <+%FWMJO /""$->
    ࣗવݴޠॲཧ෼໺
    ϚεΫྖҬͷըૉΛ༧ଌ
    ."&
    <,)F $713`>
    4JN.*.
    <;9JF $713`>
    .BTLFE*NBHF.PEFMJOH .*.

    #&J5
    <)#BP *$-3`>
    J#05
    <+;IPV *$-3`>
    .BTLFE-BOHVBHF.PEFMJOH .-.
    )0(ಛ௃ྔΛ༧ଌ
    .BTLFE'FBUVSF1SFEJDUJPO
    <$8FJ $713`>
    ҟͳΔϞʔμϧͷద༻
    ը૾ͷ࠶ߏ੒
    NVMUJGPMENBTLJOHTUSBUFHZ
    4E"&
    <:$IFO &$$7`>
    "UUFOUJPOXFJHIUʹج͍ͮͨϚεΫ࡞੒
    "UUFOUJPO(VJEFE.*.
    <*,BLPHFPSHJPV &$$7`>
    NVMUJCMPDLNBTLJOHTUSBUFHZ
    *+&1"
    <."TTSBO BS9JW`>
    Ϛϧνεέʔϧͳಛ௃ྔΛ֫ಘ
    .$."&
    <1(BP /FVS*14`>
    ΞʔΩςΫνϟͷվળ
    ը૾΁Ԡ༻
    ରরֶश
    ͷಋೖ
    ϚεΫͷվળ
    5PLFOJ[FS
    GSFF
    ϚϧνλεΫ
    4JN.*.ରরֶश

    4J5
    <4"UJUP BS9JW`>
    $."&
    <+.BP BS9JW`>
    ϚϧνλεΫ
    ."&ରরֶश

    '-*1
    <:-J BS9JW>
    ϚεΫͨ͠ը૾Λ༻͍ͨ$-*1
    .4/
    <."TTSBO &$$7`>
    ϚεΫͨ͠ը૾Λ༻͍ͨ
    /FHBUJWFGSFF
    Ի੠
    ."&UIBU-JTUFO
    <1)VBOH /FVS*14`>
    ."&"T4QBUJPUFNQPSBM-FBSOFST
    <$'FJDIUFOIPGFS /FVS*14`>
    ಈը૾
    ϚϧνϞʔμϧ
    .VMUJ."&
    <3#BDINBOO &$$7`>

    View Slide

  80. w 4J5ɿ4FMGTVQFSWJTFEWJTJPO5SBOTGPSNFS
    w 4JN.*.ϕʔεͷϚεΫྖҬͷըૉ༧ଌʹϖΞͷ༧ଌʢରরֶशʣΛ௥Ճ
    ϚεΫͱͯ͠ϥϯμϜϊΠζΛ࢖༻
    7J5΁DPOUSBTUJWFUPLFOΛ௥Ճ͠ɼDPOUSBTUJWFUPLFOΛ༻͍ͯରরֶश
    ରরֶशͷಋೖɿ4J5<"UJUP BS9JW>

    JOURNAL OF L
    A
    TEX CLASS FILES, VOL. ??, NO. ??, ?? ?? 4
    /LQHDU3URMHFWLRQRI)ODWWHQHG3DWFKHV
    3URMHFWLRQWR,PDJH6SDFH
    9LVLRQ7UDQVIRUPHU
    'DWD7RNHQV,PDJH
    3L[HO&RUUXSWLRQ
    2ULJLQDO,PDJH
    5HFRQVWUXFWHG,PDJH
    3RVLWLRQ
    (PEHGGLQJ
    &RQWUDVWLYH
    +HDG
    &RQWUDVWLYH
    (PEHGGLQJ
    Fig. 1: Self-supervised vIsion Transformer (SiT)

    View Slide

  81. w ϚεΫྖҬͷ༧ଌ
    ύονΛϥϯμϜʹϊΠζ΁ஔ͖׵͑ͯ7J5ʹೖྗ
    7J5͕ग़ྗͨ͠ύοντʔΫϯΛ%FDPEFSʢ૚ͷ.-1ʣ΁ೖྗ֤ͯ͠ύονͷըૉΛ࠶ߏ੒
    ଛࣦؔ਺ͱͯ͠0SJHJOBM*NBHFͱ࠶ߏ੒ը૾ͷ-MPTTΛ࢖༻
    ରরֶशͷಋೖɿ4J5<"UJUP BS9JW>

    &RQWUDVWLYH
    +HDG
    JOURNAL OF L
    A
    TEX CLASS FILES, VOL. ??, NO. ??, ?? ?? 4
    /LQHDU3URMHFWLRQRI)ODWWHQHG3DWFKHV
    3URMHFWLRQWR,PDJH6SDFH
    9LVLRQ7UDQVIRUPHU
    'DWD7RNHQV,PDJH
    3L[HO&RUUXSWLRQ
    2ULJLQDO,PDJH
    5HFRQVWUXFWHG,PDJH
    3RVLWLRQ
    (PEHGGLQJ
    &RQWUDVWLYH
    +HDG
    &RQWUDVWLYH
    (PEHGGLQJ
    JOURNAL OF L
    A
    TEX CLASS FILES, VOL. ??, NO. ??, ?? ?? 4
    /LQHDU3URMHFWLRQRI)ODWWHQHG3DWFKHV
    3URMHFWLRQWR,PDJH6SDFH
    9LVLRQ7UDQVIRUPHU
    'DWD7RNHQV,PDJH
    3L[HO&RUUXSWLRQ
    2ULJLQDO,PDJH
    5HFRQVWUXFWHG,PDJH
    3RVLWLRQ
    (PEHGGLQJ
    &RQWUDVWLYH
    +HDG
    &RQWUDVWLYH
    (PEHGGLQJ
    JOURNAL OF L
    A
    TEX CLASS FILES, VOL. ??, NO. ??, ?? ?? 4
    /LQHDU3URMHFWLRQRI)ODWWHQHG3DWFKHV
    3URMHFWLRQWR,PDJH6SDFH
    9LVLRQ7UDQVIRUPHU
    'DWD7RNHQV,PDJH
    3L[HO&RUUXSWLRQ
    2ULJLQDO,PDJH
    5HFRQVWUXFWHG,PDJH
    3RVLWLRQ
    (PEHGGLQJ
    &RQWUDVWLYH
    +HDG
    &RQWUDVWLYH
    (PEHGGLQJ
    3RVLWLRQ
    (PEHGGLQJ
    &RQWUDVWLYH
    (PEHGGLQJ

    View Slide

  82. w ϖΞͷ༧ଌ
    ͭͷը૾͔Βσʔλ૿෯ʹΑΓϙδςΟϒϖΞͱͳΔͭͷ7JFXΛ࡞੒
    7JFXɿύονΛϥϯμϜʹϊΠζ΁ஔ͖׵͑ͯ7J5΁ೖྗ
    7JFXɿύονʹϚεΫॲཧΛద༻ͤͣʹ7J5ͷࢦ਺ҠಈฏۉϞσϧ΁ೖྗ
    DPOUSBTUJWFUPLFOʹର͢Δग़ྗΛ༻͍ͯରরֶश
    ଛࣦؔ਺ͱͯ͠OPSNBMJ[FEUFNQFSBUVSFTDBMFEDSPTTFOUSPQZMPTTΛ࢖༻
    ରরֶशͷಋೖɿ4J5<"UJUP BS9JW>

    JOURNAL OF L
    A
    TEX CLASS FILES, VOL. ??, NO. ??, ?? ?? 4
    /LQHDU3URMHFWLRQRI)ODWWHQHG3DWFKHV
    3URMHFWLRQWR,PDJH6SDFH
    9LVLRQ7UDQVIRUPHU
    'DWD7RNHQV,PDJH
    3L[HO&RUUXSWLRQ
    2ULJLQDO,PDJH
    5HFRQVWUXFWHG,PDJH
    3RVLWLRQ
    (PEHGGLQJ
    &RQWUDVWLYH
    +HDG
    &RQWUDVWLYH
    (PEHGGLQJ
    JOURNAL OF L
    A
    TEX CLASS FILES, VOL. ??, NO. ??, ?? ?? 4
    /LQHDU3URMHFWLRQRI)ODWWHQHG3DWFKHV
    3URMHFWLRQWR,PDJH6SDFH
    9LVLRQ7UDQVIRUPHU
    'DWD7RNHQV,PDJH
    3L[HO&RUUXSWLRQ
    2ULJLQDO,PDJH
    5HFRQVWUXFWHG,PDJH
    3RVLWLRQ
    (PEHGGLQJ
    &RQWUDVWLYH
    +HDG
    &RQWUDVWLYH
    (PEHGGLQJ
    3URMHFWLRQWR,PDJH6SDFH
    5HFRQVWUXFWHG,PDJH
    &RQWUDVWLYH

    JOURNAL OF L
    A
    TEX CLASS FILES, VOL. ??, NO. ??, ?? ?? 4
    /LQHDU3URMHFWLRQRI)ODWWHQHG3DWFKHV
    3URMHFWLRQWR,PDJH6SDFH
    9LVLRQ7UDQVIRUPHU
    'DWD7RNHQV,PDJH
    3L[HO&RUUXSWLRQ
    2ULJLQDO,PDJH
    5HFRQVWUXFWHG,PDJH
    3RVLWLRQ
    (PEHGGLQJ
    &RQWUDVWLYH
    +HDG
    &RQWUDVWLYH
    (PEHGGLQJ

    View Slide

  83. w ԼྲྀλεΫʹ͓͚Δਫ਼౓ൺֱ
    *NBHF/FU,Ͱࣗݾڭࢣ͋ΓֶशˠԼྲྀλεΫ΁
    fi
    OFUVOJOH
    ରরֶशͷಋೖɿ4J5<"UJUP BS9JW>

    JOURNAL OF L
    A
    TEX CLASS FILES, VOL. ??, NO. ??, ?? ?? 8
    TABLE 2: Comparison with state-of-the-art methods when pretrained and finetuned on the target dataset, i.e. no external
    data is used, employing ViT-S/16.
    Method Flowers Pets CUB Aircraft STL10 Cars CIFAR10 CIFAR100
    Random init. 68.8 47.5 25.3 31.1 77.1 27.4 96.9 77.8
    Comparison with concurrent works
    MoCo-v3 [72] 88.9 69.0 53.1 62.5 95.4 84.0 97.3 83.4
    Comparison with post arts
    Dino [73] 82.4 58.0 43.6 49.3 92.1 73.0 96.8 78.9
    MAE [57] 86.9 73.0 59.4 69.0 – 91.0 – –
    SiT 92.8 84.7 71.2 77.8 96.5 92.1 98.2 85.2
    TABLE 3: Domain Transfer of SiT pretrained on ImageNet-1K dataset.
    Method Flowers Pets CUB Aircraft STL10 Cars CIFAR10 CIFAR100 ImageNet-1K
    ViT-S/16
    Random init. 68.8 47.5 25.3 31.1 77.1 27.4 96.9 77.8 –
    Supervised [4] 98.1 91.1 82.7 80.8 98.2 91.7 98.3 86.9 79.9
    Comparison with concurrent works
    MoCo-v3 [72] 97.7 92.3 82.6 87.3 98.0 93.0 98.2 86.6 81.4
    Dino* [73] 97.8 89.4 80.8 83.8 96.7 93.1 98.6 87.1 81.5
    SiT 98.2 92.6 84.6 87.6 98.8 93.2 99.0 90.8 82.0
    ViT-B/16
    Comparison with concurrent works
    MoCo-v3 [72] 98.3 93.7 84.1 87.2 98.4 93.4 98.2 87.3 83.2
    Dino [73] 98.4 90.2 80.7 81.5 97.2 93.0 98.2 87.1 82.8
    ڭࢣ͋Γֶश΍γϯάϧλεΫͷࣗݾڭࢣख๏ͱൺ΂ͯਫ਼౓͕޲্

    View Slide

  84. w $."&ɿ$POUSBTUJWF.BTLFE"VUPFODPEFST
    w ."&ʹϖΞͷ༧ଌʢରরֶशʣΛ௥Ճ
    ֤ύονͷಛ௃ྔΛ༧ଌ͢Δ'FBUVSF%FDPEFSΛ௥Ճ
    ݩը૾͕ಉ͡.BTLFEJNBHFͱ1JYFMTIJGUFE7JFXΛϙδςΟϒϖΞͱͯ͠ରরֶश
    ରরֶशͷಋೖɿ$."&<.BP BS9JW>

    Online
    Encoder
    ...
    Target
    Encoder
    Masked image
    Pixel-shifted View
    ...
    Pixel Decoder
    Feature
    Decoder
    Reconstruction
    Loss
    Projection
    Head
    Contrastive
    Loss
    Input
    ...
    ...
    ...
    ...

    View Slide

  85. w 0OMJOF&ODPEFS
    ύονϨϕϧͷϚεΫॲཧΛߦ͍ɼϚεΫ͍ͯ͠ͳ͍ύονΛೖྗ
    w 5BSHFU&ODPEFS1SPKFDUJPO)FBE
    ϚεΫॲཧΛద༻ͤͣʹશͯͷύονΛೖྗ
    ରরֶशͷಋೖɿ$."&<.BP BS9JW>

    Online
    Encoder
    ...
    Target
    Encoder
    Masked image
    Pixel-shifted View
    ...
    Pixel Decoder
    Feature
    Decoder
    Reconstruction
    Loss
    Projection
    Head
    Contrastive
    Loss
    Input
    ...
    ...
    ...
    ...
    Online
    Encoder
    ...
    Target
    Encoder
    Masked image
    Pixel-shifted View
    ...
    Pixel Decoder
    Feature
    Decoder
    Reconstruction
    Loss
    Projection
    Head
    Contrastive
    Loss
    Input
    ...
    ...
    ...
    ...
    Online
    Encoder
    ...
    Target
    Encoder
    Masked image
    Pixel-shifted View
    ...
    Pixel Decoder
    Feature
    Decoder
    Reconstruction
    Loss
    Projection
    Head
    Contrastive
    Loss
    Input
    ...
    ...
    ...
    ...
    Online
    Encoder
    ...
    Target
    Encoder
    Masked image
    Pixel-shifted View
    ...
    Pixel Decoder
    Feature
    Decoder
    Reconstruction
    Loss
    Projection
    Head
    Contrastive
    Loss
    Input
    ...
    ...
    ...
    ...

    View Slide

  86. w 1JYFM%FDPEFS
    0OMJOF&ODPEFSͷύοντʔΫϯͱϚεΫτʔΫϯΛೖྗ֤ͯ͠ύονͷϐΫηϧΛग़ྗ
    w 'FBUVSF%FDPEFS
    0OMJOF&ODPEFSͷύοντʔΫϯͱϚεΫτʔΫϯΛೖྗ֤ͯ͠ύονͷಛ௃ྔΛग़ྗ
    ରরֶशͷಋೖɿ$."&<.BP BS9JW>

    Online
    Encoder
    ...
    Target
    Encoder
    Masked image
    Pixel-shifted View
    ...
    Pixel Decoder
    Feature
    Decoder
    Reconstruction
    Loss
    Projection
    Head
    Contrastive
    Loss
    Input
    ...
    ...
    ...
    ...
    Online
    Encoder
    ...
    Target
    Encoder
    Masked image
    Pixel-shifted View
    ...
    Pixel Decoder
    Feature
    Decoder
    Reconstruction
    Loss
    Projection
    Head
    Contrastive
    Loss
    Input
    ...
    ...
    ...
    ...
    Online
    Encoder
    ...
    Target
    Encoder
    Masked image
    Pixel-shifted View
    ...
    Pixel Decoder
    Feature
    Decoder
    Reconstruction
    Loss
    Projection
    Head
    Contrastive
    Loss
    Input
    ...
    ...
    ...
    ...
    Online
    Encoder
    ...
    Target
    Encoder
    Masked image
    Pixel-shifted View
    ...
    Pixel Decoder
    Feature
    Decoder
    Reconstruction
    Loss
    Projection
    Head
    Contrastive
    Loss
    Input
    ...
    ...
    ...
    ...
    Online
    Encoder
    ...
    Target
    Encoder
    Masked image
    Pixel-shifted View
    ...
    Pixel Decoder
    Feature
    Decoder
    Reconstruction
    Loss
    Projection
    Head
    Contrastive
    Loss
    Input
    ...
    ...
    ...
    ...

    View Slide

  87. w 3FDPOTUSVDUJPO-PTT
    ϚεΫτʔΫϯʹର͢Δग़ྗʹͷΈ*OQVUͷରԠ͢Δύονͱ.4&MPTTΛܭࢉ
    w $POUSBTUJWF-PTT
    ݩը૾͕ಉ͡.BTLFEJNBHFͱ1JYFMTIJGUFE7JFXΛϙδςΟϒϖΞͱͯ͠*OGP/$&-PTTΛܭࢉ
    ֤ύοντʔΫϯΛฏۉͨ͠ಛ௃ྔΛը૾શମͷಛ௃ྔͱͯ͠ར༻
    ରরֶशͷಋೖɿ$."&<.BP BS9JW>

    Online
    Encoder
    ...
    Target
    Encoder
    Masked image
    Pixel-shifted View
    ...
    Pixel Decoder
    Feature
    Decoder
    Reconstruction
    Loss
    Projection
    Head
    Contrastive
    Loss
    Input
    ...
    ...
    ...
    ...
    Online
    Encoder
    ...
    Target
    Encoder
    Masked image
    Pixel-shifted View
    ...
    Pixel Decoder
    Feature
    Decoder
    Reconstruction
    Loss
    Projection
    Head
    Contrastive
    Loss
    Input
    ...
    ...
    ...
    ...
    Online
    Encoder
    ...
    Target
    Encoder
    Masked image
    Pixel-shifted View
    ...
    Pixel Decoder
    Feature
    Decoder
    Reconstruction
    Loss
    Projection
    Head
    Contrastive
    Loss
    Input
    ...
    ...
    ...
    ...
    Online
    Encoder
    ...
    Target
    Encoder
    Masked image
    Pixel-shifted View
    ...
    Pixel Decoder
    Feature
    Decoder
    Reconstruction
    Loss
    Projection
    Head
    Contrastive
    Loss
    Input
    ...
    ...
    ...
    ...
    Online
    Encoder
    ...
    Target
    Encoder
    Masked image
    Pixel-shifted View
    ...
    Pixel Decoder
    Feature
    Decoder
    Reconstruction
    Loss
    Projection
    Head
    Contrastive
    Loss
    Input
    ...
    ...
    ...
    ...
    Online
    Encoder
    ...
    Target
    Encoder
    Masked image
    Pixel-shifted View
    ...
    Pixel Decoder
    Feature
    Decoder
    Reconstruction
    Loss
    Projection
    Head
    Contrastive
    Loss
    Input
    ...
    ...
    ...
    ...
    Online
    Encoder
    ...
    Target
    Encoder
    Masked image
    Pixel-shifted View
    ...
    Pixel Decoder
    Feature
    Decoder
    Reconstruction
    Loss
    Projection
    Head
    Contrastive
    Loss
    Input
    ...
    ...
    ...
    ...

    View Slide

  88. w ύϥϝʔλͷߋ৽
    0OMJOF&ODPEFS %FDPEFS 1SPKFDUJPO)FBEɿ-PTTʹج͍ͮͯޯ഑߱Լ๏ʹΑΓߋ৽
    5BSHFU&ODPEFSɿ0OMJOF&ODPEFSͷύϥϝʔλΛࢦ਺Ҡಈฏۉ͢Δ͜ͱͰߋ৽
    ରরֶशͷಋೖɿ$."&<.BP BS9JW>

    Online
    Encoder
    ...
    Target
    Encoder
    Masked image
    Pixel-shifted View
    ...
    Pixel Decoder
    Feature
    Decoder
    Reconstruction
    Loss
    Projection
    Head
    Contrastive
    Loss
    Input
    ...
    ...
    ...
    ...

    View Slide

  89. w *NBHF/FU,ʹ͓͚Δਫ਼౓ൺֱ
    *NBHF/FU,Ͱࣗݾڭࢣ͋Γֶशˠ*NBHF/FU,Ͱ
    fi
    OFUVOJOH
    ରরֶशͷಋೖɿ$."&<.BP BS9JW>

    Method Pre-training epochs Params.(M) Supervision Accuracy
    MoCo-v3 [11] 300 86 RGB 83.2
    DINO [7] 300 86 RGB 82.8
    CIM [18] 300 86 RGB 83.3
    BEiT [3] 800 86 DALLE 83.2
    SimMIM [43] 800 86 RGB 83.8
    PeCo [16] 800 86 Perceptual Codebook 84.5
    MaskFeat [40] 1600 86 HOG 84.0
    CAE [12] 1600 86 DALLE+RGB 83.9
    iBOT [51] 1600 86 RGB 84.0
    SIM [38] 1600 86 RGB 83.8
    MAE [25] 1600 86 RGB 83.6
    CMAE (ours) 800 86 RGB 84.4
    CMAE (ours) 1600 86 RGB 84.7
    ConvMAE* [19] 800 86 RGB 84.6
    ConvMAE* [19] 1600 86 RGB 84.6
    CMAE* (ours) 800 86 RGB 85.0
    CMAE* (ours) 1600 86 RGB 85.3
    Table 2: Comparison of our model with existing methods on ViT-B. We evaluate them with the top-1 accuracy on
    γϯάϧλεΫͷࣗݾڭࢣख๏ͱൺ΂ͯਫ਼౓͕޲্

    View Slide

  90. w .4/ɿ.BTLFE4JBNFTF/FUXPSLT
    w ϚεΫͨ͠7JFXΛ༻͍ͨ/FHBUJWFGSFFख๏ΛఏҊ
    ϚεΫͨ͠7JFXͱϚεΫ͍ͯ͠ͳ͍ҟͳΔ7JFXؒͰ֬཰෼෍͕Ұக͢ΔΑ͏ʹֶश
    ֬཰෼෍͸ΫϥετʔΫϯͱ֤QSPUPUZQFTؒͷྨࣅ౓ΛΫϥεείΞͱͯ͠࡞੒
    ֤QSPUPUZQFT͸ֶशՄೳͳύϥϝʔλͱͯ͠7J5ڞʹߋ৽
    ରরֶशͷಋೖɿ.4/<"TTSBO &$$7>

    fq
    z
    prototypes
    f ¯
    q
    ema
    z
    +
    /
    /
    prediction p
    target p
    +
    H(p
    +, p)
    original
    anchor view
    target view
    patchify & mask
    patchify
    representation
    [CLS]
    cluster assignments

    View Slide

  91. w *NBHF/FU,ʹ͓͚Δਫ਼౓ൺֱ
    *NBHF/FU,Ͱࣗݾڭࢣ͋Γֶशˠ*NBHF/FU,ͰઢܗධՁ
    ରরֶशͷಋೖɿ.4/<"TTSBO &$$7>

    Table 3: Linear evaluation on ImageNet-1K using 100% of the labels.
    Method Architecture Params. Epochs Top 1
    Comparing similar architectures
    SimCLRv2 (Chen et al., 2020c) RN50 24M 800 71.7
    BYOL (Grill et al., 2020) RN50 24M 1000 74.4
    DINO (Caron et al., 2021) ViT-S/16 22M 800 77.0
    iBOT (Zhou et al., 2021) ViT-S/16 22M 800 77.9
    MSN ViT-S/16 22M 600 76.9
    Comparing larger architectures
    MAE (He et al., 2021) ViT-H/14 632M 1600 76.6
    BYOL (Grill et al., 2020) RN200 (2⇥) 250M 800 79.6
    SimCLRv2 (Chen et al., 2020c) RN151+SK (3⇥) 795M 800 79.8
    iBOT (Zhou et al., 2021) ViT-B/16 86M 400 79.4
    DINO (Caron et al., 2021) ViT-B/8 86M 300 80.1
    MoCov3 (Chen et al., 2021) ViT-BN-L/7 304M 300 81.0
    MSN ViT-L/7 304M 200 80.7
    Table 4: End-to-end fine-tuning of a ViT-B/16 encoder on ImageNet-1K using 100% of the labels. MSN obtains
    competitive performance with both joint-embedding approaches and auto-encoding approaches.
    Initialization Pretrain Epochs Top 1
    DINO (Caron et al., 2021) 800 83.6
    BEiT (Bao et al., 2021) 800 83.2
    iBOT (He et al., 2021) 800 83.8
    MAE (He et al., 2021) 1600 83.6
    SimMIM (Xie et al., 2021) - 83.8
    MaskFeat (Wei et al., 2021) - 84.0
    ઢܗධՁͷֶश࣌ʹͷֶश༻σʔλΛ࢖༻
    Table 1: Extreme low-shot. We evaluate the label-efficiency of self-supervised models pretrained on the
    ImageNet-1K dataset. For evaluation, we use an extremely small number of the ImageNet-1K labels and report
    the mean top-1 accuracy and standard deviation across 3 random splits of the data.
    Images per Class
    Method Architecture Epochs 1 2 5
    iBOT (Zhou et al., 2021)
    ViT-S/16 800 40.4 ± 0.5 50.8 ± 0.8 59.9 ± 0.2
    ViT-B/16 400 46.1 ± 0.3 56.2 ± 0.7 64.7 ± 0.3
    DINO (Caron et al., 2021)
    ViT-S/16 800 38.9 ± 0.4 48.9 ± 0.3 58.5 ± 0.1
    ViT-B/16 400 41.8 ± 0.3 51.9 ± 0.6 61.4 ± 0.2
    ViT-S/8 800 45.5 ± 0.4 56.0 ± 0.7 64.7 ± 0.4
    ViT-B/8 300 45.8 ± 0.5 55.9 ± 0.6 64.6 ± 0.2
    MAE (He et al., 2021)
    ViT-B/16 1600 8.2 ± 0.3 25.0 ± 0.3 40.5 ± 0.2
    ViT-L/16 1600 12.3 ± 0.2 19.3 ± 1.8 42.3 ± 0.3
    ViT-H/14 1600 11.6 ± 0.4 18.6 ± 0.2 32.8 ± 0.2
    MSN (Ours)
    ViT-S/16 800 47.1 ± 0.1 55.8 ± 0.6 62.8 ± 0.3
    ViT-B/16 600 49.8 ± 0.2 58.9 ± 0.4 65.5 ± 0.3
    ViT-B/8 600 55.1 ± 0.1 64.9 ± 0.7 71.6 ± 0.3
    ViT-B/4 300 54.3 ± 0.4 64.6 ± 0.7 72.4 ± 0.3
    ViT-L/7 200 57.1 ± 0.6 66.4 ± 0.6 72.1 ± 0.2
    loss, iBOT (Zhou et al., 2021) and SplitMask (El-Nouby et al., 2021) apply a joint-embedding loss
    to an output representing the global sequence (either the [CLS] token or a global average pool of
    the patch vectors). SplitMask shows that by using a patch-level loss, you can reduce the amount of
    unlabeled pre-training data. In contrast, we focus on reducing the amount of labeled data available
    for the downstream prediction task. Data2Vec (Baevski et al., 2022) demonstrates that this approach
    is suitable for multiple modalities such as vision, speech and text. Different from these approaches,
    we only match the view representations globally and do not consider a patch level loss. Consequently,
    we can completely ignore the masked patches, significantly reducing the computational and memory
    ઢܗධՁͷֶश࣌ʹΫϥε͋ͨΓdαϯϓϧͷֶश༻σʔλΛ࢖༻
    'FXTIPUͳઃఆʹ͓͍ͯରরֶश΍ैདྷͷ.*.ͱൺ΂ͯਫ਼౓͕޲্

    View Slide

  92. w '-*1ɿ'BTU-BOHVBHF*NBHF1SFUSBJOJOH
    w ϚεΫͨ͠ը૾Λ༻͍ͨ$-*1ΛఏҊ
    JNBHFFODPEFSͱͯ͠7JTJPO5SBOTGPSNFSΛ࢖༻
    ϚεΫΛద༻͍ͯ͠ͳ͍ը૾ύονͷΈΛJNBHFFODPEFS΁ೖྗֶͯ͠श
    w ֶशޙʹϚεΫ཰Ͱ਺4UFQͷ'-*1 $-*1
    Λߦ͏VONBTLJOHUVOJOHTUSBUFHZΛಋೖ
    ϚεΫͨ͠ը૾ͱϚεΫ͍ͯ͠ͳ͍ը૾ͷΪϟοϓΛٵऩ
    ରরֶशͷಋೖɿ'-*1<-J BS9JW>

    image encoder
    masked image
    visible patches
    text encoder
    text
    contrastive
    loss
    0 50 100 150 200 250
    training time (hours)
    68
    69
    70
    71
    72
    73
    zero-shot accuracy (%)
    3.7 speedup
    mask 0% (our CLIP repro.)
    mask 50%
    mask 75%
    ϚεΫઓུΛར༻͢Δ͜ͱͰֶश࣌ؒΛ୹ॖ

    View Slide

  93. w *NBHF/FU,ʹ͓͚Δਫ਼౓ൺֱ
    -"*0/.Ͱࣗݾڭࢣ͋Γֶशˠ*NBHF/FU,Ͱ[FSPTIPUUSBOTGFSઢܗධՁ
    fi
    OFUVOJOH
    w Ϋϥεࣝผ໰୊ͷԼྲྀλεΫʹ͓͚Δਫ਼౓ධՁ
    -"*0/.Ͱࣗݾڭࢣ͋ΓֶशˠԼྲྀλεΫ΁[FSPTIPUUSBOTGFS
    ରরֶशͷಋೖɿ'-*1<-J BS9JW>

    case data epochs B/16 L/16 L/14 H/14
    CLIP [52] WIT-400M 32 68.6 - 75.3 -
    OpenCLIP [36] LAION-400M 32 67.1 - 72.8 -
    CLIP, our repro. LAION-400M 32 68.2 72.4 73.1 -
    FLIP LAION-400M 32 68.0 74.3 74.6 75.5
    Table 2. Zero-shot accuracy on ImageNet-1K classification, compared with various CLIP baselines. The image size is 224. The entries
    noted by grey are pre-trained on a different dataset. Our models use a 64k batch, 50% masking ratio, and unmasked tuning.
    case data epochs model zero-shot linear probe fine-tune
    CLIP [52] WIT-400M 32 L/14 75.3 83.9† -
    CLIP [52], our transfer WIT-400M 32 L/14 75.3 83.0 87.4
    OpenCLIP [36] LAION-400M 32 L/14 72.8 82.1 86.2
    CLIP, our repro. LAION-400M 32 L/16 72.4 82.6 86.3
    FLIP LAION-400M 32 L/16 74.3 83.6 86.9
    Table 3. Linear probing and fine-tuning accuracy on ImageNet-1K classification, compared with various CLIP baselines. The entries
    noted by grey are pre-trained on a different dataset. The image size is 224. †: CLIP in [52] optimizes with L-BFGS; we use SGD instead.
    The speedup of our method is of great practical value.
    The CLIP baseline takes ⇠10 days training in 256 TPU-v3
    cores, so a speedup of 2-3⇥ saves many days in wall-clock
    time. This speedup facilitates exploring the scaling behav-
    ior, as we will discuss later in Sec. 4.3.
    4.2. Comparisons with CLIP
    In this section, we compare with various CLIP baselines
    in a large variety of scenarios. We show that our method is
    a competitive alternative to CLIP; as such, our fast training
    method is a more desirable choice in practice.
    We consider the following CLIP baselines:
    • The original CLIP checkpoints [52], trained on the pri-
    vate dataset WIT-400M.
    Table 2 reports the results of our FLIP models, using
    the best practice as we have ablated in Table 1 (a 64k
    batch, 50% masking ratio, and unmasked tuning). For
    ViT-L/14,2 our method has 74.6% accuracy, which is 1.8%
    higher than OpenCLIP and 1.5% higher than our CLIP re-
    production. Comparing with the original CLIP, our method
    reduces the gap to 0.7%. We hope our method will improve
    the original CLIP result if it were trained on the WIT data.
    ImageNet linear probing. Table 3 compares the linear
    probing results, i.e., training a linear classifier on the tar-
    get dataset with frozen features. FLIP has 83.6% accuracy,
    1.0% higher than our CLIP counterpart. It is also 0.6%
    higher than our transfer of the original CLIP checkpoint,
    using the same SGD trainer.
    data
    Food101
    CIFAR10
    CIFAR100
    Birdsnap
    SUN397
    Cars
    Aircraft
    VOC2007
    DTD
    Oxford Pets
    Caltech101
    Flowers102
    MNIST
    STL10
    EuroSAT
    RESISC45
    GTSRB
    KITTI
    Country211
    PCam
    UCF101
    Kinetics700
    CLEVR
    HatefulMemes
    SST2
    CLIP [52] WIT-400M 92.9 96.2 77.9 48.3 67.7 77.3 36.1 84.1 55.3 93.5 92.6 78.7 87.2 99.3 59.9 71.6 50.3 23.1 32.7 58.8 76.2 60.3 24.3 63.3 64.0
    CLIP [52], our eval. WIT-400M 91.0 95.2 75.6 51.2 66.6 75.0 32.3 83.3 55.0 93.6 92.4 77.7 76.0 99.3 62.0 71.6 51.6 26.9 30.9 51.6 76.1 59.5 22.2 55.3 67.3
    OpenCLIP [36], our eval. LAION-400M 87.4 94.1 77.1 61.3 70.7 86.2 21.8 83.5 54.9 90.8 94.0 72.1 71.5 98.2 53.3 67.7 47.3 29.3 21.6 51.1 71.3 50.5 22.0 55.3 57.1
    CLIP, our repro. LAION-400M 88.1 96.0 81.3 60.5 72.3 89.1 25.8 81.1 59.3 93.2 93.2 74.6 69.1 96.5 50.7 69.2 50.2 29.4 21.4 53.1 71.5 53.5 18.5 53.3 57.2
    FLIP LAION-400M 89.3 97.2 84.1 63.0 73.1 90.7 29.1 83.1 60.4 92.6 93.8 75.0 80.3 98.5 53.5 70.8 41.4 34.8 23.1 50.3 74.1 55.8 22.7 54.0 58.5
    Table 4. Zero-shot accuracy on more classification datasets, compared with various CLIP baselines. This table follows Table 11 in [52].
    The model is ViT-L/14 with an image size of 224, for all entries. Entries in green are the best ones using the LAION-400M data.
    छྨͷதͰछྨͷσʔληοτͰਫ਼౓͕޲্
    [FSPTIPUUSBOTGFSͱઢܗධՁʹ͓͍ͯਫ਼౓͕޲্

    View Slide

  94. .BTLFE*NBHF.PEFMJOHͷ୅දతͳख๏



    #&35
    <+%FWMJO /""$->
    ࣗવݴޠॲཧ෼໺
    ϚεΫྖҬͷըૉΛ༧ଌ
    ."&
    <,)F $713`>
    4JN.*.
    <;9JF $713`>
    .BTLFE*NBHF.PEFMJOH .*.

    #&J5
    <)#BP *$-3`>
    J#05
    <+;IPV *$-3`>
    .BTLFE-BOHVBHF.PEFMJOH .-.
    )0(ಛ௃ྔΛ༧ଌ
    .BTLFE'FBUVSF1SFEJDUJPO
    <$8FJ $713`>
    ҟͳΔϞʔμϧͷద༻
    ը૾ͷ࠶ߏ੒
    NVMUJGPMENBTLJOHTUSBUFHZ
    4E"&
    <:$IFO &$$7`>
    "UUFOUJPOXFJHIUʹج͍ͮͨϚεΫ࡞੒
    "UUFOUJPO(VJEFE.*.
    <*,BLPHFPSHJPV &$$7`>
    NVMUJCMPDLNBTLJOHTUSBUFHZ
    *+&1"
    <."TTSBO BS9JW`>
    Ϛϧνεέʔϧͳಛ௃ྔΛ֫ಘ
    .$."&
    <1(BP /FVS*14`>
    ΞʔΩςΫνϟͷվળ
    ը૾΁Ԡ༻
    ରরֶश
    ͷಋೖ
    ϚεΫͷվળ
    5PLFOJ[FS
    GSFF
    ϚϧνλεΫ
    4JN.*.ରরֶश

    4J5
    <4"UJUP BS9JW`>
    $."&
    <+.BP BS9JW`>
    ϚϧνλεΫ
    ."&ରরֶश

    '-*1
    <:-J BS9JW>
    ϚεΫͨ͠ը૾Λ༻͍ͨ$-*1
    .4/
    <."TTSBO &$$7`>
    ϚεΫͨ͠ը૾Λ༻͍ͨ
    /FHBUJWFGSFF
    Ի੠
    ."&UIBU-JTUFO
    <1)VBOH /FVS*14`>
    ."&"T4QBUJPUFNQPSBM-FBSOFST
    <$'FJDIUFOIPGFS /FVS*14`>
    ಈը૾
    ϚϧνϞʔμϧ
    .VMUJ."&
    <3#BDINBOO &$$7`>

    View Slide

  95. w .$."&ɿ.BTLFE$POWPMVUJPO.FFUT.BTLFE"VUPFODPEFST
    w Ϛϧνεέʔϧͳಛ௃ྔΛ֫ಘՄೳͳ."&ΛఏҊ
    5SBOTGPSNFS#MPDLͷύονྖҬʹ߹ΘͤͯϚεΫΛ࡞੒
    ϚεΫྖҬͱϚεΫΛ͍ͯ͠ͳ͍ྖҬͷಛ௃ྔ͕ࠞࡏ͠ͳ͍Α͏ʹ.BTLFE$POWPMVUJPOΛಋೖ
    ΞʔΩςΫνϟͷվળɿ.$."&<(BP /FVS*14>

    Stage2 Stage3
    H/4×W/4×C1 H/8×W/8×C2 (H/16×W/16)×C3
    H/16
    W/16
    H/8
    W/8
    H/4
    W/4
    H×W×3
    UpSample
    UpSample
    ×11
    Block-wise Masking
    Encoder
    Patch Embedding
    Patch Embedding
    Stage1
    ×2
    Patch Embedding
    Masked
    Convolution
    Block
    ×2
    Masked
    Convolution
    Block
    Transformer
    Block
    Transformer
    Block
    Decoder
    Linear
    H/4×W/4×C1
    StrideConv+Flatten StrideConv+Flatten
    H/8×W/8×C2
    Multi-Scale Fusion
    Mask Mask
    DepthWise Conv
    Linear
    Mask
    FFN
    Masked Convolution Block
    Feature Embeddings
    Value

    View Slide

  96. .BTLFE*NBHF.PEFMJOHͷ୅දతͳख๏



    #&35
    <+%FWMJO /""$->
    ࣗવݴޠॲཧ෼໺
    ϚεΫྖҬͷըૉΛ༧ଌ
    ."&
    <,)F $713`>
    4JN.*.
    <;9JF $713`>
    .BTLFE*NBHF.PEFMJOH .*.

    #&J5
    <)#BP *$-3`>
    J#05
    <+;IPV *$-3`>
    .BTLFE-BOHVBHF.PEFMJOH .-.
    )0(ಛ௃ྔΛ༧ଌ
    .BTLFE'FBUVSF1SFEJDUJPO
    <$8FJ $713`>
    ҟͳΔϞʔμϧͷద༻
    ը૾ͷ࠶ߏ੒
    Ի੠
    ."&UIBU-JTUFO
    <1)VBOH /FVS*14`>
    ."&"T4QBUJPUFNQPSBM-FBSOFST
    <$'FJDIUFOIPGFS /FVS*14`>
    ಈը૾
    ϚϧνϞʔμϧ
    .VMUJ."&
    <3#BDINBOO &$$7`>
    NVMUJGPMENBTLJOHTUSBUFHZ
    4E"&
    <:$IFO &$$7`>
    "UUFOUJPOXFJHIUʹج͍ͮͨϚεΫ࡞੒
    "UUFOUJPO(VJEFE.*.
    <*,BLPHFPSHJPV &$$7`>
    NVMUJCMPDLNBTLJOHTUSBUFHZ
    *+&1"
    <."TTSBO BS9JW`>
    Ϛϧνεέʔϧͳಛ௃ྔΛ֫ಘ
    .$."&
    <1(BP /FVS*14`>
    ΞʔΩςΫνϟͷվળ
    ը૾΁Ԡ༻
    ରরֶश
    ͷಋೖ
    ϚεΫͷվળ
    5PLFOJ[FS
    GSFF
    ϚϧνλεΫ
    4JN.*.ରরֶश

    4J5
    <4"UJUP BS9JW`>
    $."&
    <+.BP BS9JW`>
    ϚϧνλεΫ
    ."&ରরֶश

    '-*1
    <:-J BS9JW>
    ϚεΫͨ͠ը૾Λ༻͍ͨ$-*1
    .4/
    <."TTSBO &$$7`>
    ϚεΫͨ͠ը૾Λ༻͍ͨ
    /FHBUJWFGSFF

    View Slide

  97. w .VMUJ."&ɿ.VMUJNPEBM.VMUJUBTL.BTLFE"VUPFODPEFST
    w ."&ΛϚϧνϞʔμϧ΁֦ு
    ҟͳΔϞʔμϧͷద༻ɿ.VMUJ."&<#BDINBOO &$$7>

    Transformer
    encoder
    Pre-trained
    MultiMAE
    encoder
    Pre-trained
    MultiMAE
    encoder
    Decoder
    Linear
    proj.
    Linear
    proj.
    Linear
    proj.
    Decoder
    Decoder
    Selected input patches
    Original images Masked targets
    RGB
    Depth
    Semantic
    ...
    ...
    ... ...
    ... ...
    MultiMAE pre-training Single-modal fin
    Multi-modal fin
    Figure 2. (Left) MultiMAE pre-training: A small subset of randomly sampled patches from multiple modalities (e

    View Slide

  98. w ϚϧνϞʔμϧσʔλ
    ٖࣅϥϕϦϯάʹΑΓ3(#ը૾͔Β%FQUI৘ใͱ4FNBOUJDTFHNFOUBUJPO৘ใΛ࡞੒
    ٖࣅϥϕϦϯάͱͯ͠4FHNFOUBUJPO%FQUIλεΫͷֶशࡁΈϞσϧͷग़ྗΛར༻
    w &ODPEFS
    ϚεΫ͞Ε͍ͯͳ͍֤ϞμϦςΟͷύονΛ·ͱΊͯೖྗ
    ҟͳΔϞʔμϧͷద༻ɿ.VMUJ."&<#BDINBOO &$$7>

    Transformer
    encoder
    Decoder
    Linear
    proj.
    Linear
    proj.
    Linear
    proj.
    Decoder
    Decoder
    Selected input patches
    Original images Masked targets
    RGB
    Depth
    Semantic
    ... ... ...
    MultiMAE pre-training
    Figure 2. (Left) MultiMAE pre-training: A small subset of randomly sampled patches from
    Transformer
    encoder
    Pre-trained
    MultiMAE
    encoder
    Pre-trained
    MultiMAE
    encoder
    Task-
    specific
    head(s)
    Decoder
    Linear
    proj.
    Linear
    proj.
    Linear
    Decoder
    Selected input patches
    Original images Masked targets
    RGB
    Depth
    ntic
    ...
    ...
    ...
    ... ...
    ...
    MultiMAE pre-training Single-modal fine-tuning
    Multi-modal fine-tuning
    Task-
    specific
    head(s)
    3(# Transformer
    encoder
    Pre-trained
    MultiMAE
    encoder
    Pre-trained
    MultiMAE
    encoder
    Task-
    specific
    head(s)
    Decoder
    Linear
    proj.
    Linear
    proj.
    Linear
    proj.
    Decoder
    Decoder
    Selected input patches
    Original images Masked targets
    RGB
    Depth
    Semantic
    ...
    ...
    ...
    ...
    ...
    ... ...
    ... ...
    MultiMAE pre-training Single-modal fine-tuning
    Multi-modal fine-tuning
    Task-
    specific
    head(s)
    Figure 2. (Left) MultiMAE pre-training: A small subset of randomly sampled patches from multiple modalities (e.g., RGB, depth, and
    semantic segmentation) is linearly projected to tokens with a fixed dimension and encoded using a Transformer. Task-specific decoders
    reconstruct the masked-out patches by first performing a cross-attention step from queries to the encoded tokens, followed by a shallow
    Transformer. The queries consist of mask tokens (in gray), with the task-specific encoded tokens added at their respective positions. (Right)
    Fine-tuning: By pre-training on multiple modalities, MultiMAE lends itself to fine-tuning on single-modal and multi-modal downstream
    Transformer
    encoder
    Pre-trained
    MultiMAE
    encoder
    Pre-trained
    MultiMAE
    encoder
    Task-
    specific
    head(s)
    Decoder
    Linear
    proj.
    Linear
    proj.
    Linear
    proj.
    Decoder
    Decoder
    Selected input patches
    Original images Masked targets
    RGB
    Depth
    Semantic
    ...
    ...
    ...
    ...
    ...
    ... ...
    ... ...
    MultiMAE pre-training Single-modal fine-tuning
    Multi-modal fine-tuning
    Task-
    specific
    head(s)
    Figure 2. (Left) MultiMAE pre-training: A small subset of randomly sampled patches from multiple modalities (e.g., RGB, depth, and
    semantic segmentation) is linearly projected to tokens with a fixed dimension and encoded using a Transformer. Task-specific decoders
    reconstruct the masked-out patches by first performing a cross-attention step from queries to the encoded tokens, followed by a shallow
    4FNBOUJD
    %FQUI
    ༧Ίඞཁͳσʔλ

    View Slide

  99. w %FDPEFS
    ઢܗࣹӨͨ͠&ODPEFSग़ྗʹରͯ͠Ґஔ৘ใͱϞμϦςΟ৘ใΛ෇༩
    $SPTTBUUFOUJPOʹΑΓϞμϦςΟؒͷؔ܎Λߟྀͨ͠τʔΫϯΛ5SBOTGPSNFSCMPDL΁ೖྗ
    ‣ 2VFSZɹɹɿઢܗࣹӨޙͷ֤ϞμϦςΟͷτʔΫϯ
    ‣ ,FZ 7BMVFɿઢܗࣹӨޙͷશϞμϦςΟͷτʔΫϯ
    ҟͳΔϞʔμϧͷద༻ɿ.VMUJ."&<#BDINBOO &$$7>

    ing implementation de-
    15
    ation details 15
    sification fine-tuning
    . . . . . . . . . . . . 15
    ntation . . . . . . . . 15
    stimation . . . . . . . 17
    se regression tasks . . 17
    egies 17
    transfer results 18
    on on ImageNet 18
    E variants 18
    raining time 19
    the number of segmentation patches constant, we downsam-
    ple the semantic segmentation input by a factor of 4 and use
    patches of size 4⇥4.
    MultiMAE decoder. We illustrate the MultiMAE decoder
    in Fig 7. Following MAE [35], each decoder has a linear
    projection layer to adapt the outputs from the encoder to
    the decoder dimension. After this linear projection, we add
    both sine-cosine positional embeddings and learned modal-
    ity embeddings to the decoder inputs. This is then followed
    by a cross-attention layer, a MLP, and two Transformer
    blocks.
    Figure 7. MultiMAE decoders: Tokens from the MultiMAE en-
    ,FZ 7BMVF
    2VFSZ

    View Slide

  100. w ࠶ߏ੒ը૾ͷՄࢹԽ
    ͭͷϞμϦςΟʹ͓͍ͯ૯ύον਺ͷΛϥϯμϜϚεΫͯ͠࠶ߏ੒
    ҟͳΔϞʔμϧͷద༻ɿ.VMUJ."&<#BDINBOO &$$7>

    MultiMAE: Multi-modal Multi-task Masked Autoencoders
    Roman Bachmann* David Mizrahi* Andrei Atanov Amir Zamir
    Swiss Federal Institute of Technology Lausanne (EPFL)
    https://multimae.epfl.ch
    Masked
    inputs
    MultiMAE
    predictions
    Target
    Semantic Depth RGB
    Masked
    inputs
    MultiMAE
    predictions
    Target Masked
    inputs
    MultiMAE
    predictions
    Target
    Figure 1. MultiMAE pre-training objective. We randomly select 1/6 of all 16⇥16 image patches from multiple modalities and learn to
    reconstruct the remaining 5/6 masked patches from them. The figure shows validation examples from ImageNet, where masked inputs (left),

    View Slide

  101. w ԼྲྀλεΫʹ͓͚Δਫ਼౓ൺֱʢ*NBHF/FU,Ͱࣗݾڭࢣͨ͠ϞσϧΛ
    fi
    OFUVOJOHʣ
    ҟͳΔϞʔμϧͷద༻ɿ.VMUJ."&<#BDINBOO &$$7>

    Method IN-1K (C) ADE20K (S) Hypersim (S) NYUv2 (S) NYUv2 (D)
    Supervised [81] 81.8 45.8 33.9 50.1 80.7
    DINO [12] 83.1 44.6 32.5 47.9 81.3
    MoCo-v3 [17] 82.8 43.7 31.7 46.6 80.9
    MAE [35] 83.3 46.2 36.5 50.8 85.1
    MultiMAE 83.3 46.2 37.0 52.0 86.4
    Table 1. Fine-tuning with RGB-only. We report the top-1 ac-
    curacy (") on ImageNet-1K (IN-1K) [23] classification (C), mIoU
    (") on ADE20K [102] , Hypersim [68] , and NYUv2 [73] seman-
    tic segmentation (S), as well as 1
    accuracy (") on NYUv2 depth
    (D). Text in bold and underline indicates the first and second-best
    results, respectively. All methods are pre-trained on ImageNet-1K
    (with pseudo labels for MultiMAE).
    Method RG
    MAE 36
    MultiMAE 37
    Table 2. Fine-tun
    report semantic seg
    RGB and depth, m
    leverage additional
    Text in gray indica
    on.
    ADE20K (S) Hypersim (S)
    Method RGB pD RGB-pD RGB-pS RGB-pD-pS RGB pD RGB-pD RGB-pS RGB
    MAE 46.2 20.0 46.3 46.2 46.3 36.5 21.0 36.9 37.7 3
    MultiMAE 46.2 34.4 46.8 45.7 47.1 37.0 30.6 37.9 38.4 4
    Table 3. Fine-tuning with RGB and pseudo labels. Semantic segmentation transfer r
    segmentation maps, measured in mIoU ("). MultiMAE benefits much more than MAE f
    Pre-trained
    MultiMAE
    encoder
    Pre-trained
    MultiMAE
    encoder
    Task-
    specific
    head(s)
    sked targets
    ...
    ...
    ...
    ...
    ... ...
    Single-modal fine-tuning
    Multi-modal fine-tuning
    Task-
    specific
    head(s)
    pled patches from multiple modalities (e.g., RGB, depth, and
    on and encoded using a Transformer. Task-specific decoders
    p from queries to the encoded tokens, followed by a shallow
    fic encoded tokens added at their respective positions. (Right)
    to fine-tuning on single-modal and multi-modal downstream
    Pre-trained
    MultiMAE
    encoder
    Pre-trained
    MultiMAE
    encoder
    Task-
    specific
    head(s)
    sked targets
    ...
    ...
    ...
    ...
    ... ...
    Single-modal fine-tuning
    Multi-modal fine-tuning
    Task-
    specific
    head(s)
    pled patches from multiple modalities (e.g., RGB, depth, and
    on and encoded using a Transformer. Task-specific decoders
    p from queries to the encoded tokens, followed by a shallow
    fic encoded tokens added at their respective positions. (Right)
    Method IN-1K (C) ADE20K (S) Hypersim (S) NYUv2 (S) NYUv2 (D)
    Supervised [81] 81.8 45.8 33.9 50.1 80.7
    DINO [12] 83.1 44.6 32.5 47.9 81.3
    MoCo-v3 [17] 82.8 43.7 31.7 46.6 80.9
    MAE [35] 83.3 46.2 36.5 50.8 85.1
    MultiMAE 83.3 46.2 37.0 52.0 86.4
    Table 1. Fine-tuning with RGB-only. We report the top-1 ac-
    curacy (") on ImageNet-1K (IN-1K) [23] classification (C), mIoU
    (") on ADE20K [102] , Hypersim [68] , and NYUv2 [73] seman-
    tic segmentation (S), as well as 1
    accuracy (") on NYUv2 depth
    (D). Text in bold and underline indicates the first and second-best
    results, respectively. All methods are pre-trained on ImageNet-1K
    (with pseudo labels for MultiMAE).
    Hypersim (S)
    Method RGB D RGB-D
    MAE 36.5 32.5 36.9
    MultiMAE 37.0 38.5 47.6
    Table 2. Fine-tuning with RGB an
    report semantic segmentation transfer
    RGB and depth, measured in mIoU ("
    leverage additional modalities such a
    Text in gray indicates a modality that
    on.
    ADE20K (S) Hypersim (S)
    Method RGB pD RGB-pD RGB-pS RGB-pD-pS RGB pD RGB-pD RGB-pS RGB-pD-pS RGB pD
    MAE 46.2 20.0 46.3 46.2 46.3 36.5 21.0 36.9 37.7 37.3 50.1 23.8
    MultiMAE 46.2 34.4 46.8 45.7 47.1 37.0 30.6 37.9 38.4 40.1 52.0 39.9
    Table 3. Fine-tuning with RGB and pseudo labels. Semantic segmentation transfer results using pseud
    segmentation maps, measured in mIoU ("). MultiMAE benefits much more than MAE from pseudo labele
    gray indicates a modality that the model was not pre-trained on.
    than two modalities during transfer quickly becomes com-
    putationally expensive, since without masking, our method
    now scales with the full number of modalities and tokens.
    For performing multi-modal transfers with the standard
    MAE, we train a new input projection for the additional
    modalities while fine-tuning. Further training details can
    depth model was partially traine
    mantic segmentation pseudo labe
    Mask2Former model as in pre-tr
    As shown in Table 3, MultiM
    depth or semantic segmentation
    yond the RGB-only setting, alt
    Method IN-1K (C) ADE20K (S) Hypersim (S) NYUv2 (S) NYUv2
    Supervised [81] 81.8 45.8 33.9 50.1 80.
    DINO [12] 83.1 44.6 32.5 47.9 81.
    MoCo-v3 [17] 82.8 43.7 31.7 46.6 80.
    MAE [35] 83.3 46.2 36.5 50.8 85.
    MultiMAE 83.3 46.2 37.0 52.0 86.
    Table 1. Fine-tuning with RGB-only. We report the top-1
    curacy (") on ImageNet-1K (IN-1K) [23] classification (C), m
    (") on ADE20K [102] , Hypersim [68] , and NYUv2 [73] sem
    tic segmentation (S), as well as 1
    accuracy (") on NYUv2 de
    (D). Text in bold and underline indicates the first and second-
    results, respectively. All methods are pre-trained on ImageNet
    (with pseudo labels for MultiMAE).
    ADE20K (S)
    Method RGB pD RGB-pD RGB-pS RGB-pD-pS RGB p
    MAE 46.2 20.0 46.3 46.2 46.3 36.5 2
    MultiMAE 46.2 34.4 46.8 45.7 47.1 37.0 3
    Table 3. Fine-tuning with RGB and pseudo labels. Semanti
    segmentation maps, measured in mIoU ("). MultiMAE benefit
    gray indicates a modality that the model was not pre-trained on.
    than two modalities during transfer quickly becomes com
    putationally expensive, since without masking, our metho
    now scales with the full number of modalities and token
    For performing multi-modal transfers with the standa
    MAE, we train a new input projection for the addition
    modalities while fine-tuning. Further training details ca
    Method IN-1K (C) ADE20K (S) Hypersim (S) NYUv2 (S) NYUv2 (D)
    Supervised [81] 81.8 45.8 33.9 50.1 80.7
    DINO [12] 83.1 44.6 32.5 47.9 81.3
    MoCo-v3 [17] 82.8 43.7 31.7 46.6 80.9
    MAE [35] 83.3 46.2 36.5 50.8 85.1
    MultiMAE 83.3 46.2 37.0 52.0 86.4
    Table 1. Fine-tuning with RGB-only. We report the top-1 ac-
    curacy (") on ImageNet-1K (IN-1K) [23] classification (C), mIoU
    (") on ADE20K [102] , Hypersim [68] , and NYUv2 [73] seman-
    tic segmentation (S), as well as 1
    accuracy (") on NYUv2 depth
    (D). Text in bold and underline indicates the first and second-best
    results, respectively. All methods are pre-trained on ImageNet-1K
    (with pseudo labels for MultiMAE).
    Hypersim (S) N
    Method RGB D RGB-D RGB
    MAE 36.5 32.5 36.9 50.8
    MultiMAE 37.0 38.5 47.6 52.0
    Table 2. Fine-tuning with RGB and ground t
    report semantic segmentation transfer results from
    RGB and depth, measured in mIoU ("). MultiMA
    leverage additional modalities such as depth, wh
    Text in gray indicates a modality that the model w
    on.
    ADE20K (S) Hypersim (S) NYUv2 (
    Method RGB pD RGB-pD RGB-pS RGB-pD-pS RGB pD RGB-pD RGB-pS RGB-pD-pS RGB pD RGB-pD
    MAE 46.2 20.0 46.3 46.2 46.3 36.5 21.0 36.9 37.7 37.3 50.1 23.8 49.1
    MultiMAE 46.2 34.4 46.8 45.7 47.1 37.0 30.6 37.9 38.4 40.1 52.0 39.9 53.6
    Table 3. Fine-tuning with RGB and pseudo labels. Semantic segmentation transfer results using pseudo labeled de
    segmentation maps, measured in mIoU ("). MultiMAE benefits much more than MAE from pseudo labeled modalities
    gray indicates a modality that the model was not pre-trained on.
    than two modalities during transfer quickly becomes com-
    putationally expensive, since without masking, our method
    now scales with the full number of modalities and tokens.
    For performing multi-modal transfers with the standard
    MAE, we train a new input projection for the additional
    modalities while fine-tuning. Further training details can
    depth model was partially trained on this d
    mantic segmentation pseudo labels, we use t
    Mask2Former model as in pre-training.
    As shown in Table 3, MultiMAE can use
    depth or semantic segmentation to boost p
    yond the RGB-only setting, although the
    Method IN-1K (C) ADE20K (S) Hypersim (S) NYUv2 (S) NYUv2 (D)
    Supervised [81] 81.8 45.8 33.9 50.1 80.7
    DINO [12] 83.1 44.6 32.5 47.9 81.3
    MoCo-v3 [17] 82.8 43.7 31.7 46.6 80.9
    MAE [35] 83.3 46.2 36.5 50.8 85.1
    MultiMAE 83.3 46.2 37.0 52.0 86.4
    Table 1. Fine-tuning with RGB-only. We report the top-1 ac-
    curacy (") on ImageNet-1K (IN-1K) [23] classification (C), mIoU
    (") on ADE20K [102] , Hypersim [68] , and NYUv2 [73] seman-
    tic segmentation (S), as well as 1
    accuracy (") on NYUv2 depth
    (D). Text in bold and underline indicates the first and second-best
    results, respectively. All methods are pre-trained on ImageNet-1K
    (with pseudo labels for MultiMAE).
    Hypersim (S) NY
    Method RGB D RGB-D RGB
    MAE 36.5 32.5 36.9 50.8
    MultiMAE 37.0 38.5 47.6 52.0
    Table 2. Fine-tuning with RGB and ground tru
    report semantic segmentation transfer results from c
    RGB and depth, measured in mIoU ("). MultiMAE
    leverage additional modalities such as depth, while
    Text in gray indicates a modality that the model was
    on.
    ADE20K (S) Hypersim (S) NYUv2 (S)
    Method RGB pD RGB-pD RGB-pS RGB-pD-pS RGB pD RGB-pD RGB-pS RGB-pD-pS RGB pD RGB-pD RG
    MAE 46.2 20.0 46.3 46.2 46.3 36.5 21.0 36.9 37.7 37.3 50.1 23.8 49.1 50
    MultiMAE 46.2 34.4 46.8 45.7 47.1 37.0 30.6 37.9 38.4 40.1 52.0 39.9 53.6 53
    Table 3. Fine-tuning with RGB and pseudo labels. Semantic segmentation transfer results using pseudo labeled dept
    segmentation maps, measured in mIoU ("). MultiMAE benefits much more than MAE from pseudo labeled modalities as
    gray indicates a modality that the model was not pre-trained on.
    than two modalities during transfer quickly becomes com-
    putationally expensive, since without masking, our method
    now scales with the full number of modalities and tokens.
    For performing multi-modal transfers with the standard
    MAE, we train a new input projection for the additional
    modalities while fine-tuning. Further training details can
    depth model was partially trained on this da
    mantic segmentation pseudo labels, we use the
    Mask2Former model as in pre-training.
    As shown in Table 3, MultiMAE can use p
    depth or semantic segmentation to boost per
    yond the RGB-only setting, although the ga
    Method IN-1K (C) ADE20K (S) Hypersim (S) NYUv2 (S) NYUv2 (D)
    Supervised [81] 81.8 45.8 33.9 50.1 80.7
    DINO [12] 83.1 44.6 32.5 47.9 81.3
    MoCo-v3 [17] 82.8 43.7 31.7 46.6 80.9
    MAE [35] 83.3 46.2 36.5 50.8 85.1
    MultiMAE 83.3 46.2 37.0 52.0 86.4
    Table 1. Fine-tuning with RGB-only. We report the top-1 ac-
    curacy (") on ImageNet-1K (IN-1K) [23] classification (C), mIoU
    (") on ADE20K [102] , Hypersim [68] , and NYUv2 [73] seman-
    tic segmentation (S), as well as 1
    accuracy (") on NYUv2 depth
    (D). Text in bold and underline indicates the first and second-best
    results, respectively. All methods are pre-trained on ImageNet-1K
    (with pseudo labels for MultiMAE).
    Hypersim (S) N
    Method RGB D RGB-D RGB
    MAE 36.5 32.5 36.9 50.8
    MultiMAE 37.0 38.5 47.6 52.0
    Table 2. Fine-tuning with RGB and ground t
    report semantic segmentation transfer results from
    RGB and depth, measured in mIoU ("). MultiMA
    leverage additional modalities such as depth, wh
    Text in gray indicates a modality that the model w
    on.
    ADE20K (S) Hypersim (S) NYUv2 (
    Method RGB pD RGB-pD RGB-pS RGB-pD-pS RGB pD RGB-pD RGB-pS RGB-pD-pS RGB pD RGB-pD R
    MAE 46.2 20.0 46.3 46.2 46.3 36.5 21.0 36.9 37.7 37.3 50.1 23.8 49.1
    MultiMAE 46.2 34.4 46.8 45.7 47.1 37.0 30.6 37.9 38.4 40.1 52.0 39.9 53.6
    Table 3. Fine-tuning with RGB and pseudo labels. Semantic segmentation transfer results using pseudo labeled de
    segmentation maps, measured in mIoU ("). MultiMAE benefits much more than MAE from pseudo labeled modalities
    gray indicates a modality that the model was not pre-trained on.
    than two modalities during transfer quickly becomes com- depth model was partially trained on this d
    IN-1K (C) ADE20K (S) Hypersim (S) NYUv2 (S) NYUv2 (D)
    81.8 45.8 33.9 50.1 80.7
    83.1 44.6 32.5 47.9 81.3
    82.8 43.7 31.7 46.6 80.9
    83.3 46.2 36.5 50.8 85.1
    83.3 46.2 37.0 52.0 86.4
    ne-tuning with RGB-only. We report the top-1 ac-
    ImageNet-1K (IN-1K) [23] classification (C), mIoU
    0K [102] , Hypersim [68] , and NYUv2 [73] seman-
    ion (S), as well as 1
    accuracy (") on NYUv2 depth
    bold and underline indicates the first and second-best
    ctively. All methods are pre-trained on ImageNet-1K
    labels for MultiMAE).
    Hypersim (S) NYUv2 (S)
    Method RGB D RGB-D RGB D RGB-D
    MAE 36.5 32.5 36.9 50.8 23.4 49.3
    MultiMAE 37.0 38.5 47.6 52.0 41.4 56.0
    Table 2. Fine-tuning with RGB and ground truth depth. We
    report semantic segmentation transfer results from combinations of
    RGB and depth, measured in mIoU ("). MultiMAE can effectively
    leverage additional modalities such as depth, while MAE cannot.
    Text in gray indicates a modality that the model was not pre-trained
    on.
    ADE20K (S) Hypersim (S) NYUv2 (S)
    RGB pD RGB-pD RGB-pS RGB-pD-pS RGB pD RGB-pD RGB-pS RGB-pD-pS RGB pD RGB-pD RGB-pS RGB-pD-pS
    46.2 20.0 46.3 46.2 46.3 36.5 21.0 36.9 37.7 37.3 50.1 23.8 49.1 50.1 49.3
    46.2 34.4 46.8 45.7 47.1 37.0 30.6 37.9 38.4 40.1 52.0 39.9 53.6 53.5 54.0
    e-tuning with RGB and pseudo labels. Semantic segmentation transfer results using pseudo labeled depth and semantic
    maps, measured in mIoU ("). MultiMAE benefits much more than MAE from pseudo labeled modalities as input. Text in
    s a modality that the model was not pre-trained on.
    odalities during transfer quickly becomes com- depth model was partially trained on this dataset. For se-
    ʢQ%ʣɿٙࣅϥϕϦϯάʹΑΔ%FQUI৘ใ
    ʢQ4ʣɿٖࣅϥϕϦϯάʹΑΔ4FNBOUJDTFHNFOUBUJPO৘ใ
    ͭͷϞʔμϧΛ༻͍ͯ
    fi
    OFUVOJOH͢Δ͜ͱͰߴ͍ೝࣝੑೳΛൃش

    View Slide

  102. w ಈը૾΍Ի੠σʔλΛ༻͍ͨ."&
    ҟͳΔϞʔμϧͷద༻

    encoder decoder
    ....
    ....
    T
    W
    H
    T
    W
    H
    input target
    Figure 1: Masked Autoencoders as spatiotemporal learners. We mask a large subset (e.g., 90%)
    of random patches in spacetime. An encoder operates on the set of visible patches. A small decoder
    then processes the full set of encoded patches and mask tokens to reconstruct the input. Except for
    patch and positional embeddings, neither the encoder, the decoder, nor the masking strategy, has any
    spatiotemporal inductive bias.
    To the extreme, if a video has T identical static frames, randomly sampling 1/T of all spacetime
    patches would reveal most of the static frame. Because slow motion is more likely than fast motion
    in natural videos, the masking ratio can be very high as we observe empirically.
    The higher masking ratio leads to a more efficient solution in practice. Following the MAE in [31]
    that applies the encoder only on visible tokens, a masking ratio of 90% reduces the encoder time and
    memory complexity to <1/10. Put together with a small decoder [31], the MAE pre-training can
    achieve a theoretically 7.7⇥ reduction in computation vs. encoding all tokens. In fact, the computation
    reduction is so large that the data loading time becomes a new bottleneck; even so, we record a 4.1⇥
    wall-clock speedup. Such a significant speedup is of great importance for video research that is
    large-scale and time-consuming.
    We report strong results on a variety of video recognition datasets. Our MAE pre-training greatly
    improves generalization performance: on Kinetics-400 [35], it increases the accuracy of ViT-Large
    [18] by absolute 13% vs. training from scratch, while it takes less wall-clock training time overall
    (pre-training plus fine-tuning). Our MAE pre-training can outperform its supervised pre-training
    ."&"T4QBUJPUFNQPSBM-FBSOFST
    <'FJDIUFOIPGFS /FVS*14>
    Encoder
    Decoder


    Target
    MSE
    Input
    )
    ( ,
    Figure 1: Audio-MAE for audio self-supervised learning. An audio recording is first transformed
    into a spectrogram and split into patches. We embed patches and mask out a large subset (80%).
    An encoder then operates on the visible (20%) patch embeddings. Finally, a decoder processes the
    order-restored embeddings and mask tokens to reconstruct the input. Audio-MAE is minimizing the
    mean square error (MSE) on the masked portion of the reconstruction and the input spectrogram.
    This computational burden has been addressed in different ways. A popular approach is to reduce the
    sequence length in self-attention. Various ViT-based architectures have been developed to alleviate
    such issues for image and video understanding. For example, Swin-Transformer [19] only performs
    local attention within windows that shift across layers. MViT [20] employs pooling attention to
    construct a hierarchy of Transformers where sequence lengths are downsampled. For self-supervised
    learning, MAE [1] efficiently encodes only a small portion (25%) of visual patches while the majority
    of patches is discarded. The simplicity and scalability in MAE make it a promising framework for
    large-scale self-supervised learning.
    In this work, we study MAE for sound recognition and the unique challenges of the audio domain.
    We present Audio-MAE (Fig. 1) as unified and scalable framework for learning self-supervised audio
    representations. Similar to MAE, it is composed of a pair of a Transformer encoder and decoder.
    Figure 2: Visualizations on the Kinetics-400 [35] validation set (masking ratio 90%). We show the original
    video (top), masked video (middle), and MAE output (bottom) for each sample. This model reconstructs the
    original pixels. The video size is 16⇥224⇥224 and the spacetime patch size is 2⇥16⇥16 (the temporal patch
    size of 2 is not visualized here). Each sample has 8⇥14⇥14=1568 tokens with 156 being visible. For better
    visualizations, the known patches in the output are from the original input. Fig. 7 shows more examples.
    Figure 3: Visualizations of the same pre-trained model in Fig. 2 but with a masking ratio of 95%.
    Instead of predicting pixels [9, 18, 31, 80], another line of research focuses on the tokenization
    ."&UIBU-JTUFO
    <)VBOH /FVS*14>
    εϖΫτϩάϥϜʹରͯ͠ϚεΫ
    ಈը૾ʹରͯ͠ϚεΫ

    View Slide

  103. w ."&"T4QBUJPUFNQPSBM-FBSOFSTʢಈը૾ͷ."&ʣʹ͓͚ΔޮՌ
    ֶश࣌ؒͷൺֱɿ."&ʴϑΝΠϯνϡʔχϯά74ϑϧεΫϥονֶश
    ҟͳΔϞʔμϧͷద༻

    0 10 20 30 40 50 60
    0
    20
    40
    60
    80
    accuracy (%)
    wall-clock time (hours)
    MAE pre-train
    800 epochs
    fine-tune
    100 epochs
    from scratch
    400 epochs
    w/ MAE
    from scratch
    1-view
    multi-vie
    Figure 5: MAE pre-training plus fine-tuning is much more accurate and
    scratch. Here the x-axis is the wall-clock training time (128 A100 GPUs), a
    accuracy on Kinetics-400 validation. The table shows the final accuracy.
    ."&ʴϑΝΠϯνϡʔχϯά͸୹ֶ͍श࣌ؒͰߴੑೳ

    View Slide

  104. w ڭࢣϥϕϧΛ෇༩ͯ͠ͳ͍େྔͷσʔλΛٖࣅతͳ໰୊ 1SFUFYUUBTL
    ʹΑֶͬͯश
    w ࣗݾڭࢣ͋ΓֶशͰֶशͨ͠Ϟσϧ͸ࣄલֶशϞσϧͱͯ͠׆༻
    w $//Λର৅ͱͨࣗ͠ݾڭࢣ͋Γֶश
    ୅දతͳख๏ɿ1SFUFYUUBTLͷվળ d
    ˠରরֶश
    ˠ/FHBUJWFGSFF

    ରরֶशʹΑΓੑೳ͕େ͖͘վળ
    ෺ମݕग़΍ηάϝϯςʔγϣϯͳͲͷҰ෦ͷ໰୊ઃఆͰ͸ڭࢣ͋ΓࣄલֶशΛ௒͑ΔੑೳΛୡ੒
    w 7J5Λର৅ͱͨࣗ͠ݾڭࢣ͋Γֶश
    ୅දతͳख๏ɿରরֶशɾ/FHBUJWFGSFF
    ˠ.BTLFE*NBHF.PEFMJOH

    7JTJPO5SBOTGPSNFSͷߏ଄ʹ߹ΘͤͯޮՌΛൃش͢Δํ๏Λઃܭ
    ·ͱΊ

    View Slide