$30 off During Our Annual Pro Sale. View Details »

第59回名古屋CV・PRMU勉強会:ICCV2023論文紹介(自己教師あり学習)

Naoki Okamoto
November 20, 2023

 第59回名古屋CV・PRMU勉強会:ICCV2023論文紹介(自己教師あり学習)

11月11日に行われた第59回名古屋CV・PRML勉強会のICCV2023論文紹介で使用したスライドです.自己教師あり学習に関する論文を6本紹介.

Naoki Okamoto

November 20, 2023
Tweet

More Decks by Naoki Okamoto

Other Decks in Research

Transcript

  1. ୈճ໊ݹ԰$7ɾ13.-ษڧձ
    *$$7࿦จ঺հɿࣗݾڭࢣ͋Γֶश
    Ԭຊ௚थʢத෦େֶɾػց஌֮ϩϘςΟΫεݚڀάϧʔϓʣ
    IUUQNQSHKQ

    View Slide

  2. ࣗݾ঺հ

    !OBPLPL
    IUUQTHJUIVCDPNOBPL
    Ԭຊ௚थ /BPLJ0LBNPUP
    ݚڀςʔϚɿΞϯαϯϒϧֶशɼ஌ࣝৠཹ d

    ɹɹɹɹɹɹࣗݾڭࢣ͋Γֶश d

    த෦େֶ޻ֶݚڀՊϩϘοτཧ޻ֶઐ߈ത࢜ޙظ՝ఔ̎೥ੜ
    ౻٢߂࿱ݚڀࣨॴଐ
    ࣗݾڭࢣ͋Γֶशͷ࿦จΛ঺հ͠·͢ʂ

    View Slide

  3. w ڭࢣϥϕϧΛ෇༩ͯ͠ͳ͍େྔͷσʔλΛٖࣅతͳ໰୊ 1SFUFYUUBTL
    ʹΑֶͬͯश
    w ࣗݾڭࢣ͋ΓֶशͰֶशͨ͠Ϟσϧ͸ࣄલֶशϞσϧͱͯ͠׆༻
    ࣗݾڭࢣ͋Γֶश 44-4FMGTVQFSWJTFE-FBSOJOH


    ᶄ44-Ͱ࡞੒ͨ͠ࣄલֶशϞσϧΛର৅λεΫ΁సҠֶशɾ
    fi
    OFUVOJOH
    1FMJDBO
    ڭࢣϥϕϧ
    '$

    ࣄલֶशϞσϧ
    )FBE

    ࣄલֶशϞσϧ
    Ϋϥε෼ྨϞσϧ
    ෺ମݕग़Ϟσϧ
    ᶃେྔͷڭࢣͳ͠σʔλͰࣄલֶशϞσϧΛ࡞੒
    େྔͷڭࢣͳ͠σʔλ
    ࣄલֶशϞσϧ
    sses image data, we analyze its internal
    nearly projects the flattened patches into
    he top principal components of the the
    ble basis functions for a low-dimensional
    Figure 6: Representative ex-
    amples of attention from the
    d to the
    l learns
    ion em-
    ion em-
    es in the
    usoidal
    D). That
    ogy ex-
    ot yield
    e entire
    gree the
    pute the
    is inte-
    “atten-
    We find
    lowest
    bally is
    istently
    ized at-
    ʜ
    ࣗݾڭࢣ͋Γֶश

    View Slide

  4. ࣗݾڭࢣ͋Γֶशͷ୅දతͳํ๏

    ."&ͷֶशํ๏
    w .BTLFE*NBHF.PEFMJOH .*.

    ϚεΫ͞Εͨύονͷಛ௃ྔΛ༧ଌ
    w #&J5<)#BP *$-3>
    w #&J5W<;1FOH BS9JW>

    ϚεΫ͞ΕͨύονͷըૉΛ༧ଌ
    w ."&<,)F $713>

    w 4JN.*.<;9JF $713>
    &ODPEFS
    &ODPEFS
    .-1
    .-1
    ࢦ਺Ҡಈฏۉ
    4PGUNBY
    $FOUFS
    4PGUNBY
    ଛࣦܭࢉ
    -PDBM
    (MPCBM
    %*/0ͷֶशํ๏
    ଛࣦܭࢉ
    &ODPEFS %FDPEFS
    ϚεΫॲཧ
    ϚεΫτʔΫϯͷ௥Ճ
    ."&ͷֶशํ๏
    w $POUSBTUJWF-FBSOJOH $-

    ෳ਺ͷҟͳΔ7JFX͔Βநग़ͨ͠ը૾Ϩϕϧͷಛ௃Λൺֱ
    w .P$PW<9$IFO *$$7>

    w %*/0<.$BSPO *$$7>

    View Slide

  5. ࣗݾڭࢣ͋Γֶशͷ୅දతͳํ๏

    w ϚϧνϞʔμϧ$-
    ը૾ͱςΩετͷϖΞͰ$-
    w $-*1<"3BEGPSE *$.->
    w ϚϧνϞʔμϧ.*.
    3(# %FQUI 4FNBOUJDΛಉ࣌ʹ.*.
    w .VMUJ."&<3#BDINBOO &$$7>
    w ϋΠϒϦου
    $-ͱ.*.Λಉ࣌ʹֶश
    w J#05<+;IPV *$-3>

    w %*/0W<.0RVBC BS9JW>
    w $POUSBTUJWF-FBSOJOH $-

    ෳ਺ͷҟͳΔ7JFX͔Βநग़ͨ͠ը૾Ϩϕϧͷಛ௃Λൺֱ
    w .P$PW<9$IFO *$$7>

    w %*/0<.$BSPO *$$7>
    7J5Λର৅ͱͨࣗ͠ݾڭࢣ͋Γֶश<>

    w .BTLFE*NBHF.PEFMJOH .*.

    ϚεΫ͞Εͨύονͷಛ௃ྔΛ༧ଌ
    w #&J5<)#BP *$-3>
    w #&J5W<>

    ϚεΫ͞ΕͨύονͷըૉΛ༧ଌ
    w ."&<,)F $713>

    w 4JN.*.<;9JF $713>
    ."&ͷֶशํ๏
    &ODPEFS
    &ODPEFS
    .-1
    .-1
    ࢦ਺Ҡಈฏۉ
    4PGUNBY
    $FOUFS
    4PGUNBY
    ଛࣦܭࢉ
    -PDBM
    (MPCBM
    %*/0ͷֶशํ๏
    ଛࣦܭࢉ
    &ODPEFS %FDPEFS
    ϚεΫॲཧ
    ϚεΫτʔΫϯͷ௥Ճ
    ."&ͷֶशํ๏
    7J5Λର৅ͱͨࣗ͠ݾڭࢣ͋Γֶश<>

    w ϚϧνϞʔμϧ$-
    ը૾ͱςΩετͷϖΞͰ$-
    w $-*1<"3BEGPSE *$.->
    w ϚϧνϞʔμϧ.*.
    3(# %FQUI 4FNBOUJDΛಉ࣌ʹ.*.
    w .VMUJ."&<3#BDINBOO &$$7>
    w ϋΠϒϦου
    $-ͱ.*.Λಉ࣌ʹֶश
    w J#05<+;IPV *$-3>

    w %*/0W<.0RVBC BS9JW>
    &ODPEFS
    &ODPEFS
    .-1
    .-1
    ࢦ਺Ҡಈฏۉ
    4PGUNBY
    $FOUFS
    4PGUNBY
    ଛࣦܭࢉ
    -PDBM
    (MPCBM
    %*/0ͷֶशํ๏
    ଛࣦܭࢉ
    &ODPEFS %FDPEFS
    ϚεΫॲཧ
    ϚεΫτʔΫϯͷ௥Ճ
    ."&ͷֶशํ๏
    $POUSBTUJWF-FBSOJOH $-
    .BTLFE*NBHF.PEFMJOH .*.

    View Slide

  6. ࣗݾڭࢣ͋Γֶशͷ୅දతͳํ๏

    IUUQTTQFBLFSEFDLDPNOBPLTFMGTVQFSWJTFEMFBSOJOH

    View Slide

  7. w λΠτϧʹࣗݾڭࢣ͋Γֶशʹؔ࿈͢ΔΩʔϫʔυ͕෇͘࿦จΛݕࡧ
    4FMGTVQFSWJTFE ɿຊ
    $POUSBTUJWF ɿຊ
    .BTLFEɼ."& ɿຊ
    $-*1 ɿຊ
    0QFO7PDBCVMBSZ ɿຊ
    *$$7ͷ࿦จௐࠪ

    Ԭຊ͕໘ന͍ɾैདྷ๏ʹ͸ͳ͍Ξϓϩʔνͱࢥͬͨ࿦จຊΛ঺հ

    View Slide

  8. w σʔληοτͷେن໛ԽʹΑΓߴ͍ࣄલֶशͷޮՌΛൃش
    5IF&
    ff
    FDUJWFOFTTPG."&1SF1SFUSBJOJOHGPS#JMMJPO4DBMF1SFUSBJOJOH
    w ج൫Ϟσϧ͔Βͷৠཹ TFMGEJTUJMMBUJPO
    ʹΑΓج൫ϞσϧͷੑೳΛվળ
    6ONBTLFE5FBDIFS5PXBSET5SBJOJOH&
    ffi
    DJFOU7JEFP'PVOEBUJPO.PEFMT
    *NQSPWJOH$-*1'JOFUVOJOH1FSGPSNBODF
    w $-*1Ϟσϧʹ͓͍ͯը૾ʹয఺Λ౰ͯͨ৽ͨͳϓϩϯϓτΛఏҊ
    8IBUEPFT$-*1LOPXBCPVUBSFEDJSDMF 7JTVBMQSPNQUFOHJOFFSJOHGPS7-.T
    w $-*1ϞσϧΛ௥Ճֶश͢Δ͜ͱͰ৽ͨͳؔ܎ੑʹֶ͍ͭͯश
    $-*1/GPS;FSP4IPU00%%FUFDUJPO5FBDIJOH$-*1UP4BZ/P
    5FBDIJOH$-*1UP$PVOUUP5FO
    ঺հ͢Δ࿦จ

    View Slide

  9. w ԯαϯϓϧͷେن໛σʔλΛ༻͍ͨ৔߹ͷ."&ͷֶशޮՌΛௐࠪ
    *OTUBHSBNͰऩूͨ͠ԯͷը૾ͱऑϥϕϧʢը૾ͷ౤ߘʹ෇ਵ͢ΔϋογϡλάʣΛ࢖༻
    w ."&ˠऑڭࢣ͋Γֶशͷ̎ஈ֊ͷࣄલֶशʹΑͬͯΑΓߴ͍ࣄલֶशͷޮՌΛൃش
    ."&ɼऑڭࢣ͸ڞʹFQPDIͷֶशͷΈ
    w ը૾ϞσϧΛݻఆͯ͠ݴޠϞσϧΛ$-*1ܗࣜͰֶशˠ;FSPTIPUධՁͰը૾ϞσϧΛධՁ
    5IF&
    ff
    FDUJWFOFTTPG."&1SF1SFUSBJOJOHGPS#JMMJPO4DBMF1SFUSBJOJOH
    <.4JOHI *$$7>

    ViT models at various scales in terms of number of param-
    eters, including ViT-B (86M), ViT-L (307M), and ViT-H
    (632M). We also train on larger 1.9B and 6.5B parameter
    ViT models, which we call ViT-2B and ViT-6.5B, respec-
    tively (Appendix Table 8). As is common practice [23, 84],
    we train models of sizes ViT-B, ViT-L with a patch size of
    16 and larger models with a patch size of 14. We pretrain
    with a 224 × 224 resolution for all models.
    Pre-pretraining (MAE) [33] learns visual representations
    from image datasets without using any labels. We choose
    this approach as it is simple to implement and scales very
    Dataset Task #cls #train #val
    ImageNet-1k (IN1k) [64] Image cls. 1000 1M 50K
    iNaturalist-18 (iNat18) [36] Fine-grained cls. 8142 437K 24K
    ImageNetv2 (INv2) [61] Image cls. 1000 – 10K
    ImageNet-ReaL (IN-ReaL) [7] Image cls. 1000 – 50K
    ObjectNet (ON) [6] Image cls. 113 – 19K
    Food-101 (F-101) [9] Image cls. 101 N/A 25K
    COCO [49] Obj. det. 80 118K 5K
    LVIS [32] Obj. det. 1K 100K 20K
    Kinetics-400 (K400) [43] Action cls. 400 220K 20K
    Something Something-v2 (SSv2) [30] Action cls. 174 169K 25K
    Table 1: Evaluation datasets used to evaluate MAE→WSP on
    image classification, object detection, and video action recognition

    View Slide

  10. w ԯαϯϓϧͷେن໛σʔλΛ༻͍ͨ৔߹ͷ."&ͷֶशޮՌΛௐࠪ
    *OTUBHSBNͰऩूͨ͠ԯͷը૾ͱऑϥϕϧʢը૾ͷ౤ߘʹ෇ਵ͢ΔϋογϡλάʣΛ࢖༻
    w ."&ˠऑڭࢣ͋Γֶशͷ̎ஈ֊ͷࣄલֶशʹΑͬͯΑΓߴ͍ࣄલֶशͷޮՌΛൃش
    ."&ɼऑڭࢣ͸ڞʹFQPDIͷֶशͷΈ
    w ը૾ϞσϧΛݻఆͯ͠ݴޠϞσϧΛ$-*1ܗࣜͰֶशˠ;FSPTIPUධՁͰը૾ϞσϧΛධՁ
    5IF&
    ff
    FDUJWFOFTTPG."&1SF1SFUSBJOJOHGPS#JMMJPO4DBMF1SFUSBJOJOH
    <.4JOHI *$$7>

    ϙελʔηογϣϯͰͷஶऀΒ
    ͲΕ͚ͩ΍ͬͯ΋ऩଋ͠ͳ͍

    FQPDIͰे෼ͳੑೳΛൃش
    SFBTPOBCMZFOPVHI

    View Slide

  11. w ը૾ͷج൫Ϟσϧ͸ಈը૾΁௚઀ద༻͢Δ͜ͱ͕ࠔ೉
    w ը૾ͷج൫ϞσϧΛڭࢣͱͯ͠ɼಈը૾ͷج൫ϞσϧΛεΫϥον͔Βࣗݾڭࢣ͋Γֶश
    4UBHFɿϚεΫ͍ͯ͠ͳ͍ύονͷಛ௃ྔ͕ڭࢣͱҰக͢ΔΑ͏ʹֶश
    4UBHFɿಈը૾ͱݴޠؒͰରরֶशɾτʔΫϯͷϚονϯάɼݴޠͷ.BTLFE.PEFMJOHͰֶश
    6ONBTLFE5FBDIFS5PXBSET5SBJOJOH&
    ff i
    DJFOU7JEFP'PVOEBUJPO.PEFMT
    <,-J *$$7>

    View Slide

  12. w .*.Ͱֶशͨ͠Ϟσϧͱൺ΂ͯ$-*1ͷը૾Ϟσϧ͸'JOFUVOJOHͷਫ਼౓͕௿͍
    $-*1ͱ.*.ͷֶशํ๏ʢ*OQVU 5BSHFU -PTT 4FNBOUJDͳϥϕϧͷ༗ແʣͷҧ͍ʹΑΔ΋ͷͱߟ࡯
    w $-*1ͷ4FNBOUJDͳ৘ใΛҡ࣋ͭͭ͠ɼ.*.ͷಛੑΛ૊ΈࠐΉֶशํ๏'%$-*1ΛఏҊ
    $-*1ͷը૾ϞσϧΛڭࢣͱͯ͠ɼ֤τʔΫϯͷதؒ૚ৠཹͰੜెΛεΫϥον͔Βֶश
    ˠ$-*1ͷ;FSPTIPUੑೳΛҡ࣋ͭͭ͠'JOFUVOJOHੑೳ͕޲্
    *NQSPWJOH$-*1'JOFUVOJOH1FSGPSNBODF<:8FJ *$$7>

    ture
    , in-
    met-
    as.
    ed to
    rop-
    e in-
    mod-
    new
    the
    Table 3: Ingredients comparison between CLIP and MIM
    methods from the perspective of input ratios, training target
    granularity and loss format.
    Method Input Target Loss Semantics
    BeiT [2] Partial Token-level Cross-entropy
    MAE [17] Partial Token-level Regression
    CLIP [42] Full Image-level Cross-entropy X
    FD-CLIP Full Token-level Regression X
    widely used in deep reinforcement learning [48, 4], life-
    long learning [62] and recommendation system [7].
    Our work does not aim to propose a new distillation ap-
    .*.ͷैདྷ๏ͱ$-*1ͷֶशํ๏ͷҧ͍ '%$-*1ͷֶशํ๏

    View Slide

  13. w $-*1ʹ͓͚Δը૾Ϟʔμϧʹয఺Λ౰ͯͨϓϩϯϓτΤϯδχΞϦϯάͷݚڀ
    w ը૾Ϟʔμϧͷϓϩϯϓτͱͯ͠ը૾಺ͷ෺ମΛ੺ؙ͍ͰғΉΞϓϩʔνΛఏҊ
    w ੺ؙ͍Λը૾ʹ௥Ճ͢Δ͜ͱͰؙͷதͷ෺ମʹؔ࿈͢ΔςΩετͱͷྨࣅ౓͕޲্
    8IBUEPFT$-*1LOPXBCPVUBSFEDJSDMF 7JTVBMQS0NQUFOHJOFFSJOHGPS7-.T
    <"4IUFESJUTLJ *$$7>

    . . . . . .
    The cub
    on the right
    Referring expressions comprehension
    This is a
    cat
    bird
    Classification
    A
    Q / A
    . . .
    ear
    eye
    nose
    Naming keypoints
    bear
    The
    ear
    eye
    nose
    of a bear
    A / Q
    bear
    Q
    Q
    . . .
    A . . .
    VLM VLM VLM

    View Slide

  14. w $-*1ʹΑΔΫϥε໊ͷϓϩϯϓτΛ༻͍ͨը૾ͷ;FSPTIUP෼෍֎ݕ஌ͷݚڀ
    w ϙδςΟϒͱωΨςΟϒͳϓϩϯϓτΛ༻͍ͨ෼෍֎ݕ஌ΛఏҊ
    lOPzDMBTTQSPNQUͱ$-*1ͷ5FYU&ODPEFSͷॏΈͰॳظԽͨ͠lOPz5FYU&ODPEFSΛಋೖ
    w $$.σʔληοτͱlOPzUFYUΛ༻͍ͯlOPz5FYU&ODPEFS͸'JOFUVOJOH
    DMBTTQSPNQUͱlOPzDMBTTQSPNQUʹର͢Δྨࣅ౓Λ༻͍ͯ෼෍֎ݕ஌
    $-*1/GPS;FSP4IPU00%%FUFDUJPO5FBDIJOH$-*1UP4BZ/P<)8BOH *$$7>

    class 1
    class 2
    class 3
    class 4
    class n
    (a) feature space of standard methods
    class 1
    class 2
    class 3 class n
    class 4
    (b) feature space of CLIPN
    hard-distinguish
    OOD samples
    easy-distinguish
    OOD samples
    feature space of
    in-domain classes
    feature space
    of the “no” logit
    A photo with a horse runing
    A photo with a horse runing
    (A photo of no fish)
    A photo with a horse runing
    A photo with a horse runing
    (A photo of no cat)
    A photo with a horse runing
    A photo with a horse runing
    (A photo of no cow)
    learnable "no" class prompts
    OOD image
    cow cat fish OOD
    sum
    0.15
    A photo with a horse runing
    A photo with a horse runing
    A photo of a cow
    A photo with a horse runing
    A photo with a horse runing
    A photo of a cat
    A photo with a horse runing
    A photo with a horse runing
    A photo of a fish
    standard class prompts
    Text
    Encoder
    "no" Text
    Encoder
    Image
    Encoder
    "no" cat cat
    OOD
    Competing-to-win
    Agreeing-to-differ
    the decision
    element multiplication
    cow cat fish
    0.8
    cow cat fish
    saying "no" probabilities
    0.6 0.6
    0.2
    cow cat fish
    ID probabilities
    0.4 0.4
    0.25 0.25
    0.5 0.15
    0.1
    0.6
    0.8
    0.2
    OOD

    View Slide

  15. w $-*1͸ը૾಺ͷ෺ମ਺ʹؔ͢ΔςΩετ͕ۤख
    ਖ਼֬ͳ෺ମ਺ͷςΩετ͕෇༩͞Εֶͨश༻σʔλΛ࡞੒
    ϞσϧͷΧ΢ϯτೳྗΛධՁ͢ΔͨΊͷ৽͍͠ϕϯνϚʔΫ$PVOU#FODIΛఏҊ
    w ਖ਼֬ͳ਺ͱؒҧͬͨ਺ͷςΩετΛ۠ผ͢ΔֶशΛઃܭ͠ɼ$-*1ϞσϧΛ'JOFUVOJOH
    5FBDIJOH$-*1UP$PVOUUP5FO<31BJTT *$$7>

    “Four
    running
    horses...”
    “Landscape in
    autumn...”
    Counterfactual
    Captions
    Counting
    Subset
    Non-Counting
    Subset
    h
    atch
    “Three
    running
    horses...”
    !!"#$
    Caption CF caption
    !!%&'(
    (b) Teaching CLIP to count
    Image
    Number
    Swap
    CLIP Text Encoder
    CLIP Image Encoder
    Original CLIP Model
    “Three”
    “Five”
    “Animals”
    “Six”
    “Nine”
    Ours
    “Animals” “Dogs” “Hearts”
    (a) Image Retrieval Results:
    Ours Original
    CLIP
    “Five”
    ( , )
    “Hearts”
    “Lamps” “Dogs” “Lamps”
    (b) Relevancy Maps:

    View Slide

  16. w σʔληοτͷେن໛ԽʹΑΓߴ͍ࣄલֶशͷޮՌΛൃش
    5IF&
    ff
    FDUJWFOFTTPG."&1SF1SFUSBJOJOHGPS#JMMJPO4DBMF1SFUSBJOJOH
    w ج൫Ϟσϧ͔Βͷৠཹ TFMGEJTUJMMBUJPO
    ʹΑΓج൫ϞσϧͷੑೳΛվળ
    6ONBTLFE5FBDIFS5PXBSET5SBJOJOH&
    ffi
    DJFOU7JEFP'PVOEBUJPO.PEFMT
    *NQSPWJOH$-*1'JOFUVOJOH1FSGPSNBODF
    w $-*1Ϟσϧʹ͓͍ͯը૾ʹয఺Λ౰ͯͨ৽ͨͳϓϩϯϓτΛఏҊ
    8IBUEPFT$-*1LOPXBCPVUBSFEDJSDMF 7JTVBMQSPNQUFOHJOFFSJOHGPS7-.T
    w $-*1ϞσϧΛ௥Ճֶश͢Δ͜ͱͰ৽ͨͳؔ܎ੑʹֶ͍ͭͯश
    $-*1/GPS;FSP4IPU00%%FUFDUJPO5FBDIJOH$-*1UP4BZ/P
    5FBDIJOH$-*1UP$PVOUUP5FO
    ·ͱΊ

    View Slide