Upgrade to Pro — share decks privately, control downloads, hide ads and more …

コーディングエージェントとABNを再考

Sponsored · Your Podcast. Everywhere. Effortlessly. Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.

 コーディングエージェントとABNを再考

Claude Codeを使って、CVPR2019論文のABNを再考してみた記録とその感想(2026年5月4日付)です。

Avatar for Hironobu Fujiyoshi

Hironobu Fujiyoshi

May 04, 2026

More Decks by Hironobu Fujiyoshi

Other Decks in Research

Transcript

  1. w "MFY,FOEBMM $&0BUXBZWF  ͖͔͚ͬɿ೥݄೔ͷ9ͷϙετ Multi-Task Learning Using Uncertainty to

    Weigh Losses for Scene Geometry and Semantics Alex Kendall University of Cambridge [email protected] Yarin Gal University of Oxford [email protected] Roberto Cipolla University of Cambridge [email protected] Abstract Numerous deep learning applications benefit from multi- task learning with multiple regression and classification ob- jectives. In this paper we make the observation that the performance of such systems is strongly dependent on the relative weighting between each task’s loss. Tuning these weights by hand is a difficult and expensive process, mak- ing multi-task learning prohibitive in practice. We pro- pose a principled approach to multi-task deep learning which weighs multiple loss functions by considering the ho- moscedastic uncertainty of each task. This allows us to si- multaneously learn various quantities with different units or scales in both classification and regression settings. We demonstrate our model learning per-pixel depth regression, semantic and instance segmentation from a monocular in- put image. Perhaps surprisingly, we show our model can learn multi-task weightings and outperform separate mod- els trained individually on each task. 1. Introduction Multi-task learning aims to improve learning efficiency and prediction accuracy by learning multiple objectives from a shared representation [7]. Multi-task learning is prevalent in many applications of machine learning – from computer vision [27] to natural language processing [11] to speech recognition [23]. We explore multi-task learning within the setting of vi- sual scene understanding in computer vision. Scene under- standing algorithms must understand both the geometry and semantics of the scene at the same time. This forms an in- teresting multi-task learning problem because scene under- standing involves joint learning of various regression and classification tasks with different units and scales. Multi- task learning of visual scene understanding is of crucial importance in systems where long computation run-time is prohibitive, such as the ones used in robotics. Combining all tasks into a single model reduces computation and allows these systems to run in real-time. Prior approaches to simultaneously learning multiple tasks use a na¨ ıve weighted sum of losses, where the loss weights are uniform, or manually tuned [38, 27, 15]. How- ever, we show that performance is highly dependent on an appropriate choice of weighting between each task’s loss. Searching for an optimal weighting is prohibitively expen- sive and difficult to resolve with manual tuning. We observe that the optimal weighting of each task is dependent on the measurement scale (e.g. meters, centimetres or millimetres) and ultimately the magnitude of the task’s noise. In this work we propose a principled way of combining multiple loss functions to simultaneously learn multiple ob- jectives using homoscedastic uncertainty. We interpret ho- moscedastic uncertainty as task-dependent weighting and show how to derive a principled multi-task loss function which can learn to balance various regression and classifica- tion losses. Our method can learn to balance these weight- ings optimally, resulting in superior performance, compared with learning each task individually. Specifically, we demonstrate our method in learning scene geometry and semantics with three tasks. Firstly, we learn to classify objects at a pixel level, also known as se- mantic segmentation [32, 3, 42, 8, 45]. Secondly, our model performs instance segmentation, which is the harder task of segmenting separate masks for each individual object in an image (for example, a separate, precise mask for each in- dividual car on the road) [37, 18, 14, 4]. This is a more difficult task than semantic segmentation, as it requires not only an estimate of each pixel’s class, but also which object that pixel belongs to. It is also more complicated than ob- ject detection, which often predicts object bounding boxes alone [17]. Finally, our model predicts pixel-wise metric depth. Depth by recognition has been demonstrated using dense prediction networks with supervised [15] and unsu- pervised [16] deep learning. However it is very hard to esti- mate depth in a way which generalises well. We show that we can improve our estimation of geometry and depth by using semantic labels and multi-task deep learning. In existing literature, separate deep learning models 1 arXiv:1705.07115v3 [cs.CV] 24 Apr 2018 Encoder Semantic Decoder Input Image Multi-Task Loss Instance Decoder Depth Decoder Semantic Task Uncertainty Instance Task Uncertainty Depth Task Uncertainty Σ Figure 1: Multi-task deep learning. We derive a principled way of combining multiple regression and classification loss functions for multi-task learning. Our architecture takes a single monocular RGB image as input and produces a pixel-wise classification, an instance semantic segmentation and an estimate of per pixel depth. Multi-task learning can improve accuracy over separately trained models because cues from one task, such as depth, are used to regularize and improve the generalization of another domain, such as segmentation. would be used to learn depth regression, semantic segmen- tation and instance segmentation to create a complete scene understanding system. Given a single monocular input im- age, our system is the first to produce a semantic segmenta- tion, a dense estimate of metric depth and an instance level segmentation jointly (Figure 1). While other vision mod- els have demonstrated multi-task learning, we show how to learn to combine semantics and geometry. Combining these tasks into a single model ensures that the model agrees be- tween the separate task outputs while reducing computa- tion. Finally, we show that using a shared representation with multi-task learning improves performance on various metrics, making the models more effective. In summary, the key contributions of this paper are: 1. a novel and principled multi-task loss to simultane- ously learn various classification and regression losses of varying quantities and units using homoscedastic task uncertainty, 2. a unified architecture for semantic segmentation, in- stance segmentation and depth regression, 3. demonstrating the importance of loss weighting in multi-task deep learning and how to obtain superior performance compared to equivalent separately trained models. 2. Related Work Multi-task learning aims to improve learning efficiency and prediction accuracy for each task, when compared to training a separate model for each task [40, 5]. It can be con- sidered an approach to inductive knowledge transfer which improves generalisation by sharing the domain information between complimentary tasks. It does this by using a shared representation to learn multiple tasks – what is learned from one task can help learn other tasks [7]. Fine-tuning [1, 36] is a basic example of multi-task learning, where we can leverage different learning tasks by considering them as a pre-training step. Other models al- ternate learning between each training task, for example in natural language processing [11]. Multi-task learning can also be used in a data streaming setting [40], or to prevent forgetting previously learned tasks in reinforcement learn- ing [26]. It can also be used to learn unsupervised features from various data sources with an auto-encoder [35]. In computer vision there are many examples of methods for multi-task learning. Many focus on semantic tasks, such as classification and semantic segmentation [30] or classifi- cation and detection [38]. MultiNet [39] proposes an archi- tecture for detection, classification and semantic segmenta- tion. CrossStitch networks [34] explore methods to com- bine multi-task neural activations. Uhrig et al. [41] learn semantic and instance segmentations under a classification setting. Multi-task deep learning has also been used for ge- ometry and regression tasks. [15] show how to learn se- mantic segmentation, depth and surface normals. PoseNet [25] is a model which learns camera position and orienta- tion. UberNet [27] learns a number of different regression and classification tasks under a single architecture. In this work we are the first to propose a method for jointly learn- ing depth regression, semantic and instance segmentation. Like the model of [15], our model learns both semantic and geometry representations, which is important for scene un- derstanding. However, our model learns the much harder task of instance segmentation which requires knowledge of both semantics and geometry. This is because our model must determine the class and spatial relationship for each pixel in each object for instance segmentation. 2 $713࿦จ ֤λεΫͷෆ࣮֬ੑΛֶशͯ͠ॏΈ෇͚͠ɺγʔϯͷ زԿਪఆͱҙຯཧղΛಉ࣌ʹߴਫ਼౓Խ ൴ͷ݁࿦ɿ ίʔσΟϯάΤʔδΣϯτ͸ιϑτ΢ΣΞΤϯδχΞϦϯά͚ͩ Ͱͳ͘ɺݚڀʹͱͬͯ΋มֵతͰ͋Δ ৄࡉͳ࣮૷ɺՄࢹԽɺίʔσΟϯά؀ڥͷηοτΞοϓ͸ࠓ΍ந ৅Խ͞Ε͍ͯΔ ϘτϧωοΫ͸΋͸΍ίʔσΟϯάͰ͸ͳ͘ɺΑΓྑ͍࣭໰Λ ͢Δ͜ͱɻ͜ΕʹΑΓɺݚڀ͕଎ָ͘͘͠ײ͡ΒΔ ࢲ͸ͦͷQEGΛ$PEFYʹΞοϓϩʔυ͠ɺ(15ʹ͜͏ࢦࣔ͠· ͨ͠ɻʮ./*45σʔληοτΛ࢖͓ͬͨ΋ͪΌͷλεΫͰ͜ͷ࿦จ Λ࠶࣮૷ͤΑɻϚϧνλεΫͷ΋ͷͩɻʯ ৴͡ΒΕͳ͍͜ͱʹɺθϩγϣοτͰਖ਼͘͠ಈ࡞͢Δ࣮૷Λੜ੒͠ ·ͨ͠ɻࢲͨͪͷֵ৽తͳෆ࣮֬ੑՃॏଛࣦ΋ؚΊͯͰ͢ɻ ࣍ʹɺΤʔδΣϯτʹʮֶशύϥϝʔλΛ࠷దԽͯ͠ੑೳΛ޲্͞ ͤΑʯͱґཔͨ͠Βɺֶश཰ɺॏΈݮਰɺ΢ΥʔϜΞοϓͳͲΛௐ ੔ͯ͠͞ΒʹͷվળΛߜΓग़͠·ͨ͠ɻ
  2. w .13(ʹ͓͚Δ0OFPGNZGBWPSJUFQBQFST  $713ͷޱ಄ൃද࿦จ  ஶऀɿ)JSPTIJ'VLVJ 5TVCBTB)JSBLBXB 5BLBZPTIJ:BNBTIJUB )JSPOPCV'VKJZPTIJ 

    Ҿ༻਺ɿʢ(PPHMF4DIPMBS ೥݄೔࣌఺ʣ  ಺༰ɿ൑அࠜڌΛΞςϯγϣϯϚοϓͱͯ͠ՄࢹԽ͠ͳ͕Βਫ਼౓޲্͢Δ஫ҙػߏ෇͖$// w ίʔσΟϯάΤʔδΣϯτͱ"#/Λ࠶ߟ͍ͨ͠ʂ  2ίʔσΟϯάΤʔδΣϯτ͸࿦จ1%'ͷΈ͔Β࠶ݱՄೳ͔   2೥Ҏ߱ͷ஌ݟͰ"#/ΛͲ͜·ͰվળͰ͖Δ͔   2ݚڀϓϩηεͷͲ͕ࣗ͜ಈԽ͞ΕɺͲ͜ʹਓؒͷ൑அ͕ඞཁ͔  "UUFOUJPO#SBODI/FUXPSL<'VLVJ $713> Attention Branch Network: Learning of Attention Mechanism for Visual Explanation Hiroshi Fukui, Tsubasa Hirakawa, Takayoshi Yamashita, Hironobu Fujiyoshi Chubu University 1200 Matsumotocho, Kasugai, Aichi, Japan {[email protected], [email protected], yamashita@isc, fujiyoshi@isc}.chubu.ac.jp Abstract Visual explanation enables humans to understand the de- cision making of deep convolutional neural network (CNN), but it is insufficient to contribute to improving CNN perfor- mance. In this paper, we focus on the attention map for vi- sual explanation, which represents a high response value as the attention location in image recognition. This attention region significantly improves the performance of CNN by introducing an attention mechanism that focuses on a spe- cific region in an image. In this work, we propose Attention Branch Network (ABN), which extends a response-based vi- sual explanation model by introducing a branch structure with an attention mechanism. ABN can be applicable to several image recognition tasks by introducing a branch for the attention mechanism and is trainable for visual expla- nation and image recognition in an end-to-end manner. We evaluate ABN on several image recognition tasks such as image classification, fine-grained recognition, and multiple facial attribute recognition. Experimental results indicate that ABN outperforms the baseline models on these image recognition tasks while generating an attention map for vi- sual explanation. Our code is available 1 . 1. Introduction Deep convolutional neural network (CNN) [1, 17] mod- els have been achieved the great performance on various image recognition tasks [25, 9, 7, 34, 8, 12, 18]. How- ever, despite CNN models performing well on such tasks, it is difficult to interpret the decision making of CNN in the inference process. To understand the decision mak- ing of CNN, methods of interpreting CNN have been pro- posed [39, 41, 26, 4, 24, 3, 22]. “Visual explanation” has been used to interpret the de- cision making of CNN by highlighting the attention loca- 1https://github.com/machine-perception-robotics -group/attention_branch_network Attention map Attention map … Great grey owl Ruffed grouse (a) Class Activation Mapping (b) Attention Branch Network w c GAP & fc. Attention branch Perception branch Label Label L per(x i) L att(x i) Feature extractor Input image Input image L(x i) Attention mechanism Feature map Feature extractor Figure 1. Network structures of class activation mapping and pro- posed attention branch network. tion in a top-down manner during the inference process. Visual explanation can be categorized into gradient-based or response-based. Gradient-based visual explanation typ- ically use gradients with auxiliary data, such as noise [4] and class index [24, 3]. Although these methods can inter- pret the decision making of CNN without re-training and modifying the architecture, they require the backpropaga- tion process to obtain gradients. In contrast, response- based visual explanation can interpret the decision mak- ing of CNN during the inference process. Class activation mapping (CAM) [41], which is a representative response- based visual explanation, can obtain an attention map in each category using the response of the convolution layer. CAM replaces the convolution and global average pool- ing (GAP) [20] and obtains an attention map that include high response value positions representing the class, as shown in Fig. 1(a). However, CAM requires replacing the fully-connected layer with a convolution layer and GAP, thus, decreasing the performance of CNN. To avoid this problem, gradient-based methods are of- 0 arXiv:1812.10025v2 [cs.CV] 10 Apr 2019 $713
  3. w ߏ੒  ίʔσΟϯάΤʔδΣϯτɿ$MBVEF$PEF  ܭࢉػɿ%(94QBSL  ݚڀࢧԉίʔσΟϯάΤʔδΣϯτͷߏ੒ $MBVEFΞϓϦ 7JTVBM4UVEJPDPEF

    %(94QBSL $MBVEFSFNPUF 44) $MBVEF$PEF εϚϗ͔Β͍ͭͰ΋࣮ݧͷਐḿΛ֬ೝͯ͠ࢦࣔ 1$͔Β44)ͰηοςΠϯάɺ ϓϩδΣΫτͷ࡞੒ɺࢦࣔͱঝೝ
  4. w ໨త  %(94QBSL্Ͱ$MBVEF$PEFΛར༻͢ΔͨΊɺ/PEFKTɾ$MBVEF$PEF$-*ɾUNVY؀ڥΛ੔උ  खॱ 44)઀ଓޙʹOWNΛಋೖ͠ɺ/PEFKT-54൛ΛΠϯετʔϧ OQNJOTUBMMH!BOUISPQJDBJDMBVEFDPEFͰ$MBVEF$PEF$-*Λಋೖ UNVYOFXTSFTFBSDIBHFOUͰ࡞ۀηογϣϯΛ࡞੒ DMBVEFίϚϯυͰ$MBVEF$PEFΛىಈ

    $MBVEFΞΧ΢ϯτͰೝূ͠ɺར༻ن໿ɾ҆શ֬ೝΛ׬ྃ IUUQTHJUIVCDPN,PZP*NBJ"*@3FTFBSDI@"HFOUUSFFNBJO4UFQ  4UFQɿ$MBVEF$PEFͷ؀ڥߏங mprg@spark-09ef:~$ claude ╭─── Claude Code v2.1.117 ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ │ │ Tips for getting started │ │ Welcome back Imai! │ Run /init to create a CLAUDE.md file with instructions for Claude │ │ │ Note: You have launched claude in your home directory. For the best experience, launch it in a pr… │ │ ▐▛███▜▌ │ ────────────────────────────────────────────────────────────────────────────────────────────────── │ │ ▝▜█████▛▘ │ Recent activity │ │ ▘▘ ▝▝ │ No recent activity │ │ Sonnet 4.6 · Claude Pro · [email protected]'s │ │ │ Organization │ │ │ /home/mprg │ │ ╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── ❯ ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── ? for shortcuts
  5. w ໨త  %PDLFS؀ڥɺݚڀ༻ϑΥϧμߏ੒ɺ.$1ಋೖͷೖޱΛ֬ೝ؀ڥΛ੔උ  खॱ ࡞ۀσΟϨΫτϦΛ࡞੒ $MBVEFΛ࡞ۀσΟΫτϦͰىಈ JOJUίϚϯυʹΑΓɺࢦࣔॻ$-"6%&NEΛ࡞੒ %PDLFSίϯςφͷ࡞੒

    .$1αʔόͷಋೖʢ fi MFTZTUFN.$1ʣ IUUQTHJUIVCDPN,PZP*NBJ"*@3FTFBSDI@"HFOUUSFFNBJO4UFQ  4UFQɿΤʔδΣϯτߏஙͷ४උ ~/research/project1/ ├── papers/ # ࿦จPDFஔ͖৔ ├── experiments/ # ࣮ݧ؅ཧ ├── outputs/ │ ├── figures/ # ՄࢹԽ݁Ռ │ └── notes.md # ࣮ݧϝϞ └── .claude/ └── commands/ # ΧελϜίϚϯυஔ͖৔ ࡞ۀσΟϨΫτϦߏ଄
  6. w ໨త  ݚڀࢧԉΤʔδΣϯτΛߏங͠ɺ"#/࿦จΛର৅ʹ࠶ݱ࣮૷Λߦ͏  खॱ ࢦࣔॻ$-"6%&NEͷઃܭ ݚڀࢧԉΤʔδΣϯτͷߦಈݪཧ ΧελϜίϚϯυͷઃఆ 

    JNQMFNFOUɿ࿦จ1%'͔Β࠶ݱ࣮૷·ͰΛࢧԉ "#/࿦จͷݕূ࣮ݧ  "#/࿦จ1%'ΛQBQFSTBCOQEGʹ഑ஔ  JNQMFNFOUQBQFSTBCOQEGΛ࣮ߦ  $*'"3༻ͷ3FT/FUϕʔεͷ"#/Λ࣮૷  "UUFOUJPO.BQɺ5PQ"DDVSBDZɺֶशۂઢΛධՁ IUUQTHJUIVCDPN,PZP*NBJ"*@3FTFBSDI@"HFOUUSFFNBJO4UFQ  4UFQɿݚڀࢧԉΤʔδΣϯτߏஙͱಈ࡞֬ೝ # ݚڀΛ൒ࣗಈతʹߦ͏ Research AI Agent ## ϓϩδΣΫτ֓ཁ Attention Branch NetworkʢABNʣΛϕʔεͱͯ͠ɼʮԾઆ→࣮૷ɾ࣮ߦ→݁Ռ෼ੳɾվળʯͷϧʔϓΛ ൒ࣗಈͰճ͢AI AgentΛߏங͢Δɽ ## ؀ڥ - ࣮ߦ؀ڥɿDGX SparkʢNVIDIA GB10 / Blackwell GPUʣ - Dockerίϯςφɿresearch-devʢৗ࣌ىಈɾ࢖͍ճ͠ʣ - ίϯςφૢ࡞ɿdocker exec research-dev python ... - Ϛ΢ϯτɿ~/research/project1 ↔ /workspace ## ߦಈݪཧ 1. ࣮ݧΛ࣮ߦ͢Δલʹඞͣ experiments/exp_NNN/design.md Λ࡞੒͠ɼϢʔβʔͷঝೝΛಘΔ 2. ঝೝͳ͠ʹ࣮ݧΛ࣮ߦ͠ͳ͍ 3. ࣮ݧ݁ՌΛใࠂ͢Δࡍ͸ɼඞͣ࣍ͷ໰͍΍վળҊΛෳ਺ఏҊ͢Δ 4. ৽͍͠࿦จΛಡΜͩΒ papers/ ʹཁ໿Λ࢒͢ 5. ࣮ݧ͸CIFAR-10/CIFAR-100ͳͲܰྔσʔληοτΛ༏ઌ͠ɼImageNet͸࠷ऴஈ֊ͷΈ ## ݱࡏͷݚڀ - ࿦จɿAttention Branch NetworkʢABNʣarXiv:1812.10025 - ϑΣʔζɿ࣮૷ɾಈ࡞֬ೝ - ࢖༻σʔληοτɿCIFAR-10ʢୈҰ༏ઌʣɼCIFAR-100ʢୈೋ༏ઌʣ ## σΟϨΫτϦߏ੒ - papers/ɿ࿦จPDFɾཁ໿ - experiments/exp_NNN/ɿ࣮ݧ͝ͱͷσΟϨΫτϦ - design.mdɿ࣮ݧઃܭॻʢঝೝήʔτʣ - run.pyɿ࣮ݧίʔυ - results/ɿ࣮ݧ݁Ռ - outputs/figures/ɿՄࢹԽ݁Ռ - outputs/notes.mdɿ࣮ݧϝϞ - .claude/commands/ɿΧελϜεϥογϡίϚϯυ ࢦࣔॻɿ$-"6%&
  7.  FYQ@ɿ"#/ॳظ࣮૷<> ϓϩϯϓτ 3FT/FUʢϕʔεϥΠϯʣͱ3FT/FU "#/ΛಉҰ৚݅ͰֶशɾධՁ͠ɺ"#/ͷޮՌΛఆྔతʹൺֱ͢Δ ݁Ռ Ϟσϧ &SSPS "DD ࠩ෼

    #BTFMJOF   Š "#/   ʢޮՌͳ͠ʣ ࿦จࢀর஋"#/   — ߟ࡯ɾ࣍ͷख w #PUUMFOFDLϒϩοΫͷະ࢖༻ɾCO@BUUɾBUU@DPOWͳͲͷެ࣮ࣜ૷ͷৄࡉ͕ະ൓ө ˠެࣜ(JU)VCΛࢀর࣮ͯ͠૷Λमਖ਼͢ΔʢFYQ@ʣ ˠ࢒೦ͳ͕ΒθϩγϣοτͰͷ࠶ݱ͸ͳΒͣ $*'"33FT/FU
  8.  FYQ@ɿ"#/ॳظ࣮૷<> w θϩγϣοτͰ࠶ݱͰ͖ͳ͔ͬͨཧ༝ΛΤʔδΣϯτʹฉ͍ͯΈΔͱʜ ݁࿦ɿ࿦จͷهࡌϛεͰ͸ͳ͘ɺهड़ͷཻ౓ෆ଍ ࿦จʢ1%'ʣͷ಺༰͸ਖ਼֬Ͱ͕͢ɺ࣮૷Λ׬શ࠶ݱ͢Δʹ͸৘ใ͕ෆे෼Ͱͨ͠ɻ  ࿦จʹॻ͍ͯ͋Δ͜ͱ "UUFOUJPO#SBODIͷ֓೦ߏ଄ʢ෼ذҐஔɾTJHNPJEɾଛࣦؔ਺ʣ ֶशઃఆʢ4(%

    FQPDIT -3ͳͲʣ  ࿦จʹॻ͍͍ͯͳ͍࣮૷ৄࡉʢ(JU)VCͰ൑໌ʣ   ͜Ε͸ਂ૚ֶश࿦จશൠʹڞ௨͢Δ໰୊Ͱɺ࿦จ͸ʮԿΛ͢Δ͔ʯΛઆ໌͠ʮͲ͏࣮૷͢Δ͔ʯͷࡉ෦͸লུ͢Δͷ͕Ұൠత Ͱ͢ɻެࣜίʔυΛࢀরͯ͠ॳΊͯ࠶ݱͰ͖Δɺͱ͍͏έʔε͸௝͋͘͠Γ·ͤΜɻ ߲໨ FYQ@ͷ࣮૷ (JUIVC ϒϩοΫܕ #BTJD#MPDL #PUUMFOFDL #/ͷ༗ແ ͳ͠ ͋Γ BUUFOUJPODPOW º ºˠ#/ˠTJHNPJE XFJHIU@EFDBZ F F BUU@MBZFSͷن໛ ̍ϒϩοΫ ̍̔ϒϩοΫ
  9.  FYQ@ɿެࣜ४ڌ࣮૷ ϓϩϯϓτ ެࣜ(JU)VCʢNBDIJOFQFSDFQUJPOSPCPUJDTHSPVQʣΛࢀর࣮͠૷Λमਖ਼ɺ࠶࣮ݧ͢Δ Ϟσϧ &SSPS "DD ࠩ෼ #BTFMJOF 

     Š "#/   —ʢޮՌݦࡏԽʣ ߟ࡯ɾ࣍ͷख ˠ$VUPVUɾ-3εέδϡʔϥɾਖ਼ଇԽͰ͞ΒͳΔਫ਼౓޲্Λ໨ࢦ͢ʢFYQ@ʣ ˠ࠶ݱ੒ޭʂ $*'"33FT/FU ݁Ռ ߲໨ FYQ@ FYQ@ʢमਖ਼ޙʣ ϒϩοΫܕ #BTJD#MPDL #PUUMFOFDLʢEFQUI≥ʣ #/BGUFSBUU ͳ͠ ͋Γ BUUFOUJPODPOW º ºˠ#/ˠTJHNPJE XFJHIU@EFDBZ F F CBUDI@TJ[F   ओͳमਖ਼఺
  10.  FYQ@ɿϋΠύʔύϥϝʔλ୳ࡧ ϓϩϯϓτ $VUPVUɾ$PTJOF"OOFBMJOHɾ-BCFM4NPPUIJOHΛ૊Έ߹Θͤͯਫ਼౓޲্Λૂ͏ɻ໨ඪFSSPS ˠ໨ඪʢʣୡ੒ʂ $*'"33FT/FU ݁Ռ $POGJH มߋ఺ $

    $VUPVU º ௥Ճ $ $PTJOF"OOFBMJOH FQPDI8BSNVQ $ $ $ʢ$VUPVU $PTJOFʣ $ $ -BCFM4NPPUIJOH ओͳमਖ਼఺ $PO fi H &SSPS "DD WTFYQ@"#/ $   — $   — $   — $˒࠷ྑ   — ߟ࡯ɾ࣍ͷख ˠ$ͷ৚݅Λ΋ͱʹɺ৽͍͠ΞΠσΟΞΛಋೖͯ͠"#/ͷਫ਼౓޲্Λ໨ࢦ͢ʢFYQ@ʣ
  11.  FYQ@ɿ%JWFSTF"#/ʢଟ༷ੑ੍໿ʣ ϓϩϯϓτ "UUFOUJPO#SBODIΛຊʹ֦ு͠ɺίαΠϯྨࣅ౓࠷খԽʢЕʣͰޓ͍ʹҟͳΔྖҬ΁஫໨ͤ͞ΔɻΞϯαϯϒϧਪ࿦Ͱਫ਼ ౓޲্ɻ໨ඪFSSPS ˠ໨ඪʢʣʹ͸Θ͔ͣʹಧ͔ͣɺ͕ͩޮՌ͸͋Δ͔΋ ݁Ռ ΞʔΩςΫνϟ $PO fi

    H ಺༰ &SSPS "DD WT$ $ɿ"#/ ΦϦδφϧ    %ɿ%JWFSTF"#/ ෼ذΞϯαϯϒϧ   — ߟ࡯ɾ࣍ͷख ˠ$*'"3ʹల։ͯ͠ޮՌΛݕূʢFYQ@ʣ backbone ├─ AttentionBranch_1 → map_1 → pred_att_1 ─┐ ├─ AttentionBranch_2 → map_2 → pred_att_2 ─┤ ଟ༷ੑଛࣦʢcosine similarityʣ └─ PerceptionBranchʢmap_1+map_2 ͰՃॏʣ ─┘ final_pred = (pred_att_1 + pred_att_2 + pred_per) / 3 $*'"33FT/FU
  12.  FYQ@ɿ$*'"3ల։ ϓϩϯϓτ $*'"3Ͱ࠷ྑͷ%ʢ%JWFSTF"#/ʣΛ$*'"3ʹద༻͠ɺΫϥε਺ഒͰͷޮՌΛ֬ೝ ݁Ռ ࣮ݧઃఆ Ϟσϧ &SSPS "DD ࠩ෼

    #ʢ#BTFMJOFʣ   Š $ "#/    %ʢ%JWFSTF"#/ʣ   —˒େ෯վળ ߟ࡯ɾ࣍ͷख w $*'"3ʢ—ʣͱൺ΂$*'"3Ͱ͸—ͱେ෯ʹେ͖͍ޮՌ w Ϋϥε਺͕ଟ͍΄Ͳଟ༷ͳ஫ҙ͕ॏཁʹͳΔՄೳੑΛࣔࠦ ˠ͞ΒͳΔਫ਼౓޲্ʹ޲͚ͯɺΞϯαϯϒϧਪ࿦ํ๏ͷվળ΁ʢFYQ@ FYQ@ʣ FYQ@%ͷઃఆΛ׬શܧঝɻมߋ఺͸'$ग़ྗͷΈˠΫϥε $*'"33FT/FU
  13.  FYQ@ɿ৴པ౓ϕʔεಈతॏΈ෇͚ ϓϩϯϓτ ֤෼ذͷग़ྗ֬཰ͷࣗ৴౓ʢ࠷େ֬཰ɾෛΤϯτϩϐʔʣΛαϯϓϧ͝ͱͷॏΈͱͯ͠࢖͍ɺΞϯαϯϒϧͷ࣭Λ޲্ͤ͞Δɻ௥Ճ ֶशͳ͠ɻ σʔληοτ 8ʢۉ౳ʣ 8ʢ.BY1SPCʣ 8ʢෛΤϯτϩϐʔʣ $*'"3

     ˣ ˣ $*'"3  ˢѱԽ ˢѱԽ ߟ࡯ɾ࣍ͷख w $*'"3Ͱ͸ٯʹѱԽɻόοΫϘʔϯڞ༗ʹΑΓ෼ذͷ૬͕ؔߴ͘ɺ৴པ౓ॏΈͷޮՌ͕ग़ͳ͍ɻ w ˠೖྗଆ͔Βଟ༷ੑΛՃ͑Δ55"Λࢼ͢ʢFYQ@ʣ $*'"33FT/FU $*'"33FT/FU ݁Ռ ˠ$*'"3Ͱ͸ޮՌ͋Γɺ$*'"3Ͱ͸ޮՌͳ͠
  14.  FYQ@ɿ5FTU5JNF"VHNFOUBUJPOʢ55"ʣ ϓϩϯϓτ ෼ذؒͷ૬͕ؔߴ͍໰୊Λճආ͢ΔͨΊɺೖྗଆ͔Βଟ༷ੑΛՃ͑Δ55"Λࢼ͢ɻ௥Ճֶशͳ͠ɻطଘνΣοΫϙΠϯτΛͦͷ· ·࢖༻ɻ %BUBTFU 5ʢϕʔεʣ 5ʢ fl JQʣ

    5ʢDSPQ fl JQʣ $*'"3  ˣ ˣ $*'"3  ˣ ˣ $*'"33FT/FU $*'"33FT/FU ˠඍখͰ͸͋Δ͕$*'"3ͱ$*'"3ͱ΋ʹޮՌΛ֬ೝ ݁Ռ $POGJHઃܭ $POGJH ಺༰ 5 55"ͳ͠ʢϕʔεϥΠϯʣ 5 ਫฏ൓సͷฏۉʢWJFXʣ 5 DSPQº൓సʢWJFXʣ
  15.  WTͷݚڀϓϩηεൺֱ w ೥ʢݪ࿦จ࣌ʣͱ೥Ͱݚڀϓϩηε͕Ͳ͏มΘΔ͔Λରൺ ݚڀϑΣʔζ 2019೥ 2026೥ ࣮૷ ֶੜ͕਺ि͔͚࣮ؒͯ૷ ࿦จ

    + ެࣜίʔυࢀর͸ਓख ΤʔδΣϯτ͕ ਺࣌ؒͰ࣮૷ ࿦จPDF → ίʔυΛ൒ࣗಈԽ ϋΠύϥ୳ࡧ खಈɾܦݧଇϕʔε 1ࢼߦͣͭஞ࣍తʹݕূ ΤʔδΣϯτ͕ ମܥతʹ୳ࡧ Cutout / Cosine / LS Λܥ౷తʹ૊߹ͤ ΞΠσΞݕূ 1ͭͷԾઆΛஸೡʹݕূ ࣦഊ͢Δͱίετ͕େ͖͍ ෳ਺ΞΠσΞΛฒྻݕূ Diverse ABNɾ৴པ౓ॏΈɾTTA Λԣஅతʹ ஶऀͷ໾ׂ ࣮૷ + ઃܭ + ൑அ (ίʔσΟϯά͕ओཁ࡞ۀ) ໰͍ͷઃఆ + ൑அ ʹूத (ίʔσΟϯά͸ந৅Խ) ظؒ (໨҆) ࿦จԽ·Ͱ ਺ϲ݄ ʙ ൒೥ ݪߘ࡞੒·Ͱ ໿ 1 िؒ ϘτϧωοΫ ࣮૷ྗɾܭࢉࢿݯɾ࣌ؒ ʮྑ͍໰͍ΛཱͯΔྗʯ (ϓϩϯϓτઃܭ / Ձ஋൑அ) ݚڀͷϘτϧωοΫ͸ʮίʔσΟϯάʯ͔ΒʮϓϩϯϓςΟϯάʯ΁Ҡߦ ˠ࣮૷͸ந৅Խ͞Εɺݚڀऀͷຊ࣭తͳ໾ׂ͸໰͍ͷઃఆͱՁ஋൑அʹऩଋ͍ͯ͘͠
  16. w 2ίʔσΟϯάΤʔδΣϯτ͸࿦จ1%'ͷΈ͔Β࠶ݱՄೳ͔   ճ౴ɿ෦෼తʹ:FTɺ׬શͳ࠶ݱ͸/P  ࿦จʹهड़͞Εͨൣғ "UUFOUJPO#SBODIͷ෼ذߏ଄ɺଛࣦؔ਺ɺ4(%FQPDI-3ͳͲͷ ֶशઃఆ ͸θϩγϣοτͰਖ਼࣮͘͠૷Ͱ͖ͨ

    FYQ@   ͔͠͠࿦จهड़ͷཻ౓ෆ଍ʹΑΓɺਫ਼౓͸࠶ݱͰ͖ͳ͔ͬͨ "DDWT࿦จ஋   ެࣜ(JU)VCΛࢀর͢Δ͜ͱͰॳΊͯ࠶ݱʹ੒ޭ FYQ@ "DD  ࣔࠦɿ ࿦จ͸ʮԿΛ͢Δ͔ʯΛهड़͢Δ͕ɺʮͲ͏࣮૷͢Δ͔ʯͷࡉ෦ #BTJD#MPDLWT#PUUMFOFDLɺ BUUFOUJPODPOWͷΧʔωϧαΠζɺXFJHIU@EFDBZͳͲ ͸লུ͞ΕΔ͜ͱ͕Ұൠత ˠθϩγϣοτ࠶ݱͷน͸ɺΤʔδΣϯτͷೳྗͰ͸ͳ͘࿦จهड़ͷ׳शʹ͋Δ  ·ͱΊɿίʔσΟϯάΤʔδΣϯτͱ"#/Λ࠶ߟ<>
  17. w 2೥Ҏ߱ͷ஌ݟͰ"#/ΛͲ͜·ͰվળͰ͖Δ͔   ճ౴ɿ$*'"3ͰޡࠩΛ໿ʹɺ$*'"3ͰޡࠩΛ໿ʹ࡟ݮ  طଘٕज़ͷ૊Έ߹Θͤ $VUPVU $PTJOF"OOFBMJOH -BCFM4NPPUIJOH

    Ͱେ͖ͳվળ FYQ@   ৽نΞΠσΞͰ͋Δ%JWFSTF"#/ ଟ༷ੑ੍໿෇͖ෳ਺஫ҙ෼ذ Ͱ͞Βʹվળ FYQ@ FYQ@   ಛʹ$*'"3Ͱ͸QUͱɺΫϥε਺͕ଟ͍΄Ͳଟ༷ͳ஫ҙػߏͷޮՌ͕ݦஶ ࣔࠦɿ ೥ʹஶऀࣗ਎͕ࢥ͍͔ͭͳ͔ͬͨʮෳ਺෼ذͷଟ༷ੑ੍໿ʯͳͲͷΞΠσΞΛ༩͑Δͱɺద੾ ͳ࣮૷ΛΤʔδΣϯτ͕ఏҊɻख๏ͷ࠶ߟʹ͸ɺ౰࣌ͷ੍໿ʹറΒΕͳ͍֎෦ࢹ఺͕༗ޮ  ·ͱΊɿίʔσΟϯάΤʔδΣϯτͱ"#/Λ࠶ߟ<>
  18. w 2ݚڀϓϩηεͷͲ͕ࣗ͜ಈԽ͞ΕɺͲ͜ʹਓؒͷ൑அ͕ඞཁ͔   ճ౴ɿ࣮૷ɾ୳ࡧ͸ࣗಈԽՄೳɺ໰͍ͷઃఆͱՁ஋൑அ͸ਓؒͷ໾ׂ  ΤʔδΣϯτ͕ಘҙʢࣗಈԽՄೳʣɿ w ࿦จ1%'͔Βͷ࣮૷ʢެࣜίʔυࢀর͋Γʣ w

    ϋΠύʔύϥϝʔλ୳ࡧͷཱҊͱ࣮ߦ w ΞϒϨʔγϣϯ࣮ݧͷઃܭͱฒྻ࣮ߦ w ݁ՌͷूܭɾՄࢹԽɾ-B5F9ݪߘԽ  ਓؒͷ൑அ͕ඞཁʢࣗಈԽࠔ೉ʣ w ݚڀͷ໰͍ͷઃఆɿԿΛ࠶ߟ͢΂͖͔ɺͲ͜ʹΪϟοϓ͕͋Δ͔ w ৽نΞΠσΞͷํ޲෇͚ɿΤʔδΣϯτ୯ಠͰ͸طଘٕज़ͷ૊Έ߹Θͤʹཹ·Δ܏޲ w ݁Ռͷҙຯ෇͚ɿͳͥ$*'"3Ͱ%JWFSTF"#/͕ޮ͍ͨͷ͔౳ͷղऍ w ݚڀͱͯ͠ͷՁ஋൑அɿߩݙͷҐஔ෇͚ɺ࠶ݱੑɾ੣࣮ੑͷ୲อ ࣔࠦɿ ϘτϧωοΫ͸ίʔσΟϯά͔ΒϓϩϯϓςΟϯά ྑ͍໰͍ΛཱͯΔ͜ͱ ΁Ҡߦ "MFY,FOEBMMࢯͷओு ʮݚڀ͕଎ָ͘͘͠ײ͡ΒΕΔʯ ͱ੔߹͢ΔҰํɺʮݚڀͱͯ͠ͷอূʯΛͲ͏୲อ͢Δ͔͸ະղܾͷ՝୊  ·ͱΊɿίʔσΟϯάΤʔδΣϯτͱ"#/Λ࠶ߟ<>
  19. w ໰͍ͱճ౴ͷαϚϦ  ·ͱΊɿίʔσΟϯάΤʔδΣϯτͱ"#/Λ࠶ߟ<> 2ɿθϩγϣοτ࠶ݱ͸Մೳ͔ʁ 2ɿͰͷ࠶ߟ 2ɿࣗಈԽͱݚڀऀͷ໾ׂ ˚෦෼తʹՄೳ ˕େ෯վળ ໾ׂ෼୲͕໌֬

    ߏ଄͸࠶ݱ˓ ਫ਼౓͸࠶ݱº $*'"3ɿˠ $*'"3ɿQUվળ ࣗಈԽɿ࣮૷୳ࡧ ਓؒɿ໰͍Ձ஋ น͸࿦จͷهड़ཻ౓ ˠFYQ@  ˠFYQ@ʙ ϘτϧωοΫ͸ϓϩϯϓτ΁ ˠશ࣮ݧΛ௨ͯ͡
  20. w 1SPTɿ  ΍Ζ͏ͱࢥ͔ͬͯΒݪߘ࡞੒·Ͱ໿Ұिؒఔ౓Ͱ࣮ݱ  ༏लͳֶੜͱҰॹʹݚڀΛਐΊ͍ͯΔײ͡Ͱɺଔ࿦Ϩϕϧͷ಺༰͸Մೳ  εϚϗΞϓϦ͔ΒʮਐḿΛڭ͑ͯʯͱ࢛࿡࣌தԿ౓΋΍ΓͱΓͰָ͖ͯ͠ ͍ʂ ʢ݁ՌΛ֬ೝ͢ΔࡍʹϫΫϫΫײ͋ΓɺݚڀΛ࢝Ίͨࠒͷָ͠͞Λ࠶ମݧʣ

    w $POTɿ  ૝ఆ௨Γɺ1SPͰ͸࢖༻੍ݶͱͳΓɺ.BYϓϥϯʹมߋ  ݚڀ࿦จͱͯ͠౤ߘ͢ΔͨΊʹ͸಺༰͔ͭίʔυͷอূΛͲ͏୲อ͢Δͷ ͔ʁ  ݚڀऀͱͯ͠ͷجૅମྗ͕ඞཁʢݚڀڭҭͷඞཁੑʣ  ݸਓతͳॴײ