Slide ICCV2023 Constructing Image Text Pair Dataset from Books
- Constructing Image Text Pair Dataset from Books
- DataComp Workshop
- Towards the Next Generation of Computer Vision Datasets October 3, 2023 - ICCV, Paris
- Yamato OKAMOTO
Cloud Corp., WORKS MOBILE JAPAN Corp.) DataComp Workshop Towards the Next Generation of Computer Vision Datasets October 3, 2023 - ICCV, Paris Haruto Toyonaga (Doshisha University) Yoshihisa Ijiri (LINE Corporation.) Hirokatsu Kataoka (LINE Corporation.)
customary activities. • To protect these valuable books, digital archiving is is now widely expanding. • As the next step of archiving, we should discuss exploiting them. 2
archived books can be considered as multi-modal data. • As novel way to utilize digital archives, we constructed an image-text pair dataset autonomously from them. • We then trained machine learning models on this dataset to acquire knowledge from books, just like humans read books. 3
2. Layout Analyzer (to extract only text of caption) 3. Object Detection (detect illustration areas) 4. Matching nearest-neighbor pair (※ Each model had trained on annotated book-image dataset.) 4
photo books. ü From the period of 1868 to 1945 ü 175 photo books (containing a total of 12640 book images). ü Photographs of locations or buildings from almost every prefecture in Japan. • Ultimately, we obtained 9516 image-text pairs. 5
system using CLIP. • Using ViT–B/32 for initialization, and we trained CLIP on the constructed dataset. Result • Training enhanced its retrieval performance, especially in the old Japanese domain. • This suggests that digital archives provides CLIP with new domain-specific knowledge. • The trained CLIP retrieved items based on specific Japanese location or building names. 6
on the constructed dataset. • Analyzing the model provides us with new insights. Result • t-SNE visualization told us, which cities are unique and which are similar. • Grad-CAM Visualization told us, which elements likely represent city identities. 7
archives by creating an image-text pair dataset. • We demonstrated the effectiveness of model training on this dataset. • This is the first step to realizing machine learning to acquire knowledge autonomously, just like humans read books. 8 All book images presented in this document are reproduced from the NDL-DocL dataset. https://github.com/ndl-lab/layout-dataset