Constructing Image Text Pair Dataset from Books Yamato OKAMOTO (NAVER Cloud Corp., WORKS MOBILE JAPAN Corp.) DataComp Workshop Towards the Next Generation of Computer Vision Datasets October 3, 2023 - ICCV, Paris Haruto Toyonaga (Doshisha University) Yoshihisa Ijiri (LINE Corporation.) Hirokatsu Kataoka (LINE Corporation.)

Motivation: Utilize Digital Archiving • Books record historical, cultural, and customary activities. • To protect these valuable books, digital archiving is is now widely expanding. • As the next step of archiving, we should discuss exploiting them. 2

Purpose: To make AI acquire knowledge from books • digitally archived books can be considered as multi-modal data. • As novel way to utilize digital archives, we constructed an image-text pair dataset autonomously from them. • We then trained machine learning models on this dataset to acquire knowledge from books, just like humans read books. 3

Developed: Dataset Construction Pipeline 1. OCR (detect and recognize text) 2. Layout Analyzer (to extract only text of caption) 3. Object Detection (detect illustration areas) 4. Matching nearest-neighbor pair (※ Each model had trained on annotated book-image dataset.) 4

Experiments: Dataset Construction • Applied our pipeline to old Japanese photo books. ü From the period of 1868 to 1945 ü 175 photo books (containing a total of 12640 book images). ü Photographs of locations or buildings from almost every prefecture in Japan. • Ultimately, we obtained 9516 image-text pairs. 5

Experiments: Image-Text Retrieval Setting • We constructed a cross-modal retrieval system using CLIP. • Using ViT–B/32 for initialization, and we trained CLIP on the constructed dataset. Result • Training enhanced its retrieval performance, especially in the old Japanese domain. • This suggests that digital archives provides CLIP with new domain-specific knowledge. • The trained CLIP retrieved items based on specific Japanese location or building names. 6

Experiments: Insight Extraction Setting • Trained a city classification model on the constructed dataset. • Analyzing the model provides us with new insights. Result • t-SNE visualization told us, which cities are unique and which are similar. • Grad-CAM Visualization told us, which elements likely represent city identities. 7

Conclusion • We proposed a new approach for leveraging digital archives by creating an image-text pair dataset. • We demonstrated the effectiveness of model training on this dataset. • This is the first step to realizing machine learning to acquire knowledge autonomously, just like humans read books. 8 All book images presented in this document are reproduced from the NDL-DocL dataset.