Slide 1

Slide 1 text

Takuya Asano (takuya-a) Distill-VQ: Learning Retrieval Oriented Vector Quantization By Distilling Knowledge from Dense Embeddings IR Reading 2022 Fall

Slide 2

Slide 2 text

Vector Search and ANN Background • ຒΊࠐΈΛ࢖ͬͨϕΫτϧݕࡧ͕޿͕͖͍ͬͯͯΔ • ݕࡧΤϯδϯɺਪનγεςϜͳͲ • ΫΤϦͱจॻͷຒΊࠐΈͷྨࣅ౓ʹΑͬͯจॻΛબ୒ • େن໛ͳϕΫτϧݕࡧʹ͓͍ͯ͸ɺ ۙࣅ࠷ۙ๣୳ࡧʢANNʣ͕Ωʔύʔπ • ࣮ੈքʹ͓͍ͯɺઢܗ୳ࡧ͸ݱ࣮తͰ͸ͳ͍ • ଎౓ɾϝϞϦ࢖༻ྔɾਫ਼౓ͷτϨʔυΦϑΛ࣮ݱ

Slide 3

Slide 3 text

Vector Quantization (VQ) Background • ANN ͷͨΊͷσʔλߏ଄ͱΞϧΰϦζϜ • ϕΫτϧू߹Λ K ݸͷηϯτϩΠυͰ୅ද͢Δ • ࣄલʹɺϕΫτϧू߹͔ΒηϯτϩΠυΛܭࢉ • ϕΫτϧΛΤϯίʔυ͢Δͱ͖ʹ͸ɺ࠷΋͍ۙηϯτϩΠυΛٻΊɺͦͷIDͷΈΛه࿥͢Δ • ΋ͱͷϕΫτϧΛID͚ͩͰූ߸ԽͰ͖ΔͷͰίϯύΫτʹ • KΛେ͖͘͢Δͱۙࣅਫ਼౓͸্͕͍͕ͬͯ͘ɺͦͷͿΜ஗͘ͳΓɺϝϞϦ࢖༻ྔ΋େ͖͍

Slide 4

Slide 4 text

Product Quantization (PQ) Background • ϕΫτϧͷ࣍ݩΛ M ݸʹ෼ׂͯ͠ɺͦΕͧΕ Vector Quantization ͢Δ • ͦΕͧΕͷ୅දϕΫτϧͷू߹ΛίʔυϒοΫͱݺͿ • ೖྗϕΫτϧΛMݸʹ෼ׂ͠ɺͦΕͧΕίʔυϒοΫͷத͔Β࠷΋͍ۙ୅දϕ ΫτϧΛ୳͠ɺͦͷIDΛه࿥͢Δ • MݸͷIDͷΈͰೖྗϕΫτϧΛූ߸Խ • ϝϞϦޮ཰΋ۙࣅਫ਼౓΋Α͍

Slide 5

Slide 5 text

Inverted File (IVF) Background • సஔΠϯσοΫεΛิॿσʔλߏ଄ͱͯ͠ར༻ • ͍ۙϕΫτϧΛసஔϦετʹ·ͱΊΔ • ૸ࠪ͢Δཁૉ͕গͳ͍ͷͰߴ଎ • ࠷ॳʹૈ͍ྔࢠԽΛߦ͍ɺసஔϦετΛऔಘ • సஔϦετΛ૸ࠪͯ͠࠷ۙ๣ͷϕΫτϧΛܭࢉ H. Jégou, M. Douze and C. Schmid, "Product Quantization for Nearest Neighbor Search," in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 1, pp. 117-128, Jan. 2011, doi: 10.1109/TPAMI.2010.57.

Slide 6

Slide 6 text

Distill-VQ Summary • IVF ͱ PQ Λซ༻ͨ͠ϕΫτϧྔࢠԽʹΑΓ ANN Λߦ͏ • ରরֶश (contrastive learning) ͰҎԼΛ࠷దԽ͢Δ • IVF ͷηϯτϩΠυ • PQ ͷίʔυϒοΫ • ΫΤϦຒΊࠐΈͷΤϯίʔμʔ • Α͘܇࿅͞ΕͨີͳຒΊࠐΈΛڭࢣɺ্هͷίϯϙʔωϯτΛੜెͱֶͯ͠श • ෳ਺ͷσʔληοτɺෳ਺ͷλεΫͰ SOTA

Slide 7

Slide 7 text

Distill-VQ: Workflow Method • ࣄલ४උ • ͢΂ͯͷจॻͷຒΊࠐΈΛܭࢉʢDistill-VQ Ͱ͸ݻఆʣ • จॻຒΊࠐΈ͔ΒɺIVF ͱ PQ ΛॳظԽʢηϯτϩΠυͷܭࢉʣ • ڭࢣείΞͷܭࢉͷͨΊʹɺΑ͘܇࿅͞ΕͨΫΤϦΤϯίʔμʔΛ४උ • ͜ͷΫΤϦΤϯίʔμʔΛ࢖ͬͯΫΤϦຒΊࠐΈΛܭࢉ

Slide 8

Slide 8 text

Distill-VQ: Workflow Method 1. ΫΤϦຒΊࠐΈΛܭࢉ Shitao Xiao, Zheng Liu, Weihao Han, Jianjin Zhang, Defu Lian, Yeyun Gong, Qi Chen, Fan Yang, Hao Sun, Yingxia Shao, and Xing Xie. 2022. Distill-VQ: Learning Retrieval Oriented Vector Quantization By Distilling Knowledge from Dense Embeddings. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '22). Association for Computing Machinery, New York, NY, USA, 1513–1523. https://doi.org/10.1145/3477495.3531799

Slide 9

Slide 9 text

Distill-VQ: Workflow Method 2. ࣄલ४උͨ͠จॻຒΊࠐΈ
 ͔ΒαϯϓϦϯά Shitao Xiao, Zheng Liu, Weihao Han, Jianjin Zhang, Defu Lian, Yeyun Gong, Qi Chen, Fan Yang, Hao Sun, Yingxia Shao, and Xing Xie. 2022. Distill-VQ: Learning Retrieval Oriented Vector Quantization By Distilling Knowledge from Dense Embeddings. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '22). Association for Computing Machinery, New York, NY, USA, 1513–1523. https://doi.org/10.1145/3477495.3531799

Slide 10

Slide 10 text

Distill-VQ: Workflow Method 3. IVF Λ࢖ͬͨੜెείΞΛܭࢉ Shitao Xiao, Zheng Liu, Weihao Han, Jianjin Zhang, Defu Lian, Yeyun Gong, Qi Chen, Fan Yang, Hao Sun, Yingxia Shao, and Xing Xie. 2022. Distill-VQ: Learning Retrieval Oriented Vector Quantization By Distilling Knowledge from Dense Embeddings. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '22). Association for Computing Machinery, New York, NY, USA, 1513–1523. https://doi.org/10.1145/3477495.3531799

Slide 11

Slide 11 text

Distill-VQ: Workflow Method 4. PQ Λ࢖ͬͨੜెείΞΛܭࢉ Shitao Xiao, Zheng Liu, Weihao Han, Jianjin Zhang, Defu Lian, Yeyun Gong, Qi Chen, Fan Yang, Hao Sun, Yingxia Shao, and Xing Xie. 2022. Distill-VQ: Learning Retrieval Oriented Vector Quantization By Distilling Knowledge from Dense Embeddings. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '22). Association for Computing Machinery, New York, NY, USA, 1513–1523. https://doi.org/10.1145/3477495.3531799

Slide 12

Slide 12 text

Distill-VQ: Workflow Method 5. ڭࢣείΞΛܭࢉ Shitao Xiao, Zheng Liu, Weihao Han, Jianjin Zhang, Defu Lian, Yeyun Gong, Qi Chen, Fan Yang, Hao Sun, Yingxia Shao, and Xing Xie. 2022. Distill-VQ: Learning Retrieval Oriented Vector Quantization By Distilling Knowledge from Dense Embeddings. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '22). Association for Computing Machinery, New York, NY, USA, 1513–1523. https://doi.org/10.1145/3477495.3531799

Slide 13

Slide 13 text

Distill-VQ: Workflow Method 6. ੜెείΞͱڭࢣείΞͷ
 ྨࣅ౓Λܭࢉ͠ɺϞσϧΛߋ৽ Shitao Xiao, Zheng Liu, Weihao Han, Jianjin Zhang, Defu Lian, Yeyun Gong, Qi Chen, Fan Yang, Hao Sun, Yingxia Shao, and Xing Xie. 2022. Distill-VQ: Learning Retrieval Oriented Vector Quantization By Distilling Knowledge from Dense Embeddings. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '22). Association for Computing Machinery, New York, NY, USA, 1513–1523. https://doi.org/10.1145/3477495.3531799

Slide 14

Slide 14 text

Distill-VQ: Detailed Algorithm Method • ֶशΞϧΰϦζϜ • L4: จॻίϨΫγϣϯ D ͔ΒީิจॻΛαϯϓϦϯά • L5: ڭࢣͷείΞΛܭࢉ
 ɹɹ • L6: IVF ͱ PQ Λ࢖ͬͯੜెͷείΞΛܭࢉ
 ɹɹ • L7: IVFɺPQɺΫΤϦΤϯίʔμʔΛֶश • f: similarity function
 
 Shitao Xiao, Zheng Liu, Weihao Han, Jianjin Zhang, Defu Lian, Yeyun Gong, Qi Chen, Fan Yang, Hao Sun, Yingxia Shao, and Xing Xie. 2022. Distill-VQ: Learning Retrieval Oriented Vector Quantization By Distilling Knowledge from Dense Embeddings. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '22). Association for Computing Machinery, New York, NY, USA, 1513–1523. https://doi.org/10.1145/3477495.3531799

Slide 15

Slide 15 text

Experiment Settings Experiments • σʔληοτ • MS MARCO Passage retrieval • Bing Search ͷΫΤϦ • Natural Questions (NQ) • Google Search ͷΫΤϦ • ϕʔεϥΠϯ • طଘͷϕΫτϧྔࢠԽख๏ (IVFPQ, IVFOPQ, ScaNN) • ࠷ۙͷಉֶ࣌शख๏ (Poeem, JPQ, RepCONC) Shitao Xiao, Zheng Liu, Weihao Han, Jianjin Zhang, Defu Lian, Yeyun Gong, Qi Chen, Fan Yang, Hao Sun, Yingxia Shao, and Xing Xie. 2022. Distill-VQ: Learning Retrieval Oriented Vector Quantization By Distilling Knowledge from Dense Embeddings. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '22). Association for Computing Machinery, New York, NY, USA, 1513–1523. https://doi.org/10.1145/3477495.3531799

Slide 16

Slide 16 text

Experiment Settings Experiments • Distill-VQ ͷڭࢣϞσϧͱͯ͠ɺΑ͘܇࿅͞Εͨ2छྨͷΤϯίʔμʔΛࢼͨ͠ • AR2-G • CoCondenser • ͜ΕΒͷϞσϧ͸ MS MARCO ͱ NQ Ͱ࠷΋ accurate • จॻຒΊࠐΈ • ϑΣΞʹൺֱ͢ΔͨΊʹ͢΂ͯͷख๏Ͱಉ͡΋ͷΛ࢖༻

Slide 17

Slide 17 text

Overall Performance Experiments • ݕࡧ඼࣭΁ͷΠϯύΫτΛطଘख๏ͱൺֱ • Ұ؏ͯ͠༗ҙʹߴ͍ੑೳˍSOTA • 2छྨͷΤϯίʔμʔ
 AR2-G, CoCondenser • 2छྨͷσʔληοτ
 MS MARCO, NQ Shitao Xiao, Zheng Liu, Weihao Han, Jianjin Zhang, Defu Lian, Yeyun Gong, Qi Chen, Fan Yang, Hao Sun, Yingxia Shao, and Xing Xie. 2022. Distill-VQ: Learning Retrieval Oriented Vector Quantization By Distilling Knowledge from Dense Embeddings. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '22). Association for Computing Machinery, New York, NY, USA, 1513–1523. https://doi.org/10.1145/3477495.3531799

Slide 18

Slide 18 text

Explorations of Knowledge Distillation Experiments • Distill-VQ Ͱ͸ɺsimilarity function ΍ɺจ ॻαϯϓϦϯάํ๏ʹબ୒ͷ༨஍͕͋Δͷ Ͱɺม͑ͯΈ࣮ͯݧ • ϥϯΩϯάΛߟྀͨ͠ similarity function (KL-Div, ListNet, RankNet) ͷ΄͏͕ੑೳ ͕ߴ͍ • όοναϯϓϦϯάͱ Top-K ͷ૊Έ߹Θͤ (IB + Top-K) ͸ɺϥϕϧ෇͖σʔλΛ࢖ͬ ͨ৔߹ (GT) ΑΓߴੑೳʢʂʣ Shitao Xiao, Zheng Liu, Weihao Han, Jianjin Zhang, Defu Lian, Yeyun Gong, Qi Chen, Fan Yang, Hao Sun, Yingxia Shao, and Xing Xie. 2022. Distill-VQ: Learning Retrieval Oriented Vector Quantization By Distilling Knowledge from Dense Embeddings. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '22). Association for Computing Machinery, New York, NY, USA, 1513–1523. https://doi.org/10.1145/3477495.3531799

Slide 19

Slide 19 text

Efficiency and Retrieval Quality Experiments • ଎౓ͱ࠶ݱ཰ͷτϨʔυΦϑΛɺFAISS ͷΦϦδφϧͷ IVFOPQ ͱൺֱ • ͢΂ͯͷઃఆͰ IVFOPQ Λ্ճͬͨ Shitao Xiao, Zheng Liu, Weihao Han, Jianjin Zhang, Defu Lian, Yeyun Gong, Qi Chen, Fan Yang, Hao Sun, Yingxia Shao, and Xing Xie. 2022. Distill-VQ: Learning Retrieval Oriented Vector Quantization By Distilling Knowledge from Dense Embeddings. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '22). Association for Computing Machinery, New York, NY, USA, 1513–1523. https://doi.org/10.1145/3477495.3531799

Slide 20

Slide 20 text

Personal Impressions • ϥϕϧ͖ͭσʔλ͕ͳͯ͘΋ɺANN ͷੑೳΛ޲্Ͱ͖Δख๏ͱͯ͠ɺେมڵຯਂ͔ͬͨ • σʔλߏ଄ɾΞϧΰϦζϜ͸ม͑ͣʹద༻Ͱ͖ΔͷͰɺΫΤϦॲཧ଎౓΁ͷѱӨڹ΋ͳ͍ • MS MARCO Passage ͳͲͰ༗ҙͳੑೳ޲্͕֬ೝ͞ΕͨͷͰ༗๬ • ࣮ΞϓϦέʔγϣϯʹద༻͢Δ͜ͱΛߟ͑ΔͱɺIVFɾPQɾΫΤϦΤϯίʔμʔͷ࠶ֶशͲ͏͢ Δ͔͕ؾʹͳΔ • ANN ΠϯσοΫεશମΛ࡞Γͳ͓͠ʹͳΔͱࢥ͏ͷͰɺֶशʹ͔͔Δ࣌ؒ΋ؾʹͳΔ • ʢߋ৽͕ͳ͍ɺ੩తͳΠϯσοΫεͰ͋Ε͹໰୊ͳ͍͕ɺͦͷΑ͏ͳΞϓϦέʔγϣϯ͸كʣ