Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Research Paper Introduction in IR Reading 2022 Fall

Takuya Asano
November 12, 2022

Research Paper Introduction in IR Reading 2022 Fall

IR Reading 2022 Fall: https://sigir.jp/post/2022-11-12-irreading_2022fall/

Shitao Xiao, Zheng Liu, Weihao Han, Jianjin Zhang, Defu Lian, Yeyun Gong, Qi Chen, Fan Yang, Hao Sun, Yingxia Shao, and Xing Xie. 2022. Distill-VQ: Learning Retrieval Oriented Vector Quantization By Distilling Knowledge from Dense Embeddings. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '22). Association for Computing Machinery, New York, NY, USA, 1513–1523. https://doi.org/10.1145/3477495.3531799

Takuya Asano

November 12, 2022
Tweet

More Decks by Takuya Asano

Other Decks in Research

Transcript

  1. Takuya Asano (takuya-a)
    Distill-VQ: Learning Retrieval Oriented
    Vector Quantization By Distilling
    Knowledge from Dense Embeddings
    IR Reading 2022 Fall

    View Slide

  2. Vector Search and ANN
    Background
    • ຒΊࠐΈΛ࢖ͬͨϕΫτϧݕࡧ͕޿͕͖͍ͬͯͯΔ

    • ݕࡧΤϯδϯɺਪનγεςϜͳͲ

    • ΫΤϦͱจॻͷຒΊࠐΈͷྨࣅ౓ʹΑͬͯจॻΛબ୒

    • େن໛ͳϕΫτϧݕࡧʹ͓͍ͯ͸ɺ ۙࣅ࠷ۙ๣୳ࡧʢANNʣ͕Ωʔύʔπ

    • ࣮ੈքʹ͓͍ͯɺઢܗ୳ࡧ͸ݱ࣮తͰ͸ͳ͍

    • ଎౓ɾϝϞϦ࢖༻ྔɾਫ਼౓ͷτϨʔυΦϑΛ࣮ݱ

    View Slide

  3. Vector Quantization (VQ)
    Background
    • ANN ͷͨΊͷσʔλߏ଄ͱΞϧΰϦζϜ

    • ϕΫτϧू߹Λ K ݸͷηϯτϩΠυͰ୅ද͢Δ

    • ࣄલʹɺϕΫτϧू߹͔ΒηϯτϩΠυΛܭࢉ

    • ϕΫτϧΛΤϯίʔυ͢Δͱ͖ʹ͸ɺ࠷΋͍ۙηϯτϩΠυΛٻΊɺͦͷIDͷΈΛه࿥͢Δ

    • ΋ͱͷϕΫτϧΛID͚ͩͰූ߸ԽͰ͖ΔͷͰίϯύΫτʹ

    • KΛେ͖͘͢Δͱۙࣅਫ਼౓͸্͕͍͕ͬͯ͘ɺͦͷͿΜ஗͘ͳΓɺϝϞϦ࢖༻ྔ΋େ͖͍

    View Slide

  4. Product Quantization (PQ)
    Background
    • ϕΫτϧͷ࣍ݩΛ M ݸʹ෼ׂͯ͠ɺͦΕͧΕ Vector Quantization ͢Δ

    • ͦΕͧΕͷ୅දϕΫτϧͷू߹ΛίʔυϒοΫͱݺͿ

    • ೖྗϕΫτϧΛMݸʹ෼ׂ͠ɺͦΕͧΕίʔυϒοΫͷத͔Β࠷΋͍ۙ୅දϕ
    ΫτϧΛ୳͠ɺͦͷIDΛه࿥͢Δ

    • MݸͷIDͷΈͰೖྗϕΫτϧΛූ߸Խ

    • ϝϞϦޮ཰΋ۙࣅਫ਼౓΋Α͍

    View Slide

  5. Inverted File (IVF)
    Background
    • సஔΠϯσοΫεΛิॿσʔλߏ଄ͱͯ͠ར༻

    • ͍ۙϕΫτϧΛసஔϦετʹ·ͱΊΔ

    • ૸ࠪ͢Δཁૉ͕গͳ͍ͷͰߴ଎

    • ࠷ॳʹૈ͍ྔࢠԽΛߦ͍ɺసஔϦετΛऔಘ

    • సஔϦετΛ૸ࠪͯ͠࠷ۙ๣ͷϕΫτϧΛܭࢉ
    H. Jégou, M. Douze and C. Schmid, "Product Quantization for Nearest Neighbor
    Search," in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.
    33, no. 1, pp. 117-128, Jan. 2011, doi: 10.1109/TPAMI.2010.57.

    View Slide

  6. Distill-VQ
    Summary
    • IVF ͱ PQ Λซ༻ͨ͠ϕΫτϧྔࢠԽʹΑΓ ANN Λߦ͏

    • ରরֶश (contrastive learning) ͰҎԼΛ࠷దԽ͢Δ

    • IVF ͷηϯτϩΠυ

    • PQ ͷίʔυϒοΫ

    • ΫΤϦຒΊࠐΈͷΤϯίʔμʔ

    • Α͘܇࿅͞ΕͨີͳຒΊࠐΈΛڭࢣɺ্هͷίϯϙʔωϯτΛੜెͱֶͯ͠श

    • ෳ਺ͷσʔληοτɺෳ਺ͷλεΫͰ SOTA

    View Slide

  7. Distill-VQ: Workflow
    Method
    • ࣄલ४උ

    • ͢΂ͯͷจॻͷຒΊࠐΈΛܭࢉʢDistill-VQ Ͱ͸ݻఆʣ

    • จॻຒΊࠐΈ͔ΒɺIVF ͱ PQ ΛॳظԽʢηϯτϩΠυͷܭࢉʣ

    • ڭࢣείΞͷܭࢉͷͨΊʹɺΑ͘܇࿅͞ΕͨΫΤϦΤϯίʔμʔΛ४උ

    • ͜ͷΫΤϦΤϯίʔμʔΛ࢖ͬͯΫΤϦຒΊࠐΈΛܭࢉ

    View Slide

  8. Distill-VQ: Workflow
    Method
    1. ΫΤϦຒΊࠐΈΛܭࢉ
    Shitao Xiao, Zheng Liu, Weihao Han, Jianjin Zhang, Defu Lian, Yeyun Gong, Qi Chen, Fan Yang, Hao Sun, Yingxia Shao, and Xing Xie. 2022. Distill-VQ: Learning Retrieval Oriented Vector Quantization By Distilling Knowledge from Dense Embeddings. In Proceedings of the 45th International ACM SIGIR
    Conference on Research and Development in Information Retrieval (SIGIR '22). Association for Computing Machinery, New York, NY, USA, 1513–1523. https://doi.org/10.1145/3477495.3531799

    View Slide

  9. Distill-VQ: Workflow
    Method
    2. ࣄલ४උͨ͠จॻຒΊࠐΈ

    ͔ΒαϯϓϦϯά
    Shitao Xiao, Zheng Liu, Weihao Han, Jianjin Zhang, Defu Lian, Yeyun Gong, Qi Chen, Fan Yang, Hao Sun, Yingxia Shao, and Xing Xie. 2022. Distill-VQ: Learning Retrieval Oriented Vector Quantization By Distilling Knowledge from Dense Embeddings. In Proceedings of the 45th International ACM SIGIR
    Conference on Research and Development in Information Retrieval (SIGIR '22). Association for Computing Machinery, New York, NY, USA, 1513–1523. https://doi.org/10.1145/3477495.3531799

    View Slide

  10. Distill-VQ: Workflow
    Method
    3. IVF Λ࢖ͬͨੜెείΞΛܭࢉ
    Shitao Xiao, Zheng Liu, Weihao Han, Jianjin Zhang, Defu Lian, Yeyun Gong, Qi Chen, Fan Yang, Hao Sun, Yingxia Shao, and Xing Xie. 2022. Distill-VQ: Learning Retrieval Oriented Vector Quantization By Distilling Knowledge from Dense Embeddings. In Proceedings of the 45th International ACM SIGIR
    Conference on Research and Development in Information Retrieval (SIGIR '22). Association for Computing Machinery, New York, NY, USA, 1513–1523. https://doi.org/10.1145/3477495.3531799

    View Slide

  11. Distill-VQ: Workflow
    Method
    4. PQ Λ࢖ͬͨੜెείΞΛܭࢉ
    Shitao Xiao, Zheng Liu, Weihao Han, Jianjin Zhang, Defu Lian, Yeyun Gong, Qi Chen, Fan Yang, Hao Sun, Yingxia Shao, and Xing Xie. 2022. Distill-VQ: Learning Retrieval Oriented Vector Quantization By Distilling Knowledge from Dense Embeddings. In Proceedings of the 45th International ACM SIGIR
    Conference on Research and Development in Information Retrieval (SIGIR '22). Association for Computing Machinery, New York, NY, USA, 1513–1523. https://doi.org/10.1145/3477495.3531799

    View Slide

  12. Distill-VQ: Workflow
    Method
    5. ڭࢣείΞΛܭࢉ
    Shitao Xiao, Zheng Liu, Weihao Han, Jianjin Zhang, Defu Lian, Yeyun Gong, Qi Chen, Fan Yang, Hao Sun, Yingxia Shao, and Xing Xie. 2022. Distill-VQ: Learning Retrieval Oriented Vector Quantization By Distilling Knowledge from Dense Embeddings. In Proceedings of the 45th International ACM SIGIR
    Conference on Research and Development in Information Retrieval (SIGIR '22). Association for Computing Machinery, New York, NY, USA, 1513–1523. https://doi.org/10.1145/3477495.3531799

    View Slide

  13. Distill-VQ: Workflow
    Method
    6. ੜెείΞͱڭࢣείΞͷ

    ྨࣅ౓Λܭࢉ͠ɺϞσϧΛߋ৽
    Shitao Xiao, Zheng Liu, Weihao Han, Jianjin Zhang, Defu Lian, Yeyun Gong, Qi Chen, Fan Yang, Hao Sun, Yingxia Shao, and Xing Xie. 2022. Distill-VQ: Learning Retrieval Oriented Vector Quantization By Distilling Knowledge from Dense Embeddings. In Proceedings of the 45th International ACM SIGIR
    Conference on Research and Development in Information Retrieval (SIGIR '22). Association for Computing Machinery, New York, NY, USA, 1513–1523. https://doi.org/10.1145/3477495.3531799

    View Slide

  14. Distill-VQ: Detailed Algorithm
    Method
    • ֶशΞϧΰϦζϜ

    • L4: จॻίϨΫγϣϯ D ͔ΒީิจॻΛαϯϓϦϯά

    • L5: ڭࢣͷείΞΛܭࢉ

    ɹɹ

    • L6: IVF ͱ PQ Λ࢖ͬͯੜెͷείΞΛܭࢉ

    ɹɹ

    • L7: IVFɺPQɺΫΤϦΤϯίʔμʔΛֶश

    • f: similarity function


    Shitao Xiao, Zheng Liu, Weihao Han, Jianjin Zhang, Defu Lian, Yeyun Gong, Qi Chen, Fan Yang, Hao Sun, Yingxia Shao, and Xing Xie. 2022. Distill-VQ: Learning Retrieval Oriented Vector Quantization By Distilling Knowledge from Dense Embeddings. In Proceedings of the 45th International ACM SIGIR
    Conference on Research and Development in Information Retrieval (SIGIR '22). Association for Computing Machinery, New York, NY, USA, 1513–1523. https://doi.org/10.1145/3477495.3531799

    View Slide

  15. Experiment Settings
    Experiments
    • σʔληοτ
    • MS MARCO Passage retrieval
    • Bing Search ͷΫΤϦ

    • Natural Questions (NQ)
    • Google Search ͷΫΤϦ

    • ϕʔεϥΠϯ
    • طଘͷϕΫτϧྔࢠԽख๏ (IVFPQ, IVFOPQ, ScaNN)

    • ࠷ۙͷಉֶ࣌शख๏ (Poeem, JPQ, RepCONC)
    Shitao Xiao, Zheng Liu, Weihao Han, Jianjin Zhang, Defu Lian, Yeyun Gong, Qi Chen, Fan Yang, Hao Sun, Yingxia Shao, and Xing Xie. 2022. Distill-VQ: Learning Retrieval Oriented Vector Quantization By Distilling Knowledge from Dense Embeddings. In Proceedings of the 45th International ACM SIGIR
    Conference on Research and Development in Information Retrieval (SIGIR '22). Association for Computing Machinery, New York, NY, USA, 1513–1523. https://doi.org/10.1145/3477495.3531799

    View Slide

  16. Experiment Settings
    Experiments
    • Distill-VQ ͷڭࢣϞσϧͱͯ͠ɺΑ͘܇࿅͞Εͨ2छྨͷΤϯίʔμʔΛࢼͨ͠

    • AR2-G
    • CoCondenser
    • ͜ΕΒͷϞσϧ͸ MS MARCO ͱ NQ Ͱ࠷΋ accurate

    • จॻຒΊࠐΈ

    • ϑΣΞʹൺֱ͢ΔͨΊʹ͢΂ͯͷख๏Ͱಉ͡΋ͷΛ࢖༻

    View Slide

  17. Overall Performance
    Experiments
    • ݕࡧ඼࣭΁ͷΠϯύΫτΛطଘख๏ͱൺֱ

    • Ұ؏ͯ͠༗ҙʹߴ͍ੑೳˍSOTA

    • 2छྨͷΤϯίʔμʔ

    AR2-G, CoCondenser

    • 2छྨͷσʔληοτ

    MS MARCO, NQ
    Shitao Xiao, Zheng Liu, Weihao Han, Jianjin Zhang, Defu Lian, Yeyun Gong, Qi Chen, Fan Yang, Hao Sun, Yingxia Shao, and Xing Xie. 2022. Distill-VQ: Learning Retrieval Oriented Vector Quantization By Distilling Knowledge from Dense Embeddings. In Proceedings of the 45th International ACM SIGIR
    Conference on Research and Development in Information Retrieval (SIGIR '22). Association for Computing Machinery, New York, NY, USA, 1513–1523. https://doi.org/10.1145/3477495.3531799

    View Slide

  18. Explorations of Knowledge Distillation
    Experiments
    • Distill-VQ Ͱ͸ɺsimilarity function ΍ɺจ
    ॻαϯϓϦϯάํ๏ʹબ୒ͷ༨஍͕͋Δͷ
    Ͱɺม͑ͯΈ࣮ͯݧ

    • ϥϯΩϯάΛߟྀͨ͠ similarity function
    (KL-Div, ListNet, RankNet) ͷ΄͏͕ੑೳ
    ͕ߴ͍

    • όοναϯϓϦϯάͱ Top-K ͷ૊Έ߹Θͤ
    (IB + Top-K) ͸ɺϥϕϧ෇͖σʔλΛ࢖ͬ
    ͨ৔߹ (GT) ΑΓߴੑೳʢʂʣ
    Shitao Xiao, Zheng Liu, Weihao Han, Jianjin Zhang, Defu Lian, Yeyun Gong, Qi Chen, Fan Yang, Hao Sun, Yingxia Shao, and Xing Xie. 2022. Distill-VQ: Learning Retrieval Oriented Vector Quantization By Distilling Knowledge from Dense Embeddings. In Proceedings of the 45th International ACM SIGIR
    Conference on Research and Development in Information Retrieval (SIGIR '22). Association for Computing Machinery, New York, NY, USA, 1513–1523. https://doi.org/10.1145/3477495.3531799

    View Slide

  19. Efficiency and Retrieval Quality
    Experiments
    • ଎౓ͱ࠶ݱ཰ͷτϨʔυΦϑΛɺFAISS ͷΦϦδφϧͷ IVFOPQ ͱൺֱ

    • ͢΂ͯͷઃఆͰ IVFOPQ Λ্ճͬͨ
    Shitao Xiao, Zheng Liu, Weihao Han, Jianjin Zhang, Defu Lian, Yeyun Gong, Qi Chen, Fan Yang, Hao Sun, Yingxia Shao, and Xing Xie. 2022. Distill-VQ: Learning Retrieval Oriented Vector Quantization By Distilling Knowledge from Dense Embeddings. In Proceedings of the 45th International ACM SIGIR
    Conference on Research and Development in Information Retrieval (SIGIR '22). Association for Computing Machinery, New York, NY, USA, 1513–1523. https://doi.org/10.1145/3477495.3531799

    View Slide

  20. Personal Impressions
    • ϥϕϧ͖ͭσʔλ͕ͳͯ͘΋ɺANN ͷੑೳΛ޲্Ͱ͖Δख๏ͱͯ͠ɺେมڵຯਂ͔ͬͨ

    • σʔλߏ଄ɾΞϧΰϦζϜ͸ม͑ͣʹద༻Ͱ͖ΔͷͰɺΫΤϦॲཧ଎౓΁ͷѱӨڹ΋ͳ͍

    • MS MARCO Passage ͳͲͰ༗ҙͳੑೳ޲্͕֬ೝ͞ΕͨͷͰ༗๬

    • ࣮ΞϓϦέʔγϣϯʹద༻͢Δ͜ͱΛߟ͑ΔͱɺIVFɾPQɾΫΤϦΤϯίʔμʔͷ࠶ֶशͲ͏͢
    Δ͔͕ؾʹͳΔ

    • ANN ΠϯσοΫεશମΛ࡞Γͳ͓͠ʹͳΔͱࢥ͏ͷͰɺֶशʹ͔͔Δ࣌ؒ΋ؾʹͳΔ

    • ʢߋ৽͕ͳ͍ɺ੩తͳΠϯσοΫεͰ͋Ε͹໰୊ͳ͍͕ɺͦͷΑ͏ͳΞϓϦέʔγϣϯ͸كʣ

    View Slide