Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval

Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval

A42dd3541cd40296dcd8a5e6b4a01bef?s=128

Scatter Lab Inc.

August 07, 2020
Tweet

Transcript

  1. MLࣁ޷ա S6E3 Approximate Nearest Neighbor 
 Negative Contrastive Learning for

    Dense Text Retrieval ӣળࢿ ML Research Scientist, Pingpong
  2. ݾର ݾର 1. Introduction 1. ޙઁ ੿੄ 2. ӝઓ ӝߨ੄

    ೠ҅ 2. Approach 1. ੽Ӕ ߑߨ ࣗѐ 2. ࠺زӝ ೟ण ܖ౯ 3. Experiment 1. प೷ ࢸ҅ 2. प೷ Ѿҗ 3. ҳഅ ࣁࠗࢎ೦
  3. • ࠄ ֤ޙীࢲ Ҿӓ੸ਵ۽ ಽҊ੗ ೞח ޙઁח Open-Domain Question Answering

    (QA) పझ௼ • Open-Domain QAח যڃ بݫੋী Ҵೠغয ੓૑ ঋ਷ ૕ޙਸ ؍઎ਸ ٸ, 
 ࠁਬೞҊ ੓ח ׮۝(~1M+)੄ ޙࢲٜ о਍ؘ ನೣغয ੓ח ੿׹ਸ ଺ח పझ௼۽ ੿੄ೡ ࣻ ੓णפ׮. • ৘ܳ ٜݶ ਤఃೖ٣ইী ઓ੤ೞח ݽٚ ޙࢲܳ ଵઑೡ ࣻ ੓׮ח о੿ ೞী
 “ఋ֢झח ੹ ਋઱੄ ݻಌࣃ౟੄ ࢤݺ୓ܳ લ৓য?” ী ؀ೠ ׹ਸ ଺ח Ѫ ੑפ׮. ޙઁ ੿੄ [1/2]
  4. • ٩۞׬ ӝ߈੄ ݽ؛ਸ ੉ਊ೧ࢲ ࠁ׮ ੿ഛೠ ੿׹ਸ ଺ਸ ࣻ

    ੓૑݅, ݽٚ ޙࢲ(+Nর)ী ؀೧ 
 ো࢑ਸ ࣻ೯ೞח Ѫ਷ ݒ਋ ࠺ബਯ ੸੉Ҋ, पदр ࢲ࠺झо ࠛоמೞ׮ח ೠ҅੼੉ ੓঻णפ׮. • ӝઓ੄ োҳٜ਷ ࣘب੸ ೠ҅੼ਸ ӓࠂೞӝ ਤ೧ ௼ѱ فо૑ stage ۽ ܻ࠙ೞৈ ޙઁܳ ಽҊ੗ ೞ৓णפ׮ • 1. Document Retrieval: ઱য૓ ૕੄ী ؀೧ࢲ ҙ۲੉ ੓ח ޙࢲٜਸ ଺ח ױ҅ • 2. Reading Comprehension: ઱য૓ ૕੄ী ؀ೠ ҳ୓੸ੋ ׹ਸ ҙ۲ ޙࢲܳ ଵઑೞৈ ب୹ೞח ݽ؛ • য়ט ࣗѐ೧ ܾ٘ ֤ޙ਷ ੉ Document Retrieval ੄ ࢿמ ೱ࢚ী ҙೠ ߑߨਸ ઁউ೤פ׮. ޙઁ ੿੄ [2/2]
  5. • ӝઓ੄ ؀ࠗ࠙੄ োҳীࢲח Document Retrieval ী Lexical Feature ܳ

    ઱۽ ࢎਊೞ৓णפ׮. • ৘द) BM25, TF-IDF, Keyword Matching ١١ (Elastic Search੄ ઱ػ ӝמ) • ೞ૑݅ ੉۞ೠ ߑߨ਷ ૕੄ ੗୓੄ ೣ୷੸ ੄޷(Semantic)ܳ ੉೧ೞҊ ҙ۲ػ ׹߸ਸ ଺ਸ ࣻח হणפ׮. • ৘द) Q. ־о పठۄ੄ ؀੢੉ঠ? 
 -> (పठۄ, ؀੢) ਵ۽ Ѩ࢝೧ب ف ఃਕ٘ܳ ನೣೞח ޙࢲܳ ଺ਸ ࣻ হ਺.. ӝઓ ߑߨ੄ ೠ҅੼ [1/3]
  6. • ୭Ӕ੄ োҳٜ਷(Lee et al., 2019; Guu et al., 2020;

    Seo et al. 2019) ૕੄৬ ޙࢲܳ BERTܳ ੉ਊ೧ Representation ਵ۽ ಴അೞৈ ࠁ׮ Semantic ೠ ੿ࠁܳ ನ଱ೡ ࣻ ੓ח ߑߨਸ ઁউೞ৓਺. • ੉۞ೠ ߑߨٜ਷ BI-Encoder ҳઑ੄ ݽ؛ਸ ࢎਊೞݴ, In-Batch Negative ۽ ೟णਸ ࣻ೯೤פ׮. • ೟ण੉ ৮ܐػ ੉റীח Document Encoderܳ ੉ਊ೧ࢲ ޷ܻ ޙࢲٜਸ encoding ೧ ֬਺ • Inference दীח ૕੄݅ BERT۽ Representation ਸ ҅࢑ೞҊ FAISS ৬ э਷ Approximate Nearest Neighbor Search ోਸ ੉ਊ೧ ߄۽ ૕੄ Representation җ оө਍ Top-Kѐ੄ ޙࢲܳ ଺਺ ӝઓ ߑߨ੄ ೠ҅੼ [2/3]
  7. Bi-encoder ૕੄ ޙࢲ

  8. ೟णߑߨ: In-Batch Negative Q1 D1 Q2 D2 Q3 D3 Q4

    D4 ೟ण ؘ੉ఠࣇ
  9. ೟णߑߨ: In-Batch Negative Q1 D1 Q2 D2 Q3 D3 Q4

    D4 ೟ण ؘ੉ఠࣇ Q: (4, 512) D: (4, 512)
  10. ೟णߑߨ: In-Batch Negative Q1 D1 Q2 D2 Q3 D3 Q4

    D4 ೟ण ؘ੉ఠࣇ Q: (4, 512) D: (4, 512) Q ⋅ DT -> (4,4)
  11. ೟णߑߨ: In-Batch Negative Q1 Q2 Q3 Q4 D1 D2 D3

    D4 Q1 D1 Q2 D2 Q3 D3 Q4 D4 ೟ण ؘ੉ఠࣇ Q: (4, 512) D: (4, 512) Q ⋅ DT -> (4,4)
  12. ೟णߑߨ: In-Batch Negative Q1 Q2 Q3 Q4 D1 D2 D3

    D4 Q1 D1 Q2 D2 Q3 D3 Q4 D4 ೟ण ؘ੉ఠࣇ Q: (4, 512) D: (4, 512) 0.5 0.6 0.4 0.7 0.2 0.1 0.2 0.1 0.2 0.1 0.3 0.1 0.2 0.1 0.1 0.1 Softmax Q ⋅ DT п Row ߹۽ Softmaxܳ ஂೣ -> (4,4)
  13. ೟णߑߨ: In-Batch Negative Q1 Q2 Q3 Q4 D1 D2 D3

    D4 Q1 D1 Q2 D2 Q3 D3 Q4 D4 ೟ण ؘ੉ఠࣇ Q: (4, 512) D: (4, 512) 0.99 0.99 0.01 0.99 0.99 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 Q ⋅ DT ೟ण ݾ಴: 
 п Row ীࢲ ؀਽غח ޙࢲо ઁੌ ֫਷ чਸ ыب۾ -> (4,4)
  14. • ੷੗ח Dense Retrieval ݽ؛ਸ ೟णೡ ٸ ࢎਊೞח In-Batch Negativeী

    ޙઁо ੓਺ਸ ૑੸೤פ׮. • In-Batch Negative ೟ण ߑߨ਷ যו੿ب ਬࢎೠ ޙࢲٜਸ ୶ܻחؘীח ਬബೞ૑݅, 
 ૓૞ ҙ۲੉ ੓ח ޙࢲܳ ੿ഛೞѱ ఐ࢝ೞӝীח Ӕࠄ੸ੋ ೠ҅о ੓ਸ Ѫ੉ۄח оࢸਸ ࣁ਒פ׮. • ৵ջೞݶ ৮੹ ҙ۲੉ হח റࠁٜ ઺ী, ҙ۲੉ ੓ח ೞա੄ ޙࢲܳ ࡳب۾ ೟णೞח Ѫҗ
 ҙ۲ࢿ ੓ח റࠁٜ ઺ীࢲ ૓૞ ҙ۲੉ ੓ח ೞա੄ ޙࢲܳ ࡳب۾ ೟णೞח Ѫ਷ ׮ܰӝ ٸޙੑפ׮. ӝઓ ߑߨ੄ ೠ҅੼ [2/3]
  15. • negative sample ٜ੄ representation ਸ t-SNEਵ۽ दпചೞৈ ࠙ࢳਸ ࣻ೯ೞ৓णפ׮.

    • ӝઓী ઱۽ ࢎਊೞ؍ Random, BM25 ӝ߈੄ Negative ٜ਷ पઁ Relevant Document ৬ ࠙ನ ର੉о ब೮਺ • ژೠ Random Negative ۽ ೟णػ ݽ؛۽ Dense Retrieval ਸ ࣻ೯द, पઁ ҙ۲ ޙࢲٜਸ ந஖ೞ૑ ޅ೮਺. ӝઓ ߑߨ੄ ೠ҅੼ [2/3]
  16. • negative sample ٜ੄ representation ਸ t-SNEਵ۽ दпചೞৈ ࠙ࢳਸ ࣻ೯ೞ৓णפ׮.

    • ӝઓী ઱۽ ࢎਊೞ؍ Random, BM25 ӝ߈੄ Negative ٜ਷ पઁ Relevant Document ৬ ࠙ನ ର੉о ब೮਺ • ژೠ Random Negative ۽ ೟णػ ݽ؛۽ Dense Retrieval ਸ ࣻ೯द, पઁ ҙ۲ ޙࢲٜਸ ந஖ೞ૑ ޅ೮਺. ӝઓ ߑߨ੄ ೠ҅੼ [2/3] “੉ উীࢲ ޤо ૓૞ ҙ۲ ޙࢲջ!”
 ؀ೠ Ѫب ೟णਸ ࣻ೯೧ঠ ೠ׮!
  17. • ࠄ ֤ޙীࢲח ೟णद ࢎਊغח negative sampleਸ ࡳח ࢜۽਍ ߑߨਸ

    ઁউ೤פ׮ • Approximate nearest neighbor Negative Contrastive Estimation(ANCE) • ೟ण ઺р ݽ؛੄ retrieval ػ Ѿҗܳ ੉ਊ೧ࢲ য۰਍ negative sampleਸ ݅٘ח ߑߨੑפ׮. • ࠺زӝ੸ਵ۽ faiss index ܳ N step ݃׮ সؘ੉౟ೞҊ, negative sample ਸ ૑ࣘ੸ਵ۽ јन೤פ׮ Approach
  18. Approach

  19. • ಣо పझ௼ TREC 2019 Deep Learning Track ܳ ࢎਊೞ৓णפ׮.

    • Ѩ࢝ ূ૓ Bing ਵ۽ ٜযৡ ߔ݅ѐ ੉࢚੄ ૕੄ী ؀೧ࢲ ҙ۲ػ ޙࢲо ۨ੉࠶݂ غয ੓ח ؘ੉ఠࣇ • ੉ ؘ੉ఠࣇਸ ࢶఖೠ ੉ਬ۽ ௼Ҋ, ୭न੉Ҋ, о੢ അप੸ੋ ࢚ടਸ ੜ ߈৔೮ӝ ⮶ޙী ࢎਊ೮׮Ҋ ੷੗ח ӝࣿೞ৓णפ׮. • ಣо ݫ౟ܼ਷ MRRҗ Recall@1k, NDCGܳ ࢎਊೞ৓णפ׮. • ؀ࠗ࠙੄ ࢿמ਷ Retrieval ী ؀ೠ ࢿמਸ ஏ੿ೞ৓Ҋ, ୶о੸ਵ۽ ઱য૓ 100ѐ੄ candidate ղীࢲ DR ݽ؛ਸ ੉ਊ೧ 
 ҙ۲ػ ޙࢲٜਸ Rerank ೞח מ۱ب э੉ Ѩૐೞ৓णפ׮. (಴ীࢲ RerankۄҊ ա৬ ੓ח ࠗ࠙) • DPRҗ زੌೞѱ, بݫੋ੄ ઁೠ੉ হח QAؘ੉ఠࣇੋ OpenQA task ؘ੉ఠࣇਵ۽ب ಣоܳ ࣻ೯ೞ৓णפ׮. 
 ಣо ߑध਷ Top-Nউী पઁ۽ ਋ܻо ఋѶ౴ ೞח passage о ನೣغয ੓ח૑ ইצ૑ ಣоೞח ݫ౟ܼਸ ࢎਊೞ৓णפ׮ Experiment
  20. Experiment

  21. • ӝઓ੄ ߑߨ਷ BM25۽ Document Retrieval ࣻ೯റ, BERT ۽ Reranking

    ೞח Two-Stage ߑߨਸ ࢎਊೞ৓णפ׮ • Inference दী ୨ 1.42 ୡ Ѧ۷णפ׮. • ߈ݶী ࠄ ֤ޙ਷ ANN ӝ߈ Dense Retrieval ਸ ࢎਊ೮ӝ ٸޙী ࠁ׮ ࡅܲ ࣘب੄ Inference о оמ೤פ׮.
 -> Inference दী 11.6ms ߆ী Ѧܻ૑ ঋ਺. Ӓۢীب Two-Stage ࠁ׮ ֫਷ ࢿמਸ ࠁৈષ Experiment
  22. • Dense Retrievalਸ In-Batch Negative ߑधਵ۽݅ ೟ण ೞח Ѫ਷ ೠ҅੼੉

    ࠙ݺ ઓ੤ೠ׮ • റࠁٜ р੄ ਋ࢶࣽਤܳ Ѿ੿ೞח מ۱੉ ࠗ઒ೞ׮. • ೟ण җ੿ীࢲ ഁтܻח റࠁ ޙࢲٜ੉ աৢ Ѫਸ о੿೧ࢲ, ૓૞ ੿׹੉ о੢ оӰب۾ ೟णਸ ೧ঠ ೠ׮. • ੉ܳ ਤ೧ࢲ ೟ण җ੿ীࢲ ୶ۿҗ زੌೞѱ ANN indexing ਸ ࣻ೯ೞҊ, negative ٜਸ retrieval۽ ࡳ ח ߑߨਸ ઁউೠ׮. ӒܻҊ ੉ܳ ࠺زӝ੸ਵ۽ ࣻ೯ೞৈࢲ োࣘ੸ੋ ೟णਸ ೡ ࣻ ੓ب۾ ೠ׮ • प೷ Ѿҗ ઁউೞח ೟ण ߑध੉ पઁ పझ௼ীࢲ ࠁ׮ ਋ࣻೠ ࢿ੸ਸ ࠁৈ઱঻׮. • Ѩ࢝ Retrieval పझ௼৬, Open-Domain QAীࢲ Document Retrieval ࢿמਸ ಣоೞ৓׮ Conclusion
  23. • https://codertimo.github.io/2020/07/20/ANN-negative-contrastive-learning/ ଵҊ੗ܐ