Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Approximate Nearest Neighbor Negative Contrasti...

Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval

Scatter Lab Inc.

August 07, 2020
Tweet

More Decks by Scatter Lab Inc.

Other Decks in Research

Transcript

  1. MLࣁ޷ա S6E3 Approximate Nearest Neighbor 
 Negative Contrastive Learning for

    Dense Text Retrieval ӣળࢿ ML Research Scientist, Pingpong
  2. ݾର ݾର 1. Introduction 1. ޙઁ ੿੄ 2. ӝઓ ӝߨ੄

    ೠ҅ 2. Approach 1. ੽Ӕ ߑߨ ࣗѐ 2. ࠺زӝ ೟ण ܖ౯ 3. Experiment 1. प೷ ࢸ҅ 2. प೷ Ѿҗ 3. ҳഅ ࣁࠗࢎ೦
  3. • ࠄ ֤ޙীࢲ Ҿӓ੸ਵ۽ ಽҊ੗ ೞח ޙઁח Open-Domain Question Answering

    (QA) పझ௼ • Open-Domain QAח যڃ بݫੋী Ҵೠغয ੓૑ ঋ਷ ૕ޙਸ ؍઎ਸ ٸ, 
 ࠁਬೞҊ ੓ח ׮۝(~1M+)੄ ޙࢲٜ о਍ؘ ನೣغয ੓ח ੿׹ਸ ଺ח పझ௼۽ ੿੄ೡ ࣻ ੓णפ׮. • ৘ܳ ٜݶ ਤఃೖ٣ইী ઓ੤ೞח ݽٚ ޙࢲܳ ଵઑೡ ࣻ ੓׮ח о੿ ೞী
 “ఋ֢झח ੹ ਋઱੄ ݻಌࣃ౟੄ ࢤݺ୓ܳ લ৓য?” ী ؀ೠ ׹ਸ ଺ח Ѫ ੑפ׮. ޙઁ ੿੄ [1/2]
  4. • ٩۞׬ ӝ߈੄ ݽ؛ਸ ੉ਊ೧ࢲ ࠁ׮ ੿ഛೠ ੿׹ਸ ଺ਸ ࣻ

    ੓૑݅, ݽٚ ޙࢲ(+Nর)ী ؀೧ 
 ো࢑ਸ ࣻ೯ೞח Ѫ਷ ݒ਋ ࠺ബਯ ੸੉Ҋ, पदр ࢲ࠺झо ࠛоמೞ׮ח ೠ҅੼੉ ੓঻णפ׮. • ӝઓ੄ োҳٜ਷ ࣘب੸ ೠ҅੼ਸ ӓࠂೞӝ ਤ೧ ௼ѱ فо૑ stage ۽ ܻ࠙ೞৈ ޙઁܳ ಽҊ੗ ೞ৓णפ׮ • 1. Document Retrieval: ઱য૓ ૕੄ী ؀೧ࢲ ҙ۲੉ ੓ח ޙࢲٜਸ ଺ח ױ҅ • 2. Reading Comprehension: ઱য૓ ૕੄ী ؀ೠ ҳ୓੸ੋ ׹ਸ ҙ۲ ޙࢲܳ ଵઑೞৈ ب୹ೞח ݽ؛ • য়ט ࣗѐ೧ ܾ٘ ֤ޙ਷ ੉ Document Retrieval ੄ ࢿמ ೱ࢚ী ҙೠ ߑߨਸ ઁউ೤פ׮. ޙઁ ੿੄ [2/2]
  5. • ӝઓ੄ ؀ࠗ࠙੄ োҳীࢲח Document Retrieval ী Lexical Feature ܳ

    ઱۽ ࢎਊೞ৓णפ׮. • ৘द) BM25, TF-IDF, Keyword Matching ١١ (Elastic Search੄ ઱ػ ӝמ) • ೞ૑݅ ੉۞ೠ ߑߨ਷ ૕੄ ੗୓੄ ೣ୷੸ ੄޷(Semantic)ܳ ੉೧ೞҊ ҙ۲ػ ׹߸ਸ ଺ਸ ࣻח হणפ׮. • ৘द) Q. ־о పठۄ੄ ؀੢੉ঠ? 
 -> (పठۄ, ؀੢) ਵ۽ Ѩ࢝೧ب ف ఃਕ٘ܳ ನೣೞח ޙࢲܳ ଺ਸ ࣻ হ਺.. ӝઓ ߑߨ੄ ೠ҅੼ [1/3]
  6. • ୭Ӕ੄ োҳٜ਷(Lee et al., 2019; Guu et al., 2020;

    Seo et al. 2019) ૕੄৬ ޙࢲܳ BERTܳ ੉ਊ೧ Representation ਵ۽ ಴അೞৈ ࠁ׮ Semantic ೠ ੿ࠁܳ ನ଱ೡ ࣻ ੓ח ߑߨਸ ઁউೞ৓਺. • ੉۞ೠ ߑߨٜ਷ BI-Encoder ҳઑ੄ ݽ؛ਸ ࢎਊೞݴ, In-Batch Negative ۽ ೟णਸ ࣻ೯೤פ׮. • ೟ण੉ ৮ܐػ ੉റীח Document Encoderܳ ੉ਊ೧ࢲ ޷ܻ ޙࢲٜਸ encoding ೧ ֬਺ • Inference दীח ૕੄݅ BERT۽ Representation ਸ ҅࢑ೞҊ FAISS ৬ э਷ Approximate Nearest Neighbor Search ోਸ ੉ਊ೧ ߄۽ ૕੄ Representation җ оө਍ Top-Kѐ੄ ޙࢲܳ ଺਺ ӝઓ ߑߨ੄ ೠ҅੼ [2/3]
  7. ೟णߑߨ: In-Batch Negative Q1 D1 Q2 D2 Q3 D3 Q4

    D4 ೟ण ؘ੉ఠࣇ Q: (4, 512) D: (4, 512)
  8. ೟णߑߨ: In-Batch Negative Q1 D1 Q2 D2 Q3 D3 Q4

    D4 ೟ण ؘ੉ఠࣇ Q: (4, 512) D: (4, 512) Q ⋅ DT -> (4,4)
  9. ೟णߑߨ: In-Batch Negative Q1 Q2 Q3 Q4 D1 D2 D3

    D4 Q1 D1 Q2 D2 Q3 D3 Q4 D4 ೟ण ؘ੉ఠࣇ Q: (4, 512) D: (4, 512) Q ⋅ DT -> (4,4)
  10. ೟णߑߨ: In-Batch Negative Q1 Q2 Q3 Q4 D1 D2 D3

    D4 Q1 D1 Q2 D2 Q3 D3 Q4 D4 ೟ण ؘ੉ఠࣇ Q: (4, 512) D: (4, 512) 0.5 0.6 0.4 0.7 0.2 0.1 0.2 0.1 0.2 0.1 0.3 0.1 0.2 0.1 0.1 0.1 Softmax Q ⋅ DT п Row ߹۽ Softmaxܳ ஂೣ -> (4,4)
  11. ೟णߑߨ: In-Batch Negative Q1 Q2 Q3 Q4 D1 D2 D3

    D4 Q1 D1 Q2 D2 Q3 D3 Q4 D4 ೟ण ؘ੉ఠࣇ Q: (4, 512) D: (4, 512) 0.99 0.99 0.01 0.99 0.99 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 Q ⋅ DT ೟ण ݾ಴: 
 п Row ীࢲ ؀਽غח ޙࢲо ઁੌ ֫਷ чਸ ыب۾ -> (4,4)
  12. • ੷੗ח Dense Retrieval ݽ؛ਸ ೟णೡ ٸ ࢎਊೞח In-Batch Negativeী

    ޙઁо ੓਺ਸ ૑੸೤פ׮. • In-Batch Negative ೟ण ߑߨ਷ যו੿ب ਬࢎೠ ޙࢲٜਸ ୶ܻחؘীח ਬബೞ૑݅, 
 ૓૞ ҙ۲੉ ੓ח ޙࢲܳ ੿ഛೞѱ ఐ࢝ೞӝীח Ӕࠄ੸ੋ ೠ҅о ੓ਸ Ѫ੉ۄח оࢸਸ ࣁ਒פ׮. • ৵ջೞݶ ৮੹ ҙ۲੉ হח റࠁٜ ઺ী, ҙ۲੉ ੓ח ೞա੄ ޙࢲܳ ࡳب۾ ೟णೞח Ѫҗ
 ҙ۲ࢿ ੓ח റࠁٜ ઺ীࢲ ૓૞ ҙ۲੉ ੓ח ೞա੄ ޙࢲܳ ࡳب۾ ೟णೞח Ѫ਷ ׮ܰӝ ٸޙੑפ׮. ӝઓ ߑߨ੄ ೠ҅੼ [2/3]
  13. • negative sample ٜ੄ representation ਸ t-SNEਵ۽ दпചೞৈ ࠙ࢳਸ ࣻ೯ೞ৓णפ׮.

    • ӝઓী ઱۽ ࢎਊೞ؍ Random, BM25 ӝ߈੄ Negative ٜ਷ पઁ Relevant Document ৬ ࠙ನ ର੉о ब೮਺ • ژೠ Random Negative ۽ ೟णػ ݽ؛۽ Dense Retrieval ਸ ࣻ೯द, पઁ ҙ۲ ޙࢲٜਸ ந஖ೞ૑ ޅ೮਺. ӝઓ ߑߨ੄ ೠ҅੼ [2/3]
  14. • negative sample ٜ੄ representation ਸ t-SNEਵ۽ दпചೞৈ ࠙ࢳਸ ࣻ೯ೞ৓णפ׮.

    • ӝઓী ઱۽ ࢎਊೞ؍ Random, BM25 ӝ߈੄ Negative ٜ਷ पઁ Relevant Document ৬ ࠙ನ ର੉о ब೮਺ • ژೠ Random Negative ۽ ೟णػ ݽ؛۽ Dense Retrieval ਸ ࣻ೯द, पઁ ҙ۲ ޙࢲٜਸ ந஖ೞ૑ ޅ೮਺. ӝઓ ߑߨ੄ ೠ҅੼ [2/3] “੉ উীࢲ ޤо ૓૞ ҙ۲ ޙࢲջ!”
 ؀ೠ Ѫب ೟णਸ ࣻ೯೧ঠ ೠ׮!
  15. • ࠄ ֤ޙীࢲח ೟णद ࢎਊغח negative sampleਸ ࡳח ࢜۽਍ ߑߨਸ

    ઁউ೤פ׮ • Approximate nearest neighbor Negative Contrastive Estimation(ANCE) • ೟ण ઺р ݽ؛੄ retrieval ػ Ѿҗܳ ੉ਊ೧ࢲ য۰਍ negative sampleਸ ݅٘ח ߑߨੑפ׮. • ࠺زӝ੸ਵ۽ faiss index ܳ N step ݃׮ সؘ੉౟ೞҊ, negative sample ਸ ૑ࣘ੸ਵ۽ јन೤פ׮ Approach
  16. • ಣо పझ௼ TREC 2019 Deep Learning Track ܳ ࢎਊೞ৓णפ׮.

    • Ѩ࢝ ূ૓ Bing ਵ۽ ٜযৡ ߔ݅ѐ ੉࢚੄ ૕੄ী ؀೧ࢲ ҙ۲ػ ޙࢲо ۨ੉࠶݂ غয ੓ח ؘ੉ఠࣇ • ੉ ؘ੉ఠࣇਸ ࢶఖೠ ੉ਬ۽ ௼Ҋ, ୭न੉Ҋ, о੢ അप੸ੋ ࢚ടਸ ੜ ߈৔೮ӝ ⮶ޙী ࢎਊ೮׮Ҋ ੷੗ח ӝࣿೞ৓णפ׮. • ಣо ݫ౟ܼ਷ MRRҗ Recall@1k, NDCGܳ ࢎਊೞ৓णפ׮. • ؀ࠗ࠙੄ ࢿמ਷ Retrieval ী ؀ೠ ࢿמਸ ஏ੿ೞ৓Ҋ, ୶о੸ਵ۽ ઱য૓ 100ѐ੄ candidate ղীࢲ DR ݽ؛ਸ ੉ਊ೧ 
 ҙ۲ػ ޙࢲٜਸ Rerank ೞח מ۱ب э੉ Ѩૐೞ৓णפ׮. (಴ীࢲ RerankۄҊ ա৬ ੓ח ࠗ࠙) • DPRҗ زੌೞѱ, بݫੋ੄ ઁೠ੉ হח QAؘ੉ఠࣇੋ OpenQA task ؘ੉ఠࣇਵ۽ب ಣоܳ ࣻ೯ೞ৓णפ׮. 
 ಣо ߑध਷ Top-Nউী पઁ۽ ਋ܻо ఋѶ౴ ೞח passage о ನೣغয ੓ח૑ ইצ૑ ಣоೞח ݫ౟ܼਸ ࢎਊೞ৓णפ׮ Experiment
  17. • ӝઓ੄ ߑߨ਷ BM25۽ Document Retrieval ࣻ೯റ, BERT ۽ Reranking

    ೞח Two-Stage ߑߨਸ ࢎਊೞ৓णפ׮ • Inference दী ୨ 1.42 ୡ Ѧ۷णפ׮. • ߈ݶী ࠄ ֤ޙ਷ ANN ӝ߈ Dense Retrieval ਸ ࢎਊ೮ӝ ٸޙী ࠁ׮ ࡅܲ ࣘب੄ Inference о оמ೤פ׮.
 -> Inference दী 11.6ms ߆ী Ѧܻ૑ ঋ਺. Ӓۢীب Two-Stage ࠁ׮ ֫਷ ࢿמਸ ࠁৈષ Experiment
  18. • Dense Retrievalਸ In-Batch Negative ߑधਵ۽݅ ೟ण ೞח Ѫ਷ ೠ҅੼੉

    ࠙ݺ ઓ੤ೠ׮ • റࠁٜ р੄ ਋ࢶࣽਤܳ Ѿ੿ೞח מ۱੉ ࠗ઒ೞ׮. • ೟ण җ੿ীࢲ ഁтܻח റࠁ ޙࢲٜ੉ աৢ Ѫਸ о੿೧ࢲ, ૓૞ ੿׹੉ о੢ оӰب۾ ೟णਸ ೧ঠ ೠ׮. • ੉ܳ ਤ೧ࢲ ೟ण җ੿ীࢲ ୶ۿҗ زੌೞѱ ANN indexing ਸ ࣻ೯ೞҊ, negative ٜਸ retrieval۽ ࡳ ח ߑߨਸ ઁউೠ׮. ӒܻҊ ੉ܳ ࠺زӝ੸ਵ۽ ࣻ೯ೞৈࢲ োࣘ੸ੋ ೟णਸ ೡ ࣻ ੓ب۾ ೠ׮ • प೷ Ѿҗ ઁউೞח ೟ण ߑध੉ पઁ పझ௼ীࢲ ࠁ׮ ਋ࣻೠ ࢿ੸ਸ ࠁৈ઱঻׮. • Ѩ࢝ Retrieval పझ௼৬, Open-Domain QAীࢲ Document Retrieval ࢿמਸ ಣоೞ৓׮ Conclusion