Scatter Lab Inc.
August 07, 2020
1.4k

# Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval

August 07, 2020

## Transcript

1. ### MLࣁ޷ա S6E3 Approximate Nearest Neighbor   Negative Contrastive Learning for

Dense Text Retrieval ӣળࢿ ML Research Scientist, Pingpong
2. ### ݾର ݾର 1. Introduction 1. ޙઁ ੿੄ 2. ӝઓ ӝߨ੄

ೠ҅ 2. Approach 1. ੽Ӕ ߑߨ ࣗѐ 2. ࠺زӝ ೟ण ܖ౯ 3. Experiment 1. प೷ ࢸ҅ 2. प೷ Ѿҗ 3. ҳഅ ࣁࠗࢎ೦
3. ### • ࠄ ֤ޙীࢲ Ҿӓ੸ਵ۽ ಽҊ੗ ೞח ޙઁח Open-Domain Question Answering

(QA) పझ௼ • Open-Domain QAח যڃ بݫੋী Ҵೠغয ੓૑ ঋ਷ ૕ޙਸ ؍઎ਸ ٸ,   ࠁਬೞҊ ੓ח ׮۝(~1M+)੄ ޙࢲٜ о਍ؘ ನೣغয ੓ח ੿׹ਸ ଺ח పझ௼۽ ੿੄ೡ ࣻ ੓णפ׮. • ৘ܳ ٜݶ ਤఃೖ٣ইী ઓ੤ೞח ݽٚ ޙࢲܳ ଵઑೡ ࣻ ੓׮ח о੿ ೞী  “ఋ֢झח ੹ ਋઱੄ ݻಌࣃ౟੄ ࢤݺ୓ܳ લ৓য?” ী ؀ೠ ׹ਸ ଺ח Ѫ ੑפ׮. ޙઁ ੿੄ [1/2]
4. ### • ٩۞׬ ӝ߈੄ ݽ؛ਸ ੉ਊ೧ࢲ ࠁ׮ ੿ഛೠ ੿׹ਸ ଺ਸ ࣻ

੓૑݅, ݽٚ ޙࢲ(+Nর)ী ؀೧   ো࢑ਸ ࣻ೯ೞח Ѫ਷ ݒ਋ ࠺ബਯ ੸੉Ҋ, पदр ࢲ࠺झо ࠛоמೞ׮ח ೠ҅੼੉ ੓঻णפ׮. • ӝઓ੄ োҳٜ਷ ࣘب੸ ೠ҅੼ਸ ӓࠂೞӝ ਤ೧ ௼ѱ فо૑ stage ۽ ܻ࠙ೞৈ ޙઁܳ ಽҊ੗ ೞ৓णפ׮ • 1. Document Retrieval: ઱য૓ ૕੄ী ؀೧ࢲ ҙ۲੉ ੓ח ޙࢲٜਸ ଺ח ױ҅ • 2. Reading Comprehension: ઱য૓ ૕੄ী ؀ೠ ҳ୓੸ੋ ׹ਸ ҙ۲ ޙࢲܳ ଵઑೞৈ ب୹ೞח ݽ؛ • য়ט ࣗѐ೧ ܾ٘ ֤ޙ਷ ੉ Document Retrieval ੄ ࢿמ ೱ࢚ী ҙೠ ߑߨਸ ઁউ೤פ׮. ޙઁ ੿੄ [2/2]
5. ### • ӝઓ੄ ؀ࠗ࠙੄ োҳীࢲח Document Retrieval ী Lexical Feature ܳ

઱۽ ࢎਊೞ৓णפ׮. • ৘द) BM25, TF-IDF, Keyword Matching ١١ (Elastic Search੄ ઱ػ ӝמ) • ೞ૑݅ ੉۞ೠ ߑߨ਷ ૕੄ ੗୓੄ ೣ୷੸ ੄޷(Semantic)ܳ ੉೧ೞҊ ҙ۲ػ ׹߸ਸ ଺ਸ ࣻח হणפ׮. • ৘द) Q. ־о పठۄ੄ ؀੢੉ঠ?   -> (పठۄ, ؀੢) ਵ۽ Ѩ࢝೧ب ف ఃਕ٘ܳ ನೣೞח ޙࢲܳ ଺ਸ ࣻ হ਺.. ӝઓ ߑߨ੄ ೠ҅੼ [1/3]
6. ### • ୭Ӕ੄ োҳٜ਷(Lee et al., 2019; Guu et al., 2020;

Seo et al. 2019) ૕੄৬ ޙࢲܳ BERTܳ ੉ਊ೧ Representation ਵ۽ ಴അೞৈ ࠁ׮ Semantic ೠ ੿ࠁܳ ನ଱ೡ ࣻ ੓ח ߑߨਸ ઁউೞ৓਺. • ੉۞ೠ ߑߨٜ਷ BI-Encoder ҳઑ੄ ݽ؛ਸ ࢎਊೞݴ, In-Batch Negative ۽ ೟णਸ ࣻ೯೤פ׮. • ೟ण੉ ৮ܐػ ੉റীח Document Encoderܳ ੉ਊ೧ࢲ ޷ܻ ޙࢲٜਸ encoding ೧ ֬਺ • Inference दীח ૕੄݅ BERT۽ Representation ਸ ҅࢑ೞҊ FAISS ৬ э਷ Approximate Nearest Neighbor Search ోਸ ੉ਊ೧ ߄۽ ૕੄ Representation җ оө਍ Top-Kѐ੄ ޙࢲܳ ଺਺ ӝઓ ߑߨ੄ ೠ҅੼ [2/3]

D4 ೟ण ؘ੉ఠࣇ
9. ### ೟णߑߨ: In-Batch Negative Q1 D1 Q2 D2 Q3 D3 Q4

D4 ೟ण ؘ੉ఠࣇ Q: (4, 512) D: (4, 512)
10. ### ೟णߑߨ: In-Batch Negative Q1 D1 Q2 D2 Q3 D3 Q4

D4 ೟ण ؘ੉ఠࣇ Q: (4, 512) D: (4, 512) Q ⋅ DT -> (4,4)
11. ### ೟णߑߨ: In-Batch Negative Q1 Q2 Q3 Q4 D1 D2 D3

D4 Q1 D1 Q2 D2 Q3 D3 Q4 D4 ೟ण ؘ੉ఠࣇ Q: (4, 512) D: (4, 512) Q ⋅ DT -> (4,4)
12. ### ೟णߑߨ: In-Batch Negative Q1 Q2 Q3 Q4 D1 D2 D3

D4 Q1 D1 Q2 D2 Q3 D3 Q4 D4 ೟ण ؘ੉ఠࣇ Q: (4, 512) D: (4, 512) 0.5 0.6 0.4 0.7 0.2 0.1 0.2 0.1 0.2 0.1 0.3 0.1 0.2 0.1 0.1 0.1 Softmax Q ⋅ DT п Row ߹۽ Softmaxܳ ஂೣ -> (4,4)
13. ### ೟णߑߨ: In-Batch Negative Q1 Q2 Q3 Q4 D1 D2 D3

D4 Q1 D1 Q2 D2 Q3 D3 Q4 D4 ೟ण ؘ੉ఠࣇ Q: (4, 512) D: (4, 512) 0.99 0.99 0.01 0.99 0.99 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 Q ⋅ DT ೟ण ݾ಴:   п Row ীࢲ ؀਽غח ޙࢲо ઁੌ ֫਷ чਸ ыب۾ -> (4,4)
14. ### • ੷੗ח Dense Retrieval ݽ؛ਸ ೟णೡ ٸ ࢎਊೞח In-Batch Negativeী

ޙઁо ੓਺ਸ ૑੸೤פ׮. • In-Batch Negative ೟ण ߑߨ਷ যו੿ب ਬࢎೠ ޙࢲٜਸ ୶ܻחؘীח ਬബೞ૑݅,   ૓૞ ҙ۲੉ ੓ח ޙࢲܳ ੿ഛೞѱ ఐ࢝ೞӝীח Ӕࠄ੸ੋ ೠ҅о ੓ਸ Ѫ੉ۄח оࢸਸ ࣁ਒פ׮. • ৵ջೞݶ ৮੹ ҙ۲੉ হח റࠁٜ ઺ী, ҙ۲੉ ੓ח ೞա੄ ޙࢲܳ ࡳب۾ ೟णೞח Ѫҗ  ҙ۲ࢿ ੓ח റࠁٜ ઺ীࢲ ૓૞ ҙ۲੉ ੓ח ೞա੄ ޙࢲܳ ࡳب۾ ೟णೞח Ѫ਷ ׮ܰӝ ٸޙੑפ׮. ӝઓ ߑߨ੄ ೠ҅੼ [2/3]
15. ### • negative sample ٜ੄ representation ਸ t-SNEਵ۽ दпചೞৈ ࠙ࢳਸ ࣻ೯ೞ৓णפ׮.

• ӝઓী ઱۽ ࢎਊೞ؍ Random, BM25 ӝ߈੄ Negative ٜ਷ पઁ Relevant Document ৬ ࠙ನ ର੉о ब೮਺ • ژೠ Random Negative ۽ ೟णػ ݽ؛۽ Dense Retrieval ਸ ࣻ೯द, पઁ ҙ۲ ޙࢲٜਸ ந஖ೞ૑ ޅ೮਺. ӝઓ ߑߨ੄ ೠ҅੼ [2/3]
16. ### • negative sample ٜ੄ representation ਸ t-SNEਵ۽ दпചೞৈ ࠙ࢳਸ ࣻ೯ೞ৓णפ׮.

• ӝઓী ઱۽ ࢎਊೞ؍ Random, BM25 ӝ߈੄ Negative ٜ਷ पઁ Relevant Document ৬ ࠙ನ ର੉о ब೮਺ • ژೠ Random Negative ۽ ೟णػ ݽ؛۽ Dense Retrieval ਸ ࣻ೯द, पઁ ҙ۲ ޙࢲٜਸ ந஖ೞ૑ ޅ೮਺. ӝઓ ߑߨ੄ ೠ҅੼ [2/3] “੉ উীࢲ ޤо ૓૞ ҙ۲ ޙࢲջ!”  ؀ೠ Ѫب ೟णਸ ࣻ೯೧ঠ ೠ׮!
17. ### • ࠄ ֤ޙীࢲח ೟णद ࢎਊغח negative sampleਸ ࡳח ࢜۽਍ ߑߨਸ

ઁউ೤פ׮ • Approximate nearest neighbor Negative Contrastive Estimation(ANCE) • ೟ण ઺р ݽ؛੄ retrieval ػ Ѿҗܳ ੉ਊ೧ࢲ য۰਍ negative sampleਸ ݅٘ח ߑߨੑפ׮. • ࠺زӝ੸ਵ۽ faiss index ܳ N step ݃׮ সؘ੉౟ೞҊ, negative sample ਸ ૑ࣘ੸ਵ۽ јन೤פ׮ Approach

19. ### • ಣо పझ௼ TREC 2019 Deep Learning Track ܳ ࢎਊೞ৓णפ׮.

• Ѩ࢝ ূ૓ Bing ਵ۽ ٜযৡ ߔ݅ѐ ੉࢚੄ ૕੄ী ؀೧ࢲ ҙ۲ػ ޙࢲо ۨ੉࠶݂ غয ੓ח ؘ੉ఠࣇ • ੉ ؘ੉ఠࣇਸ ࢶఖೠ ੉ਬ۽ ௼Ҋ, ୭न੉Ҋ, о੢ അप੸ੋ ࢚ടਸ ੜ ߈৔೮ӝ ⮶ޙী ࢎਊ೮׮Ҋ ੷੗ח ӝࣿೞ৓णפ׮. • ಣо ݫ౟ܼ਷ MRRҗ Recall@1k, NDCGܳ ࢎਊೞ৓णפ׮. • ؀ࠗ࠙੄ ࢿמ਷ Retrieval ী ؀ೠ ࢿמਸ ஏ੿ೞ৓Ҋ, ୶о੸ਵ۽ ઱য૓ 100ѐ੄ candidate ղীࢲ DR ݽ؛ਸ ੉ਊ೧   ҙ۲ػ ޙࢲٜਸ Rerank ೞח מ۱ب э੉ Ѩૐೞ৓णפ׮. (಴ীࢲ RerankۄҊ ա৬ ੓ח ࠗ࠙) • DPRҗ زੌೞѱ, بݫੋ੄ ઁೠ੉ হח QAؘ੉ఠࣇੋ OpenQA task ؘ੉ఠࣇਵ۽ب ಣоܳ ࣻ೯ೞ৓णפ׮.   ಣо ߑध਷ Top-Nউী पઁ۽ ਋ܻо ఋѶ౴ ೞח passage о ನೣغয ੓ח૑ ইצ૑ ಣоೞח ݫ౟ܼਸ ࢎਊೞ৓णפ׮ Experiment

21. ### • ӝઓ੄ ߑߨ਷ BM25۽ Document Retrieval ࣻ೯റ, BERT ۽ Reranking

ೞח Two-Stage ߑߨਸ ࢎਊೞ৓णפ׮ • Inference दী ୨ 1.42 ୡ Ѧ۷णפ׮. • ߈ݶী ࠄ ֤ޙ਷ ANN ӝ߈ Dense Retrieval ਸ ࢎਊ೮ӝ ٸޙী ࠁ׮ ࡅܲ ࣘب੄ Inference о оמ೤פ׮.  -> Inference दী 11.6ms ߆ী Ѧܻ૑ ঋ਺. Ӓۢীب Two-Stage ࠁ׮ ֫਷ ࢿמਸ ࠁৈષ Experiment
22. ### • Dense Retrievalਸ In-Batch Negative ߑधਵ۽݅ ೟ण ೞח Ѫ਷ ೠ҅੼੉

࠙ݺ ઓ੤ೠ׮ • റࠁٜ р੄ ਋ࢶࣽਤܳ Ѿ੿ೞח מ۱੉ ࠗ઒ೞ׮. • ೟ण җ੿ীࢲ ഁтܻח റࠁ ޙࢲٜ੉ աৢ Ѫਸ о੿೧ࢲ, ૓૞ ੿׹੉ о੢ оӰب۾ ೟णਸ ೧ঠ ೠ׮. • ੉ܳ ਤ೧ࢲ ೟ण җ੿ীࢲ ୶ۿҗ زੌೞѱ ANN indexing ਸ ࣻ೯ೞҊ, negative ٜਸ retrieval۽ ࡳ ח ߑߨਸ ઁউೠ׮. ӒܻҊ ੉ܳ ࠺زӝ੸ਵ۽ ࣻ೯ೞৈࢲ োࣘ੸ੋ ೟णਸ ೡ ࣻ ੓ب۾ ೠ׮ • प೷ Ѿҗ ઁউೞח ೟ण ߑध੉ पઁ పझ௼ীࢲ ࠁ׮ ਋ࣻೠ ࢿ੸ਸ ࠁৈ઱঻׮. • Ѩ࢝ Retrieval పझ௼৬, Open-Domain QAীࢲ Document Retrieval ࢿמਸ ಣоೞ৓׮ Conclusion