Slide 1

Slide 1 text

Full-Text Search Philipp Krenn @xeraa

Slide 2

Slide 2 text

ViennaDB Papers We Love Vienna

Slide 3

Slide 3 text

Infrastructure | Developer Advocate

Slide 4

Slide 4 text

Full-Text Search (FTS) Databases vs Full-Text Search

Slide 5

Slide 5 text

No content

Slide 6

Slide 6 text

Before FTS Regular expressions !

Slide 7

Slide 7 text

Before FTS Arrays with relevant terms ! https:/ /docs.mongodb.com/manual/tutorial/model-data-for- keyword-search/

Slide 8

Slide 8 text

{ title : "Moby-Dick" , author : "Herman Melville" , published : 1851 , ISBN : 0451526996 , topics : [ "whaling" , "allegory" , "revenge" , "American" , "novel" , "nautical" , "voyage" , "Cape Cod" ] } db.volumes.createIndex({ topics: 1 }) db.volumes.findOne({ topics : "voyage" }, { title: 1 })

Slide 9

Slide 9 text

No Stemming Fuzziness Synonyms Ranking

Slide 10

Slide 10 text

Finally FTS 230+ votes Created 2009 Resolved 2013 https:/ /jira.mongodb.org/browse/SERVER-380

Slide 11

Slide 11 text

FTS in MongoDB Beta since 2.4 Stable since 3.0 80% solution — for more Elasticsearch

Slide 12

Slide 12 text

FTS in MongoDB In Latin alphabets Case insensitive (default in 3.2) [A-z] other characters removed (default in 3.2)

Slide 13

Slide 13 text

$text Updated version in MongoDB 3.2 { $text: { $search: "", $language: "", $caseSensitive: , $diacriticSensitive: } }

Slide 14

Slide 14 text

No content

Slide 15

Slide 15 text

Example text These are not the droids you are looking for.

Slide 16

Slide 16 text

Tokenizer these̴are̴not̴the̴droids̴you̴ are̴looking̴for

Slide 17

Slide 17 text

Stop Words droids̴looking

Slide 18

Slide 18 text

Stemming droid̴look

Slide 19

Slide 19 text

Let's try it > db.starwars.ensureIndex({ quote: "text" }) { "createdCollectionAutomatically": true, "numIndexesBefore": 1, "numIndexesAfter": 2, "ok": 1 }

Slide 20

Slide 20 text

Let's try it > db.starwars.getIndices() [ { "v": 1, "key": { "_id": 1 }, "name": "_id_", "ns": "starwars.starwars" }, { "v": 1, "key": { "_fts": "text", "_ftsx": 1 }, "name": "quote_text", "ns": "starwars.starwars", "weights": { "quote": 1 }, "default_language": "english", "language_override": "language", "textIndexVersion": 3 } ]

Slide 21

Slide 21 text

Let's try it > db.starwars.insert({ quote: "These are not the droids you are looking for." }) Inserted 1 record(s) in 39ms WriteResult({ "nInserted": 1 })

Slide 22

Slide 22 text

Let's try it > db.starwars.find({ $text: { $search: "droid" }}) { "_id": ObjectId("574c50c3920246255ce2ad81"), "quote": "These are not the droids you are looking for." } Fetched 1 record(s) in 35ms > db.starwars.find({ $text: { $search: "look" }}) { "_id": ObjectId("574c50c3920246255ce2ad81"), "quote": "These are not the droids you are looking for." } Fetched 1 record(s) in 4ms > db.starwars.find({ $text: { $search: "you" }}) Fetched 0 record(s) in 1ms

Slide 23

Slide 23 text

Let's try it > db.starwars.find({ $text: { $search: "look" }}).explain() { "queryPlanner": { "plannerVersion": 1, "namespace": "starwars.starwars", "indexFilterSet": false, "parsedQuery": { "$text": { "$search": "look", "$language": "english", "$caseSensitive": false, "$diacriticSensitive": false } }, ...

Slide 24

Slide 24 text

Let's try it ... "parsedTextQuery": { "terms": [ "look" ], "negatedTerms": [ ], "phrases": [ ], "negatedPhrases": [ ] }, ...

Slide 25

Slide 25 text

Let's try it > db.starwars.find({ $text: { $search: "-look" } }).explain() ... "parsedTextQuery": { "terms": [ ], "negatedTerms": [ "look" ], "phrases": [ ], "negatedPhrases": [ ] }, ...

Slide 26

Slide 26 text

Let's try it > db.starwars.find({ $text: { $search: "look" }}, {score: {$meta: "textScore"}}) { "_id": ObjectId("574c50c3920246255ce2ad81"), "quote": "These are not the droids you are looking for.", "score": 0.75 } Fetched 1 record(s) in 8ms > db.starwars.find({ $text: { $search: "looks" }}, {score: {$meta: "textScore"}}) .sort({ score: { $meta: "textScore" } }).limit(1) { "_id": ObjectId("574c50c3920246255ce2ad81"), "quote": "These are not the droids you are looking for.", "score": 0.75 } Fetched 1 record(s) in 5ms

Slide 27

Slide 27 text

Indexing String or array of strings only Optional language or translations Optional weighting if multiple fields indexed

Slide 28

Slide 28 text

Indexing > db.allFields.ensureIndex({ "$**": "text"}, { default_language: "german" })

Slide 29

Slide 29 text

Queries // OR > db.starwars.find({ $text: { $search: "look droid" } }) // AND but without input stemming > db.starwars.find({ $text: { $search: "\"look\" \"droid\"" } }) // Negation > db.starwars.find({ $text: { $search: "look -droid" } }) // Phrase > db.starwars.find({ $text: { $search: "\"look droid\"" } }) // Translation > db.starwars.find({ $text: { $search: "buscar", $language: "es" } })

Slide 30

Slide 30 text

Score > db.starwars.find({ $text: { $search: "father look" } }, { score: { $meta: "textScore" } }) { "_id": ObjectId("574c83c9920246255ce2ad82"), "quote": "These are not the droids you are looking for", "score": 0.75 } { "_id": ObjectId("574c8712920246255ce2ad83"), "quote": "I am your father", "score": 1 } { "_id": ObjectId("574c8763920246255ce2ad84"), "quote": "Look at me father", "score": 1.5 }

Slide 31

Slide 31 text

Score https:/ /github.com/mongodb/mongo/blob/v3.2/src/mongo/db/fts/fts_spec.cpp#L219 double coeff = (0.5 * data.count / numTokens) + 0.5; data.count: matches numTokens: stemmed words

Slide 32

Slide 32 text

Score father look "These are not the droids you are looking for" droid look == 1 match, 2 tokens coeff:

Slide 33

Slide 33 text

Score father look "I am your father" father == 1 match, 1 token coeff:

Slide 34

Slide 34 text

Score https:/ /github.com/mongodb/mongo/blob/v3.2/src/mongo/db/fts/fts_spec.cpp#L228 score += (weight * data.freq * coeff * adjustment); weight: method parameter data.freq, adjustment: 1

Slide 35

Slide 35 text

Score father look "Look at me father" look father == 1 match, 2 tokens look father == 1 match, 2 tokens coeff:

Slide 36

Slide 36 text

Score father look each score: Sum:

Slide 37

Slide 37 text

Limitations B-tree vs inverted index

Slide 38

Slide 38 text

Missing features Fuzziness Suggestions Highlighting Synonyms

Slide 39

Slide 39 text

Thanks! Questions? @xeraa