Dealing with a search engine in your application - a Solr approach for beginners

Slide 1

Slide 1 text

Dealing with a search engine in your application a Solr approach for beginners Elaine Naomi Watanabe

Slide 2

Slide 2 text

Elaine Naomi Watanabe Full-stack developer (Playax) Master's degree in Computer Science (IME-USP) Passionate about: Web Development, Agile, Cloud Computing, DevOps, NoSQL and RDBMS

Slide 3

Slide 3 text

ANALYZING BILLIONS OF DATA TO HELP ARTISTS AND MUSIC PROFESSIONALS TO DEVELOP THEIR AUDIENCE BIG DATA + MUSIC + TECH = <3

Slide 4

Slide 4 text

AGENDA

Slide 5

Slide 5 text

Searching Problem Introduction Information Retrieval Basic concepts Apache Solr How to configure Sunspot Gem Integrating with Ruby on Rails Next Steps Including references SPOILER ALERT

Slide 6

Slide 6 text

THE SEARCHING PROBLEM

Slide 7

Slide 7 text

or did you mean... HOW TO SEARCH LIKE THE GOOGLE?

Slide 8

Slide 8 text

A LIST OF SONGS… TITLES, ARTISTS, LYRICS... IMAGINE IN OUR CONTEXT…

Slide 9

Slide 9 text

"Imagine all the people living life in peace" (Imagine - John Lennon) "I'm a radioactive, radioactive" (Radioactive - Imagine Dragons) "Welcome to the jungle, watch it bring you to your knees" (Welcome to the jungle - Guns N' Roses) A little little sample of songs...

Slide 10

Slide 10 text

A SQL LIKE statement is enough? SEARCHING TERM: "IMAGINE" SELECT * FROM songs WHERE title LIKE '%IMAGINE%' OR artist LIKE '%IMAGINE%' OR lyrics LIKE '%IMAGINE%';

Slide 11

Slide 11 text

Slide 12

Slide 12 text

A SQL LIKE statement is really enough?? SEARCHING TERM: "IMAGINE PEOPLE"

Slide 13

Slide 13 text

Slide 14

Slide 14 text

A SQL LIKE statement is really enough?? SEARCHING TERM: "IMAGINE PEOPLE" SELECT * FROM songs WHERE title LIKE '%IMAGINE%PEOPLE%' OR artist LIKE '%IMAGINE%PEOPLE%' OR lyrics LIKE '%IMAGINE%PEOPLE%';

Slide 15

Slide 15 text

USER x YOUR APP A BUG SEARCH TOOL

Slide 16

Slide 16 text

When LIKE STATEMENT is not enough... SEARCH TERMS: "Dragons Imagine" "Imagine John" "Imagine JONH" <- TYPO!

Slide 17

Slide 17 text

When only Yahoo! Answers is the solution... Ueca tudi diango... tanananananananaann nisss ♬ welcome to the jungle, watch it bring you to your knees ♬ (╯°▽°)╯ ︵ ┻━┻

Slide 18

Slide 18 text

INFORMATION RETRIEVAL

Slide 19

Slide 19 text

Unstructured data Large number of documents

Slide 20

Slide 20 text

IN THE PAST... List all documents that match a search query was enough… However, in a Big Data era…

Slide 21

Slide 21 text

NOWADAYS … Ranking documents by their relevance for a search query is the most important goal.

Slide 22

Slide 22 text

Basic concepts

Slide 23

Slide 23 text

TOKENIZATION: Tokens ~> Words A list words of "A list of words!" Tokens semantic units

Slide 24

Slide 24 text

DON T PANIC "DON'T PANIC" Is it enough to remove punctuation and spaces? "do not panic" do not panic How to tokenize contractions? Are all of them semantic units? Are they same tokens? don't = do not DON'T PANIC

Slide 25

Slide 25 text

Imagine Dragons "Imagine Dragons" Imagine Dragons Is it enough to remove punctuation and spaces? or

Slide 26

Slide 26 text

you know "you-know-who" you-know-who Is it enough to remove punctuation and spaces? or who

Slide 27

Slide 27 text

"Fullstack developer" "Full-stack dev" "Full stack developer" Is it enough to remove punctuation and spaces? Fullstack Full developer stack developer Full-stack developer For a user, these terms should return the same documents, isn't it?

Slide 28

Slide 28 text

30 seconds to Mars Thirty seconds to Mars November, 18th, 2017 2017-11-18 SP São Paulo How to deal with numbers and abbreviations?

Slide 29

Slide 29 text

Kaminari Is a gem or thunder in japanese? Windows Is it plural of window or about the company? About the semantics of the original term and its normalized token...

Slide 30

Slide 30 text

音楽 ONGAKU おんがく SAME LANGUAGE, SAME PRONUNCIATION DIFFERENT ALPHABETS

Slide 31

Slide 31 text

STOP WORDS: extremely common words In English: a, an, the, and, or, are, as, at, by, for, from, of ... In Portuguese: um, uma, a, o, as, os, é, são, por, de, da, do, se …

Slide 32

Slide 32 text

STOP WORDS: extremely common words A list words of list words meaningful tokens

Slide 33

Slide 33 text

Stop words, diacritics, case folding... Stop word removal Case folding normalization Diacritics removal HELLO WORLD Hello World hello world hello world naive naïve naive roses are red red roses roses red

Slide 34

Slide 34 text

When not to normalize tokens… The Who (a band) Se (Brazilian song, from Djavan) Strings solely composed by stop words Different meanings for words with and without diacritics In Spanish: peña means a "cliff" pena means "sorrow" When not to set all characters to lowercase General Motors Windows Apple

Slide 35

Slide 35 text

LEMMATIZATION / STEMMING To reduce a token to its base form

Slide 36

Slide 36 text

LEMMATIZATION: based on a vocabulary am, are, is be sou, somos, foi, é ser car, cars, car’s, cars’ car English Portuguese carros, carro carro

Slide 37

Slide 37 text

STEMMING Heuristic process that chops off the ends of words cats cat ponies poni Increase the number of returned documents. However, harming precision...

Slide 38

Slide 38 text

STEMMING Heuristic process that chops off the ends of words amor amor amores amora operating operat system Portuguese system English It means love It's a Brazilian berry not so meaningful tokens

Slide 39

Slide 39 text

SYNONYMS bike bicycle indivíduo pessoa

Slide 40

Slide 40 text

Bag of words: List of keywords Ordering of words is ignored! e.g. Imagine Dragons Dragons Imagine Phrase queries: Order matters! Restrict searches e.g. "Imagine Dragons"

Slide 41

Slide 41 text

RELEVANCE term frequency (tf) total of occurrences of a term in a document inverse document frequency (idf) how rare is a term in all indexed documents

Slide 42

Slide 42 text

RELEVANCE tf-idf = tf x idf function that balances the term frequency in a document within how rare is term in a collection

Slide 43

Slide 43 text

Boolean Model Probabilistic Model PageRank ...

Slide 44

Slide 44 text

Evaluating a searching model

Slide 45

Slide 45 text

Evaluation method We need a test dataset with: 1. A document collection 2. A collection of queries 3. A set of relevance judgments, for each query, a list of relevant and non-relevant documents TP: True Positive TN: True Negative FP: False Positive FN: False Negative

Slide 46

Slide 46 text

ACCURACY TP + TN TP + FP + TN + FN

Slide 47

Slide 47 text

PRECISION TP TP + FP # Corrected Matches / # Total Results Returned

Slide 48

Slide 48 text

RECALL TP TP + FN # Corrected Matches / (# Corrected Matches + # Missed Matches)

Slide 49

Slide 49 text

F1 SCORE 2 * (RECALL + PRECISION) (RECALL + PRECISION)

Slide 50

Slide 50 text

When a model is good enough for an app? You can choose the model with the best F1 score, for example. However, there is no universal solution It is an incremental process You should tune it based on users' information needs Usability tests is also a good way to evaluate a model

Slide 51

Slide 51 text

Adding a search engine to my app

Slide 52

Slide 52 text

FULL TEXT SEARCH IN MARIADB... CREATE TABLE `songs` ( `id` int NOT NULL AUTO_INCREMENT PRIMARY KEY, `title` varchar(300), `artist` varchar(255), `genre` varchar(255), `lyrics` text ) ENGINE=InnoDB; CREATE FULLTEXT INDEX songs_title_idx ON songs (title); CREATE FULLTEXT INDEX songs_artist_idx ON songs (artist); CREATE FULLTEXT INDEX songs_lyrics_idx ON songs (lyrics); CREATE FULLTEXT INDEX songs_genre_idx ON songs (genre); FTS

Slide 53

Slide 53 text

FULL TEXT SEARCH IN MARIADB... SELECT * FROM songs WHERE MATCH (title,artist, lyrics) AGAINST ('imagine' IN NATURAL LANGUAGE MODE); SELECT * FROM songs WHERE MATCH (title,artist, lyrics) AGAINST ('imagine' IN BOOLEAN MODE); CREATE FULLTEXT INDEX songs_all_idx ON songs (title,artist,lyrics); default mode

Slide 54

Slide 54 text

FULL TEXT SEARCH IN MARIADB... SELECT * FROM songs WHERE MATCH (title,artist, lyrics) AGAINST ('imagine dragons'); Returned rows: Radioactive - Imagine Dragons Imagine - John Lennon

Slide 55

Slide 55 text

FULL TEXT SEARCH IN MARIADB... SELECT * FROM songs WHERE MATCH (title,artist, lyrics) AGAINST ('+imagine +dragons') IN BOOLEAN MODE); Radioactive - Imagine Dragons

Slide 56

Slide 56 text

FULL TEXT SEARCH IN MARIADB... SELECT * FROM songs WHERE MATCH (title,artist, lyrics) AGAINST ('"imagine dragons"'); Radioactive - Imagine Dragons

Slide 57

Slide 57 text

FULL TEXT SEARCH IN MARIADB... SELECT * FROM songs WHERE MATCH(genre) AGAINST('alternative'); SELECT * FROM songs WHERE MATCH(genre) AGAINST('music'); SELECT * FROM songs WHERE MATCH(genre) AGAINST('alternative' WITH QUERY EXPANSION); SELECT * FROM songs WHERE MATCH(genre) AGAINST('music' WITH QUERY EXPANSION); Imagine Dragons John Lennon Imagine Dragons John Lennon Imagine Dragons - Alternative Rock John Lennon - Rock music, Pop music

Slide 58

Slide 58 text

Why to use an external search engine? Spell checking! Spell checking! Or did you mean… search like Google? ♡

Slide 59

Slide 59 text

Why to use an external search engine? You can use spell checking! You can also: - Add multivalued fields (document oriented database) - Add new algorithms to the databases - Customize stop words, stemming analyzers - Use fuzziness functions - Boost some documents/fields according to the search

Slide 60

Slide 60 text

Apache Solr and ElasticSearch Based on Apache Lucene Document oriented databases (welcome to polyglot persistence!) It is not a relational database, ok? No ACID, sorry! Developed to be scalable Apache Solr has a better documentation +50 ES has native support to Structured Query DSL +1 ES is better for analytic queries

Slide 61

Slide 61 text

ElasticSearch DSL // artist = John Lennon AND (genres = rock OR genres = pop) // AND NOT(nome = imagine) GET /songs/v1/_search { "query" : { "bool": { "must": {"match": {"artist": "John Lennon" }}, "should": [ {"match": {"genres": "rock" }}, {"match": {"genres": "pop" }} ], "must_not": {"match": {"nome": "imagine"}} } } }

Slide 62

Slide 62 text

Our choice: Apache Solr Apache Solr is Open Source and Open Development +1000 Latest release: 7.1.0 (October 17th, 2017)

Slide 63

Slide 63 text

Apache Solr

Slide 64

Slide 64 text

Installing for development environment... docker run --name my_solr -p 8983:8983 -d solr https://hub.docker.com/r/risdenk/docker-solr/

Slide 65

Slide 65 text

localhost:8983

Slide 66

Slide 66 text

Creating a core docker exec -it my_solr solr create_core -c development core ~> database or table document ~> a row from a table schemaless!! core name

Slide 67

Slide 67 text

List of all cores

Slide 68

Slide 68 text

Menu options for each core

Slide 69

Slide 69 text

Creating a document... curl -X POST -H 'Content-Type: application/json' 'http://localhost:8983/solr/development/update/json/docs' --data-binary ' { "id": "1", "title": "Song 1" }'

Slide 70

Slide 70 text

Zero documents?? Check!

Slide 71

Slide 71 text

Commit!! curl -X POST -H 'Content-Type: application/json' 'http://localhost:8983/solr/development/update?commit=true' --data-binary ' { "commit": {} }'

Slide 72

Slide 72 text

Creating a document... curl -X POST -H 'Content-Type: application/json' 'http://localhost:8983/solr/development/update?commit=true' --data-binary ' [{ "id": "1", "title": "Song 1" },{ "title": "Song 2" }]' Optional in insert

Slide 73

Slide 73 text

Our new documents!

Slide 74

Slide 74 text

Our new documents! title and title_str? dynamic fields *_str, *_i, ...

Slide 75

Slide 75 text

Updating a document... curl -X POST -H 'Content-Type: application/json' 'http://localhost:8983/solr/development/update?commit=true' --data-binary ' [{ "id": "1", "title": "Song 3" }, { "title": "Song 3" }]'

Slide 76

Slide 76 text

id: 1 new doc

Slide 77

Slide 77 text

Documents menu - JSON

Slide 78

Slide 78 text

Deleting a document... curl -X POST -H 'Content-Type: application/json' 'http://localhost:8983/solr/development/update' --data-binary ' { "delete": { "id":"1" }, "commit": {}, }'

Slide 79

Slide 79 text

Deleting ALL documents curl -X POST -H 'Content-Type: application/json' 'http://localhost:8983/solr/development/update?commit=true' --data-binary ' { "delete": { "query": "*:*" } }'

Slide 80

Slide 80 text

Documents menu - Solr command

Slide 81

Slide 81 text

Searching

Slide 82

Slide 82 text

q (query) main query parameter fq (filter query) filter query (to reduce the dataset) fl (filter list) list of fields to return sort list of fields to sort the dataset Results are paginated QUERY

Slide 83

Slide 83 text

Basic queries List all documents (with pagination) curl 'http://localhost:8983/solr/development/select?q=*:*'

Slide 84

Slide 84 text

Basic queries List all documents (with pagination) curl http://localhost:8983/solr/development/select -d ' { query:"*:*" }'

Slide 85

Slide 85 text

My documents { "docs": [ { "title": ["Song 1"], "genre": "Rock", "year": 2010 }, { "title": ["Song 2"], "genre": "MPB", "year": 1990 }, { "title": ["Other music Rock"], "genre": "Pop", "year": 1970 }, { "title": ["My favorite songs"], "genre": "Rock Music", "year": 2011 } ] }

Slide 86

Slide 86 text

Fuzzy matching title:Song* 3 documents title:Song? 1 document title:Sonjs 0 documents title:Sonjs~1 1 document title:Sonjs~2 3 documents title:(my songs) 1 document title:"my songs" 0 documents title:"my songs"~2 1 document title:(-favorite +song*) 2 documents *:* 4 documents Wildcards: ? one letter * any number letter ~ query slop ( ) keyword search " " phrase query

Slide 87

Slide 87 text

Fuzzy matching title:"song" AND genre:"rock" 1 document (title:"song" AND genre:"rock") OR title:"track" 2 documents year: [1980 TO *] 3 documents genre:[Pop TO *] 3 documents Boosting: (title:music OR title:Rock)^1.5 (genre:music OR genre: Rock) 3 documents 1st: "Other music rock" (title:music OR title:Rock) (genre:music OR genre: Rock)^1.5 3 documents 1st: "My favorite songs"

Slide 88

Slide 88 text

Searching in all fields In your schema.xml, add: You can add but it is not recommended: Then, you can search without defining the default field

Slide 89

Slide 89 text

Analysis: list all indexing and querying transformations Indexing Transform. Querying Transform.

Slide 90

Slide 90 text

Customizing fields and their analyzers (schema.xml)

Slide 91

Slide 91 text

Spell Checking

Slide 92

Slide 92 text

Building the spell checking indexing curl --request GET --url 'http://localhost:8983/solr/development/select?q=*:*&spellcheck.build=t rue&spellcheck=true'

Slide 93

Slide 93 text

Suggestion: IMAGINE Searching: IMAGINA

Slide 94

Slide 94 text

Searching: DRAGOONS Suggestion: DRAGONS

Slide 95

Slide 95 text

Searching: IMAGINA DRAGOONS Suggestion: IMAGINE DRAGONS

Slide 96

Slide 96 text

Integrating with Ruby on Rails

Slide 97

Slide 97 text

Connecting through a REST Client … params = {q: 'title:song' } response = RestClient.get "http://localhost:8983/solr/development/select?#{params.to_param}" response_json = JSON.parse(response.body) items = response_json["response"]["docs"] [{"title"=>["Song 1"], "id"=>"eeb507c6-461f-4219-9f5a-50528340d84d", "_version_"=>1584234836063682560, "title_str"=>["Song 1"]}, {"title"=>["Song 2"], "id"=>"1b8bacc1-9ed9-4c85-922d-71b3472f9d44", "_version_"=>1584234836065779712, "title_str"=>["Song 2"]}] ヽ(•́o•̀)ノ

Slide 98

Slide 98 text

Sunspot Gem V. 2.2.7

Slide 99

Slide 99 text

Installing... gem 'sunspot_rails' rails generate sunspot_rails:install development: solr: hostname: solr port: 8983 path: /solr/playax log_level: INFO auto_index_callback: after_commit auto_remove_callback: after_commit config/sunspot.yml

Slide 100

Slide 100 text

Sunspot needs its own schema.xml. Follow this example in: elainenaomi/search_engine

Slide 101

Slide 101 text

Sunspot DSL - Defining the indexed fields class Song < ActiveRecord::Base searchable do text :title, stored: true text :lyrics, stored: false text :artist, stored: true string :genre, multiple: true, stored: true do genre.split(',') end end end Sunspot.index! Song.all

Slide 102

Slide 102 text

Bag of words: search = Song.search do fulltext 'imagine dragons' with :genre, 'Rock' without :genre, 'Pop' with(:year).less_than 2014 field_list :title, :artist order_by :title, :asc end songs = search.results Imagine (John Lennon) Radioactive ( Imagine Dragons)

Slide 103

Slide 103 text

Phrase queries: search = Song.search do fulltext "\"imagine dragons\"" with :genre, 'Rock' without :genre, 'Pop' with(:year).less_than 2014 field_list :title, :artist order_by :title, :asc end songs = search.results Radioactive ( Imagine Dragons)

Slide 104

Slide 104 text

Query Phrase Slop # Two words can appear between the words in the phrase, so # "imagine all the people" also matches, in addition to "imagine people" Song.search do fulltext '"imagine people"' do fields :lyrics query_phrase_slop 2 end end

Slide 105

Slide 105 text

Minimum Match Song.search do fulltext "dragons imagine test" do fields :artist, :title minimum_match '70%' end end Song.search do fulltext 'dragons imagine test' do fields :artist, :title boost_fields title: 2.0 minimum_match '60%' end end 1 document: Radioactive ( Imagine Dragons) 2 documents: 1st: Imagine (John Lennon) 2nd: Radioactive (Imagine Dragons) boost rounded down

Slide 106

Slide 106 text

Spell checking search = Sunspot.search(Song) do keywords 'Imagina Dragoons' spellcheck :count => 3 end search.spellcheck_suggestion_for('imagina') # => 'imagine' search.spellcheck_suggestions # => [{"word"=>"imagine", "freq"=>3}, {"word"=>"dragons", "freq"=>1}]

Slide 107

Slide 107 text

To test or not to test?

Slide 108

Slide 108 text

To test or not to test? Unit tests? No. Integration tests? Maybe… Search engines depends on terms frequency to ranking docs You will need all your dataset to compute precision, recall.. You can test only filter queries, indexing callbacks…

Slide 109

Slide 109 text

Summary

Slide 110

Slide 110 text

Summary The searching problem ● User: a bug search tool Adding a search engine to my app ● Full text search in MariaDB ● Apache Solr x ElasticSearch Apache Solr ● How to create cores ● CRUD operations Integrating with Rails ● Sunspot gem ● How to index, search and test

Slide 111

Slide 111 text

Keep in mind Always verify the user's information needs from your app E.g.: check if removing stop words, synonymous should be applied "No" Meghan Trainor "I am" - P.O.D E.g: which transformations your search engine should apply - Phonetic transformations? Custom language analyzers?

Slide 112

Slide 112 text

Keep in mind The information is not only on text files but also in audios, videos, images, etc.

Slide 113

Slide 113 text

Suggested topics for studying - Evaluation of available analyzers for FTS - Optimization of Performance (such as soft commit, lazy build indexes) - Distribution and replication through SolrCloud - Using of Machine Learning algorithms - Creation of custom function queries - Authentication - Integrating with Logstash and Kibana - Geospatial searches

Slide 114

Slide 114 text

Thank you! <3 github.com/elainenaomi slideshare.net/elainenaomi @elaine_nw

Slide 115

Slide 115 text

References Introduction to Information Retrieval Manning, Christopher D., Prabhakar Raghavan, and Hinrich Schütze (2008) Solr in action Grainger, Trey, Timothy Potter, and Yonik Seeley (2014) Sunspot gem http://sunspot.github.io/ Uma introdução ao tema recuperação de informações textuais. Barth, F. J. (2013) 10 Reasons to Choose Apache Solr Over Elasticsearch (2016) https://dzone.com/articles/10-reasons-to-choose-apache-solr-over-elasticsearc

Slide 116

Slide 116 text

References Apache Solr vs Elasticsearch http://solr-vs-elasticsearch.com/ When to consider Solr https://stackoverflow.com/questions/4960952/when-to-consider-solr Indexing for full text search in PostgreSQL https://www.compose.com/articles/indexing-for-full-text-search-in-postgresql/ PolyglotPersistence https://martinfowler.com/bliki/PolyglotPersistence.html Yahoo! Answers: Qual o nome desta Música? https://br.answers.yahoo.com/question/index?qid=20080627085726AAJM9Wa

Slide 117

Slide 117 text

References Full-Text Index in MariaDB https://mariadb.com/kb/en/library/full-text-index-overview/ Natural Language Full-Text Searches (MySQL) https://dev.mysql.com/doc/refman/5.7/en/fulltext-natural-language.html Postgres full-text search is Good Enough! (2015) http://rachbelaid.com/postgres-full-text-search-is-good-enough/ Text Indexes in MongoDB https://docs.mongodb.com/manual/core/index-text/ Full-Text Index Stopwords for MariaDB https://mariadb.com/kb/en/library/full-text-index-stopwords/