Dealing with a search engine in your application - a Solr approach for beginners

Dealing with a search engine in your application a Solr
approach for beginners Elaine Naomi Watanabe

Elaine Naomi Watanabe Full-stack developer (Playax) Master's degree in Computer
Science (IME-USP) Passionate about: Web Development, Agile, Cloud Computing, DevOps, NoSQL and RDBMS

ANALYZING BILLIONS OF DATA TO HELP ARTISTS AND MUSIC PROFESSIONALS
TO DEVELOP THEIR AUDIENCE BIG DATA + MUSIC + TECH = <3

AGENDA

Searching Problem Introduction Information Retrieval Basic concepts Apache Solr How
to configure Sunspot Gem Integrating with Ruby on Rails Next Steps Including references SPOILER ALERT

THE SEARCHING PROBLEM

or did you mean... HOW TO SEARCH LIKE THE GOOGLE?

A LIST OF SONGS… TITLES, ARTISTS, LYRICS... IMAGINE IN OUR
CONTEXT…

"Imagine all the people living life in peace" (Imagine -
John Lennon) "I'm a radioactive, radioactive" (Radioactive - Imagine Dragons) "Welcome to the jungle, watch it bring you to your knees" (Welcome to the jungle - Guns N' Roses) A little little sample of songs...

A SQL LIKE statement is enough? SEARCHING TERM: "IMAGINE" SELECT
* FROM songs WHERE title LIKE '%IMAGINE%' OR artist LIKE '%IMAGINE%' OR lyrics LIKE '%IMAGINE%';

John Lennon) "I'm a radioactive, radioactive" (Radioactive - Imagine Dragons) "Welcome to the jungle, watch it bring you to your knees" (Welcome to the jungle - Guns N' Roses) Searching for "Imagine"

A SQL LIKE statement is really enough?? SEARCHING TERM: "IMAGINE
PEOPLE"

John Lennon) "I'm a radioactive, radioactive" (Radioactive - Imagine Dragons) "Welcome to the jungle, watch it bring you to your knees" (Welcome to the jungle - Guns N' Roses) A little little sample of songs...

A SQL LIKE statement is really enough?? SEARCHING TERM: "IMAGINE
PEOPLE" SELECT * FROM songs WHERE title LIKE '%IMAGINE%PEOPLE%' OR artist LIKE '%IMAGINE%PEOPLE%' OR lyrics LIKE '%IMAGINE%PEOPLE%';

USER x YOUR APP A BUG SEARCH TOOL

When LIKE STATEMENT is not enough... SEARCH TERMS: "Dragons Imagine"
"Imagine John" "Imagine JONH" <- TYPO!

When only Yahoo! Answers is the solution... Ueca tudi diango...
tanananananananaann nisss ♬ welcome to the jungle, watch it bring you to your knees ♬ (╯°▽°)╯ ︵ ┻━┻

INFORMATION RETRIEVAL

Unstructured data Large number of documents

IN THE PAST... List all documents that match a search
query was enough… However, in a Big Data era…

NOWADAYS … Ranking documents by their relevance for a search
query is the most important goal.

Basic concepts

TOKENIZATION: Tokens ~> Words A list words of "A list
of words!" Tokens semantic units

DON T PANIC "DON'T PANIC" Is it enough to remove
punctuation and spaces? "do not panic" do not panic How to tokenize contractions? Are all of them semantic units? Are they same tokens? don't = do not DON'T PANIC

Imagine Dragons "Imagine Dragons" Imagine Dragons Is it enough to
remove punctuation and spaces? or

you know "you-know-who" you-know-who Is it enough to remove punctuation
and spaces? or who

"Fullstack developer" "Full-stack dev" "Full stack developer" Is it enough
to remove punctuation and spaces? Fullstack Full developer stack developer Full-stack developer For a user, these terms should return the same documents, isn't it?

30 seconds to Mars Thirty seconds to Mars November, 18th,
2017 2017-11-18 SP São Paulo How to deal with numbers and abbreviations?

Kaminari Is a gem or thunder in japanese? Windows Is
it plural of window or about the company? About the semantics of the original term and its normalized token...

音楽 ONGAKU おんがく SAME LANGUAGE, SAME PRONUNCIATION DIFFERENT ALPHABETS

STOP WORDS: extremely common words In English: a, an, the,
and, or, are, as, at, by, for, from, of ... In Portuguese: um, uma, a, o, as, os, é, são, por, de, da, do, se …

STOP WORDS: extremely common words A list words of list
words meaningful tokens

Stop words, diacritics, case folding... Stop word removal Case folding
normalization Diacritics removal HELLO WORLD Hello World hello world hello world naive naïve naive roses are red red roses roses red

When not to normalize tokens… The Who (a band) Se
(Brazilian song, from Djavan) Strings solely composed by stop words Different meanings for words with and without diacritics In Spanish: peña means a "cliff" pena means "sorrow" When not to set all characters to lowercase General Motors Windows Apple

LEMMATIZATION / STEMMING To reduce a token to its base
form

LEMMATIZATION: based on a vocabulary am, are, is be sou,
somos, foi, é ser car, cars, car’s, cars’ car English Portuguese carros, carro carro

STEMMING Heuristic process that chops off the ends of words
cats cat ponies poni Increase the number of returned documents. However, harming precision...

STEMMING Heuristic process that chops off the ends of words
amor amor amores amora operating operat system Portuguese system English It means love It's a Brazilian berry not so meaningful tokens

SYNONYMS bike bicycle indivíduo pessoa

Bag of words: List of keywords Ordering of words is
ignored! e.g. Imagine Dragons Dragons Imagine Phrase queries: Order matters! Restrict searches e.g. "Imagine Dragons"

RELEVANCE term frequency (tf) total of occurrences of a term
in a document inverse document frequency (idf) how rare is a term in all indexed documents

RELEVANCE tf-idf = tf x idf function that balances the
term frequency in a document within how rare is term in a collection

Boolean Model Probabilistic Model PageRank ...

Evaluating a searching model

Evaluation method We need a test dataset with: 1. A
document collection 2. A collection of queries 3. A set of relevance judgments, for each query, a list of relevant and non-relevant documents TP: True Positive TN: True Negative FP: False Positive FN: False Negative

ACCURACY TP + TN TP + FP + TN +
FN

PRECISION TP TP + FP # Corrected Matches / #
Total Results Returned

RECALL TP TP + FN # Corrected Matches / (#
Corrected Matches + # Missed Matches)

F1 SCORE 2 * (RECALL + PRECISION) (RECALL + PRECISION)

When a model is good enough for an app? You
can choose the model with the best F1 score, for example. However, there is no universal solution It is an incremental process You should tune it based on users' information needs Usability tests is also a good way to evaluate a model

Adding a search engine to my app

FULL TEXT SEARCH IN MARIADB... CREATE TABLE `songs` ( `id`
int NOT NULL AUTO_INCREMENT PRIMARY KEY, `title` varchar(300), `artist` varchar(255), `genre` varchar(255), `lyrics` text ) ENGINE=InnoDB; CREATE FULLTEXT INDEX songs_title_idx ON songs (title); CREATE FULLTEXT INDEX songs_artist_idx ON songs (artist); CREATE FULLTEXT INDEX songs_lyrics_idx ON songs (lyrics); CREATE FULLTEXT INDEX songs_genre_idx ON songs (genre); FTS

FULL TEXT SEARCH IN MARIADB... SELECT * FROM songs WHERE
MATCH (title,artist, lyrics) AGAINST ('imagine' IN NATURAL LANGUAGE MODE); SELECT * FROM songs WHERE MATCH (title,artist, lyrics) AGAINST ('imagine' IN BOOLEAN MODE); CREATE FULLTEXT INDEX songs_all_idx ON songs (title,artist,lyrics); default mode

MATCH (title,artist, lyrics) AGAINST ('imagine dragons'); Returned rows: Radioactive - Imagine Dragons Imagine - John Lennon

MATCH (title,artist, lyrics) AGAINST ('+imagine +dragons') IN BOOLEAN MODE); Radioactive - Imagine Dragons

MATCH (title,artist, lyrics) AGAINST ('"imagine dragons"'); Radioactive - Imagine Dragons

MATCH(genre) AGAINST('alternative'); SELECT * FROM songs WHERE MATCH(genre) AGAINST('music'); SELECT * FROM songs WHERE MATCH(genre) AGAINST('alternative' WITH QUERY EXPANSION); SELECT * FROM songs WHERE MATCH(genre) AGAINST('music' WITH QUERY EXPANSION); Imagine Dragons John Lennon Imagine Dragons John Lennon Imagine Dragons - Alternative Rock John Lennon - Rock music, Pop music

Why to use an external search engine? Spell checking! Spell
checking! Or did you mean… search like Google? ♡

Why to use an external search engine? You can use
spell checking! You can also: - Add multivalued fields (document oriented database) - Add new algorithms to the databases - Customize stop words, stemming analyzers - Use fuzziness functions - Boost some documents/fields according to the search

Apache Solr and ElasticSearch Based on Apache Lucene Document oriented
databases (welcome to polyglot persistence!) It is not a relational database, ok? No ACID, sorry! Developed to be scalable Apache Solr has a better documentation +50 ES has native support to Structured Query DSL +1 ES is better for analytic queries

ElasticSearch DSL // artist = John Lennon AND (genres =
rock OR genres = pop) // AND NOT(nome = imagine) GET /songs/v1/_search { "query" : { "bool": { "must": {"match": {"artist": "John Lennon" }}, "should": [ {"match": {"genres": "rock" }}, {"match": {"genres": "pop" }} ], "must_not": {"match": {"nome": "imagine"}} } } }

Our choice: Apache Solr Apache Solr is Open Source and
Open Development +1000 Latest release: 7.1.0 (October 17th, 2017)

Apache Solr

Installing for development environment... docker run --name my_solr -p 8983:8983
-d solr https://hub.docker.com/r/risdenk/docker-solr/

localhost:8983

Creating a core docker exec -it my_solr solr create_core -c
development core ~> database or table document ~> a row from a table schemaless!! core name

List of all cores

Menu options for each core

Creating a document... curl -X POST -H 'Content-Type: application/json' 'http://localhost:8983/solr/development/update/json/docs'
--data-binary ' { "id": "1", "title": "Song 1" }'

Zero documents?? Check!

Commit!! curl -X POST -H 'Content-Type: application/json' 'http://localhost:8983/solr/development/update?commit=true' --data-binary '
{ "commit": {} }'

Creating a document... curl -X POST -H 'Content-Type: application/json' 'http://localhost:8983/solr/development/update?commit=true'
--data-binary ' [{ "id": "1", "title": "Song 1" },{ "title": "Song 2" }]' Optional in insert

Our new documents!

Our new documents! title and title_str? dynamic fields *_str, *_i,
...

Updating a document... curl -X POST -H 'Content-Type: application/json' 'http://localhost:8983/solr/development/update?commit=true'
--data-binary ' [{ "id": "1", "title": "Song 3" }, { "title": "Song 3" }]'

id: 1 new doc

Documents menu - JSON

Deleting a document... curl -X POST -H 'Content-Type: application/json' 'http://localhost:8983/solr/development/update'
--data-binary ' { "delete": { "id":"1" }, "commit": {}, }'

Deleting ALL documents curl -X POST -H 'Content-Type: application/json' 'http://localhost:8983/solr/development/update?commit=true'
--data-binary ' { "delete": { "query": "*:*" } }'

Documents menu - Solr command

Searching

q (query) main query parameter fq (filter query) filter query
(to reduce the dataset) fl (filter list) list of fields to return sort list of fields to sort the dataset Results are paginated QUERY

Basic queries List all documents (with pagination) curl 'http://localhost:8983/solr/development/select?q=*:*'

Basic queries List all documents (with pagination) curl http://localhost:8983/solr/development/select -d
' { query:"*:*" }'

My documents { "docs": [ { "title": ["Song 1"], "genre":
"Rock", "year": 2010 }, { "title": ["Song 2"], "genre": "MPB", "year": 1990 }, { "title": ["Other music Rock"], "genre": "Pop", "year": 1970 }, { "title": ["My favorite songs"], "genre": "Rock Music", "year": 2011 } ] }

Fuzzy matching title:Song* 3 documents title:Song? 1 document title:Sonjs 0
documents title:Sonjs~1 1 document title:Sonjs~2 3 documents title:(my songs) 1 document title:"my songs" 0 documents title:"my songs"~2 1 document title:(-favorite +song*) 2 documents *:* 4 documents Wildcards: ? one letter * any number letter ~ query slop ( ) keyword search " " phrase query

Fuzzy matching title:"song" AND genre:"rock" 1 document (title:"song" AND genre:"rock")
OR title:"track" 2 documents year: [1980 TO *] 3 documents genre:[Pop TO *] 3 documents Boosting: (title:music OR title:Rock)^1.5 (genre:music OR genre: Rock) 3 documents 1st: "Other music rock" (title:music OR title:Rock) (genre:music OR genre: Rock)^1.5 3 documents 1st: "My favorite songs"

Searching in all fields In your schema.xml, add: <copyField source="*_txt"
dest="_text_" /> <copyField source="*_text" dest="_text_" /> You can add but it is not recommended: <copyField source="*" dest="_text_" /> Then, you can search without defining the default field

Analysis: list all indexing and querying transformations Indexing Transform. Querying
Transform.

Customizing fields and their analyzers (schema.xml) <fieldtype name="phonetic" stored="false" indexed="true"
class="solr.TextField" > <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.DoubleMetaphoneFilterFactory" inject="false"/> <filter class="solr.StopFilterFactory" format="snowball" words="lang/stopwords_pt.txt" /> </analyzer> </fieldtype>

Spell Checking

Building the spell checking indexing curl --request GET --url 'http://localhost:8983/solr/development/select?q=*:*&spellcheck.build=t
rue&spellcheck=true'

Suggestion: IMAGINE Searching: IMAGINA

Searching: DRAGOONS Suggestion: DRAGONS

Searching: IMAGINA DRAGOONS Suggestion: IMAGINE DRAGONS

Integrating with Ruby on Rails

Connecting through a REST Client … params = {q: 'title:song'
} response = RestClient.get "http://localhost:8983/solr/development/select?#{params.to_param}" response_json = JSON.parse(response.body) items = response_json["response"]["docs"] [{"title"=>["Song 1"], "id"=>"eeb507c6-461f-4219-9f5a-50528340d84d", "_version_"=>1584234836063682560, "title_str"=>["Song 1"]}, {"title"=>["Song 2"], "id"=>"1b8bacc1-9ed9-4c85-922d-71b3472f9d44", "_version_"=>1584234836065779712, "title_str"=>["Song 2"]}] ヽ(•́o•̀)ノ

Sunspot Gem V. 2.2.7

Installing... gem 'sunspot_rails' rails generate sunspot_rails:install development: solr: hostname: solr
port: 8983 path: /solr/playax log_level: INFO auto_index_callback: after_commit auto_remove_callback: after_commit config/sunspot.yml

Sunspot needs its own schema.xml. Follow this example in: elainenaomi/search_engine

Sunspot DSL - Defining the indexed fields class Song <
ActiveRecord::Base searchable do text :title, stored: true text :lyrics, stored: false text :artist, stored: true string :genre, multiple: true, stored: true do genre.split(',') end end end Sunspot.index! Song.all

Bag of words: search = Song.search do fulltext 'imagine dragons'
with :genre, 'Rock' without :genre, 'Pop' with(:year).less_than 2014 field_list :title, :artist order_by :title, :asc end songs = search.results Imagine (John Lennon) Radioactive ( Imagine Dragons)

Phrase queries: search = Song.search do fulltext "\"imagine dragons\"" with
:genre, 'Rock' without :genre, 'Pop' with(:year).less_than 2014 field_list :title, :artist order_by :title, :asc end songs = search.results Radioactive ( Imagine Dragons)

Query Phrase Slop # Two words can appear between the
words in the phrase, so # "imagine all the people" also matches, in addition to "imagine people" Song.search do fulltext '"imagine people"' do fields :lyrics query_phrase_slop 2 end end

Minimum Match Song.search do fulltext "dragons imagine test" do fields
:artist, :title minimum_match '70%' end end Song.search do fulltext 'dragons imagine test' do fields :artist, :title boost_fields title: 2.0 minimum_match '60%' end end 1 document: Radioactive ( Imagine Dragons) 2 documents: 1st: Imagine (John Lennon) 2nd: Radioactive (Imagine Dragons) boost rounded down

Spell checking search = Sunspot.search(Song) do keywords 'Imagina Dragoons' spellcheck
:count => 3 end search.spellcheck_suggestion_for('imagina') # => 'imagine' search.spellcheck_suggestions # => [{"word"=>"imagine", "freq"=>3}, {"word"=>"dragons", "freq"=>1}]

To test or not to test?

To test or not to test? Unit tests? No. Integration
tests? Maybe… Search engines depends on terms frequency to ranking docs You will need all your dataset to compute precision, recall.. You can test only filter queries, indexing callbacks…

Summary

Summary The searching problem • User: a bug search tool
Adding a search engine to my app • Full text search in MariaDB • Apache Solr x ElasticSearch Apache Solr • How to create cores • CRUD operations Integrating with Rails • Sunspot gem • How to index, search and test

Keep in mind Always verify the user's information needs from
your app E.g.: check if removing stop words, synonymous should be applied "No" Meghan Trainor "I am" - P.O.D E.g: which transformations your search engine should apply - Phonetic transformations? Custom language analyzers?

Keep in mind The information is not only on text
files but also in audios, videos, images, etc.

Suggested topics for studying - Evaluation of available analyzers for
FTS - Optimization of Performance (such as soft commit, lazy build indexes) - Distribution and replication through SolrCloud - Using of Machine Learning algorithms - Creation of custom function queries - Authentication - Integrating with Logstash and Kibana - Geospatial searches

Thank you! <3 github.com/elainenaomi slideshare.net/elainenaomi @elaine_nw

References Introduction to Information Retrieval Manning, Christopher D., Prabhakar Raghavan,
and Hinrich Schütze (2008) Solr in action Grainger, Trey, Timothy Potter, and Yonik Seeley (2014) Sunspot gem http://sunspot.github.io/ Uma introdução ao tema recuperação de informações textuais. Barth, F. J. (2013) 10 Reasons to Choose Apache Solr Over Elasticsearch (2016) https://dzone.com/articles/10-reasons-to-choose-apache-solr-over-elasticsearc

References Apache Solr vs Elasticsearch http://solr-vs-elasticsearch.com/ When to consider Solr
https://stackoverflow.com/questions/4960952/when-to-consider-solr Indexing for full text search in PostgreSQL https://www.compose.com/articles/indexing-for-full-text-search-in-postgresql/ PolyglotPersistence https://martinfowler.com/bliki/PolyglotPersistence.html Yahoo! Answers: Qual o nome desta Música? https://br.answers.yahoo.com/question/index?qid=20080627085726AAJM9Wa

References Full-Text Index in MariaDB https://mariadb.com/kb/en/library/full-text-index-overview/ Natural Language Full-Text Searches
(MySQL) https://dev.mysql.com/doc/refman/5.7/en/fulltext-natural-language.html Postgres full-text search is Good Enough! (2015) http://rachbelaid.com/postgres-full-text-search-is-good-enough/ Text Indexes in MongoDB https://docs.mongodb.com/manual/core/index-text/ Full-Text Index Stopwords for MariaDB https://mariadb.com/kb/en/library/full-text-index-stopwords/

Dealing with a search engine in your applicatio...

Dealing with a search engine in your application - a Solr approach for beginners

More Decks by Elaine Naomi

Featured

Transcript