Dealing with a search engine
in your application
a Solr approach for beginners
Elaine Naomi Watanabe
Slide 2
Slide 2 text
Elaine Naomi Watanabe
Full-stack developer
(Playax)
Master's degree in Computer Science
(IME-USP)
Passionate about:
Web Development, Agile,
Cloud Computing, DevOps,
NoSQL and RDBMS
Slide 3
Slide 3 text
ANALYZING BILLIONS OF DATA TO HELP
ARTISTS AND MUSIC PROFESSIONALS TO
DEVELOP THEIR AUDIENCE
BIG DATA + MUSIC + TECH = <3
Slide 4
Slide 4 text
AGENDA
Slide 5
Slide 5 text
Searching
Problem
Introduction
Information
Retrieval
Basic concepts
Apache
Solr
How to configure
Sunspot
Gem
Integrating with
Ruby on Rails
Next
Steps
Including
references
SPOILER ALERT
Slide 6
Slide 6 text
THE SEARCHING
PROBLEM
Slide 7
Slide 7 text
or did you mean...
HOW TO SEARCH LIKE THE
GOOGLE?
Slide 8
Slide 8 text
A LIST OF SONGS…
TITLES, ARTISTS, LYRICS...
IMAGINE IN OUR CONTEXT…
Slide 9
Slide 9 text
"Imagine all the people living life in peace"
(Imagine - John Lennon)
"I'm a radioactive, radioactive"
(Radioactive - Imagine Dragons)
"Welcome to the jungle,
watch it bring you to your knees"
(Welcome to the jungle - Guns N' Roses)
A little little sample of songs...
Slide 10
Slide 10 text
A SQL LIKE statement is enough?
SEARCHING TERM: "IMAGINE"
SELECT *
FROM songs
WHERE
title LIKE '%IMAGINE%' OR
artist LIKE '%IMAGINE%' OR
lyrics LIKE '%IMAGINE%';
Slide 11
Slide 11 text
"Imagine all the people living life in peace"
(Imagine - John Lennon)
"I'm a radioactive, radioactive"
(Radioactive - Imagine Dragons)
"Welcome to the jungle,
watch it bring you to your knees"
(Welcome to the jungle - Guns N' Roses)
Searching for "Imagine"
Slide 12
Slide 12 text
A SQL LIKE statement is really enough??
SEARCHING TERM:
"IMAGINE PEOPLE"
Slide 13
Slide 13 text
"Imagine all the people living life in peace"
(Imagine - John Lennon)
"I'm a radioactive, radioactive"
(Radioactive - Imagine Dragons)
"Welcome to the jungle,
watch it bring you to your knees"
(Welcome to the jungle - Guns N' Roses)
A little little sample of songs...
Slide 14
Slide 14 text
A SQL LIKE statement is really enough??
SEARCHING TERM: "IMAGINE PEOPLE"
SELECT *
FROM songs
WHERE
title LIKE '%IMAGINE%PEOPLE%' OR
artist LIKE '%IMAGINE%PEOPLE%' OR
lyrics LIKE '%IMAGINE%PEOPLE%';
Slide 15
Slide 15 text
USER x YOUR APP
A BUG SEARCH TOOL
Slide 16
Slide 16 text
When LIKE STATEMENT is not enough...
SEARCH TERMS:
"Dragons Imagine"
"Imagine John"
"Imagine JONH" <- TYPO!
Slide 17
Slide 17 text
When only Yahoo! Answers is the solution...
Ueca tudi diango...
tanananananananaann nisss
♬ welcome to the jungle,
watch it bring you to your knees ♬
(╯°▽°)╯ ︵ ┻━┻
Slide 18
Slide 18 text
INFORMATION
RETRIEVAL
Slide 19
Slide 19 text
Unstructured data
Large number of documents
Slide 20
Slide 20 text
IN THE PAST...
List all documents that match a search query
was enough…
However, in a Big Data era…
Slide 21
Slide 21 text
NOWADAYS …
Ranking documents by their relevance
for a search query is the most important goal.
Slide 22
Slide 22 text
Basic concepts
Slide 23
Slide 23 text
TOKENIZATION: Tokens ~> Words
A list words
of
"A list of words!"
Tokens
semantic units
Slide 24
Slide 24 text
DON T PANIC
"DON'T PANIC"
Is it enough to remove punctuation and spaces?
"do not panic" do not panic
How to tokenize contractions?
Are all of them semantic units?
Are they same tokens?
don't = do not
DON'T PANIC
Slide 25
Slide 25 text
Imagine Dragons
"Imagine Dragons"
Imagine Dragons
Is it enough to remove punctuation and spaces?
or
Slide 26
Slide 26 text
you know
"you-know-who" you-know-who
Is it enough to remove punctuation and spaces?
or
who
Slide 27
Slide 27 text
"Fullstack developer"
"Full-stack dev"
"Full stack developer"
Is it enough to remove punctuation and spaces?
Fullstack
Full developer
stack
developer
Full-stack developer
For a user, these terms should return the same documents, isn't it?
Slide 28
Slide 28 text
30 seconds to Mars Thirty seconds to Mars
November, 18th, 2017 2017-11-18
SP São Paulo
How to deal with numbers and abbreviations?
Slide 29
Slide 29 text
Kaminari
Is a gem or thunder in
japanese?
Windows
Is it plural of window or
about the company?
About the semantics of the original term and its normalized token...
Slide 30
Slide 30 text
音楽
ONGAKU
おんがく
SAME LANGUAGE, SAME PRONUNCIATION
DIFFERENT ALPHABETS
Slide 31
Slide 31 text
STOP WORDS: extremely common words
In English:
a, an, the, and, or, are, as, at, by,
for, from, of ...
In Portuguese:
um, uma, a, o, as, os, é, são, por, de,
da, do, se …
Slide 32
Slide 32 text
STOP WORDS: extremely common words
A list words
of list words
meaningful tokens
Slide 33
Slide 33 text
Stop words, diacritics, case folding...
Stop word
removal
Case folding
normalization
Diacritics
removal
HELLO WORLD
Hello World
hello world
hello
world
naive
naïve
naive
roses are red
red roses
roses
red
Slide 34
Slide 34 text
When not to normalize tokens…
The Who (a band)
Se (Brazilian song, from Djavan)
Strings solely composed by stop
words
Different meanings for words
with and without
diacritics
In Spanish:
peña means a "cliff"
pena means "sorrow"
When not to set all characters to
lowercase
General Motors
Windows
Apple
Slide 35
Slide 35 text
LEMMATIZATION / STEMMING
To reduce a token to its base form
Slide 36
Slide 36 text
LEMMATIZATION: based on a vocabulary
am, are, is be
sou, somos, foi, é ser
car, cars, car’s, cars’ car
English
Portuguese
carros, carro carro
Slide 37
Slide 37 text
STEMMING
Heuristic process that chops off the ends of words
cats cat
ponies poni
Increase the number of returned documents.
However, harming precision...
Slide 38
Slide 38 text
STEMMING
Heuristic process that chops off the ends of words
amor
amor
amores
amora
operating operat
system
Portuguese
system
English
It means love
It's a Brazilian berry
not so meaningful tokens
Slide 39
Slide 39 text
SYNONYMS
bike bicycle
indivíduo pessoa
Slide 40
Slide 40 text
Bag of words:
List of keywords
Ordering of words is ignored!
e.g.
Imagine Dragons
Dragons Imagine
Phrase queries:
Order matters!
Restrict searches
e.g.
"Imagine Dragons"
Slide 41
Slide 41 text
RELEVANCE
term frequency (tf)
total of occurrences of a term in a document
inverse document frequency (idf)
how rare is a term in all indexed documents
Slide 42
Slide 42 text
RELEVANCE
tf-idf = tf x idf
function that balances the term frequency in a
document within how rare is term in a collection
Slide 43
Slide 43 text
Boolean Model
Probabilistic Model
PageRank
...
Slide 44
Slide 44 text
Evaluating a
searching model
Slide 45
Slide 45 text
Evaluation method
We need a test dataset with:
1. A document collection
2. A collection of queries
3. A set of relevance judgments, for each query, a list of relevant and
non-relevant documents
TP: True Positive
TN: True Negative
FP: False Positive
FN: False Negative
Slide 46
Slide 46 text
ACCURACY
TP + TN
TP + FP + TN + FN
Slide 47
Slide 47 text
PRECISION
TP
TP + FP
# Corrected Matches / # Total Results Returned
F1 SCORE
2 * (RECALL + PRECISION)
(RECALL + PRECISION)
Slide 50
Slide 50 text
When a model is good enough for an app?
You can choose the model with the best F1 score, for
example.
However, there is no universal solution
It is an incremental process
You should tune it based on users' information needs
Usability tests is also a good way to evaluate a model
Slide 51
Slide 51 text
Adding a search
engine to my app
Slide 52
Slide 52 text
FULL TEXT SEARCH IN MARIADB...
CREATE TABLE `songs` (
`id` int NOT NULL AUTO_INCREMENT PRIMARY KEY,
`title` varchar(300),
`artist` varchar(255),
`genre` varchar(255),
`lyrics` text
) ENGINE=InnoDB;
CREATE FULLTEXT INDEX songs_title_idx ON songs (title);
CREATE FULLTEXT INDEX songs_artist_idx ON songs (artist);
CREATE FULLTEXT INDEX songs_lyrics_idx ON songs (lyrics);
CREATE FULLTEXT INDEX songs_genre_idx ON songs (genre);
FTS
Slide 53
Slide 53 text
FULL TEXT SEARCH IN MARIADB...
SELECT *
FROM songs
WHERE MATCH (title,artist, lyrics) AGAINST ('imagine' IN NATURAL
LANGUAGE MODE);
SELECT *
FROM songs
WHERE MATCH (title,artist, lyrics) AGAINST ('imagine' IN BOOLEAN MODE);
CREATE FULLTEXT INDEX songs_all_idx ON songs (title,artist,lyrics);
default mode
Slide 54
Slide 54 text
FULL TEXT SEARCH IN MARIADB...
SELECT *
FROM songs
WHERE MATCH (title,artist, lyrics) AGAINST ('imagine dragons');
Returned rows:
Radioactive - Imagine Dragons
Imagine - John Lennon
Slide 55
Slide 55 text
FULL TEXT SEARCH IN MARIADB...
SELECT *
FROM songs
WHERE MATCH (title,artist, lyrics)
AGAINST ('+imagine +dragons') IN BOOLEAN MODE);
Radioactive - Imagine Dragons
Slide 56
Slide 56 text
FULL TEXT SEARCH IN MARIADB...
SELECT *
FROM songs
WHERE MATCH (title,artist, lyrics) AGAINST ('"imagine dragons"');
Radioactive - Imagine Dragons
Slide 57
Slide 57 text
FULL TEXT SEARCH IN MARIADB...
SELECT * FROM songs WHERE MATCH(genre) AGAINST('alternative');
SELECT * FROM songs WHERE MATCH(genre) AGAINST('music');
SELECT * FROM songs WHERE MATCH(genre)
AGAINST('alternative' WITH QUERY EXPANSION);
SELECT * FROM songs WHERE MATCH(genre)
AGAINST('music' WITH QUERY EXPANSION);
Imagine Dragons
John Lennon
Imagine Dragons
John Lennon
Imagine Dragons - Alternative Rock
John Lennon - Rock music, Pop music
Slide 58
Slide 58 text
Why to use an external search engine?
Spell checking!
Spell checking! Or did you mean…
search like Google? ♡
Slide 59
Slide 59 text
Why to use an external search engine?
You can use spell checking!
You can also:
- Add multivalued fields (document oriented database)
- Add new algorithms to the databases
- Customize stop words, stemming analyzers
- Use fuzziness functions
- Boost some documents/fields according to the search
Slide 60
Slide 60 text
Apache Solr and ElasticSearch
Based on Apache Lucene
Document oriented databases (welcome to polyglot persistence!)
It is not a relational database, ok? No ACID, sorry!
Developed to be scalable
Apache Solr has a better documentation +50
ES has native support to Structured Query DSL +1
ES is better for analytic queries
Slide 61
Slide 61 text
ElasticSearch DSL
// artist = John Lennon AND (genres = rock OR genres = pop)
// AND NOT(nome = imagine)
GET /songs/v1/_search
{
"query" : {
"bool": {
"must": {"match": {"artist": "John Lennon" }},
"should": [
{"match": {"genres": "rock" }},
{"match": {"genres": "pop" }}
],
"must_not": {"match": {"nome": "imagine"}}
}
}
}
Slide 62
Slide 62 text
Our choice: Apache Solr
Apache Solr is Open Source and Open Development +1000
Latest release: 7.1.0 (October 17th, 2017)
Slide 63
Slide 63 text
Apache Solr
Slide 64
Slide 64 text
Installing for development environment...
docker run --name my_solr -p 8983:8983 -d solr
https://hub.docker.com/r/risdenk/docker-solr/
Slide 65
Slide 65 text
localhost:8983
Slide 66
Slide 66 text
Creating a core
docker exec -it my_solr solr create_core -c development
core ~> database or table
document ~> a row from a table
schemaless!!
core name
Slide 67
Slide 67 text
List of all cores
Slide 68
Slide 68 text
Menu options for each core
Slide 69
Slide 69 text
Creating a document...
curl -X POST -H 'Content-Type: application/json'
'http://localhost:8983/solr/development/update/json/docs' --data-binary
'
{
"id": "1",
"title": "Song 1"
}'
Deleting a document...
curl -X POST -H 'Content-Type: application/json'
'http://localhost:8983/solr/development/update' --data-binary '
{
"delete": { "id":"1" },
"commit": {},
}'
Slide 79
Slide 79 text
Deleting ALL documents
curl -X POST -H 'Content-Type: application/json'
'http://localhost:8983/solr/development/update?commit=true' --data-binary '
{
"delete": {
"query": "*:*"
}
}'
Slide 80
Slide 80 text
Documents menu - Solr command
Slide 81
Slide 81 text
Searching
Slide 82
Slide 82 text
q (query) main query parameter
fq (filter query) filter query
(to reduce the dataset)
fl (filter list) list of fields to return
sort list of fields to sort the
dataset
Results are paginated
QUERY
Slide 83
Slide 83 text
Basic queries
List all documents (with pagination)
curl 'http://localhost:8983/solr/development/select?q=*:*'
Slide 84
Slide 84 text
Basic queries
List all documents (with pagination)
curl http://localhost:8983/solr/development/select -d '
{
query:"*:*"
}'
Fuzzy matching
title:"song" AND genre:"rock" 1 document
(title:"song" AND genre:"rock") OR title:"track" 2 documents
year: [1980 TO *] 3 documents
genre:[Pop TO *] 3 documents
Boosting:
(title:music OR title:Rock)^1.5 (genre:music OR genre: Rock) 3 documents
1st: "Other music rock"
(title:music OR title:Rock) (genre:music OR genre: Rock)^1.5 3 documents
1st: "My favorite songs"
Slide 88
Slide 88 text
Searching in all fields
In your schema.xml, add:
You can add but it is not recommended:
Then, you can search without defining the default field
Slide 89
Slide 89 text
Analysis: list all indexing and querying transformations
Indexing
Transform.
Querying
Transform.
Slide 90
Slide 90 text
Customizing fields and their analyzers (schema.xml)
Slide 91
Slide 91 text
Spell Checking
Slide 92
Slide 92 text
Building the spell checking indexing
curl --request GET --url
'http://localhost:8983/solr/development/select?q=*:*&spellcheck.build=t
rue&spellcheck=true'
Sunspot needs its own schema.xml.
Follow this example in:
elainenaomi/search_engine
Slide 101
Slide 101 text
Sunspot DSL - Defining the indexed fields
class Song < ActiveRecord::Base
searchable do
text :title, stored: true
text :lyrics, stored: false
text :artist, stored: true
string :genre, multiple: true, stored: true do
genre.split(',')
end
end
end
Sunspot.index! Song.all
Slide 102
Slide 102 text
Bag of words:
search = Song.search do
fulltext 'imagine dragons'
with :genre, 'Rock'
without :genre, 'Pop'
with(:year).less_than 2014
field_list :title, :artist
order_by :title, :asc
end
songs = search.results
Imagine
(John Lennon)
Radioactive
( Imagine Dragons)
Slide 103
Slide 103 text
Phrase queries:
search = Song.search do
fulltext "\"imagine dragons\""
with :genre, 'Rock'
without :genre, 'Pop'
with(:year).less_than 2014
field_list :title, :artist
order_by :title, :asc
end
songs = search.results
Radioactive
( Imagine Dragons)
Slide 104
Slide 104 text
Query Phrase Slop
# Two words can appear between the words in the phrase, so
# "imagine all the people" also matches, in addition to "imagine people"
Song.search do
fulltext '"imagine people"' do
fields :lyrics
query_phrase_slop 2
end
end
Slide 105
Slide 105 text
Minimum Match
Song.search do
fulltext "dragons imagine test" do
fields :artist, :title
minimum_match '70%'
end
end
Song.search do
fulltext 'dragons imagine test' do
fields :artist, :title
boost_fields title: 2.0
minimum_match '60%'
end
end
1 document:
Radioactive
( Imagine Dragons)
2 documents:
1st: Imagine
(John Lennon)
2nd: Radioactive
(Imagine Dragons)
boost
rounded down
To test or not to test?
Unit tests? No.
Integration tests? Maybe…
Search engines depends on terms frequency to ranking docs
You will need all your dataset to compute precision, recall..
You can test only filter queries, indexing callbacks…
Slide 109
Slide 109 text
Summary
Slide 110
Slide 110 text
Summary
The searching
problem
● User: a bug search tool
Adding a search
engine to my app
● Full text search in MariaDB
● Apache Solr x ElasticSearch
Apache Solr
● How to create cores
● CRUD operations
Integrating with Rails
● Sunspot gem
● How to index, search and test
Slide 111
Slide 111 text
Keep in mind
Always verify the user's information needs from your app
E.g.: check if removing stop words, synonymous should be applied
"No" Meghan Trainor
"I am" - P.O.D
E.g: which transformations your search engine should apply
- Phonetic transformations? Custom language analyzers?
Slide 112
Slide 112 text
Keep in mind
The information is not only on text files but also in
audios, videos, images, etc.
Slide 113
Slide 113 text
Suggested topics for studying
- Evaluation of available analyzers for FTS
- Optimization of Performance (such as soft commit, lazy build indexes)
- Distribution and replication through SolrCloud
- Using of Machine Learning algorithms
- Creation of custom function queries
- Authentication
- Integrating with Logstash and Kibana
- Geospatial searches
References
Introduction to Information Retrieval
Manning, Christopher D., Prabhakar Raghavan, and Hinrich Schütze (2008)
Solr in action
Grainger, Trey, Timothy Potter, and Yonik Seeley (2014)
Sunspot gem
http://sunspot.github.io/
Uma introdução ao tema recuperação de informações textuais.
Barth, F. J. (2013)
10 Reasons to Choose Apache Solr Over Elasticsearch (2016)
https://dzone.com/articles/10-reasons-to-choose-apache-solr-over-elasticsearc
Slide 116
Slide 116 text
References
Apache Solr vs Elasticsearch
http://solr-vs-elasticsearch.com/
When to consider Solr
https://stackoverflow.com/questions/4960952/when-to-consider-solr
Indexing for full text search in PostgreSQL
https://www.compose.com/articles/indexing-for-full-text-search-in-postgresql/
PolyglotPersistence
https://martinfowler.com/bliki/PolyglotPersistence.html
Yahoo! Answers: Qual o nome desta Música?
https://br.answers.yahoo.com/question/index?qid=20080627085726AAJM9Wa
Slide 117
Slide 117 text
References
Full-Text Index in MariaDB
https://mariadb.com/kb/en/library/full-text-index-overview/
Natural Language Full-Text Searches (MySQL)
https://dev.mysql.com/doc/refman/5.7/en/fulltext-natural-language.html
Postgres full-text search is Good Enough! (2015)
http://rachbelaid.com/postgres-full-text-search-is-good-enough/
Text Indexes in MongoDB
https://docs.mongodb.com/manual/core/index-text/
Full-Text Index Stopwords for MariaDB
https://mariadb.com/kb/en/library/full-text-index-stopwords/