Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Dealing with a search engine in your applicatio...

Elaine Naomi
November 18, 2017
32

Dealing with a search engine in your application - a Solr approach for beginners

Introduction to the basic concepts of information retrieval, such as methods for tokenization, removal of stop words and special characters, the addition of thesaurus, among others. Presentation of methods to verify the quality of the obtained results. The pros and cons of using Apache Solr, a free alternative solution for Elasticsearch. Integration to Solr through the Sunspot gem, demonstrating how to create indexes, search documents and use spell checking feature.

Presented at RubyConfBR 2017

Available on: https://en.eventials.com/locaweb/dealing-with-a-search-engine-in-your-application-a-solr-approach-for-beginners-com-elaine-naomi-watanabe/

Elaine Naomi

November 18, 2017
Tweet

Transcript

  1. Dealing with a search engine in your application a Solr

    approach for beginners Elaine Naomi Watanabe
  2. Elaine Naomi Watanabe Full-stack developer (Playax) Master's degree in Computer

    Science (IME-USP) Passionate about: Web Development, Agile, Cloud Computing, DevOps, NoSQL and RDBMS
  3. ANALYZING BILLIONS OF DATA TO HELP ARTISTS AND MUSIC PROFESSIONALS

    TO DEVELOP THEIR AUDIENCE BIG DATA + MUSIC + TECH = <3
  4. Searching Problem Introduction Information Retrieval Basic concepts Apache Solr How

    to configure Sunspot Gem Integrating with Ruby on Rails Next Steps Including references SPOILER ALERT
  5. "Imagine all the people living life in peace" (Imagine -

    John Lennon) "I'm a radioactive, radioactive" (Radioactive - Imagine Dragons) "Welcome to the jungle, watch it bring you to your knees" (Welcome to the jungle - Guns N' Roses) A little little sample of songs...
  6. A SQL LIKE statement is enough? SEARCHING TERM: "IMAGINE" SELECT

    * FROM songs WHERE title LIKE '%IMAGINE%' OR artist LIKE '%IMAGINE%' OR lyrics LIKE '%IMAGINE%';
  7. "Imagine all the people living life in peace" (Imagine -

    John Lennon) "I'm a radioactive, radioactive" (Radioactive - Imagine Dragons) "Welcome to the jungle, watch it bring you to your knees" (Welcome to the jungle - Guns N' Roses) Searching for "Imagine"
  8. "Imagine all the people living life in peace" (Imagine -

    John Lennon) "I'm a radioactive, radioactive" (Radioactive - Imagine Dragons) "Welcome to the jungle, watch it bring you to your knees" (Welcome to the jungle - Guns N' Roses) A little little sample of songs...
  9. A SQL LIKE statement is really enough?? SEARCHING TERM: "IMAGINE

    PEOPLE" SELECT * FROM songs WHERE title LIKE '%IMAGINE%PEOPLE%' OR artist LIKE '%IMAGINE%PEOPLE%' OR lyrics LIKE '%IMAGINE%PEOPLE%';
  10. When only Yahoo! Answers is the solution... Ueca tudi diango...

    tanananananananaann nisss ♬ welcome to the jungle, watch it bring you to your knees ♬ (╯°▽°)╯ ︵ ┻━┻
  11. IN THE PAST... List all documents that match a search

    query was enough… However, in a Big Data era…
  12. DON T PANIC "DON'T PANIC" Is it enough to remove

    punctuation and spaces? "do not panic" do not panic How to tokenize contractions? Are all of them semantic units? Are they same tokens? don't = do not DON'T PANIC
  13. "Fullstack developer" "Full-stack dev" "Full stack developer" Is it enough

    to remove punctuation and spaces? Fullstack Full developer stack developer Full-stack developer For a user, these terms should return the same documents, isn't it?
  14. 30 seconds to Mars Thirty seconds to Mars November, 18th,

    2017 2017-11-18 SP São Paulo How to deal with numbers and abbreviations?
  15. Kaminari Is a gem or thunder in japanese? Windows Is

    it plural of window or about the company? About the semantics of the original term and its normalized token...
  16. STOP WORDS: extremely common words In English: a, an, the,

    and, or, are, as, at, by, for, from, of ... In Portuguese: um, uma, a, o, as, os, é, são, por, de, da, do, se …
  17. Stop words, diacritics, case folding... Stop word removal Case folding

    normalization Diacritics removal HELLO WORLD Hello World hello world hello world naive naïve naive roses are red red roses roses red
  18. When not to normalize tokens… The Who (a band) Se

    (Brazilian song, from Djavan) Strings solely composed by stop words Different meanings for words with and without diacritics In Spanish: peña means a "cliff" pena means "sorrow" When not to set all characters to lowercase General Motors Windows Apple
  19. LEMMATIZATION: based on a vocabulary am, are, is be sou,

    somos, foi, é ser car, cars, car’s, cars’ car English Portuguese carros, carro carro
  20. STEMMING Heuristic process that chops off the ends of words

    cats cat ponies poni Increase the number of returned documents. However, harming precision...
  21. STEMMING Heuristic process that chops off the ends of words

    amor amor amores amora operating operat system Portuguese system English It means love It's a Brazilian berry not so meaningful tokens
  22. Bag of words: List of keywords Ordering of words is

    ignored! e.g. Imagine Dragons Dragons Imagine Phrase queries: Order matters! Restrict searches e.g. "Imagine Dragons"
  23. RELEVANCE term frequency (tf) total of occurrences of a term

    in a document inverse document frequency (idf) how rare is a term in all indexed documents
  24. RELEVANCE tf-idf = tf x idf function that balances the

    term frequency in a document within how rare is term in a collection
  25. Evaluation method We need a test dataset with: 1. A

    document collection 2. A collection of queries 3. A set of relevance judgments, for each query, a list of relevant and non-relevant documents TP: True Positive TN: True Negative FP: False Positive FN: False Negative
  26. RECALL TP TP + FN # Corrected Matches / (#

    Corrected Matches + # Missed Matches)
  27. When a model is good enough for an app? You

    can choose the model with the best F1 score, for example. However, there is no universal solution It is an incremental process You should tune it based on users' information needs Usability tests is also a good way to evaluate a model
  28. FULL TEXT SEARCH IN MARIADB... CREATE TABLE `songs` ( `id`

    int NOT NULL AUTO_INCREMENT PRIMARY KEY, `title` varchar(300), `artist` varchar(255), `genre` varchar(255), `lyrics` text ) ENGINE=InnoDB; CREATE FULLTEXT INDEX songs_title_idx ON songs (title); CREATE FULLTEXT INDEX songs_artist_idx ON songs (artist); CREATE FULLTEXT INDEX songs_lyrics_idx ON songs (lyrics); CREATE FULLTEXT INDEX songs_genre_idx ON songs (genre); FTS
  29. FULL TEXT SEARCH IN MARIADB... SELECT * FROM songs WHERE

    MATCH (title,artist, lyrics) AGAINST ('imagine' IN NATURAL LANGUAGE MODE); SELECT * FROM songs WHERE MATCH (title,artist, lyrics) AGAINST ('imagine' IN BOOLEAN MODE); CREATE FULLTEXT INDEX songs_all_idx ON songs (title,artist,lyrics); default mode
  30. FULL TEXT SEARCH IN MARIADB... SELECT * FROM songs WHERE

    MATCH (title,artist, lyrics) AGAINST ('imagine dragons'); Returned rows: Radioactive - Imagine Dragons Imagine - John Lennon
  31. FULL TEXT SEARCH IN MARIADB... SELECT * FROM songs WHERE

    MATCH (title,artist, lyrics) AGAINST ('+imagine +dragons') IN BOOLEAN MODE); Radioactive - Imagine Dragons
  32. FULL TEXT SEARCH IN MARIADB... SELECT * FROM songs WHERE

    MATCH (title,artist, lyrics) AGAINST ('"imagine dragons"'); Radioactive - Imagine Dragons
  33. FULL TEXT SEARCH IN MARIADB... SELECT * FROM songs WHERE

    MATCH(genre) AGAINST('alternative'); SELECT * FROM songs WHERE MATCH(genre) AGAINST('music'); SELECT * FROM songs WHERE MATCH(genre) AGAINST('alternative' WITH QUERY EXPANSION); SELECT * FROM songs WHERE MATCH(genre) AGAINST('music' WITH QUERY EXPANSION); Imagine Dragons John Lennon Imagine Dragons John Lennon Imagine Dragons - Alternative Rock John Lennon - Rock music, Pop music
  34. Why to use an external search engine? Spell checking! Spell

    checking! Or did you mean… search like Google? ♡
  35. Why to use an external search engine? You can use

    spell checking! You can also: - Add multivalued fields (document oriented database) - Add new algorithms to the databases - Customize stop words, stemming analyzers - Use fuzziness functions - Boost some documents/fields according to the search
  36. Apache Solr and ElasticSearch Based on Apache Lucene Document oriented

    databases (welcome to polyglot persistence!) It is not a relational database, ok? No ACID, sorry! Developed to be scalable Apache Solr has a better documentation +50 ES has native support to Structured Query DSL +1 ES is better for analytic queries
  37. ElasticSearch DSL // artist = John Lennon AND (genres =

    rock OR genres = pop) // AND NOT(nome = imagine) GET /songs/v1/_search { "query" : { "bool": { "must": {"match": {"artist": "John Lennon" }}, "should": [ {"match": {"genres": "rock" }}, {"match": {"genres": "pop" }} ], "must_not": {"match": {"nome": "imagine"}} } } }
  38. Our choice: Apache Solr Apache Solr is Open Source and

    Open Development +1000 Latest release: 7.1.0 (October 17th, 2017)
  39. Installing for development environment... docker run --name my_solr -p 8983:8983

    -d solr https://hub.docker.com/r/risdenk/docker-solr/
  40. Creating a core docker exec -it my_solr solr create_core -c

    development core ~> database or table document ~> a row from a table schemaless!! core name
  41. Creating a document... curl -X POST -H 'Content-Type: application/json' 'http://localhost:8983/solr/development/update?commit=true'

    --data-binary ' [{ "id": "1", "title": "Song 1" },{ "title": "Song 2" }]' Optional in insert
  42. q (query) main query parameter fq (filter query) filter query

    (to reduce the dataset) fl (filter list) list of fields to return sort list of fields to sort the dataset Results are paginated QUERY
  43. My documents { "docs": [ { "title": ["Song 1"], "genre":

    "Rock", "year": 2010 }, { "title": ["Song 2"], "genre": "MPB", "year": 1990 }, { "title": ["Other music Rock"], "genre": "Pop", "year": 1970 }, { "title": ["My favorite songs"], "genre": "Rock Music", "year": 2011 } ] }
  44. Fuzzy matching title:Song* 3 documents title:Song? 1 document title:Sonjs 0

    documents title:Sonjs~1 1 document title:Sonjs~2 3 documents title:(my songs) 1 document title:"my songs" 0 documents title:"my songs"~2 1 document title:(-favorite +song*) 2 documents *:* 4 documents Wildcards: ? one letter * any number letter ~ query slop ( ) keyword search " " phrase query
  45. Fuzzy matching title:"song" AND genre:"rock" 1 document (title:"song" AND genre:"rock")

    OR title:"track" 2 documents year: [1980 TO *] 3 documents genre:[Pop TO *] 3 documents Boosting: (title:music OR title:Rock)^1.5 (genre:music OR genre: Rock) 3 documents 1st: "Other music rock" (title:music OR title:Rock) (genre:music OR genre: Rock)^1.5 3 documents 1st: "My favorite songs"
  46. Searching in all fields In your schema.xml, add: <copyField source="*_txt"

    dest="_text_" /> <copyField source="*_text" dest="_text_" /> You can add but it is not recommended: <copyField source="*" dest="_text_" /> Then, you can search without defining the default field
  47. Customizing fields and their analyzers (schema.xml) <fieldtype name="phonetic" stored="false" indexed="true"

    class="solr.TextField" > <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.DoubleMetaphoneFilterFactory" inject="false"/> <filter class="solr.StopFilterFactory" format="snowball" words="lang/stopwords_pt.txt" /> </analyzer> </fieldtype>
  48. Connecting through a REST Client … params = {q: 'title:song'

    } response = RestClient.get "http://localhost:8983/solr/development/select?#{params.to_param}" response_json = JSON.parse(response.body) items = response_json["response"]["docs"] [{"title"=>["Song 1"], "id"=>"eeb507c6-461f-4219-9f5a-50528340d84d", "_version_"=>1584234836063682560, "title_str"=>["Song 1"]}, {"title"=>["Song 2"], "id"=>"1b8bacc1-9ed9-4c85-922d-71b3472f9d44", "_version_"=>1584234836065779712, "title_str"=>["Song 2"]}] ヽ(•́o•̀)ノ
  49. Installing... gem 'sunspot_rails' rails generate sunspot_rails:install development: solr: hostname: solr

    port: 8983 path: /solr/playax log_level: INFO auto_index_callback: after_commit auto_remove_callback: after_commit config/sunspot.yml
  50. Sunspot DSL - Defining the indexed fields class Song <

    ActiveRecord::Base searchable do text :title, stored: true text :lyrics, stored: false text :artist, stored: true string :genre, multiple: true, stored: true do genre.split(',') end end end Sunspot.index! Song.all
  51. Bag of words: search = Song.search do fulltext 'imagine dragons'

    with :genre, 'Rock' without :genre, 'Pop' with(:year).less_than 2014 field_list :title, :artist order_by :title, :asc end songs = search.results Imagine (John Lennon) Radioactive ( Imagine Dragons)
  52. Phrase queries: search = Song.search do fulltext "\"imagine dragons\"" with

    :genre, 'Rock' without :genre, 'Pop' with(:year).less_than 2014 field_list :title, :artist order_by :title, :asc end songs = search.results Radioactive ( Imagine Dragons)
  53. Query Phrase Slop # Two words can appear between the

    words in the phrase, so # "imagine all the people" also matches, in addition to "imagine people" Song.search do fulltext '"imagine people"' do fields :lyrics query_phrase_slop 2 end end
  54. Minimum Match Song.search do fulltext "dragons imagine test" do fields

    :artist, :title minimum_match '70%' end end Song.search do fulltext 'dragons imagine test' do fields :artist, :title boost_fields title: 2.0 minimum_match '60%' end end 1 document: Radioactive ( Imagine Dragons) 2 documents: 1st: Imagine (John Lennon) 2nd: Radioactive (Imagine Dragons) boost rounded down
  55. Spell checking search = Sunspot.search(Song) do keywords 'Imagina Dragoons' spellcheck

    :count => 3 end search.spellcheck_suggestion_for('imagina') # => 'imagine' search.spellcheck_suggestions # => [{"word"=>"imagine", "freq"=>3}, {"word"=>"dragons", "freq"=>1}]
  56. To test or not to test? Unit tests? No. Integration

    tests? Maybe… Search engines depends on terms frequency to ranking docs You will need all your dataset to compute precision, recall.. You can test only filter queries, indexing callbacks…
  57. Summary The searching problem • User: a bug search tool

    Adding a search engine to my app • Full text search in MariaDB • Apache Solr x ElasticSearch Apache Solr • How to create cores • CRUD operations Integrating with Rails • Sunspot gem • How to index, search and test
  58. Keep in mind Always verify the user's information needs from

    your app E.g.: check if removing stop words, synonymous should be applied "No" Meghan Trainor "I am" - P.O.D E.g: which transformations your search engine should apply - Phonetic transformations? Custom language analyzers?
  59. Keep in mind The information is not only on text

    files but also in audios, videos, images, etc.
  60. Suggested topics for studying - Evaluation of available analyzers for

    FTS - Optimization of Performance (such as soft commit, lazy build indexes) - Distribution and replication through SolrCloud - Using of Machine Learning algorithms - Creation of custom function queries - Authentication - Integrating with Logstash and Kibana - Geospatial searches
  61. References Introduction to Information Retrieval Manning, Christopher D., Prabhakar Raghavan,

    and Hinrich Schütze (2008) Solr in action Grainger, Trey, Timothy Potter, and Yonik Seeley (2014) Sunspot gem http://sunspot.github.io/ Uma introdução ao tema recuperação de informações textuais. Barth, F. J. (2013) 10 Reasons to Choose Apache Solr Over Elasticsearch (2016) https://dzone.com/articles/10-reasons-to-choose-apache-solr-over-elasticsearc
  62. References Apache Solr vs Elasticsearch http://solr-vs-elasticsearch.com/ When to consider Solr

    https://stackoverflow.com/questions/4960952/when-to-consider-solr Indexing for full text search in PostgreSQL https://www.compose.com/articles/indexing-for-full-text-search-in-postgresql/ PolyglotPersistence https://martinfowler.com/bliki/PolyglotPersistence.html Yahoo! Answers: Qual o nome desta Música? https://br.answers.yahoo.com/question/index?qid=20080627085726AAJM9Wa
  63. References Full-Text Index in MariaDB https://mariadb.com/kb/en/library/full-text-index-overview/ Natural Language Full-Text Searches

    (MySQL) https://dev.mysql.com/doc/refman/5.7/en/fulltext-natural-language.html Postgres full-text search is Good Enough! (2015) http://rachbelaid.com/postgres-full-text-search-is-good-enough/ Text Indexes in MongoDB https://docs.mongodb.com/manual/core/index-text/ Full-Text Index Stopwords for MariaDB https://mariadb.com/kb/en/library/full-text-index-stopwords/