Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Search Intelligence @elo7.com

Search Intelligence @elo7.com

Fernando Meyer

March 09, 2013
Tweet

More Decks by Fernando Meyer

Other Decks in Technology

Transcript

  1. Outline Some data about our data Some history Apache Solr

    How Lucene Works Examples Terms Inverted index How a result is scored against a query in Lucene Lucene conceptual Scoring formula [?]
  2. Search Intelligence How have we optimized our index How to

    declare a solr index Infrastructure Upgrade version 2 - single node version 3 - current infrastructure Frenzy API Example of product operation Content recommendation Architecture http://elo7.com 2013 3/29
  3. Search Intelligence About Fernando Meyer - Undergrad in Applied Mathematics

    for University of São Paulo. Holds more than 12 years of experience in R&D deploying cool systems for companies like RedHat(JBoss), Globo and Locaweb. Currently is focusing his research and interests in machine learning, information retrieve and statistics. Felipe Besson - B.S. in Information Systems and Masters in Computer Sci- ence for the University of São Paulo, Brazil. His research focused on automated testing of web services composition. Now, he is expanding his horizons by working with searching, data mining, machine learning and other geek stuff. http://elo7.com 2013 5/29
  4. Search Intelligence Some data about our data • 3000 (avg.)

    queries per second • from 3500 to 4200 users on site per minute • 15000 requests per minute on AppServer • 160000 (avg.) bot/requests per day • 160000 (avg.) bot/requests per day • 1200000 indexed products • 20000 active sellers http://elo7.com 2013 6/29
  5. Search Intelligence Some history • Search v0.0 - select *

    from product where text like ’%query%’ • Search v0.1 - Sphinx – No delta index – Poor index/query performance for large scale dataset • Search v1.0 - Apache Solr http://elo7.com 2013 7/29
  6. Search Intelligence Apache Solr Solr is written in Java and

    runs as a standalone full-text search server within a servlet container such as Jetty. Solr uses the Lucene Java search library at its core for full-text indexing and search, and has REST-like HTTP/XML and JSON APIs that make it easy to use from virtually any programming language. http://elo7.com 2013 8/29
  7. Search Intelligence How Lucene Works Lucene is an inverted full-text

    index. This means that it takes all the documents, splits them into words, and then builds an index for each word. Since the index is an exact string-match, unordered, it can be extremely fast. http://elo7.com 2013 9/29
  8. Search Intelligence Examples Terms T[0] = "it is what it

    is" T[1] = "what is it" T[2] = "it is a banana" Inverted index "a": {(2, 2)} "banana": {(2, 3)} "is": {(0, 1), (0, 4), (1, 1), (2, 1)} "it": {(0, 0), (0, 3), (1, 2), (2, 0)} "what": {(0, 2), (1, 0)} http://elo7.com 2013 10/29
  9. Search Intelligence How a result is scored against a query

    in Lucene A.K.A: That answer to the dollar question: Why isn’t this product appearing by searching "bleh" Lucene conceptual Scoring formula [?] score(q,d) = coord-factor(q,d).query-boost(q). A·B A B .doc-len-norm(d).score(d) http://elo7.com 2013 11/29
  10. Search Intelligence How have we optimized our index <fieldType name="text_pt_br"

    class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.ASCIIFoldingFilterFactory"/> <filter class="com.elo7.solr.analysis.OrengoStemmerFilterFa http://elo7.com 2013 12/29
  11. Search Intelligence exceptionList="stemmerignore.txt"/> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.WhitespaceTokenizerFactory"/>

    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/> <filter class="solr.SynonymFilterFactory" synonyms="synonym ignoreCase="true" expand="true"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.ASCIIFoldingFilterFactory"/> http://elo7.com 2013 13/29
  12. Search Intelligence How to declare a solr index <field name="id"

    type="int" indexed="true" stored="true" required="true" /> <field name="title" type="text_pt_br" indexed="true" stored="true"/> <field name="description" type="text_pt_br" indexed="true" stored="false" /> <field name="tags" type="text_pt_br" indexed="true" stored="true" multiValued="true"/> http://elo7.com 2013 15/29
  13. Search Intelligence Infrastructure Upgrade version 2 - single node •

    Scaling issues • M1.xlarge => m2.2xlarge => c1.xlarge 90% CPU • Solr 3.6 • Full index with ruby scripts (takes 3.5hs to full index ) http://elo7.com 2013 16/29
  14. Search Intelligence version 3 - current infrastructure • 3 m1.xlarge

    (20% CPU Usage) behind an amazon ELB • 1 m1.xlarge Search API (50% of logged users staging ) • Solr Data Importer (takes 15mn to full index) http://elo7.com 2013 17/29
  15. Search Intelligence Frenzy API Solr environment evolution • Operations: Searching,

    indexing and deleting • Resources: Products, stores, auto-complete suggestions and categories • Recommendations Advantages • Removing search and indexing logic from marketplace • Providing a search service to other applications (e.g., mobile) http://elo7.com 2013 18/29
  16. Search Intelligence Example of product operation Searching • input (GET):

    query term – filters: city, min. price and max. price – sort: featured, organic, oldest, newest, ... • output (json) – metadata (query status, response time and hits) – list of products – references (previous and next page urls) http://elo7.com 2013 19/29
  17. Search Intelligence Content recommendation • Collaborative filtering (user similarity) •

    Based on user favorited products Input (GET) • frenzy/users/:id/recommendations Output: (similiar to search output) http://elo7.com 2013 20/29
  18. Search Intelligence Current Scenario • Experimental stage • Search operations

    are being integrated • 50% of logged user searches are using the API • Recommendation API is being evolved http://elo7.com 2013 22/29
  19. Search Intelligence Future WorkContent Tracker We need to understand, track,

    analyse and take advantage on our users navigation patterns. • Any user receiver an unique ID • This ID follows any user’s interaction with the website • Whenever an user interacts with a product: views; add to favorites; social share; add to cart or buys. we trigger a convertion action. http://elo7.com 2013 23/29
  20. Search Intelligence SearchID UserID Term pgN Filters A376AC e00c59 "abajur"

    1 Nil A376AD e00c59 "abajur" 1 "pr:[10.0,15.0]" A376AE e00c59 "abajur" 1 "pr:[10.0,15.0] city:curitiba" Table 1: Search Action logger http://elo7.com 2013 24/29
  21. Search Intelligence ViewID SearchID PRDID PPP 000001 A376AE 201209 1

    000002 A37FED 204439 5 000003 EDA342 202234 1 000004 EFDBC1 231324 5 000005 EDA563 214512 2 000006 EFA564 264553 13 Table 2: Product View logger http://elo7.com 2013 25/29
  22. Search Intelligence ActionID ViewID type 000001 000001 cart 000002 000002

    fav 000003 000005 cart 000004 000004 social 000005 000003 ship 000006 000006 contact Table 3: Product Action logger http://elo7.com 2013 26/29
  23. Search Intelligence ActionID convert 000001 true 000002 true 000003 false

    000004 false 000005 false 000006 true Table 4: Action to convert http://elo7.com 2013 27/29
  24. Search Intelligence BigData Analytics • Product conversion per channel •

    Consumer behaviour • Trends • Better recomendation (including new users) • Better emailmarketing (attractiveness ) • Per product stats (Clicks/Impressions/CTR) http://elo7.com 2013 28/29