Elasticsearch & PeopleSearch

Elasticsearch & PeopleSearch

Presented on 10/11/12 at the Boston Elasticsearch meetup held at the Microsoft New England Research & Development Center. This talk gave a very high-level overview of Elasticsearch to newcomers and explained why ES is a good fit for Traackr's use case.

Ba71ad856b0e8d5f224b8c927c85a7c4?s=128

George P. Stathis

October 11, 2012
Tweet

Transcript

  1. Elasticsearch & “PeopleSearch” Leveraging Elasticsearch @

  2. About Traackr A search engine A people discovery engine Subscription-based

    Migrated from Solr to Elasticsearch in Q3 ’12
  3. About me 14+ years of experience building full-stack web software

    systems with a past focus on e- commerce and publishing VP Engineering @ Traackr, responsible for building engineering capability to enable Traackr's growth goals about.me/george-stathis
  4. About this talk Short intro to Elasticsearch How search is

    done @ Traackr Why Elasticsearch was the right fit
  5. About Elasticsearch Lucene under the covers Distributed from the ground

    up Full support for Lucene Near Real-Time search Native JSON Query DSL Automatic schema detection (“schema-less”) Supports document types
  6. Elasticsearch - Distributed Indices broken into shards shards have 0

    or more replicas data nodes hold one or more shards data nodes can coordinate/forward requests automatic routing & rebalancing but overrides available Default mode is multicast (zen discovery), unicast available for multicast unfriendly networks, AWS plug-in available, Zookeeper plug-in available made possible by Sonian. YouTube demo: http://youtu.be/ l4ReamjCxHo Source: https://confluence.oceanobservatories.org/display/CIDev/Indexing+with+ElasticSearch
  7. Elasticsearch - NRT Uses Lucene’s IndexReader.open(IndexWriter writer, boolean applyAllDeletes) Opens

    a near real time IndexReader from the IndexWriter By default, flushes and makes new updates available every second
  8. Elasticsearch - JSON DSL # Query String! curl 'localhost:9200/test/_search?pretty=1' -d

    '{! "query" : {! "query_string" : {! "query" : "tags:scala"! }! }! }'
 # Range! curl 'localhost:9200/test/_search?pretty=1' -d '{! "query" : {! "range" : {! "price" : { "gt" : 15 }! }! }! }' Source: https://github.com/kimchy/talks/blob/master/2011/wsnparis/06-search-querydsl.sh Source: https://github.com/kimchy/talks/blob/master/2011/wsnparis/06-search-querydsl.sh
  9. Elasticsearch - JSON DSL (cont) # Filtered Query! # Filters

    are similar to queries, except they do no scoring ! # and are easily cached. ! # There are many filter types as well, including range and term! curl 'localhost:9200/test/_search?pretty=1' -d '{! "query" : {! "filtered" : {! "query" : {! "query_string" : {! "query" : "tags:scala"! }! },! "filter" : {! "range" : {! "price" : { "gt" : 15 }! }! }! }! }! }' Source: https://github.com/kimchy/talks/blob/master/2011/wsnparis/06-search-querydsl.sh
  10. Elasticsearch - Schema Dynamic object mapping with intelligent defaults Can

    be turned off Can be overridden globally or on a per index basis: {! "_default_" : {! "date_formats" : ["yyyy-MM-dd", "dd-MM-yyyy", "date_optional_time"],! }! }
  11. Elasticsearch Demo

  12. Search @ Traackr Answering authors by searching posts

  13. Traackr search requirements Posts are coming in at about 1

    million a day Each author averages several hundred posts Posts need to be available for search immediately Relevance and sorting has to be rolled up/grouped at the author level
  14. Early approach to search search posts group matched posts by

    author for each grouped set, add up the lucene scores of the posts combine sum of post scores with author social and website metrics for final group score sort groups (i.e. authors) try to do this quickly! Performance hit
  15. Room for improvement How can we avoid the “late binding”

    performance penalty? Get the search engine to do as much of the scoring as possible Store all data needed for displaying results in the search engine (i.e. no db calls)
  16. Alternatives - Denormalize? Index authors and their posts together under

    one document. Pros straight forward built-in post relevance sum Cons each profile change would trigger the reindexing of all the author’s posts each new post would trigger the re- indexing of all the author’s posts + profile a non-starter for real-time search
  17. Alternatives - Solr Join? “In many cases, documents have relationships

    between them and it is too expensive to denormalize them. Thus, a join operation is needed. Preserving the document relationship allows documents to be updated independently without having to reindex large numbers of denormalized documents.” - http://wiki.apache.org/solr/Join E.g. Find all post docs matching "search engines", then join them against author docs and return that list of authors:
 
 ...?q={!join+from=author_id+to=id}search+engines Pros addresses the issue of loading author profiles from db Cons Does not preserve the post relevance scores -> non-starter Submit patch to get scores? Wouldn’t touch SOLR-2272 with a ten foot pole:
  18. Alternatives - Solr Grouping? Groups results by a given document

    field (e.g. author_id) http://wiki.apache.org/solr/FieldCollapsing ...&q=real+time+search&group=true&group.field=author_id [...]! "grouped":{! "author_id":{! "matches":2,! "groups":[{! "groupValue":"04e3bc5078344ad1a065815f0bb9f14d",! "doclist":{"maxScore":3.456747, "numFound":1,"start":0,"docs":[! {! "id":"5d09240934eb331bada1ff3f0b773153",! "title":"Refresh API",! "url":"http://www.elasticsearch.org/guide/reference/api/admin-indices-refresh.html",! "author_id":"04e3bc5078344ad1a065815f0bb9f14d"}]! }},! {! "groupValue":"9e4f40e1aa82f2e1a9368748d1268082",! "doclist":{"maxScore":2.456747,"numFound":2,"start":0,"docs":[! {! "id":"831ce82bdff34abeb495f260bc7d67d2",! "title":"Realtime Search: Solr vs Elasticsearch"},! "url":"http://blog.socialcast.com/realtime-search-solr-vs-elasticsearch/",! "author_id":"9e4f40e1aa82f2e1a9368748d1268082"},! [...]]! }}]}}
  19. Alternatives - Solr Grouping? Pros Faster than doing grouping at

    the app layer: no need for post counting Possible to sort groups by sum of post relevance scores inside the engine (with some custom work): Cons No concept of author; author profiles still need to be fetched from db, so still suffers from some performance penalty Submit patch for group sort options? Not a lot of interest in sorting groups by anything other than max score: Don’t want to be stuck maintaining custom Solr code (been there done that with HBase: http://www.slideshare.net/gstathis/finding-the- right-nosql-db-for-the-job-the-path-to-a- nonrdbms-solution-at-traackr )
 
 

  20. Alternatives - Elasticsearch! Supports document types and parent/child document mappings:

    http:// www.elasticsearch.org/guide/ reference/mapping/parent- field.html Out-of-the-box support for querying child documents and obtaining their parents: http:// www.elasticsearch.org/guide/ reference/query-dsl/top- children-query.html. Con: memory heavy Parent documents can be sorted but sum/avg/max of child document scores. curl 'localhost:9200/traackr/_search?pretty=1' -d '{! "query": {! "top_children": {! "type": "post",! "query": {! "query_string": {! "query": "elasticsearch NRT"! }! },! "score": "sum"! }! }! }' can order parent results by sum of child scores! {! "post" : {! "_parent" : {! "type" : "author"! }! }! } Big win
  21. Top Children Demo

  22. Other Elasticsearch benefits Lucene: don’t have to give up query

    syntax if you come from Solr In-JVM nodes: can use Java API to unit test different permutations of indexing configurations (e.g. different analyzers and tokenizers): great help for testing search on a qualitative basis; allows for embedded ES instances Index API and Cluster API: a great deal of cluster and index configuration changes can be made on the fly through curl API calls without restarting the cluster; very convenient for testing and cluster management Warmer API: significant help in avoiding search time drops due to segment merges; https://github.com/elasticsearch/elasticsearch/issues/1913 Percolators: register queries and let the engine tell you which queries match on a given document; great potential for real-time; http://www.elasticsearch.org/guide/ reference/api/percolate.html
  23. Q&A