Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Elasticsearch & PeopleSearch

Elasticsearch & PeopleSearch

Presented on 10/11/12 at the Boston Elasticsearch meetup held at the Microsoft New England Research & Development Center. This talk gave a very high-level overview of Elasticsearch to newcomers and explained why ES is a good fit for Traackr's use case.

George P. Stathis

October 11, 2012
Tweet

More Decks by George P. Stathis

Other Decks in Technology

Transcript

  1. Elasticsearch & “PeopleSearch”
    Leveraging Elasticsearch @

    View Slide

  2. About Traackr
    A search engine
    A people discovery engine
    Subscription-based
    Migrated from Solr to
    Elasticsearch in Q3 ’12

    View Slide

  3. About me
    14+ years of experience building
    full-stack web software systems
    with a past focus on e-
    commerce and publishing
    VP Engineering @ Traackr,
    responsible for building
    engineering capability to enable
    Traackr's growth goals
    about.me/george-stathis

    View Slide

  4. About this talk
    Short intro to Elasticsearch
    How search is done @ Traackr
    Why Elasticsearch was the right fit

    View Slide

  5. About Elasticsearch
    Lucene under the covers
    Distributed from the ground up
    Full support for Lucene Near Real-Time search
    Native JSON Query DSL
    Automatic schema detection (“schema-less”)
    Supports document types

    View Slide

  6. Elasticsearch - Distributed
    Indices broken into shards
    shards have 0 or more replicas
    data nodes hold one or more shards
    data nodes can coordinate/forward
    requests
    automatic routing & rebalancing but
    overrides available
    Default mode is multicast (zen
    discovery), unicast available for
    multicast unfriendly networks, AWS
    plug-in available, Zookeeper plug-in
    available made possible by Sonian.
    YouTube demo: http://youtu.be/
    l4ReamjCxHo
    Source: https://confluence.oceanobservatories.org/display/CIDev/Indexing+with+ElasticSearch

    View Slide

  7. Elasticsearch - NRT
    Uses Lucene’s IndexReader.open(IndexWriter
    writer, boolean applyAllDeletes)
    Opens a near real time IndexReader from the
    IndexWriter
    By default, flushes and makes new updates available
    every second

    View Slide

  8. Elasticsearch - JSON DSL
    # Query String!
    curl 'localhost:9200/test/_search?pretty=1' -d '{!
    "query" : {!
    "query_string" : {!
    "query" : "tags:scala"!
    }!
    }!
    }'

    # Range!
    curl 'localhost:9200/test/_search?pretty=1' -d '{!
    "query" : {!
    "range" : {!
    "price" : { "gt" : 15 }!
    }!
    }!
    }'
    Source: https://github.com/kimchy/talks/blob/master/2011/wsnparis/06-search-querydsl.sh
    Source: https://github.com/kimchy/talks/blob/master/2011/wsnparis/06-search-querydsl.sh

    View Slide

  9. Elasticsearch - JSON DSL (cont)
    # Filtered Query!
    # Filters are similar to queries, except they do no scoring !
    # and are easily cached. !
    # There are many filter types as well, including range and term!
    curl 'localhost:9200/test/_search?pretty=1' -d '{!
    "query" : {!
    "filtered" : {!
    "query" : {!
    "query_string" : {!
    "query" : "tags:scala"!
    }!
    },!
    "filter" : {!
    "range" : {!
    "price" : { "gt" : 15 }!
    }!
    }!
    }!
    }!
    }' Source: https://github.com/kimchy/talks/blob/master/2011/wsnparis/06-search-querydsl.sh

    View Slide

  10. Elasticsearch - Schema
    Dynamic object mapping with intelligent defaults
    Can be turned off
    Can be overridden globally or on a per index basis:
    {!
    "_default_" : {!
    "date_formats" : ["yyyy-MM-dd", "dd-MM-yyyy", "date_optional_time"],!
    }!
    }

    View Slide

  11. Elasticsearch Demo

    View Slide

  12. Search @ Traackr
    Answering authors by searching posts

    View Slide

  13. Traackr search requirements
    Posts are coming in at about 1 million a day
    Each author averages several hundred posts
    Posts need to be available for search immediately
    Relevance and sorting has to be rolled up/grouped at
    the author level

    View Slide

  14. Early approach to search
    search posts
    group matched posts by author
    for each grouped set, add up the
    lucene scores of the posts
    combine sum of post scores with
    author social and website metrics
    for final group score
    sort groups (i.e. authors)
    try to do this quickly!
    Performance hit

    View Slide

  15. Room for improvement
    How can we avoid the “late binding” performance
    penalty?
    Get the search engine to do as much of the scoring
    as possible
    Store all data needed for displaying results in the
    search engine (i.e. no db calls)

    View Slide

  16. Alternatives - Denormalize?
    Index authors and their posts together
    under one document.
    Pros
    straight forward
    built-in post relevance sum
    Cons
    each profile change would trigger the
    reindexing of all the author’s posts
    each new post would trigger the re-
    indexing of all the author’s posts +
    profile
    a non-starter for real-time search

    View Slide

  17. Alternatives - Solr Join?
    “In many cases, documents have relationships between them and it is too expensive to denormalize
    them. Thus, a join operation is needed. Preserving the document relationship allows documents to
    be updated independently without having to reindex large numbers of denormalized documents.” -
    http://wiki.apache.org/solr/Join
    E.g. Find all post docs matching "search engines", then join them against author docs and return
    that list of authors:


    ...?q={!join+from=author_id+to=id}search+engines
    Pros
    addresses the issue of loading author profiles from db
    Cons
    Does not preserve the post relevance scores -> non-starter
    Submit patch to get scores? Wouldn’t touch SOLR-2272 with a ten foot pole:

    View Slide

  18. Alternatives - Solr Grouping?
    Groups results by a given document field (e.g. author_id)
    http://wiki.apache.org/solr/FieldCollapsing
    ...&q=real+time+search&group=true&group.field=author_id
    [...]!
    "grouped":{!
    "author_id":{!
    "matches":2,!
    "groups":[{!
    "groupValue":"04e3bc5078344ad1a065815f0bb9f14d",!
    "doclist":{"maxScore":3.456747, "numFound":1,"start":0,"docs":[!
    {!
    "id":"5d09240934eb331bada1ff3f0b773153",!
    "title":"Refresh API",!
    "url":"http://www.elasticsearch.org/guide/reference/api/admin-indices-refresh.html",!
    "author_id":"04e3bc5078344ad1a065815f0bb9f14d"}]!
    }},!
    {!
    "groupValue":"9e4f40e1aa82f2e1a9368748d1268082",!
    "doclist":{"maxScore":2.456747,"numFound":2,"start":0,"docs":[!
    {!
    "id":"831ce82bdff34abeb495f260bc7d67d2",!
    "title":"Realtime Search: Solr vs Elasticsearch"},!
    "url":"http://blog.socialcast.com/realtime-search-solr-vs-elasticsearch/",!
    "author_id":"9e4f40e1aa82f2e1a9368748d1268082"},!
    [...]]!
    }}]}}

    View Slide

  19. Alternatives - Solr Grouping?
    Pros
    Faster than doing grouping at the app layer: no
    need for post counting
    Possible to sort groups by sum of post relevance
    scores inside the engine (with some custom work):
    Cons
    No concept of author; author profiles still need to
    be fetched from db, so still suffers from some
    performance penalty
    Submit patch for group sort options? Not a lot of
    interest in sorting groups by anything other than
    max score:
    Don’t want to be stuck maintaining custom
    Solr code (been there done that with HBase:
    http://www.slideshare.net/gstathis/finding-the-
    right-nosql-db-for-the-job-the-path-to-a-
    nonrdbms-solution-at-traackr )



    View Slide

  20. Alternatives - Elasticsearch!
    Supports document types and
    parent/child document
    mappings: http://
    www.elasticsearch.org/guide/
    reference/mapping/parent-
    field.html
    Out-of-the-box support for
    querying child documents and
    obtaining their parents: http://
    www.elasticsearch.org/guide/
    reference/query-dsl/top-
    children-query.html.
    Con: memory heavy
    Parent documents can be
    sorted but sum/avg/max of child
    document scores.
    curl 'localhost:9200/traackr/_search?pretty=1' -d
    '{!
    "query": {!
    "top_children": {!
    "type": "post",!
    "query": {!
    "query_string": {!
    "query": "elasticsearch NRT"!
    }!
    },!
    "score": "sum"!
    }!
    }!
    }'
    can order parent
    results by sum of
    child scores!
    {!
    "post" : {!
    "_parent" : {!
    "type" : "author"!
    }!
    }!
    }
    Big win

    View Slide

  21. Top Children Demo

    View Slide

  22. Other Elasticsearch benefits
    Lucene: don’t have to give up query syntax if you come from Solr
    In-JVM nodes: can use Java API to unit test different permutations of indexing
    configurations (e.g. different analyzers and tokenizers): great help for testing search
    on a qualitative basis; allows for embedded ES instances
    Index API and Cluster API: a great deal of cluster and index configuration changes
    can be made on the fly through curl API calls without restarting the cluster; very
    convenient for testing and cluster management
    Warmer API: significant help in avoiding search time drops due to segment merges;
    https://github.com/elasticsearch/elasticsearch/issues/1913
    Percolators: register queries and let the engine tell you which queries match on a
    given document; great potential for real-time; http://www.elasticsearch.org/guide/
    reference/api/percolate.html

    View Slide

  23. Q&A

    View Slide