Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Avoiding The Search Hall Of Shame

Avoiding The Search Hall Of Shame

Given at PyCon CA 2012. Introduces how to approach search from the perspective of choosing the right solution & improvement.

daniellindsley

November 10, 2012
Tweet

More Decks by daniellindsley

Other Decks in Programming

Transcript

  1. Avoiding The
    Search Hall Of
    Shame

    View Slide

  2. There are lots of
    ways to do search
    * Well, let me correct that...

    View Slide

  3. There are lots of
    WRONG ways to do
    search

    View Slide

  4. Who?

    View Slide

  5. Who?
    • Daniel Lindsley
    * I was running a small web consultancy, but now I’m looking for a new job

    View Slide

  6. Who?
    • Daniel Lindsley
    • Pythonista by passion, Djangonaut by
    trade
    * I’ve long adored Python
    * I spend a lot of time in using Django, but I’m still just writing Python
    * Albeit slightly magical Python...

    View Slide

  7. Who?
    • Daniel Lindsley
    • Pythonista by passion, Djangonaut by
    trade
    • Author of Haystack (& Tastypie)
    + =
    * Haystack is a semi-popular search library for Django
    * This talk isn’t directly about Django, but everyone writing Python code is some capacity
    should benefit

    View Slide

  8. But most of all, I <3
    search.
    * Well, and RESTful APIs, but that’s a different talk altogether...

    View Slide

  9. Picking The Right
    Approach Matters
    A Lot

    View Slide

  10. Considerations
    * Frequency - tens of times a day? hundreds? thousands?
    * Intended - just another feature? frequent? core/indispensable?
    * Amount - few records or many
    * Space - tiny, single field documents? or huge docs with lots of fields & content?
    * CPU - how intensive are the searches?

    View Slide

  11. Considerations
    • Frequency of use
    * Frequency - tens of times a day? hundreds? thousands?
    * Intended - just another feature? frequent? core/indispensable?
    * Amount - few records or many
    * Space - tiny, single field documents? or huge docs with lots of fields & content?
    * CPU - how intensive are the searches?

    View Slide

  12. Considerations
    • Frequency of use
    • Intended use(ages)
    * Frequency - tens of times a day? hundreds? thousands?
    * Intended - just another feature? frequent? core/indispensable?
    * Amount - few records or many
    * Space - tiny, single field documents? or huge docs with lots of fields & content?
    * CPU - how intensive are the searches?

    View Slide

  13. Considerations
    • Frequency of use
    • Intended use(ages)
    • Amount of data
    * Frequency - tens of times a day? hundreds? thousands?
    * Intended - just another feature? frequent? core/indispensable?
    * Amount - few records or many
    * Space - tiny, single field documents? or huge docs with lots of fields & content?
    * CPU - how intensive are the searches?

    View Slide

  14. Considerations
    • Frequency of use
    • Intended use(ages)
    • Amount of data
    • Space to store the data
    * Frequency - tens of times a day? hundreds? thousands?
    * Intended - just another feature? frequent? core/indispensable?
    * Amount - few records or many
    * Space - tiny, single field documents? or huge docs with lots of fields & content?
    * CPU - how intensive are the searches?

    View Slide

  15. Considerations
    • Frequency of use
    • Intended use(ages)
    • Amount of data
    • Space to store the data
    • CPU needed to run the search
    * Frequency - tens of times a day? hundreds? thousands?
    * Intended - just another feature? frequent? core/indispensable?
    * Amount - few records or many
    * Space - tiny, single field documents? or huge docs with lots of fields & content?
    * CPU - how intensive are the searches?

    View Slide

  16. Assumptions

    View Slide

  17. Assumptions

    View Slide

  18. Assumptions
    • You know Python

    View Slide

  19. Assumptions
    • You know Python
    • You know a little about SQL & I/O

    View Slide

  20. Assumptions
    • You know Python
    • You know a little about SQL & I/O
    • You understand words like "tokenize"

    View Slide

  21. Assumptions
    • You know Python
    • You know a little about SQL & I/O
    • You understand words like "tokenize"
    • You know want to make things better

    View Slide

  22. WARNING:
    Hurtful words
    follow.
    * Don't take it personally
    * We've all been there (except GvR)
    * Sometimes the crappy way is A-OK
    * We’re going to start with the most wrong & improve from there.

    View Slide

  23. The MOST wrong
    of wrong ways
    * Steel yourselves. The code you are about to see is the worst of this talk...

    View Slide

  24. for the_file in our_files:
    the_data = open(the_file).read()
    if query in the_data:
    # ...
    Why is this wrong?
    * Python is SLOW (compared to other things)
    * So very RAM/CPU inefficient
    * I/O wait
    * Worst way to look for the actual text

    View Slide

  25. Slightly less wrong
    but STILL really
    wrong

    View Slide

  26. SELECT *
    FROM mytable
    WHERE big_text_blob ILIKE '%banana%';
    Why is this wrong?
    * Really, really inefficient
    * Sequence scan (READ ALL THE ROWS)
    * No index will help you here
    * ``ILIKE`` hurts
    * Double wildcard hurts

    View Slide

  27. El Wrong-o

    View Slide

  28. command = 'grep "{0}" our_files/
    * '.format(query)
    subprocess.check_output(command)
    Why is this wrong?
    * Still having to read in all the data into RAM
    * Grep is smart enough to stream chunks of the data off the disk to not consume all the data
    at once
    * Shelling out hurts

    View Slide

  29. What are the
    commonalities?

    View Slide

  30. What are the
    commonalities?
    * Reading everything into RAM is bad because that’s a lot of unused data
    * Substrings suck because you have to search the whole thing, perhaps character by character
    * Running through everything on every query is tons of slow I/O

    View Slide

  31. What are the
    commonalities?
    • Reading everything into RAM
    * Reading everything into RAM is bad because that’s a lot of unused data
    * Substrings suck because you have to search the whole thing, perhaps character by character
    * Running through everything on every query is tons of slow I/O

    View Slide

  32. What are the
    commonalities?
    • Reading everything into RAM
    • Manually looking for a substring
    * Reading everything into RAM is bad because that’s a lot of unused data
    * Substrings suck because you have to search the whole thing, perhaps character by character
    * Running through everything on every query is tons of slow I/O

    View Slide

  33. What are the
    commonalities?
    • Reading everything into RAM
    • Manually looking for a substring
    • Happens for every query
    * Reading everything into RAM is bad because that’s a lot of unused data
    * Substrings suck because you have to search the whole thing, perhaps character by character
    * Running through everything on every query is tons of slow I/O

    View Slide

  34. We need a better
    approach.

    View Slide

  35. Inverted Indexes
    to the rescue!
    But what is an inverted index?

    View Slide

  36. doc1 = "Hello world"
    doc2 = "World travelers welcome!"
    index = {
    "hello": ['doc1'],
    "world": ['doc1', 'doc2'],
    "travel": ['doc2'],
    "welcome": ['doc2'],
    # ...
    }
    * Think a Python dictionary
    * Split your documents on whitespace to tokenize up the content
    * Talk about stemming, stop words, etc.
    * Keys in the dictionary are the (now-unique) words
    * Values in the dictionary are document ids

    View Slide

  37. Better but still
    inefficient
    Let’s start looking at things that don’t hurt when searching.

    View Slide

  38. # At startup...
    index = {}
    for the_file in our_files:
    words = set(open(the_file).read().split())
    for word in words:
    index.setdefault(word, set())
    index[word].add(the_file)
    # Later...
    return index.get(query, set())
    Why is this inefficient?
    * RAM-hungry
    * Ugly global state
    * Lose the whole index between restarts
    * Only exact word matching
    * No complex queries

    View Slide

  39. Ever so slightly
    better

    View Slide

  40. $ pip install microsearch
    * Written as a teaching tool
    * Covers the fundaments
    * You should read the source

    View Slide

  41. microsearch
    Why?
    • Uses less RAM
    • Persistent index
    • Better word/query
    facilities
    Why Not?
    • Still inefficient
    • Slow
    • I/O heavy (but better
    than before)

    View Slide

  42. Good!
    Moving on up in the world to something even better.

    View Slide

  43. $ pip install whoosh
    * Roughly a Python port of Lucene
    * Written by a CANADIAN (Matt Chaput)

    View Slide

  44. # Indexing
    from whoosh.index import create_in
    from whoosh.fields import *
    schema = Schema(content=TEXT)
    ix = create_in("indexdir", schema)
    writer = ix.writer()
    writer.add_document(content=u"This is the first d
    ocument we've added!")
    writer.add_document(content=u"The second one is e
    ven more interesting!")
    writer.commit()

    View Slide

  45. # Searching
    from whoosh.qparser import QueryParser
    with ix.searcher() as searcher:
    query = QueryParser("content", ix.schema).par
    se("first")
    results = searcher.search(query)
    results[0]

    View Slide

  46. Why?
    • Far superior querying
    • Faster
    • More awesome features
    Why Not?
    • Still slow
    • Still Python

    View Slide

  47. Pretty Good!

    View Slide

  48. Why?
    • Decent Python bindings
    • Fast
    • Good featureset
    • Good query support
    Why Not?
    • Requires filesystem
    access
    • Expects a structured
    query object
    * Filesystem access means distributed setups hurt
    * Replication of the index between servers
    * Things falling out of sync

    View Slide

  49. Pretty Freaking
    Good!

    View Slide

  50. $ pip install pysolr

    View Slide

  51. import pysolr
    solr = pysolr.Solr('http://localhost:8983')
    solr.add([
    {'id': 'doc1', 'text': u'Hello world'},
    ])
    results = solr.search(u"world")

    View Slide

  52. Why?
    • Really fast
    • Really featureful
    • Great query support
    Why Not?
    • The Land of XML
    • Distributed setups hurt
    • JVM noms the RAMs
    • You need a server
    * There’s a good Haystack backend for this!

    View Slide

  53. The Best!

    View Slide

  54. $ pip install pyelasticsearch
    * Similar to pysolr, there’s a thin wrapper library for Elasticsearch

    View Slide

  55. from pyelasticsearch import ElasticSearch
    conn = ElasticSearch('http://localhost:9200/')
    conn.index(
    {'id': 'doc1', 'text': u'Hello world'},
    "pycon",
    "examples",
    1
    )
    results = conn.search(u"world")

    View Slide

  56. Why?
    • Really fast
    • Really featureful
    • Great query support
    • Awesome ops story
    • NO XML!
    Why Not?
    • JVM noms the RAMs
    • Many servers are best
    * I’d be remiss not to mention this also has a Haystack backend

    View Slide

  57. Regardless of
    engine, here are
    some good practices.

    View Slide

  58. Good Practices
    • Denormalize!
    * This isn’t a RDBMS. There are no joins. Don’t be afraid to denorm on extra data that makes
    querying easier.
    * Use the classic BlogPost/Comments example

    View Slide

  59. Good Practices
    • Denormalize!
    • Add on fields for narrowing
    * Putting more metadata on the records can help when running the searches
    * Examples: author information, publish dates, comment/view counts, etc.

    View Slide

  60. Good Practices
    • Denormalize!
    • Add on fields for narrowing
    • Good search needs good content
    * Garbage in, garbage out
    * Clean data makes for a happy engine
    * Strip out HTML, unnecessary data, meaningless numbers, etc.

    View Slide

  61. Good Practices
    • Denormalize!
    • Add on fields for narrowing
    • Good search needs good content
    • Feed the beast
    * Give the engine (and/or the process) plenty of RAM
    * Caching can give some huge wins

    View Slide

  62. Good Practices
    • Denormalize!
    • Add on fields for narrowing
    • Good search needs good content
    • Feed the beast
    • Update documents out of process
    * Updates should happen in a different thread/process/queue
    * Especially in a web environment (don’t block the response waiting on the engine)
    * No real reason to make the user wait, especially on data we already have/can rebuild

    View Slide

  63. So...

    View Slide

  64. Where do we
    stand?
    * Except for the JVM. Jerk.

    View Slide

  65. Where do we
    stand?
    • Our queries should now be fast &
    featureful

    View Slide

  66. Where do we
    stand?
    • Our queries should now be fast &
    featureful
    • We should be sipping RAM, not gulping
    * Except for the JVM. Jerk.

    View Slide

  67. Where do we
    stand?
    • Our queries should now be fast &
    featureful
    • We should be sipping RAM, not gulping
    • I/O should be tons better
    * No longer reading in *EVERYTHING* on *EVERY* query.

    View Slide

  68. Where do we
    stand?
    • Our queries should now be fast &
    featureful
    • We should be sipping RAM, not gulping
    • I/O should be tons better
    • We’ve got clean, rich documents being
    indexed

    View Slide

  69. Done.

    View Slide

  70. ...

    View Slide

  71. WRONG!
    * Having technically solid search should be the *beginning*, not the *end*
    * If your UI/UX isn't good, your search isn't good

    View Slide

  72. In the beginning...
    * I’m going to tell you a story.
    * Back to an innocent time
    * When wild searches roamed the earth
    * There was Man & Woman

    View Slide

  73. * Man just returned links, and it was crap
    * No context, no clues where to look
    * Ugh

    View Slide

  74. * Woman brought highlighted text, and it was slightly better
    * More context as to where the query appeared & why the document was considered relevant

    View Slide

  75. * Man saw this & was jealous. He wanted to one-up woman.
    * And so, he introduced advanced query syntax.

    View Slide

  76. * Woman, realizing Man was serious & feeling bad for him, introduced faceting.
    * And Man was in awe.

    View Slide

  77. * Man, not to be one-upped, added spelling suggestions.

    View Slide

  78. * Woman, determined not to be outdone, added autocomplete.

    View Slide

  79. * God, deciding the contest was pointless, hands you the data before you've finished deciding
    what you were going to search for.

    View Slide

  80. Welp, we can’t be
    God...
    But everything else is doable with Whoosh or
    better.
    * So you have no excuse :D
    * For ease, I'm just going to show the pysolr variants
    * But pretty much doable everywhere

    View Slide

  81. Just the results
    import pysolr
    solr = pysolr.Solr('http://localhost:8983')
    results = solr.search(u"world")
    for result in results:
    print '{1}'.format(
    result['id'],
    result['title']
    )
    * Just the results is pretty easy
    * We could be denorming/storing other data for display here

    View Slide

  82. With Highlights
    kwargs = {
    'hl': 'true',
    'hl.fragsize': '200',
    }
    results = solr.search(u"world", **kwargs)
    for result in results:
    print results.highlighting[result['id']]
    * With highlights, things get a little more interesting
    * Most engines can create some (non-standard) HTML
    * Kinda yuck, but can be liveable

    View Slide

  83. Advanced Queries
    results = solr.search(u"((world OR hello) AND
    created:[* TO 2012-10-31])")
    * Shown here is the Lucene-style syntax (Whoosh, Solr, ES)
    * Xapian has similar faculities, but you go about it a very different way
    * Can express many & varied queries here

    View Slide

  84. Faceting
    kwargs = {
    'facet': 'on',
    'facet.field': ['author', 'category'],
    }
    results = solr.search(u"world", **kwargs)
    # ...
    print results.facets['facet_fields']['author']
    Caveats:
    * You need to be storing additional fields
    * You need to be storing exact (non-stemmed/post-processed) data in those fields
    * This kills searching in those fields, so you may need to duplicate

    View Slide

  85. Spelling
    Suggestions
    kwargs = {
    'spellcheck': 'true',
    'spellcheck.dictionary': 'suggest',
    'spellcheck.onlyMorePopular': 'true',
    'spellcheck.count': 5,
    'spellcheck.collate': 'true',
    }
    results = solr.search(u"werld", **kwargs)
    # ...
    print results.spellcheck['suggestions']
    Some caveats:
    * Requires additional configuration (see the Solr wiki)
    * ES doesn’t support suggestions

    View Slide

  86. Autocomplete
    results = solr.search(u"completeme:wor")
    Caveats:
    * You need a new (edge) n-gram field call “completeme”
    * It’ll store lots of data as it passes a “window” over the content, which become the new terms

    View Slide

  87. OH NOES, TEH
    djangoes!

    View Slide

  88. Obligatory
    Haystack Example!
    # For the Djangonauts...
    from datetime import date
    from haystack.query import SearchQuerySet, SQ
    sqs = SearchQuerySet()
    .auto_query("bananas")
    .filter(SQ(text='hello') |
    SQ(text="world"))
    .filter(created__lte=date(2012, 10, 31))
    .facet('author', 'category')
    .filter(completeme="wor")
    suggestions = sqs.spelling_suggestion("werld")
    Because Djangoes!

    View Slide

  89. Where are we now?

    View Slide

  90. Where are we now?
    • Our search is fast & efficient

    View Slide

  91. Where are we now?
    • Our search is fast & efficient
    • We have modern search functionality

    View Slide

  92. Where are we now?
    • Our search is fast & efficient
    • We have modern search functionality
    • Hopefully we have happy users

    View Slide

  93. But what about
    THE FUTURE?!!

    View Slide

  94. View Slide

  95. The future of
    search (IMO)

    View Slide

  96. The future of
    search (IMO)
    • Related content suggestions
    * More Like This
    * Talk about how MLT works
    * Similar content wins in terms of interest/page views/etc.

    View Slide

  97. The future of
    search (IMO)
    • Related content suggestions
    • Geospatial search
    * Teaching computers about the world we live in
    * Both Solr & Elasticsearch have this right now & it’s easy to implement

    View Slide

  98. The future of
    search (IMO)
    • Related content suggestions
    • Geospatial search
    • Specialization
    * Searching within different media types

    View Slide

  99. The future of
    search (IMO)
    • Related content suggestions
    • Geospatial search
    • Specialization
    • Contextual search
    * Something the engines don’t specifically have but you can easily build
    * Gives the user the ability to search within a given “silo”
    * Even better, start to narrow it based on where they already are

    View Slide

  100. The future of
    search (IMO)
    • Related content suggestions
    • Geospatial search
    • Specialization
    • Contextual search
    • Real-time
    * The fresher the results, the better
    * Be careful not to swamp yourself
    * Sometimes you can fake it (within 1 minute, within 5 minutes, etc)

    View Slide

  101. The future of
    search (IMO)
    • Related content suggestions
    • Geospatial search
    • Specialization
    • Contextual search
    • Real-time
    • New ways to present search results
    * Everyone is used to Google-style results (though they’re getting richer)
    * Provide more context
    * Make them appear in new places

    View Slide

  102. Thank you!
    Now ask me questions, dammit
    @daniellindsley
    [email protected]
    https://github.com/toastdriven

    View Slide

  103. • http://www.flickr.com/
    photos/olivireland/
    2402838557
    • http://www.flickr.com/
    photos/kioan/
    3260355830
    Photo Credits

    View Slide