Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Avoiding The Search Hall Of Shame

Avoiding The Search Hall Of Shame

Given at PyCon CA 2012. Introduces how to approach search from the perspective of choosing the right solution & improvement.

daniellindsley

November 10, 2012
Tweet

More Decks by daniellindsley

Other Decks in Programming

Transcript

  1. Who? • Daniel Lindsley * I was running a small

    web consultancy, but now I’m looking for a new job
  2. Who? • Daniel Lindsley • Pythonista by passion, Djangonaut by

    trade * I’ve long adored Python * I spend a lot of time in using Django, but I’m still just writing Python * Albeit slightly magical Python...
  3. Who? • Daniel Lindsley • Pythonista by passion, Djangonaut by

    trade • Author of Haystack (& Tastypie) + = * Haystack is a semi-popular search library for Django * This talk isn’t directly about Django, but everyone writing Python code is some capacity should benefit
  4. But most of all, I <3 search. * Well, and

    RESTful APIs, but that’s a different talk altogether...
  5. Considerations * Frequency - tens of times a day? hundreds?

    thousands? * Intended - just another feature? frequent? core/indispensable? * Amount - few records or many * Space - tiny, single field documents? or huge docs with lots of fields & content? * CPU - how intensive are the searches?
  6. Considerations • Frequency of use * Frequency - tens of

    times a day? hundreds? thousands? * Intended - just another feature? frequent? core/indispensable? * Amount - few records or many * Space - tiny, single field documents? or huge docs with lots of fields & content? * CPU - how intensive are the searches?
  7. Considerations • Frequency of use • Intended use(ages) * Frequency

    - tens of times a day? hundreds? thousands? * Intended - just another feature? frequent? core/indispensable? * Amount - few records or many * Space - tiny, single field documents? or huge docs with lots of fields & content? * CPU - how intensive are the searches?
  8. Considerations • Frequency of use • Intended use(ages) • Amount

    of data * Frequency - tens of times a day? hundreds? thousands? * Intended - just another feature? frequent? core/indispensable? * Amount - few records or many * Space - tiny, single field documents? or huge docs with lots of fields & content? * CPU - how intensive are the searches?
  9. Considerations • Frequency of use • Intended use(ages) • Amount

    of data • Space to store the data * Frequency - tens of times a day? hundreds? thousands? * Intended - just another feature? frequent? core/indispensable? * Amount - few records or many * Space - tiny, single field documents? or huge docs with lots of fields & content? * CPU - how intensive are the searches?
  10. Considerations • Frequency of use • Intended use(ages) • Amount

    of data • Space to store the data • CPU needed to run the search * Frequency - tens of times a day? hundreds? thousands? * Intended - just another feature? frequent? core/indispensable? * Amount - few records or many * Space - tiny, single field documents? or huge docs with lots of fields & content? * CPU - how intensive are the searches?
  11. Assumptions • You know Python • You know a little

    about SQL & I/O • You understand words like "tokenize"
  12. Assumptions • You know Python • You know a little

    about SQL & I/O • You understand words like "tokenize" • You know want to make things better
  13. WARNING: Hurtful words follow. * Don't take it personally *

    We've all been there (except GvR) * Sometimes the crappy way is A-OK * We’re going to start with the most wrong & improve from there.
  14. The MOST wrong of wrong ways * Steel yourselves. The

    code you are about to see is the worst of this talk...
  15. for the_file in our_files: the_data = open(the_file).read() if query in

    the_data: # ... Why is this wrong? * Python is SLOW (compared to other things) * So very RAM/CPU inefficient * I/O wait * Worst way to look for the actual text
  16. SELECT * FROM mytable WHERE big_text_blob ILIKE '%banana%'; Why is

    this wrong? * Really, really inefficient * Sequence scan (READ ALL THE ROWS) * No index will help you here * ``ILIKE`` hurts * Double wildcard hurts
  17. command = 'grep "{0}" our_files/ * '.format(query) subprocess.check_output(command) Why is

    this wrong? * Still having to read in all the data into RAM * Grep is smart enough to stream chunks of the data off the disk to not consume all the data at once * Shelling out hurts
  18. What are the commonalities? * Reading everything into RAM is

    bad because that’s a lot of unused data * Substrings suck because you have to search the whole thing, perhaps character by character * Running through everything on every query is tons of slow I/O
  19. What are the commonalities? • Reading everything into RAM *

    Reading everything into RAM is bad because that’s a lot of unused data * Substrings suck because you have to search the whole thing, perhaps character by character * Running through everything on every query is tons of slow I/O
  20. What are the commonalities? • Reading everything into RAM •

    Manually looking for a substring * Reading everything into RAM is bad because that’s a lot of unused data * Substrings suck because you have to search the whole thing, perhaps character by character * Running through everything on every query is tons of slow I/O
  21. What are the commonalities? • Reading everything into RAM •

    Manually looking for a substring • Happens for every query * Reading everything into RAM is bad because that’s a lot of unused data * Substrings suck because you have to search the whole thing, perhaps character by character * Running through everything on every query is tons of slow I/O
  22. doc1 = "Hello world" doc2 = "World travelers welcome!" index

    = { "hello": ['doc1'], "world": ['doc1', 'doc2'], "travel": ['doc2'], "welcome": ['doc2'], # ... } * Think a Python dictionary * Split your documents on whitespace to tokenize up the content * Talk about stemming, stop words, etc. * Keys in the dictionary are the (now-unique) words * Values in the dictionary are document ids
  23. # At startup... index = {} for the_file in our_files:

    words = set(open(the_file).read().split()) for word in words: index.setdefault(word, set()) index[word].add(the_file) # Later... return index.get(query, set()) Why is this inefficient? * RAM-hungry * Ugly global state * Lose the whole index between restarts * Only exact word matching * No complex queries
  24. $ pip install microsearch * Written as a teaching tool

    * Covers the fundaments * You should read the source
  25. microsearch Why? • Uses less RAM • Persistent index •

    Better word/query facilities Why Not? • Still inefficient • Slow • I/O heavy (but better than before)
  26. $ pip install whoosh * Roughly a Python port of

    Lucene * Written by a CANADIAN (Matt Chaput)
  27. # Indexing from whoosh.index import create_in from whoosh.fields import *

    schema = Schema(content=TEXT) ix = create_in("indexdir", schema) writer = ix.writer() writer.add_document(content=u"This is the first d ocument we've added!") writer.add_document(content=u"The second one is e ven more interesting!") writer.commit()
  28. # Searching from whoosh.qparser import QueryParser with ix.searcher() as searcher:

    query = QueryParser("content", ix.schema).par se("first") results = searcher.search(query) results[0]
  29. Why? • Far superior querying • Faster • More awesome

    features Why Not? • Still slow • Still Python
  30. Why? • Decent Python bindings • Fast • Good featureset

    • Good query support Why Not? • Requires filesystem access • Expects a structured query object * Filesystem access means distributed setups hurt * Replication of the index between servers * Things falling out of sync
  31. Why? • Really fast • Really featureful • Great query

    support Why Not? • The Land of XML • Distributed setups hurt • JVM noms the RAMs • You need a server * There’s a good Haystack backend for this!
  32. from pyelasticsearch import ElasticSearch conn = ElasticSearch('http://localhost:9200/') conn.index( {'id': 'doc1',

    'text': u'Hello world'}, "pycon", "examples", 1 ) results = conn.search(u"world")
  33. Why? • Really fast • Really featureful • Great query

    support • Awesome ops story • NO XML! Why Not? • JVM noms the RAMs • Many servers are best * I’d be remiss not to mention this also has a Haystack backend
  34. Good Practices • Denormalize! * This isn’t a RDBMS. There

    are no joins. Don’t be afraid to denorm on extra data that makes querying easier. * Use the classic BlogPost/Comments example
  35. Good Practices • Denormalize! • Add on fields for narrowing

    * Putting more metadata on the records can help when running the searches * Examples: author information, publish dates, comment/view counts, etc.
  36. Good Practices • Denormalize! • Add on fields for narrowing

    • Good search needs good content * Garbage in, garbage out * Clean data makes for a happy engine * Strip out HTML, unnecessary data, meaningless numbers, etc.
  37. Good Practices • Denormalize! • Add on fields for narrowing

    • Good search needs good content • Feed the beast * Give the engine (and/or the process) plenty of RAM * Caching can give some huge wins
  38. Good Practices • Denormalize! • Add on fields for narrowing

    • Good search needs good content • Feed the beast • Update documents out of process * Updates should happen in a different thread/process/queue * Especially in a web environment (don’t block the response waiting on the engine) * No real reason to make the user wait, especially on data we already have/can rebuild
  39. Where do we stand? • Our queries should now be

    fast & featureful • We should be sipping RAM, not gulping * Except for the JVM. Jerk.
  40. Where do we stand? • Our queries should now be

    fast & featureful • We should be sipping RAM, not gulping • I/O should be tons better * No longer reading in *EVERYTHING* on *EVERY* query.
  41. Where do we stand? • Our queries should now be

    fast & featureful • We should be sipping RAM, not gulping • I/O should be tons better • We’ve got clean, rich documents being indexed
  42. ...

  43. WRONG! * Having technically solid search should be the *beginning*,

    not the *end* * If your UI/UX isn't good, your search isn't good
  44. In the beginning... * I’m going to tell you a

    story. * Back to an innocent time * When wild searches roamed the earth * There was Man & Woman
  45. * Man just returned links, and it was crap *

    No context, no clues where to look * Ugh
  46. * Woman brought highlighted text, and it was slightly better

    * More context as to where the query appeared & why the document was considered relevant
  47. * Man saw this & was jealous. He wanted to

    one-up woman. * And so, he introduced advanced query syntax.
  48. * Woman, realizing Man was serious & feeling bad for

    him, introduced faceting. * And Man was in awe.
  49. * God, deciding the contest was pointless, hands you the

    data before you've finished deciding what you were going to search for.
  50. Welp, we can’t be God... But everything else is doable

    with Whoosh or better. * So you have no excuse :D * For ease, I'm just going to show the pysolr variants * But pretty much doable everywhere
  51. Just the results import pysolr solr = pysolr.Solr('http://localhost:8983') results =

    solr.search(u"world") for result in results: print '<a href="/docs/{0}">{1}</a>'.format( result['id'], result['title'] ) * Just the results is pretty easy * We could be denorming/storing other data for display here
  52. With Highlights kwargs = { 'hl': 'true', 'hl.fragsize': '200', }

    results = solr.search(u"world", **kwargs) for result in results: print results.highlighting[result['id']] * With highlights, things get a little more interesting * Most engines can create some (non-standard) HTML * Kinda yuck, but can be liveable
  53. Advanced Queries results = solr.search(u"((world OR hello) AND created:[* TO

    2012-10-31])") * Shown here is the Lucene-style syntax (Whoosh, Solr, ES) * Xapian has similar faculities, but you go about it a very different way * Can express many & varied queries here
  54. Faceting kwargs = { 'facet': 'on', 'facet.field': ['author', 'category'], }

    results = solr.search(u"world", **kwargs) # ... print results.facets['facet_fields']['author'] Caveats: * You need to be storing additional fields * You need to be storing exact (non-stemmed/post-processed) data in those fields * This kills searching in those fields, so you may need to duplicate
  55. Spelling Suggestions kwargs = { 'spellcheck': 'true', 'spellcheck.dictionary': 'suggest', 'spellcheck.onlyMorePopular':

    'true', 'spellcheck.count': 5, 'spellcheck.collate': 'true', } results = solr.search(u"werld", **kwargs) # ... print results.spellcheck['suggestions'] Some caveats: * Requires additional configuration (see the Solr wiki) * ES doesn’t support suggestions
  56. Autocomplete results = solr.search(u"completeme:wor") Caveats: * You need a new

    (edge) n-gram field call “completeme” * It’ll store lots of data as it passes a “window” over the content, which become the new terms
  57. Obligatory Haystack Example! # For the Djangonauts... from datetime import

    date from haystack.query import SearchQuerySet, SQ sqs = SearchQuerySet() .auto_query("bananas") .filter(SQ(text='hello') | SQ(text="world")) .filter(created__lte=date(2012, 10, 31)) .facet('author', 'category') .filter(completeme="wor") suggestions = sqs.spelling_suggestion("werld") Because Djangoes!
  58. Where are we now? • Our search is fast &

    efficient • We have modern search functionality
  59. Where are we now? • Our search is fast &

    efficient • We have modern search functionality • Hopefully we have happy users
  60. The future of search (IMO) • Related content suggestions *

    More Like This * Talk about how MLT works * Similar content wins in terms of interest/page views/etc.
  61. The future of search (IMO) • Related content suggestions •

    Geospatial search * Teaching computers about the world we live in * Both Solr & Elasticsearch have this right now & it’s easy to implement
  62. The future of search (IMO) • Related content suggestions •

    Geospatial search • Specialization * Searching within different media types
  63. The future of search (IMO) • Related content suggestions •

    Geospatial search • Specialization • Contextual search * Something the engines don’t specifically have but you can easily build * Gives the user the ability to search within a given “silo” * Even better, start to narrow it based on where they already are
  64. The future of search (IMO) • Related content suggestions •

    Geospatial search • Specialization • Contextual search • Real-time * The fresher the results, the better * Be careful not to swamp yourself * Sometimes you can fake it (within 1 minute, within 5 minutes, etc)
  65. The future of search (IMO) • Related content suggestions •

    Geospatial search • Specialization • Contextual search • Real-time • New ways to present search results * Everyone is used to Google-style results (though they’re getting richer) * Provide more context * Make them appear in new places