Who? • Daniel Lindsley • Pythonista by passion, Djangonaut by trade * I’ve long adored Python * I spend a lot of time in using Django, but I’m still just writing Python * Albeit slightly magical Python...
Who? • Daniel Lindsley • Pythonista by passion, Djangonaut by trade • Author of Haystack (& Tastypie) + = * Haystack is a semi-popular search library for Django * This talk isn’t directly about Django, but everyone writing Python code is some capacity should benefit
Considerations * Frequency - tens of times a day? hundreds? thousands? * Intended - just another feature? frequent? core/indispensable? * Amount - few records or many * Space - tiny, single field documents? or huge docs with lots of fields & content? * CPU - how intensive are the searches?
Considerations • Frequency of use * Frequency - tens of times a day? hundreds? thousands? * Intended - just another feature? frequent? core/indispensable? * Amount - few records or many * Space - tiny, single field documents? or huge docs with lots of fields & content? * CPU - how intensive are the searches?
Considerations • Frequency of use • Intended use(ages) * Frequency - tens of times a day? hundreds? thousands? * Intended - just another feature? frequent? core/indispensable? * Amount - few records or many * Space - tiny, single field documents? or huge docs with lots of fields & content? * CPU - how intensive are the searches?
Considerations • Frequency of use • Intended use(ages) • Amount of data * Frequency - tens of times a day? hundreds? thousands? * Intended - just another feature? frequent? core/indispensable? * Amount - few records or many * Space - tiny, single field documents? or huge docs with lots of fields & content? * CPU - how intensive are the searches?
Considerations • Frequency of use • Intended use(ages) • Amount of data • Space to store the data * Frequency - tens of times a day? hundreds? thousands? * Intended - just another feature? frequent? core/indispensable? * Amount - few records or many * Space - tiny, single field documents? or huge docs with lots of fields & content? * CPU - how intensive are the searches?
Considerations • Frequency of use • Intended use(ages) • Amount of data • Space to store the data • CPU needed to run the search * Frequency - tens of times a day? hundreds? thousands? * Intended - just another feature? frequent? core/indispensable? * Amount - few records or many * Space - tiny, single field documents? or huge docs with lots of fields & content? * CPU - how intensive are the searches?
WARNING: Hurtful words follow. * Don't take it personally * We've all been there (except GvR) * Sometimes the crappy way is A-OK * We’re going to start with the most wrong & improve from there.
for the_file in our_files: the_data = open(the_file).read() if query in the_data: # ... Why is this wrong? * Python is SLOW (compared to other things) * So very RAM/CPU inefficient * I/O wait * Worst way to look for the actual text
SELECT * FROM mytable WHERE big_text_blob ILIKE '%banana%'; Why is this wrong? * Really, really inefficient * Sequence scan (READ ALL THE ROWS) * No index will help you here * ``ILIKE`` hurts * Double wildcard hurts
command = 'grep "{0}" our_files/ * '.format(query) subprocess.check_output(command) Why is this wrong? * Still having to read in all the data into RAM * Grep is smart enough to stream chunks of the data off the disk to not consume all the data at once * Shelling out hurts
What are the commonalities? * Reading everything into RAM is bad because that’s a lot of unused data * Substrings suck because you have to search the whole thing, perhaps character by character * Running through everything on every query is tons of slow I/O
What are the commonalities? • Reading everything into RAM * Reading everything into RAM is bad because that’s a lot of unused data * Substrings suck because you have to search the whole thing, perhaps character by character * Running through everything on every query is tons of slow I/O
What are the commonalities? • Reading everything into RAM • Manually looking for a substring * Reading everything into RAM is bad because that’s a lot of unused data * Substrings suck because you have to search the whole thing, perhaps character by character * Running through everything on every query is tons of slow I/O
What are the commonalities? • Reading everything into RAM • Manually looking for a substring • Happens for every query * Reading everything into RAM is bad because that’s a lot of unused data * Substrings suck because you have to search the whole thing, perhaps character by character * Running through everything on every query is tons of slow I/O
doc1 = "Hello world" doc2 = "World travelers welcome!" index = { "hello": ['doc1'], "world": ['doc1', 'doc2'], "travel": ['doc2'], "welcome": ['doc2'], # ... } * Think a Python dictionary * Split your documents on whitespace to tokenize up the content * Talk about stemming, stop words, etc. * Keys in the dictionary are the (now-unique) words * Values in the dictionary are document ids
# At startup... index = {} for the_file in our_files: words = set(open(the_file).read().split()) for word in words: index.setdefault(word, set()) index[word].add(the_file) # Later... return index.get(query, set()) Why is this inefficient? * RAM-hungry * Ugly global state * Lose the whole index between restarts * Only exact word matching * No complex queries
# Indexing from whoosh.index import create_in from whoosh.fields import * schema = Schema(content=TEXT) ix = create_in("indexdir", schema) writer = ix.writer() writer.add_document(content=u"This is the first d ocument we've added!") writer.add_document(content=u"The second one is e ven more interesting!") writer.commit()
Why? • Decent Python bindings • Fast • Good featureset • Good query support Why Not? • Requires filesystem access • Expects a structured query object * Filesystem access means distributed setups hurt * Replication of the index between servers * Things falling out of sync
Why? • Really fast • Really featureful • Great query support Why Not? • The Land of XML • Distributed setups hurt • JVM noms the RAMs • You need a server * There’s a good Haystack backend for this!
Why? • Really fast • Really featureful • Great query support • Awesome ops story • NO XML! Why Not? • JVM noms the RAMs • Many servers are best * I’d be remiss not to mention this also has a Haystack backend
Good Practices • Denormalize! * This isn’t a RDBMS. There are no joins. Don’t be afraid to denorm on extra data that makes querying easier. * Use the classic BlogPost/Comments example
Good Practices • Denormalize! • Add on fields for narrowing * Putting more metadata on the records can help when running the searches * Examples: author information, publish dates, comment/view counts, etc.
Good Practices • Denormalize! • Add on fields for narrowing • Good search needs good content * Garbage in, garbage out * Clean data makes for a happy engine * Strip out HTML, unnecessary data, meaningless numbers, etc.
Good Practices • Denormalize! • Add on fields for narrowing • Good search needs good content • Feed the beast * Give the engine (and/or the process) plenty of RAM * Caching can give some huge wins
Good Practices • Denormalize! • Add on fields for narrowing • Good search needs good content • Feed the beast • Update documents out of process * Updates should happen in a different thread/process/queue * Especially in a web environment (don’t block the response waiting on the engine) * No real reason to make the user wait, especially on data we already have/can rebuild
Where do we stand? • Our queries should now be fast & featureful • We should be sipping RAM, not gulping • I/O should be tons better * No longer reading in *EVERYTHING* on *EVERY* query.
Where do we stand? • Our queries should now be fast & featureful • We should be sipping RAM, not gulping • I/O should be tons better • We’ve got clean, rich documents being indexed
Welp, we can’t be God... But everything else is doable with Whoosh or better. * So you have no excuse :D * For ease, I'm just going to show the pysolr variants * But pretty much doable everywhere
Just the results import pysolr solr = pysolr.Solr('http://localhost:8983') results = solr.search(u"world") for result in results: print '{1}'.format( result['id'], result['title'] ) * Just the results is pretty easy * We could be denorming/storing other data for display here
With Highlights kwargs = { 'hl': 'true', 'hl.fragsize': '200', } results = solr.search(u"world", **kwargs) for result in results: print results.highlighting[result['id']] * With highlights, things get a little more interesting * Most engines can create some (non-standard) HTML * Kinda yuck, but can be liveable
Advanced Queries results = solr.search(u"((world OR hello) AND created:[* TO 2012-10-31])") * Shown here is the Lucene-style syntax (Whoosh, Solr, ES) * Xapian has similar faculities, but you go about it a very different way * Can express many & varied queries here
Faceting kwargs = { 'facet': 'on', 'facet.field': ['author', 'category'], } results = solr.search(u"world", **kwargs) # ... print results.facets['facet_fields']['author'] Caveats: * You need to be storing additional fields * You need to be storing exact (non-stemmed/post-processed) data in those fields * This kills searching in those fields, so you may need to duplicate
Autocomplete results = solr.search(u"completeme:wor") Caveats: * You need a new (edge) n-gram field call “completeme” * It’ll store lots of data as it passes a “window” over the content, which become the new terms
The future of search (IMO) • Related content suggestions * More Like This * Talk about how MLT works * Similar content wins in terms of interest/page views/etc.
The future of search (IMO) • Related content suggestions • Geospatial search * Teaching computers about the world we live in * Both Solr & Elasticsearch have this right now & it’s easy to implement
The future of search (IMO) • Related content suggestions • Geospatial search • Specialization • Contextual search * Something the engines don’t specifically have but you can easily build * Gives the user the ability to search within a given “silo” * Even better, start to narrow it based on where they already are
The future of search (IMO) • Related content suggestions • Geospatial search • Specialization • Contextual search • Real-time * The fresher the results, the better * Be careful not to swamp yourself * Sometimes you can fake it (within 1 minute, within 5 minutes, etc)
The future of search (IMO) • Related content suggestions • Geospatial search • Specialization • Contextual search • Real-time • New ways to present search results * Everyone is used to Google-style results (though they’re getting richer) * Provide more context * Make them appear in new places