Avoiding The Search Hall Of Shame

There are lots of ways to do search * Well,
let me correct that...

There are lots of WRONG ways to do search

Who? • Daniel Lindsley * I was running a small
web consultancy, but now I’m looking for a new job

Who? • Daniel Lindsley • Pythonista by passion, Djangonaut by
trade * I’ve long adored Python * I spend a lot of time in using Django, but I’m still just writing Python * Albeit slightly magical Python...

Who? • Daniel Lindsley • Pythonista by passion, Djangonaut by
trade • Author of Haystack (& Tastypie) + = * Haystack is a semi-popular search library for Django * This talk isn’t directly about Django, but everyone writing Python code is some capacity should beneﬁt

But most of all, I <3 search. * Well, and
RESTful APIs, but that’s a different talk altogether...

Picking The Right Approach Matters A Lot

Considerations * Frequency - tens of times a day? hundreds?
thousands? * Intended - just another feature? frequent? core/indispensable? * Amount - few records or many * Space - tiny, single ﬁeld documents? or huge docs with lots of ﬁelds & content? * CPU - how intensive are the searches?

Considerations • Frequency of use * Frequency - tens of
times a day? hundreds? thousands? * Intended - just another feature? frequent? core/indispensable? * Amount - few records or many * Space - tiny, single ﬁeld documents? or huge docs with lots of ﬁelds & content? * CPU - how intensive are the searches?

Considerations • Frequency of use • Intended use(ages) * Frequency
- tens of times a day? hundreds? thousands? * Intended - just another feature? frequent? core/indispensable? * Amount - few records or many * Space - tiny, single ﬁeld documents? or huge docs with lots of ﬁelds & content? * CPU - how intensive are the searches?

Considerations • Frequency of use • Intended use(ages) • Amount
of data * Frequency - tens of times a day? hundreds? thousands? * Intended - just another feature? frequent? core/indispensable? * Amount - few records or many * Space - tiny, single ﬁeld documents? or huge docs with lots of ﬁelds & content? * CPU - how intensive are the searches?

of data • Space to store the data * Frequency - tens of times a day? hundreds? thousands? * Intended - just another feature? frequent? core/indispensable? * Amount - few records or many * Space - tiny, single ﬁeld documents? or huge docs with lots of ﬁelds & content? * CPU - how intensive are the searches?

of data • Space to store the data • CPU needed to run the search * Frequency - tens of times a day? hundreds? thousands? * Intended - just another feature? frequent? core/indispensable? * Amount - few records or many * Space - tiny, single ﬁeld documents? or huge docs with lots of ﬁelds & content? * CPU - how intensive are the searches?

Assumptions

Assumptions • You know Python

Assumptions • You know Python • You know a little
about SQL & I/O

about SQL & I/O • You understand words like "tokenize"

about SQL & I/O • You understand words like "tokenize" • You know want to make things better

WARNING: Hurtful words follow. * Don't take it personally *
We've all been there (except GvR) * Sometimes the crappy way is A-OK * We’re going to start with the most wrong & improve from there.

The MOST wrong of wrong ways * Steel yourselves. The
code you are about to see is the worst of this talk...

for the_file in our_files: the_data = open(the_file).read() if query in
the_data: # ... Why is this wrong? * Python is SLOW (compared to other things) * So very RAM/CPU inefficient * I/O wait * Worst way to look for the actual text

Slightly less wrong but STILL really wrong

SELECT * FROM mytable WHERE big_text_blob ILIKE '%banana%'; Why is
this wrong? * Really, really inefficient * Sequence scan (READ ALL THE ROWS) * No index will help you here * ``ILIKE`` hurts * Double wildcard hurts

El Wrong-o

command = 'grep "{0}" our_files/ * '.format(query) subprocess.check_output(command) Why is
this wrong? * Still having to read in all the data into RAM * Grep is smart enough to stream chunks of the data off the disk to not consume all the data at once * Shelling out hurts

What are the commonalities?

What are the commonalities? * Reading everything into RAM is
bad because that’s a lot of unused data * Substrings suck because you have to search the whole thing, perhaps character by character * Running through everything on every query is tons of slow I/O

What are the commonalities? • Reading everything into RAM *
Reading everything into RAM is bad because that’s a lot of unused data * Substrings suck because you have to search the whole thing, perhaps character by character * Running through everything on every query is tons of slow I/O

What are the commonalities? • Reading everything into RAM •
Manually looking for a substring * Reading everything into RAM is bad because that’s a lot of unused data * Substrings suck because you have to search the whole thing, perhaps character by character * Running through everything on every query is tons of slow I/O

What are the commonalities? • Reading everything into RAM •
Manually looking for a substring • Happens for every query * Reading everything into RAM is bad because that’s a lot of unused data * Substrings suck because you have to search the whole thing, perhaps character by character * Running through everything on every query is tons of slow I/O

We need a better approach.

Inverted Indexes to the rescue! But what is an inverted
index?

doc1 = "Hello world" doc2 = "World travelers welcome!" index
= { "hello": ['doc1'], "world": ['doc1', 'doc2'], "travel": ['doc2'], "welcome": ['doc2'], # ... } * Think a Python dictionary * Split your documents on whitespace to tokenize up the content * Talk about stemming, stop words, etc. * Keys in the dictionary are the (now-unique) words * Values in the dictionary are document ids

Better but still inefficient Let’s start looking at things that
don’t hurt when searching.

# At startup... index = {} for the_file in our_files:
words = set(open(the_file).read().split()) for word in words: index.setdefault(word, set()) index[word].add(the_file) # Later... return index.get(query, set()) Why is this inefficient? * RAM-hungry * Ugly global state * Lose the whole index between restarts * Only exact word matching * No complex queries

Ever so slightly better

$ pip install microsearch * Written as a teaching tool
* Covers the fundaments * You should read the source

microsearch Why? • Uses less RAM • Persistent index •
Better word/query facilities Why Not? • Still inefﬁcient • Slow • I/O heavy (but better than before)

Good! Moving on up in the world to something even
better.

$ pip install whoosh * Roughly a Python port of
Lucene * Written by a CANADIAN (Matt Chaput)

# Indexing from whoosh.index import create_in from whoosh.fields import *
schema = Schema(content=TEXT) ix = create_in("indexdir", schema) writer = ix.writer() writer.add_document(content=u"This is the first d ocument we've added!") writer.add_document(content=u"The second one is e ven more interesting!") writer.commit()

# Searching from whoosh.qparser import QueryParser with ix.searcher() as searcher:
query = QueryParser("content", ix.schema).par se("first") results = searcher.search(query) results[0]

Why? • Far superior querying • Faster • More awesome
features Why Not? • Still slow • Still Python

Pretty Good!

Why? • Decent Python bindings • Fast • Good featureset
• Good query support Why Not? • Requires ﬁlesystem access • Expects a structured query object * Filesystem access means distributed setups hurt * Replication of the index between servers * Things falling out of sync

Pretty Freaking Good!

$ pip install pysolr

import pysolr solr = pysolr.Solr('http://localhost:8983') solr.add([ {'id': 'doc1', 'text': u'Hello
world'}, ]) results = solr.search(u"world")

Why? • Really fast • Really featureful • Great query
support Why Not? • The Land of XML • Distributed setups hurt • JVM noms the RAMs • You need a server * There’s a good Haystack backend for this!

The Best!

$ pip install pyelasticsearch * Similar to pysolr, there’s a
thin wrapper library for Elasticsearch

from pyelasticsearch import ElasticSearch conn = ElasticSearch('http://localhost:9200/') conn.index( {'id': 'doc1',
'text': u'Hello world'}, "pycon", "examples", 1 ) results = conn.search(u"world")

Why? • Really fast • Really featureful • Great query
support • Awesome ops story • NO XML! Why Not? • JVM noms the RAMs • Many servers are best * I’d be remiss not to mention this also has a Haystack backend

Regardless of engine, here are some good practices.

Good Practices • Denormalize! * This isn’t a RDBMS. There
are no joins. Don’t be afraid to denorm on extra data that makes querying easier. * Use the classic BlogPost/Comments example

Good Practices • Denormalize! • Add on ﬁelds for narrowing
* Putting more metadata on the records can help when running the searches * Examples: author information, publish dates, comment/view counts, etc.

• Good search needs good content * Garbage in, garbage out * Clean data makes for a happy engine * Strip out HTML, unnecessary data, meaningless numbers, etc.

• Good search needs good content • Feed the beast * Give the engine (and/or the process) plenty of RAM * Caching can give some huge wins

• Good search needs good content • Feed the beast • Update documents out of process * Updates should happen in a different thread/process/queue * Especially in a web environment (don’t block the response waiting on the engine) * No real reason to make the user wait, especially on data we already have/can rebuild

Where do we stand? * Except for the JVM. Jerk.

Where do we stand? • Our queries should now be
fast & featureful

fast & featureful • We should be sipping RAM, not gulping * Except for the JVM. Jerk.

fast & featureful • We should be sipping RAM, not gulping • I/O should be tons better * No longer reading in *EVERYTHING* on *EVERY* query.

fast & featureful • We should be sipping RAM, not gulping • I/O should be tons better • We’ve got clean, rich documents being indexed

WRONG! * Having technically solid search should be the *beginning*,
not the *end* * If your UI/UX isn't good, your search isn't good

In the beginning... * I’m going to tell you a
story. * Back to an innocent time * When wild searches roamed the earth * There was Man & Woman

* Man just returned links, and it was crap *
No context, no clues where to look * Ugh

* Woman brought highlighted text, and it was slightly better
* More context as to where the query appeared & why the document was considered relevant

* Man saw this & was jealous. He wanted to
one-up woman. * And so, he introduced advanced query syntax.

* Woman, realizing Man was serious & feeling bad for
him, introduced faceting. * And Man was in awe.

* Man, not to be one-upped, added spelling suggestions.

* Woman, determined not to be outdone, added autocomplete.

* God, deciding the contest was pointless, hands you the
data before you've ﬁnished deciding what you were going to search for.

Welp, we can’t be God... But everything else is doable
with Whoosh or better. * So you have no excuse :D * For ease, I'm just going to show the pysolr variants * But pretty much doable everywhere

Just the results import pysolr solr = pysolr.Solr('http://localhost:8983') results =
solr.search(u"world") for result in results: print '<a href="/docs/{0}">{1}</a>'.format( result['id'], result['title'] ) * Just the results is pretty easy * We could be denorming/storing other data for display here

With Highlights kwargs = { 'hl': 'true', 'hl.fragsize': '200', }
results = solr.search(u"world", **kwargs) for result in results: print results.highlighting[result['id']] * With highlights, things get a little more interesting * Most engines can create some (non-standard) HTML * Kinda yuck, but can be liveable

Advanced Queries results = solr.search(u"((world OR hello) AND created:[* TO
2012-10-31])") * Shown here is the Lucene-style syntax (Whoosh, Solr, ES) * Xapian has similar faculities, but you go about it a very different way * Can express many & varied queries here

Faceting kwargs = { 'facet': 'on', 'facet.field': ['author', 'category'], }
results = solr.search(u"world", **kwargs) # ... print results.facets['facet_fields']['author'] Caveats: * You need to be storing additional fields * You need to be storing exact (non-stemmed/post-processed) data in those fields * This kills searching in those fields, so you may need to duplicate

Spelling Suggestions kwargs = { 'spellcheck': 'true', 'spellcheck.dictionary': 'suggest', 'spellcheck.onlyMorePopular':
'true', 'spellcheck.count': 5, 'spellcheck.collate': 'true', } results = solr.search(u"werld", **kwargs) # ... print results.spellcheck['suggestions'] Some caveats: * Requires additional conﬁguration (see the Solr wiki) * ES doesn’t support suggestions

Autocomplete results = solr.search(u"completeme:wor") Caveats: * You need a new
(edge) n-gram ﬁeld call “completeme” * It’ll store lots of data as it passes a “window” over the content, which become the new terms

OH NOES, TEH djangoes!

Obligatory Haystack Example! # For the Djangonauts... from datetime import
date from haystack.query import SearchQuerySet, SQ sqs = SearchQuerySet() .auto_query("bananas") .filter(SQ(text='hello') | SQ(text="world")) .filter(created__lte=date(2012, 10, 31)) .facet('author', 'category') .filter(completeme="wor") suggestions = sqs.spelling_suggestion("werld") Because Djangoes!

Where are we now?

Where are we now? • Our search is fast &
efﬁcient

efﬁcient • We have modern search functionality

efﬁcient • We have modern search functionality • Hopefully we have happy users

But what about THE FUTURE?!!

The future of search (IMO)

The future of search (IMO) • Related content suggestions *
More Like This * Talk about how MLT works * Similar content wins in terms of interest/page views/etc.

The future of search (IMO) • Related content suggestions •
Geospatial search * Teaching computers about the world we live in * Both Solr & Elasticsearch have this right now & it’s easy to implement

Geospatial search • Specialization * Searching within different media types

Geospatial search • Specialization • Contextual search * Something the engines don’t speciﬁcally have but you can easily build * Gives the user the ability to search within a given “silo” * Even better, start to narrow it based on where they already are

Geospatial search • Specialization • Contextual search • Real-time * The fresher the results, the better * Be careful not to swamp yourself * Sometimes you can fake it (within 1 minute, within 5 minutes, etc)

Geospatial search • Specialization • Contextual search • Real-time • New ways to present search results * Everyone is used to Google-style results (though they’re getting richer) * Provide more context * Make them appear in new places

Thank you! Now ask me questions, dammit @daniellindsley [email protected] https://github.com/toastdriven

• http://www.ﬂickr.com/ photos/olivireland/ 2402838557 • http://www.ﬂickr.com/ photos/kioan/ 3260355830 Photo Credits

Avoiding The Search Hall Of Shame

Avoiding The Search Hall Of Shame

More Decks by daniellindsley

Other Decks in Programming

Featured

Transcript