Slide 1

Slide 1 text

Avoiding The Search Hall Of Shame

Slide 2

Slide 2 text

There are lots of ways to do search * Well, let me correct that...

Slide 3

Slide 3 text

There are lots of WRONG ways to do search

Slide 4

Slide 4 text

Who?

Slide 5

Slide 5 text

Who? • Daniel Lindsley * I was running a small web consultancy, but now I’m looking for a new job

Slide 6

Slide 6 text

Who? • Daniel Lindsley • Pythonista by passion, Djangonaut by trade * I’ve long adored Python * I spend a lot of time in using Django, but I’m still just writing Python * Albeit slightly magical Python...

Slide 7

Slide 7 text

Who? • Daniel Lindsley • Pythonista by passion, Djangonaut by trade • Author of Haystack (& Tastypie) + = * Haystack is a semi-popular search library for Django * This talk isn’t directly about Django, but everyone writing Python code is some capacity should benefit

Slide 8

Slide 8 text

But most of all, I <3 search. * Well, and RESTful APIs, but that’s a different talk altogether...

Slide 9

Slide 9 text

Picking The Right Approach Matters A Lot

Slide 10

Slide 10 text

Considerations * Frequency - tens of times a day? hundreds? thousands? * Intended - just another feature? frequent? core/indispensable? * Amount - few records or many * Space - tiny, single field documents? or huge docs with lots of fields & content? * CPU - how intensive are the searches?

Slide 11

Slide 11 text

Considerations • Frequency of use * Frequency - tens of times a day? hundreds? thousands? * Intended - just another feature? frequent? core/indispensable? * Amount - few records or many * Space - tiny, single field documents? or huge docs with lots of fields & content? * CPU - how intensive are the searches?

Slide 12

Slide 12 text

Considerations • Frequency of use • Intended use(ages) * Frequency - tens of times a day? hundreds? thousands? * Intended - just another feature? frequent? core/indispensable? * Amount - few records or many * Space - tiny, single field documents? or huge docs with lots of fields & content? * CPU - how intensive are the searches?

Slide 13

Slide 13 text

Considerations • Frequency of use • Intended use(ages) • Amount of data * Frequency - tens of times a day? hundreds? thousands? * Intended - just another feature? frequent? core/indispensable? * Amount - few records or many * Space - tiny, single field documents? or huge docs with lots of fields & content? * CPU - how intensive are the searches?

Slide 14

Slide 14 text

Considerations • Frequency of use • Intended use(ages) • Amount of data • Space to store the data * Frequency - tens of times a day? hundreds? thousands? * Intended - just another feature? frequent? core/indispensable? * Amount - few records or many * Space - tiny, single field documents? or huge docs with lots of fields & content? * CPU - how intensive are the searches?

Slide 15

Slide 15 text

Considerations • Frequency of use • Intended use(ages) • Amount of data • Space to store the data • CPU needed to run the search * Frequency - tens of times a day? hundreds? thousands? * Intended - just another feature? frequent? core/indispensable? * Amount - few records or many * Space - tiny, single field documents? or huge docs with lots of fields & content? * CPU - how intensive are the searches?

Slide 16

Slide 16 text

Assumptions

Slide 17

Slide 17 text

Assumptions

Slide 18

Slide 18 text

Assumptions • You know Python

Slide 19

Slide 19 text

Assumptions • You know Python • You know a little about SQL & I/O

Slide 20

Slide 20 text

Assumptions • You know Python • You know a little about SQL & I/O • You understand words like "tokenize"

Slide 21

Slide 21 text

Assumptions • You know Python • You know a little about SQL & I/O • You understand words like "tokenize" • You know want to make things better

Slide 22

Slide 22 text

WARNING: Hurtful words follow. * Don't take it personally * We've all been there (except GvR) * Sometimes the crappy way is A-OK * We’re going to start with the most wrong & improve from there.

Slide 23

Slide 23 text

The MOST wrong of wrong ways * Steel yourselves. The code you are about to see is the worst of this talk...

Slide 24

Slide 24 text

for the_file in our_files: the_data = open(the_file).read() if query in the_data: # ... Why is this wrong? * Python is SLOW (compared to other things) * So very RAM/CPU inefficient * I/O wait * Worst way to look for the actual text

Slide 25

Slide 25 text

Slightly less wrong but STILL really wrong

Slide 26

Slide 26 text

SELECT * FROM mytable WHERE big_text_blob ILIKE '%banana%'; Why is this wrong? * Really, really inefficient * Sequence scan (READ ALL THE ROWS) * No index will help you here * ``ILIKE`` hurts * Double wildcard hurts

Slide 27

Slide 27 text

El Wrong-o

Slide 28

Slide 28 text

command = 'grep "{0}" our_files/ * '.format(query) subprocess.check_output(command) Why is this wrong? * Still having to read in all the data into RAM * Grep is smart enough to stream chunks of the data off the disk to not consume all the data at once * Shelling out hurts

Slide 29

Slide 29 text

What are the commonalities?

Slide 30

Slide 30 text

What are the commonalities? * Reading everything into RAM is bad because that’s a lot of unused data * Substrings suck because you have to search the whole thing, perhaps character by character * Running through everything on every query is tons of slow I/O

Slide 31

Slide 31 text

What are the commonalities? • Reading everything into RAM * Reading everything into RAM is bad because that’s a lot of unused data * Substrings suck because you have to search the whole thing, perhaps character by character * Running through everything on every query is tons of slow I/O

Slide 32

Slide 32 text

What are the commonalities? • Reading everything into RAM • Manually looking for a substring * Reading everything into RAM is bad because that’s a lot of unused data * Substrings suck because you have to search the whole thing, perhaps character by character * Running through everything on every query is tons of slow I/O

Slide 33

Slide 33 text

What are the commonalities? • Reading everything into RAM • Manually looking for a substring • Happens for every query * Reading everything into RAM is bad because that’s a lot of unused data * Substrings suck because you have to search the whole thing, perhaps character by character * Running through everything on every query is tons of slow I/O

Slide 34

Slide 34 text

We need a better approach.

Slide 35

Slide 35 text

Inverted Indexes to the rescue! But what is an inverted index?

Slide 36

Slide 36 text

doc1 = "Hello world" doc2 = "World travelers welcome!" index = { "hello": ['doc1'], "world": ['doc1', 'doc2'], "travel": ['doc2'], "welcome": ['doc2'], # ... } * Think a Python dictionary * Split your documents on whitespace to tokenize up the content * Talk about stemming, stop words, etc. * Keys in the dictionary are the (now-unique) words * Values in the dictionary are document ids

Slide 37

Slide 37 text

Better but still inefficient Let’s start looking at things that don’t hurt when searching.

Slide 38

Slide 38 text

# At startup... index = {} for the_file in our_files: words = set(open(the_file).read().split()) for word in words: index.setdefault(word, set()) index[word].add(the_file) # Later... return index.get(query, set()) Why is this inefficient? * RAM-hungry * Ugly global state * Lose the whole index between restarts * Only exact word matching * No complex queries

Slide 39

Slide 39 text

Ever so slightly better

Slide 40

Slide 40 text

$ pip install microsearch * Written as a teaching tool * Covers the fundaments * You should read the source

Slide 41

Slide 41 text

microsearch Why? • Uses less RAM • Persistent index • Better word/query facilities Why Not? • Still inefficient • Slow • I/O heavy (but better than before)

Slide 42

Slide 42 text

Good! Moving on up in the world to something even better.

Slide 43

Slide 43 text

$ pip install whoosh * Roughly a Python port of Lucene * Written by a CANADIAN (Matt Chaput)

Slide 44

Slide 44 text

# Indexing from whoosh.index import create_in from whoosh.fields import * schema = Schema(content=TEXT) ix = create_in("indexdir", schema) writer = ix.writer() writer.add_document(content=u"This is the first d ocument we've added!") writer.add_document(content=u"The second one is e ven more interesting!") writer.commit()

Slide 45

Slide 45 text

# Searching from whoosh.qparser import QueryParser with ix.searcher() as searcher: query = QueryParser("content", ix.schema).par se("first") results = searcher.search(query) results[0]

Slide 46

Slide 46 text

Why? • Far superior querying • Faster • More awesome features Why Not? • Still slow • Still Python

Slide 47

Slide 47 text

Pretty Good!

Slide 48

Slide 48 text

Why? • Decent Python bindings • Fast • Good featureset • Good query support Why Not? • Requires filesystem access • Expects a structured query object * Filesystem access means distributed setups hurt * Replication of the index between servers * Things falling out of sync

Slide 49

Slide 49 text

Pretty Freaking Good!

Slide 50

Slide 50 text

$ pip install pysolr

Slide 51

Slide 51 text

import pysolr solr = pysolr.Solr('http://localhost:8983') solr.add([ {'id': 'doc1', 'text': u'Hello world'}, ]) results = solr.search(u"world")

Slide 52

Slide 52 text

Why? • Really fast • Really featureful • Great query support Why Not? • The Land of XML • Distributed setups hurt • JVM noms the RAMs • You need a server * There’s a good Haystack backend for this!

Slide 53

Slide 53 text

The Best!

Slide 54

Slide 54 text

$ pip install pyelasticsearch * Similar to pysolr, there’s a thin wrapper library for Elasticsearch

Slide 55

Slide 55 text

from pyelasticsearch import ElasticSearch conn = ElasticSearch('http://localhost:9200/') conn.index( {'id': 'doc1', 'text': u'Hello world'}, "pycon", "examples", 1 ) results = conn.search(u"world")

Slide 56

Slide 56 text

Why? • Really fast • Really featureful • Great query support • Awesome ops story • NO XML! Why Not? • JVM noms the RAMs • Many servers are best * I’d be remiss not to mention this also has a Haystack backend

Slide 57

Slide 57 text

Regardless of engine, here are some good practices.

Slide 58

Slide 58 text

Good Practices • Denormalize! * This isn’t a RDBMS. There are no joins. Don’t be afraid to denorm on extra data that makes querying easier. * Use the classic BlogPost/Comments example

Slide 59

Slide 59 text

Good Practices • Denormalize! • Add on fields for narrowing * Putting more metadata on the records can help when running the searches * Examples: author information, publish dates, comment/view counts, etc.

Slide 60

Slide 60 text

Good Practices • Denormalize! • Add on fields for narrowing • Good search needs good content * Garbage in, garbage out * Clean data makes for a happy engine * Strip out HTML, unnecessary data, meaningless numbers, etc.

Slide 61

Slide 61 text

Good Practices • Denormalize! • Add on fields for narrowing • Good search needs good content • Feed the beast * Give the engine (and/or the process) plenty of RAM * Caching can give some huge wins

Slide 62

Slide 62 text

Good Practices • Denormalize! • Add on fields for narrowing • Good search needs good content • Feed the beast • Update documents out of process * Updates should happen in a different thread/process/queue * Especially in a web environment (don’t block the response waiting on the engine) * No real reason to make the user wait, especially on data we already have/can rebuild

Slide 63

Slide 63 text

So...

Slide 64

Slide 64 text

Where do we stand? * Except for the JVM. Jerk.

Slide 65

Slide 65 text

Where do we stand? • Our queries should now be fast & featureful

Slide 66

Slide 66 text

Where do we stand? • Our queries should now be fast & featureful • We should be sipping RAM, not gulping * Except for the JVM. Jerk.

Slide 67

Slide 67 text

Where do we stand? • Our queries should now be fast & featureful • We should be sipping RAM, not gulping • I/O should be tons better * No longer reading in *EVERYTHING* on *EVERY* query.

Slide 68

Slide 68 text

Where do we stand? • Our queries should now be fast & featureful • We should be sipping RAM, not gulping • I/O should be tons better • We’ve got clean, rich documents being indexed

Slide 69

Slide 69 text

Done.

Slide 70

Slide 70 text

...

Slide 71

Slide 71 text

WRONG! * Having technically solid search should be the *beginning*, not the *end* * If your UI/UX isn't good, your search isn't good

Slide 72

Slide 72 text

In the beginning... * I’m going to tell you a story. * Back to an innocent time * When wild searches roamed the earth * There was Man & Woman

Slide 73

Slide 73 text

* Man just returned links, and it was crap * No context, no clues where to look * Ugh

Slide 74

Slide 74 text

* Woman brought highlighted text, and it was slightly better * More context as to where the query appeared & why the document was considered relevant

Slide 75

Slide 75 text

* Man saw this & was jealous. He wanted to one-up woman. * And so, he introduced advanced query syntax.

Slide 76

Slide 76 text

* Woman, realizing Man was serious & feeling bad for him, introduced faceting. * And Man was in awe.

Slide 77

Slide 77 text

* Man, not to be one-upped, added spelling suggestions.

Slide 78

Slide 78 text

* Woman, determined not to be outdone, added autocomplete.

Slide 79

Slide 79 text

* God, deciding the contest was pointless, hands you the data before you've finished deciding what you were going to search for.

Slide 80

Slide 80 text

Welp, we can’t be God... But everything else is doable with Whoosh or better. * So you have no excuse :D * For ease, I'm just going to show the pysolr variants * But pretty much doable everywhere

Slide 81

Slide 81 text

Just the results import pysolr solr = pysolr.Solr('http://localhost:8983') results = solr.search(u"world") for result in results: print '{1}'.format( result['id'], result['title'] ) * Just the results is pretty easy * We could be denorming/storing other data for display here

Slide 82

Slide 82 text

With Highlights kwargs = { 'hl': 'true', 'hl.fragsize': '200', } results = solr.search(u"world", **kwargs) for result in results: print results.highlighting[result['id']] * With highlights, things get a little more interesting * Most engines can create some (non-standard) HTML * Kinda yuck, but can be liveable

Slide 83

Slide 83 text

Advanced Queries results = solr.search(u"((world OR hello) AND created:[* TO 2012-10-31])") * Shown here is the Lucene-style syntax (Whoosh, Solr, ES) * Xapian has similar faculities, but you go about it a very different way * Can express many & varied queries here

Slide 84

Slide 84 text

Faceting kwargs = { 'facet': 'on', 'facet.field': ['author', 'category'], } results = solr.search(u"world", **kwargs) # ... print results.facets['facet_fields']['author'] Caveats: * You need to be storing additional fields * You need to be storing exact (non-stemmed/post-processed) data in those fields * This kills searching in those fields, so you may need to duplicate

Slide 85

Slide 85 text

Spelling Suggestions kwargs = { 'spellcheck': 'true', 'spellcheck.dictionary': 'suggest', 'spellcheck.onlyMorePopular': 'true', 'spellcheck.count': 5, 'spellcheck.collate': 'true', } results = solr.search(u"werld", **kwargs) # ... print results.spellcheck['suggestions'] Some caveats: * Requires additional configuration (see the Solr wiki) * ES doesn’t support suggestions

Slide 86

Slide 86 text

Autocomplete results = solr.search(u"completeme:wor") Caveats: * You need a new (edge) n-gram field call “completeme” * It’ll store lots of data as it passes a “window” over the content, which become the new terms

Slide 87

Slide 87 text

OH NOES, TEH djangoes!

Slide 88

Slide 88 text

Obligatory Haystack Example! # For the Djangonauts... from datetime import date from haystack.query import SearchQuerySet, SQ sqs = SearchQuerySet() .auto_query("bananas") .filter(SQ(text='hello') | SQ(text="world")) .filter(created__lte=date(2012, 10, 31)) .facet('author', 'category') .filter(completeme="wor") suggestions = sqs.spelling_suggestion("werld") Because Djangoes!

Slide 89

Slide 89 text

Where are we now?

Slide 90

Slide 90 text

Where are we now? • Our search is fast & efficient

Slide 91

Slide 91 text

Where are we now? • Our search is fast & efficient • We have modern search functionality

Slide 92

Slide 92 text

Where are we now? • Our search is fast & efficient • We have modern search functionality • Hopefully we have happy users

Slide 93

Slide 93 text

But what about THE FUTURE?!!

Slide 94

Slide 94 text

No content

Slide 95

Slide 95 text

The future of search (IMO)

Slide 96

Slide 96 text

The future of search (IMO) • Related content suggestions * More Like This * Talk about how MLT works * Similar content wins in terms of interest/page views/etc.

Slide 97

Slide 97 text

The future of search (IMO) • Related content suggestions • Geospatial search * Teaching computers about the world we live in * Both Solr & Elasticsearch have this right now & it’s easy to implement

Slide 98

Slide 98 text

The future of search (IMO) • Related content suggestions • Geospatial search • Specialization * Searching within different media types

Slide 99

Slide 99 text

The future of search (IMO) • Related content suggestions • Geospatial search • Specialization • Contextual search * Something the engines don’t specifically have but you can easily build * Gives the user the ability to search within a given “silo” * Even better, start to narrow it based on where they already are

Slide 100

Slide 100 text

The future of search (IMO) • Related content suggestions • Geospatial search • Specialization • Contextual search • Real-time * The fresher the results, the better * Be careful not to swamp yourself * Sometimes you can fake it (within 1 minute, within 5 minutes, etc)

Slide 101

Slide 101 text

The future of search (IMO) • Related content suggestions • Geospatial search • Specialization • Contextual search • Real-time • New ways to present search results * Everyone is used to Google-style results (though they’re getting richer) * Provide more context * Make them appear in new places

Slide 102

Slide 102 text

Thank you! Now ask me questions, dammit @daniellindsley daniel@toastdriven.com https://github.com/toastdriven

Slide 103

Slide 103 text

• http://www.flickr.com/ photos/olivireland/ 2402838557 • http://www.flickr.com/ photos/kioan/ 3260355830 Photo Credits