Whoosh, an open source pure Python search library by Matt Chaput

[email protected] Whoosh An open-source pure-Python search library Friday, 15 March,
13

Agenda • Who am I? • What is Whoosh? •
How does a inverted index work? • Demo • Advanced features • What’s next? Friday, 15 March, 13

Matt Chaput • Technical writer, Graphic designer, and UI designer
• Self-taught programmer - BASIC, Logo, Scheme, Smalltalk, JavaScript, Java, Python • Information retrieval dilettante Friday, 15 March, 13

Side Effects Software • Houdini, a high-end 3D animation and
special effects package • Used in ﬁlm, games, and commercials • Twice recognized by the Academy of Motion Pictures, Arts and Sciences • Uses Python as its scripting language Friday, 15 March, 13

Before Whoosh • Java Lucene in background process • Customers
HATED Java requirement • Shipping Java was a big hassle with Sun Friday, 15 March, 13

Why write a pure Python search library? • Compiled libraries:
installation problems and crashes, cross-platform issues • Whoosh works where Python works • I got 99 problems but make install ain’t one Friday, 15 March, 13

What happened was... • Wrote a search library for use
in Houdini help system • Would this be useful to anyone else? • Open sourced in 2007 • Front page of Hacker News • Whoosh was pretty slow • Now it’s... much less slow ;) Friday, 15 March, 13

Who uses Whoosh? • Houdini online help • MoinMoin •
Django-Haystack • Lots of users on Bitbucket and mailing list Friday, 15 March, 13

The basics • Programming library • Toolkit for building your
own search engine • Not a web crawler (bring your own text) • Thread and multiprocess safe • Two-clause BSD license (GPL compatible) • Python 2.5+, Python 3 compatible Friday, 15 March, 13

Features • Fields • Text analysis • Spell checker •
“More like this” • Pluggable components • Powerful queries • Term Vectors • Faceting and sorting • Highlighted snippets • Nested searching • Efﬁcient numeric and date ﬁelds • Documentation & tests Friday, 15 March, 13

Inverted Index Friday, 15 March, 13

Indexing • Text analysis • Translate text into terms •
List of all postings in all documents • External merge sort • Map of terms to posting lists Friday, 15 March, 13

Searching • Parse user query • Run search • Grouping/sorting
• Paging • Spelling corrections • Highlighted snippets • “More like this” Friday, 15 March, 13

Deﬁning the schema class MySchema(fields.SchemaClass): title = fields.TEXT(stored=True, sortable=True) content
= fields.TEXT(spelling=True) path = fields.STORED modified = fields.DATETIME(stored=True, sortable=True) # or myschema = fields.Schema( title=fields.TEXT(stored=True, sortable=True), content=fields.TEXT(spelling=True), path=fields.STORED, modified=fields.DATETIME(stored=True, sortable=True), ) Friday, 15 March, 13

Indexing from whoosh import fields, index class MySchema(fields.SchemaClass): title =
fields.TEXT(stored=True, sortable=True) content = fields.TEXT indexed_on = fields.DATETIME(stored=True, sortable=True) summary = fields.STORED ix = index.create_in(“indexdir”) with ix,writer() as w: w.add_document(title=”PyCon 2013”, content=myfile.read(), indexed_on=datetime.now(), summary=”Conference report.”) Friday, 15 March, 13

Searching from whoosh import index, qparser ix = index.open_dir(“indexdir”) qp
= qparser.QueryParser(“content”, ix.schema) q = qp.parse(“guido”) with ix.searcher() as s: results = s.search(q) for hit in results: print(hit[“title”]) Friday, 15 March, 13

Query types • Term • And, Or, Not • DisjunctionMax
• Nested • Phrase • Near, Contains, etc. • Range • Preﬁx • Wildcard • Regex • Fuzzy Friday, 15 March, 13

Demo Friday, 15 March, 13

Advanced features • Multiprocessing for faster indexing • Sorting/grouping •
Custom collectors (e.g. which queries matched?) • Hierarchical searching • Custom column types • Codecs Friday, 15 March, 13

Pure Python advantages • Fun! • No compilation - just
works • Fast iteration as you design the index • Start up an interpreter and inspect the index, try queries, etc. • Easy integration with other Python code Friday, 15 March, 13

Pain points • Performance • Try to touch as much
C as possible • Do some silly things (full binary numbers, cPickle) because they’re fast • Python 3 transition (u, bytes, str) • Dynamic typing in the large Friday, 15 March, 13

Future directions • Explore manipulating byte array slices instead of
strings • Keep more in RAM • Experimental codecs • More choice between fast and compact index • Split out sub-systems Friday, 15 March, 13

Interesting code • “StructFile” class • Fast on-disk hash table
• Flexible date parser • Text analyzers • On-disk FSA/FST • Cross-platform ﬁle lock • Space-efﬁcient integer sets • On-disk columnar storage Friday, 15 March, 13

Resources • Bitbucket repo: https://bitbucket.org/mchaput/whoosh • Documentation: https://whoosh.readthedocs.org/en/latest/ • Mailing
list: http://groups.google.com/group/whoosh • [email protected] Friday, 15 March, 13

Thank you! Friday, 15 March, 13

Whoosh, an open source pure Python search libra...

Whoosh, an open source pure Python search library by Matt Chaput

PyCon 2013

More Decks by PyCon 2013

Other Decks in Programming

Featured

Transcript

[email protected] Whoosh An open-source pure-Python search library Friday, 15 March,

Agenda • Who am I? • What is Whoosh? •

Matt Chaput • Technical writer, Graphic designer, and UI designer

Side Effects Software • Houdini, a high-end 3D animation and

Before Whoosh • Java Lucene in background process • Customers

Why write a pure Python search library? • Compiled libraries:

What happened was... • Wrote a search library for use

Who uses Whoosh? • Houdini online help • MoinMoin •

The basics • Programming library • Toolkit for building your

Features • Fields • Text analysis • Spell checker •

Inverted Index Friday, 15 March, 13

Indexing • Text analysis • Translate text into terms •

Searching • Parse user query • Run search • Grouping/sorting

Deﬁning the schema class MySchema(fields.SchemaClass): title = fields.TEXT(stored=True, sortable=True) content

Indexing from whoosh import fields, index class MySchema(fields.SchemaClass): title =

Searching from whoosh import index, qparser ix = index.open_dir(“indexdir”) qp

Query types • Term • And, Or, Not • DisjunctionMax

Demo Friday, 15 March, 13

Advanced features • Multiprocessing for faster indexing • Sorting/grouping •

Pure Python advantages • Fun! • No compilation - just

Pain points • Performance • Try to touch as much

Future directions • Explore manipulating byte array slices instead of

Interesting code • “StructFile” class • Fast on-disk hash table

Resources • Bitbucket repo: https://bitbucket.org/mchaput/whoosh • Documentation: https://whoosh.readthedocs.org/en/latest/ • Mailing

Thank you! Friday, 15 March, 13