Developing your search engine in Python

Rapid development of website search in Python PyCon India, Bangalore,
Sept’ 12 Vishal Kanaujia, Chetan Giridhar

For whom!  If you’re, an experienced developer who has
implemented search solutions currently dirtying your hands prototyping website search for your startup dreading to learn Java  just curious..

Think web development  Core functionality  Design patterns 
Web Interface  Usability  Scalability  Performance  …?

Search  Often considered – ‘good to have’  Enhances
user experience  Focused information  Relevance  Interaction  Ranked searching

Typical Search Engine  Designing a schema  Convert your
data as Documents and store them to index  Document is a set of fields  Field is a name=value pair  {title = “python”, content = “computer”, tag = “language”}  Analyzers  "parse" each field of your data into index- able "tokens" or keywords.  “Welcome to Pycon" it will produce list [“welcome", “to", “Pycon”]

Typical Search Engine  Indexing  Adding documents to the
index  Query and query parsers  Prepare query  Parse  Analyze  Searching  Lookup index

Indexing & Committing Input files Field1 Field3 Analyzer Schema based
document Field2 In-memory Index Index Writer Committed

Searching Query Parser Analyzer Results Input query Index Searcher Index

Development : Considerations  Sourcing input data set  Handling
input queries  How to search  Algorithms  How to display results  Customization

Development: Options  Apache Solr: Sunburnt  Haystack  Xapian:
Xappy  Elastic Search  Whoosh  Lucene: Pylucene

Talking Pylucene & Whoosh  Pythonic APIs  Deployment Large
scale and medium sized web sites  Rapid Minimal installation Clear Documentation Quick Setup Ease of Integration

Pylucene  Pylucene: Python wrappers to Lucene  Lucene: an
open source, pure Java, search engine library  The de-facto standard for search engine library  Embeds a Java VM with Lucene into a Python process

Pylucene  Simple API  High performance indexing  Scalable
to millions of documents  Efficient and feature rich search algorithms  Cross platform

Whoosh  Whoosh is a search engine library  Fast
indexing and search  One of the fastest Python search engine  100% Python code  Extensible code  No external dependency  Active development and support

Whoosh  Easy to setup  Neutral to web frameworks
 Powerful query language  Feature rich  Intuitive APIs

PyLucene Whoosh  Document  Field  IndexWriter  QueryParser
 Analyzer  IndexSearcher  fields.Schema  index.Index  qparser.QueryParser  analysis. Analyzer  searching.Searcher

Designing search in websites  Search design should be: 
An independent component Pluggable Platform independent Assume minimal external dependency Easily extendible Seamless integration

Search.py fsMgr Abstraction in Search

Comparing Engines  Basis of comparison  Indexing, Committing and
Searching  Dataset  1 GB data  ~5000 files  file size ranging between 1KB to 50MB  Setup Intel® Core™2 Duo CPU P8600 @ 2.40GHz × 2 3 GB RAM  Ubuntu Release 12.04 (precise) 32-bit

Indexing 0 100 200 300 400 500 pylucene whoosh Time
to Index time (s)

Committing 0 50 100 150 200 250 300 pylucene whoosh
Time to Commit time (s)

Searching 0 0.002 0.004 0.006 0.008 0.01 pylucene whoosh Time
to Search time (s)

Recommendations Search Engine Library No one solution fits all problems
Scalability is critical Rapid to setup, develop and tweak Understand and use 

Web development in Python  Getting rapid and easier by
the day  Web frameworks  Django, Pylons  Http Servers  Tornado, Gunicorn  Support for SQL/NoSQL databases  MySQL-python, pymongo  Template Engines  Cheetah, jinja2  Search  Pylucene, Whoosh

References  Whoosh  https://bitbucket.org/mchaput/whoosh/wiki/Home  Pylucene  http://lucene.apache.org/pylucene/ 
http://lucene.apache.org/core/3_6_1/api/all/index.html  Xappy  http://code.google.com/p/xappy/  ElasticSearch  http://www.elasticsearch.org/guide/reference/api/

Q and A

Backup

Whoosh v/s Haystack v/s Xapian • Whoosh is suitable for
a small project. Limited scalability for search and indexing – A good beginning • Haystack is appropriate with Django • Xapian is ultra fast, but is not as feature rich as Solr • Lucene is not distributed; has external dependency

Lucene v/s Database search • There are a number of
query types that RDBMSs in general do not support without vendor extensions: • Fuzzy queries, in which "fuzzy" and "wuzzy" are considered matches • Word stemming queries, which consider "take," "took," and "taken" to be identical • Sound-like queries, which consider "cat" and "kat" to be identical • Synonym queries, which consider "jump," "hop," and "leap" to be identical • Queries on binary BLOB data types, such as PDF documents, Microsoft Word or Excel documents, or HTML and XML documents • More disappointingly, SQL search results are not ranked by match- relevance scores. The SQL standard is simply not intended for full- text querying.

Typical search engine • Indexing – Convert files to a
format for quick look up – Fast random access to stored words • Searching – Specify keywords • Displaying – Lookup documents that are relevant – Ranking – Different types of queries

Advanced Searching  Morelikethis  didyoumean

Developing your search engine in Python

Developing your search engine in Python

vishalkanaujia

More Decks by vishalkanaujia

Other Decks in Programming

Featured

Transcript

Rapid development of website search in Python PyCon India, Bangalore,

For whom!  If you’re, an experienced developer who has

Think web development  Core functionality  Design patterns 

Search  Often considered – ‘good to have’  Enhances

Typical Search Engine  Designing a schema  Convert your

Typical Search Engine  Indexing  Adding documents to the

Indexing & Committing Input files Field1 Field3 Analyzer Schema based

Searching Query Parser Analyzer Results Input query Index Searcher Index

Development : Considerations  Sourcing input data set  Handling

Development: Options  Apache Solr: Sunburnt  Haystack  Xapian:

Talking Pylucene & Whoosh  Pythonic APIs  Deployment Large

Pylucene  Pylucene: Python wrappers to Lucene  Lucene: an

Pylucene  Simple API  High performance indexing  Scalable

Whoosh  Whoosh is a search engine library  Fast

Whoosh  Easy to setup  Neutral to web frameworks

PyLucene Whoosh  Document  Field  IndexWriter  QueryParser

Designing search in websites  Search design should be: 

Search.py fsMgr Abstraction in Search

Demo

Comparing Engines  Basis of comparison  Indexing, Committing and

Indexing 0 100 200 300 400 500 pylucene whoosh Time

Committing 0 50 100 150 200 250 300 pylucene whoosh

Searching 0 0.002 0.004 0.006 0.008 0.01 pylucene whoosh Time

Recommendations Search Engine Library No one solution fits all problems

Web development in Python  Getting rapid and easier by

References  Whoosh  https://bitbucket.org/mchaput/whoosh/wiki/Home  Pylucene  http://lucene.apache.org/pylucene/ 

Q and A

Backup

Whoosh v/s Haystack v/s Xapian • Whoosh is suitable for

Lucene v/s Database search • There are a number of

Typical search engine • Indexing – Convert files to a

Advanced Searching  Morelikethis  didyoumean