Slide 1

Slide 1 text

Rapid development of website search in Python PyCon India, Bangalore, Sept’ 12 Vishal Kanaujia, Chetan Giridhar

Slide 2

Slide 2 text

For whom!  If you’re, an experienced developer who has implemented search solutions currently dirtying your hands prototyping website search for your startup dreading to learn Java  just curious..

Slide 3

Slide 3 text

Think web development  Core functionality  Design patterns  Web Interface  Usability  Scalability  Performance  …?

Slide 4

Slide 4 text

Search  Often considered – ‘good to have’  Enhances user experience  Focused information  Relevance  Interaction  Ranked searching

Slide 5

Slide 5 text

Typical Search Engine  Designing a schema  Convert your data as Documents and store them to index  Document is a set of fields  Field is a name=value pair  {title = “python”, content = “computer”, tag = “language”}  Analyzers  "parse" each field of your data into index- able "tokens" or keywords.  “Welcome to Pycon" it will produce list [“welcome", “to", “Pycon”]

Slide 6

Slide 6 text

Typical Search Engine  Indexing  Adding documents to the index  Query and query parsers  Prepare query  Parse  Analyze  Searching  Lookup index

Slide 7

Slide 7 text

Indexing & Committing Input files Field1 Field3 Analyzer Schema based document Field2 In-memory Index Index Writer Committed

Slide 8

Slide 8 text

Searching Query Parser Analyzer Results Input query Index Searcher Index

Slide 9

Slide 9 text

Development : Considerations  Sourcing input data set  Handling input queries  How to search  Algorithms  How to display results  Customization

Slide 10

Slide 10 text

Development: Options  Apache Solr: Sunburnt  Haystack  Xapian: Xappy  Elastic Search  Whoosh  Lucene: Pylucene

Slide 11

Slide 11 text

Talking Pylucene & Whoosh  Pythonic APIs  Deployment Large scale and medium sized web sites  Rapid Minimal installation Clear Documentation Quick Setup Ease of Integration

Slide 12

Slide 12 text

Pylucene  Pylucene: Python wrappers to Lucene  Lucene: an open source, pure Java, search engine library  The de-facto standard for search engine library  Embeds a Java VM with Lucene into a Python process

Slide 13

Slide 13 text

Pylucene  Simple API  High performance indexing  Scalable to millions of documents  Efficient and feature rich search algorithms  Cross platform

Slide 14

Slide 14 text

Whoosh  Whoosh is a search engine library  Fast indexing and search  One of the fastest Python search engine  100% Python code  Extensible code  No external dependency  Active development and support

Slide 15

Slide 15 text

Whoosh  Easy to setup  Neutral to web frameworks  Powerful query language  Feature rich  Intuitive APIs

Slide 16

Slide 16 text

PyLucene Whoosh  Document  Field  IndexWriter  QueryParser  Analyzer  IndexSearcher  fields.Schema  index.Index  qparser.QueryParser  analysis. Analyzer  searching.Searcher

Slide 17

Slide 17 text

Designing search in websites  Search design should be:  An independent component Pluggable Platform independent Assume minimal external dependency Easily extendible Seamless integration

Slide 18

Slide 18 text

Search.py fsMgr Abstraction in Search

Slide 19

Slide 19 text

Demo

Slide 20

Slide 20 text

Comparing Engines  Basis of comparison  Indexing, Committing and Searching  Dataset  1 GB data  ~5000 files  file size ranging between 1KB to 50MB  Setup Intel® Core™2 Duo CPU P8600 @ 2.40GHz × 2 3 GB RAM  Ubuntu Release 12.04 (precise) 32-bit

Slide 21

Slide 21 text

Indexing 0 100 200 300 400 500 pylucene whoosh Time to Index time (s)

Slide 22

Slide 22 text

Committing 0 50 100 150 200 250 300 pylucene whoosh Time to Commit time (s)

Slide 23

Slide 23 text

Searching 0 0.002 0.004 0.006 0.008 0.01 pylucene whoosh Time to Search time (s)

Slide 24

Slide 24 text

Recommendations Search Engine Library No one solution fits all problems Scalability is critical Rapid to setup, develop and tweak Understand and use 

Slide 25

Slide 25 text

Web development in Python  Getting rapid and easier by the day  Web frameworks  Django, Pylons  Http Servers  Tornado, Gunicorn  Support for SQL/NoSQL databases  MySQL-python, pymongo  Template Engines  Cheetah, jinja2  Search  Pylucene, Whoosh

Slide 26

Slide 26 text

References  Whoosh  https://bitbucket.org/mchaput/whoosh/wiki/Home  Pylucene  http://lucene.apache.org/pylucene/  http://lucene.apache.org/core/3_6_1/api/all/index.html  Xappy  http://code.google.com/p/xappy/  ElasticSearch  http://www.elasticsearch.org/guide/reference/api/

Slide 27

Slide 27 text

Q and A

Slide 28

Slide 28 text

Backup

Slide 29

Slide 29 text

Whoosh v/s Haystack v/s Xapian • Whoosh is suitable for a small project. Limited scalability for search and indexing – A good beginning • Haystack is appropriate with Django • Xapian is ultra fast, but is not as feature rich as Solr • Lucene is not distributed; has external dependency

Slide 30

Slide 30 text

Lucene v/s Database search • There are a number of query types that RDBMSs in general do not support without vendor extensions: • Fuzzy queries, in which "fuzzy" and "wuzzy" are considered matches • Word stemming queries, which consider "take," "took," and "taken" to be identical • Sound-like queries, which consider "cat" and "kat" to be identical • Synonym queries, which consider "jump," "hop," and "leap" to be identical • Queries on binary BLOB data types, such as PDF documents, Microsoft Word or Excel documents, or HTML and XML documents • More disappointingly, SQL search results are not ranked by match- relevance scores. The SQL standard is simply not intended for full- text querying.

Slide 31

Slide 31 text

Typical search engine • Indexing – Convert files to a format for quick look up – Fast random access to stored words • Searching – Specify keywords • Displaying – Lookup documents that are relevant – Ranking – Different types of queries

Slide 32

Slide 32 text

Advanced Searching  Morelikethis  didyoumean