Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Developing your search engine in Python

Developing your search engine in Python

Presented at PyCon, India

vishalkanaujia

February 23, 2014
Tweet

More Decks by vishalkanaujia

Other Decks in Programming

Transcript

  1. Rapid development of website search in Python PyCon India, Bangalore,

    Sept’ 12 Vishal Kanaujia, Chetan Giridhar
  2. For whom!  If you’re, an experienced developer who has

    implemented search solutions currently dirtying your hands prototyping website search for your startup dreading to learn Java  just curious..
  3. Think web development  Core functionality  Design patterns 

    Web Interface  Usability  Scalability  Performance  …?
  4. Search  Often considered – ‘good to have’  Enhances

    user experience  Focused information  Relevance  Interaction  Ranked searching
  5. Typical Search Engine  Designing a schema  Convert your

    data as Documents and store them to index  Document is a set of fields  Field is a name=value pair  {title = “python”, content = “computer”, tag = “language”}  Analyzers  "parse" each field of your data into index- able "tokens" or keywords.  “Welcome to Pycon" it will produce list [“welcome", “to", “Pycon”]
  6. Typical Search Engine  Indexing  Adding documents to the

    index  Query and query parsers  Prepare query  Parse  Analyze  Searching  Lookup index
  7. Indexing & Committing Input files Field1 Field3 Analyzer Schema based

    document Field2 In-memory Index Index Writer Committed
  8. Development : Considerations  Sourcing input data set  Handling

    input queries  How to search  Algorithms  How to display results  Customization
  9. Development: Options  Apache Solr: Sunburnt  Haystack  Xapian:

    Xappy  Elastic Search  Whoosh  Lucene: Pylucene
  10. Talking Pylucene & Whoosh  Pythonic APIs  Deployment Large

    scale and medium sized web sites  Rapid Minimal installation Clear Documentation Quick Setup Ease of Integration
  11. Pylucene  Pylucene: Python wrappers to Lucene  Lucene: an

    open source, pure Java, search engine library  The de-facto standard for search engine library  Embeds a Java VM with Lucene into a Python process
  12. Pylucene  Simple API  High performance indexing  Scalable

    to millions of documents  Efficient and feature rich search algorithms  Cross platform
  13. Whoosh  Whoosh is a search engine library  Fast

    indexing and search  One of the fastest Python search engine  100% Python code  Extensible code  No external dependency  Active development and support
  14. Whoosh  Easy to setup  Neutral to web frameworks

     Powerful query language  Feature rich  Intuitive APIs
  15. PyLucene Whoosh  Document  Field  IndexWriter  QueryParser

     Analyzer  IndexSearcher  fields.Schema  index.Index  qparser.QueryParser  analysis. Analyzer  searching.Searcher
  16. Designing search in websites  Search design should be: 

    An independent component Pluggable Platform independent Assume minimal external dependency Easily extendible Seamless integration
  17. Comparing Engines  Basis of comparison  Indexing, Committing and

    Searching  Dataset  1 GB data  ~5000 files  file size ranging between 1KB to 50MB  Setup Intel® Core™2 Duo CPU P8600 @ 2.40GHz × 2 3 GB RAM  Ubuntu Release 12.04 (precise) 32-bit
  18. Recommendations Search Engine Library No one solution fits all problems

    Scalability is critical Rapid to setup, develop and tweak Understand and use 
  19. Web development in Python  Getting rapid and easier by

    the day  Web frameworks  Django, Pylons  Http Servers  Tornado, Gunicorn  Support for SQL/NoSQL databases  MySQL-python, pymongo  Template Engines  Cheetah, jinja2  Search  Pylucene, Whoosh
  20. References  Whoosh  https://bitbucket.org/mchaput/whoosh/wiki/Home  Pylucene  http://lucene.apache.org/pylucene/ 

    http://lucene.apache.org/core/3_6_1/api/all/index.html  Xappy  http://code.google.com/p/xappy/  ElasticSearch  http://www.elasticsearch.org/guide/reference/api/
  21. Whoosh v/s Haystack v/s Xapian • Whoosh is suitable for

    a small project. Limited scalability for search and indexing – A good beginning • Haystack is appropriate with Django • Xapian is ultra fast, but is not as feature rich as Solr • Lucene is not distributed; has external dependency
  22. Lucene v/s Database search • There are a number of

    query types that RDBMSs in general do not support without vendor extensions: • Fuzzy queries, in which "fuzzy" and "wuzzy" are considered matches • Word stemming queries, which consider "take," "took," and "taken" to be identical • Sound-like queries, which consider "cat" and "kat" to be identical • Synonym queries, which consider "jump," "hop," and "leap" to be identical • Queries on binary BLOB data types, such as PDF documents, Microsoft Word or Excel documents, or HTML and XML documents • More disappointingly, SQL search results are not ranked by match- relevance scores. The SQL standard is simply not intended for full- text querying.
  23. Typical search engine • Indexing – Convert files to a

    format for quick look up – Fast random access to stored words • Searching – Specify keywords • Displaying – Lookup documents that are relevant – Ranking – Different types of queries