$30 off During Our Annual Pro Sale. View Details »

Discovering python search engines

jmortegac
October 09, 2017

Discovering python search engines

Discovering python search engines

jmortegac

October 09, 2017
Tweet

More Decks by jmortegac

Other Decks in Programming

Transcript

  1. Discovering python
    search engine
    José Manuel Ortega

    View Slide

  2. ● Introduction to search engines
    ● ElasticSearch,whoosh,django-hystack
    ● ElasticSearch example
    ● Other solutions & tools
    ● Conclusions

    View Slide

  3. Search engines

    View Slide

  4. Search engines
    ● Document based
    ● A document is the unit of searching in a full text
    search system.
    ● A document can be a json or python dictionary

    View Slide

  5. Core concepts

    View Slide

  6. ● Index: Named collection of documents that have
    similar characteristics(like a database)
    ● Type:Logical partition of an index that contains
    documents with common fields(like a table)
    ● Document:basic unit of information(like a row)
    ● Mapping:field properties(datatype,token extraction).
    Includes information about how fields are stored in the
    index

    View Slide

  7. ● Relevance are the algorithms used to rank the
    results based on the query
    ● Corpus is the collection of all documents in the
    index
    ● Segments:Sharded data storing the inverted
    index.Allow searching in the index in a efficient
    way

    View Slide

  8. Inverted index

    View Slide

  9. View Slide

  10. View Slide

  11. ElasticSearch

    View Slide

  12. View Slide

  13. View Slide

  14. ● Open source search server based on
    Apache Lucene
    ● Written in Java
    ● Cross-platform
    ● Communications with the search server is
    done through HTTP REST API
    ● curl -X
    http://localthost:9200///id

    View Slide

  15. ● You can add a document without creating
    an index
    ● ElasticSearch will create the
    index,mapping type and fields
    automatically
    ● ElasticSearch will infer the data types
    based on the document’s data

    View Slide

  16. View Slide

  17. ● TF-IDF(Term Frecuency-Inverse Doc Freq)
    ● TF-IDF = TF * IDF
    ● TF = number of apperences of the term in
    all documents
    ● IDF = log (N / DF)
    ● N = total_document_count
    ● DF = number of documents where appears
    the term

    View Slide

  18. View Slide

  19. Creating an Index
    curl -XPUT ‘localhost:9200/myindex’-d {
    “settings”:{..}
    “mappings”:{..}
    }

    View Slide

  20. View Slide

  21. View Slide

  22. View Slide

  23. Searching a document
    curl -XGET ‘localhost:9200/myindex/mydocument/_search?q=elasticSearch’
    curl -XGET ‘localhost:9200/myindex/mydocument/_search?pretty’ -d{
    “query”:{
    “match”:{
    “_all”:”elasticSearch”
    }
    }
    }
    Query DSL

    View Slide

  24. View Slide

  25. Searching a document
    ● Search can get much more complex
    ○ Multiple terms
    ○ Multi-match(math query on specific fields)
    ○ Bool(true,false)
    ○ Range
    ○ RegExp
    ○ GeoPoint,GeoShapes

    View Slide

  26. ElasticSearch python client
    ● The official low-level client is elasticsearch-py
    ○ pip install elasticsearch

    View Slide

  27. ElasticSearch-py API

    View Slide

  28. ElasticSearch-py API

    View Slide

  29. View Slide

  30. View Slide

  31. View Slide

  32. View Slide

  33. View Slide

  34. Geo queries
    ● Elastic search supports two types of geo fields
    ○ geo_point(lat,lon)
    ○ geo_shapes(points,lines,polygons)
    ● Perform geographical searches
    ○ Finding points of interest and GPS coordinates

    View Slide

  35. View Slide

  36. View Slide

  37. https://github.com/jmortega/python_discover_search_engine

    View Slide

  38. View Slide

  39. View Slide

  40. View Slide

  41. View Slide

  42. View Slide

  43. View Slide

  44. View Slide

  45. View Slide

  46. View Slide

  47. View Slide

  48. Whoosh

    View Slide

  49. ● Pure-python full-text indexing and searching library
    ● Library of classes and functions for indexing text and
    then searching the index.
    ● It allows you to develop custom search engines for
    your content.
    ● Mainly focused on index and search definition using
    schemas
    ● Python 2.5 and Python 3

    View Slide

  50. Schema

    View Slide

  51. Create index and insert document

    View Slide

  52. Searching single field

    View Slide

  53. Searching multiple field

    View Slide

  54. Django-haystack

    View Slide

  55. View Slide

  56. ● Multiple backends (you have a Solr & a Whoosh
    index, or a master Solr & a slave Solr, etc.)
    ● An Elasticsearch backend
    ● Big query improvements
    ● Geospatial search (Solr & Elasticsearch only)
    ● The addition of Signal Processors for better control
    ● Input types for improved control over queries
    ● Rich Content Extraction in Solr

    View Slide

  57. View Slide

  58. View Slide

  59. View Slide

  60. View Slide

  61. View Slide

  62. ● Create the index
    ○ Run ./manage.py rebuild_index to create the new
    search index.
    ● Update the index
    ○ ./manage.py update_index will add new entries to the
    index.
    ○ ./manage.py rebuild_index will recreate the index
    from scratch.

    View Slide

  63. Other solutions

    View Slide

  64. Other solutions
    ● https://xapian.org
    ● https://docs.djangoproject.com/en/1.11/ref/contrib/pos
    tgres/search/
    ● https://www.postgresql.org/docs/9.6/static/textsearch.
    html

    View Slide

  65. pysolr

    View Slide

  66. View Slide

  67. Conclusions

    View Slide

  68. ● Elasticsearch's Query DSL syntax is really flexible and
    it's pretty easy to write complex queries with it,other
    solutions doesn't have an equivalent
    ● Elasticsearch is faster and flexible than other solutions
    like postgresssql full text search or solr
    ● Aggregations in ES for searching by category is another
    interesting feature that haven’t got other solutions
    ● SOLR requires more configuration than ES
    ● Whoosh is suitable for a small project. Limited scalability
    for search and indexing.

    View Slide

  69. Other tools

    View Slide

  70. View Slide

  71. View Slide

  72. View Slide

  73. View Slide

  74. View Slide

  75. References
    ● http://elasticsearch-py.readthedocs.io/en/master/
    ● https://whoosh.readthedocs.io/en/latest
    ● http://django-haystack.readthedocs.io/en/master/
    ● http://solr-vs-elasticsearch.com/
    ● https://wiki.apache.org/solr/SolPython
    ● https://github.com/django-haystack/pysolr

    View Slide

  76. View Slide

  77. View Slide

  78. Thanks!
    jmortega.github.io
    @jmortegac

    View Slide