Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Discovering python search engines

7c4b1ae16723b56facc7a8a8f95aa6ce?s=47 jmortegac
October 09, 2017

Discovering python search engines

Discovering python search engines

7c4b1ae16723b56facc7a8a8f95aa6ce?s=128

jmortegac

October 09, 2017
Tweet

More Decks by jmortegac

Other Decks in Programming

Transcript

  1. Discovering python search engine José Manuel Ortega

  2. • Introduction to search engines • ElasticSearch,whoosh,django-hystack • ElasticSearch example

    • Other solutions & tools • Conclusions
  3. Search engines

  4. Search engines • Document based • A document is the

    unit of searching in a full text search system. • A document can be a json or python dictionary
  5. Core concepts

  6. • Index: Named collection of documents that have similar characteristics(like

    a database) • Type:Logical partition of an index that contains documents with common fields(like a table) • Document:basic unit of information(like a row) • Mapping:field properties(datatype,token extraction). Includes information about how fields are stored in the index
  7. • Relevance are the algorithms used to rank the results

    based on the query • Corpus is the collection of all documents in the index • Segments:Sharded data storing the inverted index.Allow searching in the index in a efficient way
  8. Inverted index

  9. None
  10. None
  11. ElasticSearch

  12. None
  13. None
  14. • Open source search server based on Apache Lucene •

    Written in Java • Cross-platform • Communications with the search server is done through HTTP REST API • curl -X<GET|POST|PUT|DELETE> http://localthost:9200/<index>/<type_document>/id
  15. • You can add a document without creating an index

    • ElasticSearch will create the index,mapping type and fields automatically • ElasticSearch will infer the data types based on the document’s data
  16. None
  17. • TF-IDF(Term Frecuency-Inverse Doc Freq) • TF-IDF = TF *

    IDF • TF = number of apperences of the term in all documents • IDF = log (N / DF) • N = total_document_count • DF = number of documents where appears the term
  18. None
  19. Creating an Index curl -XPUT ‘localhost:9200/myindex’-d { “settings”:{..} “mappings”:{..} }

  20. None
  21. None
  22. None
  23. Searching a document curl -XGET ‘localhost:9200/myindex/mydocument/_search?q=elasticSearch’ curl -XGET ‘localhost:9200/myindex/mydocument/_search?pretty’ -d{

    “query”:{ “match”:{ “_all”:”elasticSearch” } } } Query DSL
  24. None
  25. Searching a document • Search can get much more complex

    ◦ Multiple terms ◦ Multi-match(math query on specific fields) ◦ Bool(true,false) ◦ Range ◦ RegExp ◦ GeoPoint,GeoShapes
  26. ElasticSearch python client • The official low-level client is elasticsearch-py

    ◦ pip install elasticsearch
  27. ElasticSearch-py API

  28. ElasticSearch-py API

  29. None
  30. None
  31. None
  32. None
  33. None
  34. Geo queries • Elastic search supports two types of geo

    fields ◦ geo_point(lat,lon) ◦ geo_shapes(points,lines,polygons) • Perform geographical searches ◦ Finding points of interest and GPS coordinates
  35. None
  36. None
  37. https://github.com/jmortega/python_discover_search_engine

  38. None
  39. None
  40. None
  41. None
  42. None
  43. None
  44. None
  45. None
  46. None
  47. None
  48. Whoosh

  49. • Pure-python full-text indexing and searching library • Library of

    classes and functions for indexing text and then searching the index. • It allows you to develop custom search engines for your content. • Mainly focused on index and search definition using schemas • Python 2.5 and Python 3
  50. Schema

  51. Create index and insert document

  52. Searching single field

  53. Searching multiple field

  54. Django-haystack

  55. None
  56. • Multiple backends (you have a Solr & a Whoosh

    index, or a master Solr & a slave Solr, etc.) • An Elasticsearch backend • Big query improvements • Geospatial search (Solr & Elasticsearch only) • The addition of Signal Processors for better control • Input types for improved control over queries • Rich Content Extraction in Solr
  57. None
  58. None
  59. None
  60. None
  61. None
  62. • Create the index ◦ Run ./manage.py rebuild_index to create

    the new search index. • Update the index ◦ ./manage.py update_index will add new entries to the index. ◦ ./manage.py rebuild_index will recreate the index from scratch.
  63. Other solutions

  64. Other solutions • https://xapian.org • https://docs.djangoproject.com/en/1.11/ref/contrib/pos tgres/search/ • https://www.postgresql.org/docs/9.6/static/textsearch. html

  65. pysolr

  66. None
  67. Conclusions

  68. • Elasticsearch's Query DSL syntax is really flexible and it's

    pretty easy to write complex queries with it,other solutions doesn't have an equivalent • Elasticsearch is faster and flexible than other solutions like postgresssql full text search or solr • Aggregations in ES for searching by category is another interesting feature that haven’t got other solutions • SOLR requires more configuration than ES • Whoosh is suitable for a small project. Limited scalability for search and indexing.
  69. Other tools

  70. None
  71. None
  72. None
  73. None
  74. None
  75. References • http://elasticsearch-py.readthedocs.io/en/master/ • https://whoosh.readthedocs.io/en/latest • http://django-haystack.readthedocs.io/en/master/ • http://solr-vs-elasticsearch.com/ •

    https://wiki.apache.org/solr/SolPython • https://github.com/django-haystack/pysolr
  76. None
  77. None
  78. Thanks! jmortega.github.io @jmortegac