Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Discovering python search engines

jmortegac
October 09, 2017

Discovering python search engines

Discovering python search engines

jmortegac

October 09, 2017
Tweet

More Decks by jmortegac

Other Decks in Programming

Transcript

  1. Discovering python search engine José Manuel Ortega

  2. • Introduction to search engines • ElasticSearch,whoosh,django-hystack • ElasticSearch example

    • Other solutions & tools • Conclusions
  3. Search engines

  4. Search engines • Document based • A document is the

    unit of searching in a full text search system. • A document can be a json or python dictionary
  5. Core concepts

  6. • Index: Named collection of documents that have similar characteristics(like

    a database) • Type:Logical partition of an index that contains documents with common fields(like a table) • Document:basic unit of information(like a row) • Mapping:field properties(datatype,token extraction). Includes information about how fields are stored in the index
  7. • Relevance are the algorithms used to rank the results

    based on the query • Corpus is the collection of all documents in the index • Segments:Sharded data storing the inverted index.Allow searching in the index in a efficient way
  8. Inverted index

  9. None
  10. None
  11. ElasticSearch

  12. None
  13. None
  14. • Open source search server based on Apache Lucene •

    Written in Java • Cross-platform • Communications with the search server is done through HTTP REST API • curl -X<GET|POST|PUT|DELETE> http://localthost:9200/<index>/<type_document>/id
  15. • You can add a document without creating an index

    • ElasticSearch will create the index,mapping type and fields automatically • ElasticSearch will infer the data types based on the document’s data
  16. None
  17. • TF-IDF(Term Frecuency-Inverse Doc Freq) • TF-IDF = TF *

    IDF • TF = number of apperences of the term in all documents • IDF = log (N / DF) • N = total_document_count • DF = number of documents where appears the term
  18. None
  19. Creating an Index curl -XPUT ‘localhost:9200/myindex’-d { “settings”:{..} “mappings”:{..} }

  20. None
  21. None
  22. None
  23. Searching a document curl -XGET ‘localhost:9200/myindex/mydocument/_search?q=elasticSearch’ curl -XGET ‘localhost:9200/myindex/mydocument/_search?pretty’ -d{

    “query”:{ “match”:{ “_all”:”elasticSearch” } } } Query DSL
  24. None
  25. Searching a document • Search can get much more complex

    ◦ Multiple terms ◦ Multi-match(math query on specific fields) ◦ Bool(true,false) ◦ Range ◦ RegExp ◦ GeoPoint,GeoShapes
  26. ElasticSearch python client • The official low-level client is elasticsearch-py

    ◦ pip install elasticsearch
  27. ElasticSearch-py API

  28. ElasticSearch-py API

  29. None
  30. None
  31. None
  32. None
  33. None
  34. Geo queries • Elastic search supports two types of geo

    fields ◦ geo_point(lat,lon) ◦ geo_shapes(points,lines,polygons) • Perform geographical searches ◦ Finding points of interest and GPS coordinates
  35. None
  36. None
  37. https://github.com/jmortega/python_discover_search_engine

  38. None
  39. None
  40. None
  41. None
  42. None
  43. None
  44. None
  45. None
  46. None
  47. None
  48. Whoosh

  49. • Pure-python full-text indexing and searching library • Library of

    classes and functions for indexing text and then searching the index. • It allows you to develop custom search engines for your content. • Mainly focused on index and search definition using schemas • Python 2.5 and Python 3
  50. Schema

  51. Create index and insert document

  52. Searching single field

  53. Searching multiple field

  54. Django-haystack

  55. None
  56. • Multiple backends (you have a Solr & a Whoosh

    index, or a master Solr & a slave Solr, etc.) • An Elasticsearch backend • Big query improvements • Geospatial search (Solr & Elasticsearch only) • The addition of Signal Processors for better control • Input types for improved control over queries • Rich Content Extraction in Solr
  57. None
  58. None
  59. None
  60. None
  61. None
  62. • Create the index ◦ Run ./manage.py rebuild_index to create

    the new search index. • Update the index ◦ ./manage.py update_index will add new entries to the index. ◦ ./manage.py rebuild_index will recreate the index from scratch.
  63. Other solutions

  64. Other solutions • https://xapian.org • https://docs.djangoproject.com/en/1.11/ref/contrib/pos tgres/search/ • https://www.postgresql.org/docs/9.6/static/textsearch. html

  65. pysolr

  66. None
  67. Conclusions

  68. • Elasticsearch's Query DSL syntax is really flexible and it's

    pretty easy to write complex queries with it,other solutions doesn't have an equivalent • Elasticsearch is faster and flexible than other solutions like postgresssql full text search or solr • Aggregations in ES for searching by category is another interesting feature that haven’t got other solutions • SOLR requires more configuration than ES • Whoosh is suitable for a small project. Limited scalability for search and indexing.
  69. Other tools

  70. None
  71. None
  72. None
  73. None
  74. None
  75. References • http://elasticsearch-py.readthedocs.io/en/master/ • https://whoosh.readthedocs.io/en/latest • http://django-haystack.readthedocs.io/en/master/ • http://solr-vs-elasticsearch.com/ •

    https://wiki.apache.org/solr/SolPython • https://github.com/django-haystack/pysolr
  76. None
  77. None
  78. Thanks! jmortega.github.io @jmortegac