Slide 1

Slide 1 text

Discovering python search engine José Manuel Ortega

Slide 2

Slide 2 text

● Introduction to search engines ● ElasticSearch,whoosh,django-hystack ● ElasticSearch example ● Other solutions & tools ● Conclusions

Slide 3

Slide 3 text

Search engines

Slide 4

Slide 4 text

Search engines ● Document based ● A document is the unit of searching in a full text search system. ● A document can be a json or python dictionary

Slide 5

Slide 5 text

Core concepts

Slide 6

Slide 6 text

● Index: Named collection of documents that have similar characteristics(like a database) ● Type:Logical partition of an index that contains documents with common fields(like a table) ● Document:basic unit of information(like a row) ● Mapping:field properties(datatype,token extraction). Includes information about how fields are stored in the index

Slide 7

Slide 7 text

● Relevance are the algorithms used to rank the results based on the query ● Corpus is the collection of all documents in the index ● Segments:Sharded data storing the inverted index.Allow searching in the index in a efficient way

Slide 8

Slide 8 text

Inverted index

Slide 9

Slide 9 text

No content

Slide 10

Slide 10 text

No content

Slide 11

Slide 11 text

ElasticSearch

Slide 12

Slide 12 text

No content

Slide 13

Slide 13 text

No content

Slide 14

Slide 14 text

● Open source search server based on Apache Lucene ● Written in Java ● Cross-platform ● Communications with the search server is done through HTTP REST API ● curl -X http://localthost:9200///id

Slide 15

Slide 15 text

● You can add a document without creating an index ● ElasticSearch will create the index,mapping type and fields automatically ● ElasticSearch will infer the data types based on the document’s data

Slide 16

Slide 16 text

No content

Slide 17

Slide 17 text

● TF-IDF(Term Frecuency-Inverse Doc Freq) ● TF-IDF = TF * IDF ● TF = number of apperences of the term in all documents ● IDF = log (N / DF) ● N = total_document_count ● DF = number of documents where appears the term

Slide 18

Slide 18 text

No content

Slide 19

Slide 19 text

Creating an Index curl -XPUT ‘localhost:9200/myindex’-d { “settings”:{..} “mappings”:{..} }

Slide 20

Slide 20 text

No content

Slide 21

Slide 21 text

No content

Slide 22

Slide 22 text

No content

Slide 23

Slide 23 text

Searching a document curl -XGET ‘localhost:9200/myindex/mydocument/_search?q=elasticSearch’ curl -XGET ‘localhost:9200/myindex/mydocument/_search?pretty’ -d{ “query”:{ “match”:{ “_all”:”elasticSearch” } } } Query DSL

Slide 24

Slide 24 text

No content

Slide 25

Slide 25 text

Searching a document ● Search can get much more complex ○ Multiple terms ○ Multi-match(math query on specific fields) ○ Bool(true,false) ○ Range ○ RegExp ○ GeoPoint,GeoShapes

Slide 26

Slide 26 text

ElasticSearch python client ● The official low-level client is elasticsearch-py ○ pip install elasticsearch

Slide 27

Slide 27 text

ElasticSearch-py API

Slide 28

Slide 28 text

ElasticSearch-py API

Slide 29

Slide 29 text

No content

Slide 30

Slide 30 text

No content

Slide 31

Slide 31 text

No content

Slide 32

Slide 32 text

No content

Slide 33

Slide 33 text

No content

Slide 34

Slide 34 text

Geo queries ● Elastic search supports two types of geo fields ○ geo_point(lat,lon) ○ geo_shapes(points,lines,polygons) ● Perform geographical searches ○ Finding points of interest and GPS coordinates

Slide 35

Slide 35 text

No content

Slide 36

Slide 36 text

No content

Slide 37

Slide 37 text

https://github.com/jmortega/python_discover_search_engine

Slide 38

Slide 38 text

No content

Slide 39

Slide 39 text

No content

Slide 40

Slide 40 text

No content

Slide 41

Slide 41 text

No content

Slide 42

Slide 42 text

No content

Slide 43

Slide 43 text

No content

Slide 44

Slide 44 text

No content

Slide 45

Slide 45 text

No content

Slide 46

Slide 46 text

No content

Slide 47

Slide 47 text

No content

Slide 48

Slide 48 text

Whoosh

Slide 49

Slide 49 text

● Pure-python full-text indexing and searching library ● Library of classes and functions for indexing text and then searching the index. ● It allows you to develop custom search engines for your content. ● Mainly focused on index and search definition using schemas ● Python 2.5 and Python 3

Slide 50

Slide 50 text

Schema

Slide 51

Slide 51 text

Create index and insert document

Slide 52

Slide 52 text

Searching single field

Slide 53

Slide 53 text

Searching multiple field

Slide 54

Slide 54 text

Django-haystack

Slide 55

Slide 55 text

No content

Slide 56

Slide 56 text

● Multiple backends (you have a Solr & a Whoosh index, or a master Solr & a slave Solr, etc.) ● An Elasticsearch backend ● Big query improvements ● Geospatial search (Solr & Elasticsearch only) ● The addition of Signal Processors for better control ● Input types for improved control over queries ● Rich Content Extraction in Solr

Slide 57

Slide 57 text

No content

Slide 58

Slide 58 text

No content

Slide 59

Slide 59 text

No content

Slide 60

Slide 60 text

No content

Slide 61

Slide 61 text

No content

Slide 62

Slide 62 text

● Create the index ○ Run ./manage.py rebuild_index to create the new search index. ● Update the index ○ ./manage.py update_index will add new entries to the index. ○ ./manage.py rebuild_index will recreate the index from scratch.

Slide 63

Slide 63 text

Other solutions

Slide 64

Slide 64 text

Other solutions ● https://xapian.org ● https://docs.djangoproject.com/en/1.11/ref/contrib/pos tgres/search/ ● https://www.postgresql.org/docs/9.6/static/textsearch. html

Slide 65

Slide 65 text

pysolr

Slide 66

Slide 66 text

No content

Slide 67

Slide 67 text

Conclusions

Slide 68

Slide 68 text

● Elasticsearch's Query DSL syntax is really flexible and it's pretty easy to write complex queries with it,other solutions doesn't have an equivalent ● Elasticsearch is faster and flexible than other solutions like postgresssql full text search or solr ● Aggregations in ES for searching by category is another interesting feature that haven’t got other solutions ● SOLR requires more configuration than ES ● Whoosh is suitable for a small project. Limited scalability for search and indexing.

Slide 69

Slide 69 text

Other tools

Slide 70

Slide 70 text

No content

Slide 71

Slide 71 text

No content

Slide 72

Slide 72 text

No content

Slide 73

Slide 73 text

No content

Slide 74

Slide 74 text

No content

Slide 75

Slide 75 text

References ● http://elasticsearch-py.readthedocs.io/en/master/ ● https://whoosh.readthedocs.io/en/latest ● http://django-haystack.readthedocs.io/en/master/ ● http://solr-vs-elasticsearch.com/ ● https://wiki.apache.org/solr/SolPython ● https://github.com/django-haystack/pysolr

Slide 76

Slide 76 text

No content

Slide 77

Slide 77 text

No content

Slide 78

Slide 78 text

Thanks! jmortega.github.io @jmortegac