Search Engines using Python and Elasticsearch

Slide 1

Slide 1 text

Slide 2

Slide 2 text

ABOUT ME https://spekerdeck.com/pattyvader/pybr12

Slide 3

Slide 3 text

Athena: my project https://github.com/pattyvader/athena

Slide 4

Slide 4 text

Roadmap 1. Documents search 2. Indexing 3. Crawling the web

Slide 5

Slide 5 text

Documents search

Slide 6

Slide 6 text

Documents search Store result in Elasticsearch Search term

Slide 7

Slide 7 text

Documents search Django server GET Browser (HTML)

Slide 8

Slide 8 text

Django server Documents search GET Browser (HTML)

Slide 9

Slide 9 text

Documents search . Django server

Slide 10

Slide 10 text

Documents search views.py Call “search” function Django server

Slide 11

Slide 11 text

Documents search views.py Call “search” function Invoke “search_term” function Django server

Slide 12

Slide 12 text

Indexing https://www.elastic.co/downloads/elasticsearch Relevancy score Restful protocol Json messages

Slide 13

Slide 13 text

Indexing Relevance score

Slide 14

Slide 14 text

Indexing Restful/Json PUT GET

Slide 15

Slide 15 text

Indexing https://www.elastic.co/use-cases

Slide 16

Slide 16 text

Indexing In the indexing stage, we use a library called Elasticsearch-py to integrate Elasticsearch and Python. indexer.py https://pypi.python.org/pypi/elasticsearch https://elasticsearch-py.readthedocs.io/en/master/ https://github.com/elastic/elasticsearch-py Creates the index Adds a new document to the index

Slide 17

Slide 17 text

indexing scraper.py The scraper extracts data from HTML files using the library BeautifulSoup. https://www.crummy.com/software/BeautifulSoup/

Slide 18

Slide 18 text

scraper.py indexing HTML metatags

Slide 19

Slide 19 text

Crawling the web crawler.py

Slide 20

Slide 20 text

Crawling the web 1 3 2 4 5 Read robot.txt Download HTML Extract new links Extract data Insert data in elasticsearch indexer.py scraper.py crawler.py

Slide 21

Slide 21 text

Crawling the web Before crawling a site, always check for permission in the robot.txt file. It is a good practice. crawler.py

Slide 22

Slide 22 text

Crawling the web The library “urllib2” have functions that helps to extract HTML tags for a given URL. crawler.py

Slide 23

Slide 23 text

Crawling the web crawler.py New links are extracted only in the url seed domain.

Slide 24

Slide 24 text

Crawling the web Extracting data scraper.py crawler.py

Slide 25

Slide 25 text

Crawling the web Indexing documents in Elasticsearch. indexer.py crawler.py

Slide 26

Slide 26 text

Summing up... Browser (HTML) GET Django server scraper.py crawler.py indexer.py GET internet

Slide 27

Slide 27 text

https://github.com/pattyvader https://br.linkedin.com/in/patricia-regina-18790040 Contact *Designed by Freepik from www.flaticon.com* [email protected]