Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Search Engines using Python and Elasticsearch

Avatar for Patty Vader Patty Vader
December 03, 2017

Search Engines using Python and Elasticsearch

Avatar for Patty Vader

Patty Vader

December 03, 2017
Tweet

More Decks by Patty Vader

Other Decks in Programming

Transcript

  1. Indexing In the indexing stage, we use a library called

    Elasticsearch-py to integrate Elasticsearch and Python. indexer.py https://pypi.python.org/pypi/elasticsearch https://elasticsearch-py.readthedocs.io/en/master/ https://github.com/elastic/elasticsearch-py Creates the index Adds a new document to the index
  2. indexing scraper.py The scraper extracts data from HTML files using

    the library BeautifulSoup. https://www.crummy.com/software/BeautifulSoup/
  3. Crawling the web 1 3 2 4 5 Read robot.txt

    Download HTML Extract new links Extract data Insert data in elasticsearch indexer.py scraper.py crawler.py
  4. Crawling the web Before crawling a site, always check for

    permission in the robot.txt file. It is a good practice. crawler.py
  5. Crawling the web The library “urllib2” have functions that helps

    to extract HTML tags for a given URL. crawler.py