Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Developing a Crawler API with Scrapy and Klein - PyCon Colombia 2020

Betina Costa
February 09, 2020

Developing a Crawler API with Scrapy and Klein - PyCon Colombia 2020

Today we will develop an API to search phrases by tags on the site http://quotes.toscrape.com/ . Our API should receive a tag as parameter, scrap the page and return a json containing a list with quotes and authors that belonging to that tag.

Betina Costa

February 09, 2020
Tweet

More Decks by Betina Costa

Other Decks in Technology

Transcript

  1. DEVELOPING A CRAWLER
    API WITH SCRAPY AND
    KLEIN
    PYCON COLOMBIA 2020
    BETINA COSTA

    View Slide

  2. Betina Costa
    SOFTWARE ENGINEER
    BRAZILIAN
    SPEAKER
    CRAZY CAT LADY
    POLE DANCER
    INTROVERT
    MOVIE GEEK
    PYCON COLOMBIA

    View Slide

  3. PYCON COLOMBIA
    Tutorial Goal
    Today we will develop an API to search
    phrases by tags on the site
    http://quotes.toscrape.com/ . Our API
    should receive a tag as parameter,
    scrap the page and return a json
    containing a list with quotes and
    authors that belonging to that tag.

    View Slide

  4. PYCON COLOMBIA
    Workshop Summary
    Points to Cover
    Introduction and Setup
    Scrapy Spiders and Selectors
    Building the Spider Exercise
    Handle Scrapy async behaviour with
    Klein
    Building the API exercise
    Wrapping up and Questions

    View Slide

  5. PYCON COLOMBIA
    System Requirements
    PYTHON 3 PIPENV
    $ pip install pipenv

    View Slide

  6. What is
    Scrapy?
    IS A FREE OPEN SOURCE WEB-
    CRAWLING FRAMEWORK WRITTEN IN
    PYTHON
    iIt is currently maintained by
    Scrapinghub, a web-scraping
    development and services company.
    PYCON COLOMBIA

    View Slide

  7. Why Scrapy?
    It's open source and free to use;
    It's easy to build and scale;
    It has a tool called Selector for data
    extraction;
    Handles calls asynchronously and
    quickly;
    PYCON COLOMBIA

    View Slide

  8. Why Scrapy?
    It's open source and free to use;
    It's easy to build and scale;
    It has a tool called Selector for data
    extraction;
    Handles calls asynchronously and
    quickly;
    PYCON COLOMBIA

    View Slide

  9. PYCON COLOMBIA
    SPIDERS AND SELECTORS
    Let's dive into some HTML and CSS... Please, don't run away

    View Slide

  10. Spiders and Selectors
    SPIDERS
    Spiders are classes that we define and that
    Scrapy uses to crawl information on websites.
    Scrapy comes with its own mechanism for
    extracting data. They’re called selectors because
    they “select” certain parts of the HTML
    document specified either
    by XPath or CSS expressions.
    SELECTORS

    View Slide

  11. PYCON COLOMBIA

    View Slide

  12. PYCON COLOMBIA

    View Slide

  13. PYCON COLOMBIA
    "change"

    View Slide

  14. LET'S GET TO WORK!
    PYCON COLOMBIA
    http://bit.ly/workshop_py2020

    View Slide

  15. PYCON COLOMBIA
    HANDLING ASYNC BEHAVIOUR
    With Klein \o/

    View Slide

  16. Why Klein? KLEIN IS A MICRO-FRAMEWORK FOR
    DEVELOPING PRODUCTION-READY WEB
    SERVICES WITH PYTHON.
    It’s built on widely used and well tested
    components like Werkzeug and
    Twisted, and has near-complete test
    coverage.
    PYCON COLOMBIA

    View Slide

  17. Why Klein? KLEIN IS A MICRO-FRAMEWORK FOR
    DEVELOPING PRODUCTION-READY WEB
    SERVICES WITH PYTHON.
    It’s built on widely used and well tested
    components like Werkzeug and
    Twisted, and has near-complete test
    coverage.
    PYCON COLOMBIA
    TWISTED
    Twisted is an event-driven networking
    engine written in Python

    View Slide

  18. Why Klein?
    REMEBER THAT SCRAPY HANDLES
    CALLS ASYNCHRONOUSLY?
    So, for that reason it doesn't usually
    talks very well with frameworks that are
    usually used to making requests
    synchronously.
    But Klein can helps with that!
    PYCON COLOMBIA

    View Slide

  19. BACK TO WORK!
    PYCON COLOMBIA
    http://bit.ly/workshop_py2020
    WE WILL HAVE LUNCH SOON, STAY WITH ME

    View Slide

  20. THANK YOU!
    PYCON COLOMBIA

    View Slide