Web_Scraping_with_Scrapy.pdf

Slide 1

Slide 1 text

Web Scraping with Scrapy Sigit Dewanto, 27 February 2021 @Ngalam Backend Community 1

Slide 2

Slide 2 text

About me Tukang Koding Indonesia @ Zyte (2014 - now) Developing and maintaining Scrapy spiders Community Organizer @ PythonID Jogja (2019 - now) My Web Data Extraction Research Projects ● https://github.com/seagatesoft/webdext ○ Implementation of AutoRM & DAG-MTM algorithms ○ XPath wrapper induction ● https://github.com/seagatesoft/sde ○ An implementation of DEPTA algorithm 2

Slide 3

Slide 3 text

Talk outline ● Introduction to web scraping and Scrapy ● Demo 3

Slide 4

Slide 4 text

Intro to Web Scraping & Scrapy 4

Slide 5

Slide 5 text

● Web contains a huge amount of data: products, articles, job postings, etc. ● Those data are presented as web page (HTML), and intended to be consumed by human. ● Those data need to be extracted from the web page before they can be processed by computer program. Why web scraping? 5

Slide 6

Slide 6 text

Web crawling: how to get the target web pages 6

Slide 7

Slide 7 text

Web data extraction 7 { “title”: “The Da Vinci Code”, “author”: “Robert Langdon”, “price”: “22.96”, “stock”: 3, “rating”: 2.0, ... }

Slide 8

Slide 8 text

● Why use Scrapy? ○ As framework, Scrapy provides us with the battle-tested software architecture for web scraping ○ Provides abstraction on top of Twisted ○ Many built-in features: request scheduler, cookies handling, redirect, proxy, cache, downloading files and images, store data to S3/FTP, etc. Scrapy 8

Slide 9

Slide 9 text

● To get the target web page, our crawler needs to communicate with the web server using Hypertext Transfer Protocol (HTTP) ● Scrapy provides scrapy.http module to deal with HTTP ● On Scrapy, we will write Spider that will send scrapy.http.Request and receive scrapy.http.Response ● Scrapy will manage multiple HTTP requests/responses asynchronously Working with HTTP 9 HTTP request HTTP response crawler/spider website

Slide 10

Slide 10 text

● HTML => parsel.Selector ○ CSS selector ○ XPath ● String ○ regular expression ● JSON ○ JMESPath ● JavaScript ○ js2xml Extracting Data from Web Page 10

Slide 11

Slide 11 text

Demo 11

Slide 12

Slide 12 text

We will scrape books data from http://books.toscrape.com 1. Install Scrapy 2. Inspect web pages using Scrapy shell 3. Create Scrapy spider to extract list of books from http://books.toscrape.com 4. Run Scrapy spider 5. Crawl pagination 6. Crawl detail pages 7. Extract data from detail pages 8. [Scrapy best practices: using Item, ItemLoader, and LinkExtractor] 9. [Request all pagination at once] 10. [Add breadcrumbs to Book item] Demo 12

Slide 13

Slide 13 text

● pip ○ pip install Scrapy ● conda ○ conda install -c conda-forge scrapy ● Encounter issue? See https://doc.scrapy.org/en/latest/intro/install.html 1. Install Scrapy 13

Slide 14

Slide 14 text

scrapy shell "https://books.toscrape.com" # open HTML page on web browser >>> view(response) # get list of book elements >>> response.css("ol.row > li") # verify the number of elements >>> len(response.css("ol.row > li")) >>> books = response.css("ol.row > li") # get list of book titles >>> books.css("h3 > a::attr(title)").extract() >>> len(books.css("h3 > a::attr(title)").extract()) 2. Inspect web pages using Scrapy shell 14

Slide 15

Slide 15 text

from scrapy import Spider class BookSpider(Spider): name = 'books_toscrape_com' allowed_domains = ['toscrape.com'] start_urls = [‘https://books.toscrape.com’] def parse(self, response): book_elements = response.css("ol.row > li") book_items = [] ... 3. Create Scrapy spider 15

Slide 16

Slide 16 text

# run spider on books_toscrape_com.py, store items to ﬁle books.jl # with JSON lines format, store the log tobooks.txt scrapy runspider books_toscrape_com.py -o books.jl --logfile=books.txt 4. Run Scrapy spider 16

Slide 17

Slide 17 text

from scrapy import Request def request_next_page(self, response): next_url = response.css("li.next > a::attr(href)").get() if next_url: next_url = response.urljoin(next_url) return Request(next_url, callback=self.parse) def parse(self, response): # extract list of books … yield self.request_next_page(response) 5. Crawl pagination 17

Slide 18

Slide 18 text

from scrapy import Request def parse(self, response): detail_urls = response.css("ol.row > li h3 > a::attr(href)").getall() for detail_url in detail_urls: yield Request( response.urljoin(detail_url), callback=self.parse_detail, ) yield self.request_next_page(response) 6. Crawl detail pages 18

Slide 19

Slide 19 text

def parse_detail(self, response): book_item = {} book_item['title'] = response.css("").get() book_item['price'] = response.css("").get() book_item['stock'] = response.css("").get() book_item['rating'] = response.css("").get() … return book_item 7. Extract data from detail page 19

Slide 20

Slide 20 text

8. Scrapy best practices a. Create a project b. Item, ItemLoader, input processor, output processor c. LinkExtractor 9. Request all pagination at once 10. Learn XPath: more powerful than CSS selector 11. Wants more challenge? Try https://quotes.toscrape.com 12. Scrapy pipelines, middlewares, and extensions 13. Needs browser rendering? Try Splash https://github.com/scrapinghub/splash 14. Needs cloud platform for Scrapy spiders? Try https://app.scrapinghub.com What’s next? 20

Slide 21

Slide 21 text

● Scrapy documentation https://doc.scrapy.org ● Demo source code: https://gist.github.com/seagatesoft/f98c9f6a5a cec819deff3448cdd4da11 Resources 21

Slide 22

Slide 22 text

Thank you! Sigit Dewanto, 27 February 2021 https://twitter.com/seagatesoft https://github.com/seagatesoft https://id.linkedin.com/in/sigitdewanto We are hiring! http://zyte.com/jobs 22