Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Web_Scraping_with_Scrapy.pdf

 Web_Scraping_with_Scrapy.pdf

Hi sobat ngambek! 🙌
Gimana hari-hari sobat di rumah? Makin produktif atau makin rebahan aja nih? Kami doakan semoga makin produktif dan terus semangaatt 😆

Ini adalah dokumen dari pemateri pertama yaitu:
⭐ Sigit Dewanto - Python Developer at Zyte (formerly Scrapinghub) yang membawakan materi berjudul “Web Scraping with Scrapy”

Ngalam Backend Community

April 21, 2021
Tweet

More Decks by Ngalam Backend Community

Other Decks in Technology

Transcript

  1. About me Tukang Koding Indonesia @ Zyte (2014 - now)

    Developing and maintaining Scrapy spiders Community Organizer @ PythonID Jogja (2019 - now) My Web Data Extraction Research Projects • https://github.com/seagatesoft/webdext ◦ Implementation of AutoRM & DAG-MTM algorithms ◦ XPath wrapper induction • https://github.com/seagatesoft/sde ◦ An implementation of DEPTA algorithm 2
  2. • Web contains a huge amount of data: products, articles,

    job postings, etc. • Those data are presented as web page (HTML), and intended to be consumed by human. • Those data need to be extracted from the web page before they can be processed by computer program. Why web scraping? 5
  3. Web data extraction 7 { “title”: “The Da Vinci Code”,

    “author”: “Robert Langdon”, “price”: “22.96”, “stock”: 3, “rating”: 2.0, ... }
  4. • Why use Scrapy? ◦ As framework, Scrapy provides us

    with the battle-tested software architecture for web scraping ◦ Provides abstraction on top of Twisted ◦ Many built-in features: request scheduler, cookies handling, redirect, proxy, cache, downloading files and images, store data to S3/FTP, etc. Scrapy 8
  5. • To get the target web page, our crawler needs

    to communicate with the web server using Hypertext Transfer Protocol (HTTP) • Scrapy provides scrapy.http module to deal with HTTP • On Scrapy, we will write Spider that will send scrapy.http.Request and receive scrapy.http.Response • Scrapy will manage multiple HTTP requests/responses asynchronously Working with HTTP 9 HTTP request HTTP response crawler/spider website
  6. • HTML => parsel.Selector ◦ CSS selector ◦ XPath •

    String ◦ regular expression • JSON ◦ JMESPath • JavaScript ◦ js2xml Extracting Data from Web Page 10
  7. We will scrape books data from http://books.toscrape.com 1. Install Scrapy

    2. Inspect web pages using Scrapy shell 3. Create Scrapy spider to extract list of books from http://books.toscrape.com 4. Run Scrapy spider 5. Crawl pagination 6. Crawl detail pages 7. Extract data from detail pages 8. [Scrapy best practices: using Item, ItemLoader, and LinkExtractor] 9. [Request all pagination at once] 10. [Add breadcrumbs to Book item] Demo 12
  8. • pip ◦ pip install Scrapy • conda ◦ conda

    install -c conda-forge scrapy • Encounter issue? See https://doc.scrapy.org/en/latest/intro/install.html 1. Install Scrapy 13
  9. scrapy shell "https://books.toscrape.com" # open HTML page on web browser

    >>> view(response) # get list of book elements >>> response.css("ol.row > li") # verify the number of elements >>> len(response.css("ol.row > li")) >>> books = response.css("ol.row > li") # get list of book titles >>> books.css("h3 > a::attr(title)").extract() >>> len(books.css("h3 > a::attr(title)").extract()) 2. Inspect web pages using Scrapy shell 14
  10. from scrapy import Spider class BookSpider(Spider): name = 'books_toscrape_com' allowed_domains

    = ['toscrape.com'] start_urls = [‘https://books.toscrape.com’] def parse(self, response): book_elements = response.css("ol.row > li") book_items = [] ... 3. Create Scrapy spider 15
  11. # run spider on books_toscrape_com.py, store items to file books.jl

    # with JSON lines format, store the log tobooks.txt scrapy runspider books_toscrape_com.py -o books.jl --logfile=books.txt 4. Run Scrapy spider 16
  12. from scrapy import Request def request_next_page(self, response): next_url = response.css("li.next

    > a::attr(href)").get() if next_url: next_url = response.urljoin(next_url) return Request(next_url, callback=self.parse) def parse(self, response): # extract list of books … yield self.request_next_page(response) 5. Crawl pagination 17
  13. from scrapy import Request def parse(self, response): detail_urls = response.css("ol.row

    > li h3 > a::attr(href)").getall() for detail_url in detail_urls: yield Request( response.urljoin(detail_url), callback=self.parse_detail, ) yield self.request_next_page(response) 6. Crawl detail pages 18
  14. def parse_detail(self, response): book_item = {} book_item['title'] = response.css("").get() book_item['price']

    = response.css("").get() book_item['stock'] = response.css("").get() book_item['rating'] = response.css("").get() … return book_item 7. Extract data from detail page 19
  15. 8. Scrapy best practices a. Create a project b. Item,

    ItemLoader, input processor, output processor c. LinkExtractor 9. Request all pagination at once 10. Learn XPath: more powerful than CSS selector 11. Wants more challenge? Try https://quotes.toscrape.com 12. Scrapy pipelines, middlewares, and extensions 13. Needs browser rendering? Try Splash https://github.com/scrapinghub/splash 14. Needs cloud platform for Scrapy spiders? Try https://app.scrapinghub.com What’s next? 20