Introduction to web scraping with scrapy, remote python pizza 2020

Introduction to Web Scraping with Scrapy and

Hi, I'm Mircea

I'm a student python dev for about 3 years now
currently working at

# Install scrapy in your python env $ pip install
scrapy # Start a scrapy project name 'python_pizza' $ scrapy startproject python_pizza

How do you crawl the web? With spiders! from scrapy
import Spider

You send requests, these have an URL and methods (GET
and POST are the most common ones). You receive back responses that have bodies and status codes (2.x.x. means it was a success)

The basic parts of a spider from scrapy import Spider
class MyFirstSpider(Spider): name = 'python_pizza' # unique identifier start_urls = ['https://remote.python.pizza/'] # first request url def parse(self, response): # callback ...

Use selectors to extract data from structured response bodies <div
id="topping"> <p>Pineapple</p> </div> # css selector >>> response.css('div#topping p::text').get() 'Pineapple' # xpath selector >>> response.xpath('//div[@id="topping"]/p/text()').get() 'Pineapple'

Right-click on an element in the page to view the
underlying HTML

from scrapy import Spider class MyFirstSpider(Spider): name = 'python_pizza' start_urls
= ['https://remote.python.pizza/'] def parse(self, response): talk_selector_list = response.css('div.schedule-item--info') for talk_selector in talk_selector_list: yield { 'social': talk_selector.css('a::attr(href)').get() }

This is how the website actually looks without javascript and
what we receive in our first request The request from our spider is just the first needed to load the final page!

One of these responses must contain our data.

But this is javascript, we can't use selectors like before!
If only there was an external library that could parse this for us... # install js2py in your python env $ pip install js2py

Our updated spider from scrapy import Spider class MyFirstSpider(Spider): name
= 'python_pizza' start_urls = ['https://remote.python.pizza/'] def parse(self, response): js_path = response.xpath('//script[contains(@src, "js/index")]/@src').get() # js_path is now '/assets/js/index.<hash>.chunk.js' return response.follow(js_path, callback=self.parse_js) # you could also format the url yourself and send a scrapy request # return scrapy.Request('https://remote.python.pizza' + js_path, callback=self.parse_js) def parse_js(self, response): ...

def parse_js(self, response): # Extract the speakers array speakers_data =
response.css('*').re_first('SPEAKERS=(\[[^\]]*\])') # Remove function calls cleaned_speakers_data = speakers_data.replace('n(', '(') speakers_array = js2py.eval_js(cleaned_speakers_data) # Get fields from this array for speaker in speakers_array: yield { 'social': speaker.get('social') }

Finally, to start our scraper... $ scrapy crawl python_pizza

You can also export your items to json and csv!
$ scrapy crawl python_pizza -o speakers.csv # speakers.csv social https://hynek.me/ https://twitter.com/sigmapie8 https://twitter.com/clleew https://twitter.com/Mridu__ https://twitter.com/Jayesh_Ahire1 https://twitter.com/ongchinhwee https://twitter.com/olgamatoula https://twitter.com/lordmauve ...

Web Scraping for price analysis to stay competitive! # prices.json
[ {"price": "$12", "name": "carbonara"}, {"price": "$200", "name": "pineapple"}, {"price": "$20", "name": "margarita"}, ...

Web Scraping for opinion mining! # reviews.json [ {"rating": 0,
"content": "worst pizza ever"}, {"rating": 5, "content": "5/5 MUST TRY!"}, {"rating": 4.5, "content": "lorem ipsum pizza is overpriced"}, ...

.. or for deciding if it's time to make a
twitter account..

follow me @mirceachira23

items.py class SpeakerItem(scrapy.Item): name = scrapy.Field() social = scrapy.Field(serializer=str)

from operator import methodcaller from scrapy.loader.processors import TakeFirst, Compose class
SpeakerItemLoader(ItemLoader): default_item_class = SpeakerItem default_output_processor = TakeFirst() social_out = Compose(TakeFirst(), methodcaller("strip"))

settings.py ITEM_PIPELINES = { 'python_pizza.pipelines.PythonPizzaPipeline': 300, } pipelines.py class PythonPizzaPipeline:
def process_item(self, item, spider): return item

$ pip install spidermon

let's stay connected in discord

Introduction to web scraping with scrapy, remot...

Introduction to web scraping with scrapy, remote python pizza 2020

narboom

Other Decks in Programming

Featured

Transcript

Introduction to Web Scraping with Scrapy and

Hi, I'm Mircea

I'm a student python dev for about 3 years now

# Install scrapy in your python env $ pip install

How do you crawl the web? With spiders! from scrapy

You send requests, these have an URL and methods (GET

The basic parts of a spider from scrapy import Spider

Use selectors to extract data from structured response bodies <div

Right-click on an element in the page to view the

from scrapy import Spider class MyFirstSpider(Spider): name = 'python_pizza' start_urls

This is how the website actually looks without javascript and

One of these responses must contain our data.

But this is javascript, we can't use selectors like before!

Our updated spider from scrapy import Spider class MyFirstSpider(Spider): name

def parse_js(self, response): # Extract the speakers array speakers_data =

Finally, to start our scraper... $ scrapy crawl python_pizza

You can also export your items to json and csv!

Web Scraping for price analysis to stay competitive! # prices.json

Web Scraping for opinion mining! # reviews.json [ {"rating": 0,

.. or for deciding if it's time to make a

follow me @mirceachira23

items.py class SpeakerItem(scrapy.Item): name = scrapy.Field() social = scrapy.Field(serializer=str)

from operator import methodcaller from scrapy.loader.processors import TakeFirst, Compose class

settings.py ITEM_PIPELINES = { 'python_pizza.pipelines.PythonPizzaPipeline': 300, } pipelines.py class PythonPizzaPipeline:

$ pip install spidermon

let's stay connected in discord