Webscraping with asyncio

Slide 1

Slide 1 text

Webscraping with Asyncio José Manuel Ortega @jmortegac

Slide 2

Slide 2 text

Python conferences https://speakerdeck.com/jmortega

Slide 3

Slide 3 text

Python conferences http://jmortega.github.io/

Slide 4

Slide 4 text

Github repository https://github.com/jmortega/webscraping_asyncio_2016

Slide 5

Slide 5 text

Agenda ▶ Webscraping python tools ▶ Requests vs aiohttp ▶ Introduction to asyncio ▶ Async client/server ▶ Building a webcrawler with asyncio ▶ Alternatives to asyncio

Slide 6

Slide 6 text

Webscraping

Slide 7

Slide 7 text

Python tools ➢ Requests ➢ Beautiful Soup 4 ➢ Pyquery ➢ Webscraping ➢ Scrapy

Slide 8

Slide 8 text

Python tools ➢ Mechanize ➢ Robobrowser ➢ Selenium

Slide 9

Slide 9 text

Requests http://docs.python-requests.org/en/latest

Slide 10

Slide 10 text

Web scraping with Python 1. Download webpage with HTTP module(requests,urllib,aiohttp) 2. Parse the page with BeautifulSoup/lxml 3. Select elements with Regular expressions,XPath or css selectors 4. Store results in a database,csv,json

Slide 11

Slide 11 text

BeautifulSoup

Slide 12

Slide 12 text

BeautifulSoup ▶ soup = BeautifulSoup(html_doc,’html.parser’) ▶ Print all: print(soup.prettify()) ▶ Print text: print(soup.get_text()) from bs4 import BeautifulSoup

Slide 13

Slide 13 text

BeautifulSoup functions ▪ find_all(‘a’)→Returns all links ▪ find(‘title’)→Returns the first element ▪ get(‘href’)→Returns the attribute href value ▪ (element).text → Returns the text inside an element for link in soup.find_all('a'): print(link.get('href'))

Slide 14

Slide 14 text

External/internal links

Slide 15

Slide 15 text

External/internal links http://python.ie/pycon-2016/

Slide 16

Slide 16 text

BeautifulSoup PyCon

Slide 17

Slide 17 text

BeautifulSoup PyCon Output

Slide 18

Slide 18 text

Parsers Comparison

Slide 19

Slide 19 text

PyQuery

Slide 20

Slide 20 text

PyQuery

Slide 21

Slide 21 text

PyQuery output

Slide 22

Slide 22 text

Spiders /crawlers ▶ A Web crawler is an Internet bot that systematically browses the World Wide Web, typically for the purpose of Web indexing. A Web crawler may also be called a Web spider. https://en.wikipedia.org/wiki/Web_crawler

Slide 23

Slide 23 text

Spiders /crawlers scrapinghub.com

Slide 24

Slide 24 text

Scrapy https://pypi.python.org/pypi/Scrapy/1.1.2

Slide 25

Slide 25 text

Scrapy ▶ Uses a mechanism based on XPath expressions called Xpath Selectors. ▶ Uses Parser LXML to find elements ▶ Twisted for asynchronous operations

Slide 26

Slide 26 text

Scrapy advantages ▶ Faster than mechanize because it uses twisted for asynchronous operations. ▶ Scrapy has better support for html parsing. ▶ Scrapy has better support for unicode characters, redirections, gzipped responses, encodings. ▶ You can export the extracted data directly to JSON,XML and CSV.

Slide 27

Slide 27 text

Export data ▶ scrapy crawl ▶ $ scrapy crawl -o items.json -t json ▶ $ scrapy crawl -o items.csv -t csv ▶ $ scrapy crawl -o items.xml -t xml ▶

Slide 28

Slide 28 text

Scrapy concurrency

Slide 29

Slide 29 text

The concurrency problem ▶ Different approaches: ▶ Multiple processes ▶ Threads ▶ Separate distributed machines ▶ Asynchronous programming(event loop)

Slide 30

Slide 30 text

Requests problems ▶ Requests operations are blocking the main thread ▶ It pauses until operation completed ▶ We need one thread for each request if we want non-blocking operations

Slide 31

Slide 31 text

Threads problems ▶ Get Overhead ▶ Stack size ▶ Context changes ▶ Synchronization

Slide 32

Slide 32 text

Solution ▶NOT USE THREADS ▶USE ONE THREAD ▶+ EVENT LOOP

Slide 33

Slide 33 text

New concepts ▶ Event loop ▶ Async ▶ Await ▶ Futures ▶ Coroutines ▶ Tasks ▶ Executors

Slide 34

Slide 34 text

Event loop implementations ▶ Asyncio ▶ https://docs.python.org/3.4/library/asyncio.html ▶ Tornado web server ▶ http://www.tornadoweb.org/en/stable ▶ Twisted ▶ https://twistedmatrix.com ▶ Gevent ▶ http://www.gevent.org

Slide 35

Slide 35 text

Asyncio def.

Slide 36

Slide 36 text

Asyncio ▶ Python >=3.3 ▶ Event-loop framework ▶ I/O Asynchronous ▶ Non-blocking approach with sockets ▶ All requests in one thread ▶ Event-driven switching ▶ aio-http module for make requests asynchronously

Slide 37

Slide 37 text

Asyncio ▶ Interoperatibility with other frameworks

Slide 38

Slide 38 text

Requests vs aiohttp #!/usr/local/bin/python3.5 import asyncio from aiohttp import ClientSession async def hello(): async with ClientSession() as session: async with session.get("http://httpbin.org/headers") as response: response = await response.read() print(response) loop = asyncio.get_event_loop() loop.run_until_complete(hello()) import requests def hello() return requests.get("http://httpbin.org/get") print(hello())

Slide 39

Slide 39 text

Event Loop ▶ An event loop allow us to write asynchronous code using callbacks or coroutines. ▶ Event loop function like task switcher,just the way operating systems switch between active tasks on the CPU. ▶ The idea is that we have an event loop running until all tasks scheduled are completed. ▶ Features and tasks are created through the event loop.

Slide 40

Slide 40 text

Event Loop ▶ An event loop is used to orchestrate the execution of the coroutines. ▶ asyncio.get_event_loop() ▶ asyncio.run_until_complete(coroutines,futures) ▶ asyncio.run_forever() ▶ asyncio.stop()

Slide 41

Slide 41 text

Starting Event Loop

Slide 42

Slide 42 text

Coroutines ▶ Coroutines are functions that allow for multitasking without requiring multiple threads or processes. ▶ Coroutines are like functions, but they can be suspended or resumed at certain points in the code. ▶ Coroutines allow write asynchronous code that combines the efficiency of callbacks with the classic good looks of multithreaded.

Slide 43

Slide 43 text

Coroutines 3.4 vs 3.5 import asyncio @asyncio.coroutine def fetch(self, url): response = yield from self.session.get(url) body = yield from response.read() import asyncio async def fetch(self, url): response = await self.session.get(url) body = await response.read()

Slide 44

Slide 44 text

Coroutines in event loop #!/usr/local/bin/python3.5 import asyncio import aiohttp async def get_page(url): response = await aiohttp.request('GET', url) body = await response.read() print(body) loop = asyncio.get_event_loop() loop.run_until_complete(asyncio.wait([get_page('http://python.org'), get_page('http://pycon.org')]))

Slide 45

Slide 45 text

Requests in event loop async def getpage_with_requests(url): return await loop.run_in_executor(None,requests.get,url) #methods equivalents async def getpage_with_aiohttp(url): with aitohttp.ClientSession() as session: async with session.get(url) as response: return await response.read()

Slide 46

Slide 46 text

Tasks ▶ The asyncio.Task class is a subclass of asyncio.Future to encapsulate and manage coroutines. ▶ Allow independently running tasks to run concurrently with other tasks on the same event loop. ▶ When a coroutine is wrapped in a task, it connects the task to the event loop.

Slide 47

Slide 47 text

Tasks

Slide 48

Slide 48 text

Tasks

Slide 49

Slide 49 text

Tasks

Slide 50

Slide 50 text

Tasks execution

Slide 51

Slide 51 text

Futures ▶ To manage an object Future in Asyncio, we must declare the following: ▶ import asyncio ▶ future = asyncio.Future() ▶ https://docs.python.org/3/library/asyncio -task.html#future ▶ https://docs.python.org/3/library/concurr ent.futures.html

Slide 52

Slide 52 text

Futures ▶ The asyncio.Future class is essentially a promise of a result. ▶ A Future will returns the results when they are available, and once it receives results, it will pass them along to all the registered callbacks. ▶ Each future is a task to be executed in the event loop

Slide 53

Slide 53 text

Futures

Slide 54

Slide 54 text

Semaphores ▶ Adding synchronization ▶ Limiting number of concurrent requests. ▶ The argument indicates the number of simultaneous requests we want to allow. ▶ sem = asyncio.Semaphore(5) with (await sem): page = await get(url, compress=True)

Slide 55

Slide 55 text

Async Client /server ▶ asyncio.start_server ▶ server = asyncio.start_server(handle_connection,host=HOST,port=PORT)

Slide 56

Slide 56 text

Async Client /server ▶ asyncio.open_connection

Slide 57

Slide 57 text

Async Client /server

Slide 58

Slide 58 text

Async Web crawler

Slide 59

Slide 59 text

Async Web crawler ▶ Send asynchronous requests to all the links on a web page and add the responses to a queue to be processed as we go. ▶ Coroutines allow running independent tasks and processing their results in 3 ways: ▶ Using asyncio.as_completed →by processing the results as they come. ▶ Using asyncio.gather→ only once they have all finished loading. ▶ Using asyncio.ensure_future

Slide 60

Slide 60 text

Async Web crawler import asyncio import random @asyncio.coroutine def get_url(url): wait_time = random.randint(1, 4) yield from asyncio.sleep(wait_time) print('Done: URL {} took {}s to get!'.format(url, wait_time)) return url, wait_time @asyncio.coroutine def process_results_as_come_in(): coroutines = [get_url(url) for url in ['URL1', 'URL2', 'URL3']] for coroutine in asyncio.as_completed(coroutines): url, wait_time = yield from coroutine print('Coroutine for {} is done'.format(url)) def main(): loop = asyncio.get_event_loop() print(“Process results as they come in:") loop.run_until_complete(process_results_as_come_in()) if __name__ == '__main__': main() asyncio.as_completed

Slide 61

Slide 61 text

Async Web crawler execution

Slide 62

Slide 62 text

Async Web crawler import asyncio import random @asyncio.coroutine def get_url(url): wait_time = random.randint(1, 4) yield from asyncio.sleep(wait_time) print('Done: URL {} took {}s to get!'.format(url, wait_time)) return url, wait_time @asyncio.coroutine def process_once_everything_ready(): coroutines = [get_url(url) for url in ['URL1', 'URL2', 'URL3']] results = yield from asyncio.gather(*coroutines) print(results) def main(): loop = asyncio.get_event_loop() print(“Process results once they are all ready:") loop.run_until_complete(process_once_everything_ready()) if __name__ == '__main__': main() asyncio.gather

Slide 63

Slide 63 text

asyncio.gather From Python documentation, this is what asyncio.gather does: asyncio.gather(*coros_or_futures, loop=None, return_exceptions=False) Return a future aggregating results from the given coroutine objects or futures. All futures must share the same event loop. If all the tasks are done successfully, the returned future’s result is the list of results (in the order of the original sequence, not necessarily the order of results arrival). If return_exceptions is True, exceptions in the tasks are treated the same as successful results, and gathered in the result list; otherwise, the first raised exception will be immediately propagated to the returned future.

Slide 64

Slide 64 text

Async Web crawler import asyncio import random @asyncio.coroutine def get_url(url): wait_time = random.randint(1, 4) yield from asyncio.sleep(wait_time) print('Done: URL {} took {}s to get!'.format(url, wait_time)) return url, wait_time @asyncio.coroutine def process_ensure_future(): tasks= [asyncio.ensure_future(get_url(url) )for url in ['URL1', 'URL2', 'URL3']] results = yield from asyncio.wait(tasks) print(results) def main(): loop = asyncio.get_event_loop() print(“Process ensure future:") loop.run_until_complete(process_ensure_future()) if __name__ == '__main__': main() asyncio.ensure_future

Slide 65

Slide 65 text

Async Web crawler execution

Slide 66

Slide 66 text

Async Web downloader

Slide 67

Slide 67 text

Async Web downloader faster

Slide 68

Slide 68 text

Async Web downloader ▶ With get_partial_content ▶ With download_coroutine

Slide 69

Slide 69 text

Async Extracting links with r.e

Slide 70

Slide 70 text

Async Extracting links with bs4

Slide 71

Slide 71 text

Async Extracting links execution ▶ With bs4 ▶ With regex

Slide 72

Slide 72 text

Alternatives to asyncio ▶ ThreadPoolExecutor ▶ https://docs.python.org/3.5/library/concurrent.futures.html#concurrent.fut ures.ThreadPoolExecutor ▶ ProcessPoolExecutor ▶ https://docs.python.org/3.5/library/concurrent.futures.html#concur rent.futures.ProcessPoolExecutor ▶ Parallel python ▶ http://www.parallelpython.com

Slide 73

Slide 73 text

Parallel python ▶ SMP(symmetric multiprocessing) architecture with multiple cores in the same machine ▶ Distribute tasks in multiple machines ▶ Cluster

Slide 74

Slide 74 text

ProcessPoolExecutor number_of_cpus = cpu_count()

Slide 75

Slide 75 text

References ▶ http://www.crummy.com/software/BeautifulSoup ▶ http://scrapy.org ▶ http://docs.webscraping.com ▶ https://github.com/KeepSafe/aiohttp ▶ http://aiohttp.readthedocs.io/en/stable/ ▶ https://docs.python.org/3.4/library/asyncio.html ▶ https://github.com/REMitchell/python-scraping

Slide 76

Slide 76 text

Books

Slide 77

Slide 77 text

Books

Slide 78

Slide 78 text

Thank you! @jmortegac http://speakerdeck.com/jmortega http://github.com/jmortega