Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Webscraping with asyncio

jmortegac
November 06, 2016

Webscraping with asyncio

Webscraping with asyncio in python

jmortegac

November 06, 2016
Tweet

More Decks by jmortegac

Other Decks in Programming

Transcript

  1. Webscraping
    with Asyncio
    José Manuel Ortega
    @jmortegac

    View Slide

  2. Python conferences
    https://speakerdeck.com/jmortega

    View Slide

  3. Python conferences
    http://jmortega.github.io/

    View Slide

  4. Github repository
    https://github.com/jmortega/webscraping_asyncio_2016

    View Slide

  5. Agenda
    ▶ Webscraping python tools
    ▶ Requests vs aiohttp
    ▶ Introduction to asyncio
    ▶ Async client/server
    ▶ Building a webcrawler with asyncio
    ▶ Alternatives to asyncio

    View Slide

  6. Webscraping

    View Slide

  7. Python tools

    Requests

    Beautiful Soup 4

    Pyquery

    Webscraping

    Scrapy

    View Slide

  8. Python tools

    Mechanize

    Robobrowser

    Selenium

    View Slide

  9. Requests http://docs.python-requests.org/en/latest

    View Slide

  10. Web scraping with Python
    1.
    Download webpage with HTTP
    module(requests,urllib,aiohttp)
    2.
    Parse the page with
    BeautifulSoup/lxml
    3.
    Select elements with Regular
    expressions,XPath or css selectors
    4.
    Store results in a database,csv,json

    View Slide

  11. BeautifulSoup

    View Slide

  12. BeautifulSoup
    ▶ soup =
    BeautifulSoup(html_doc,’html.parser’)
    ▶ Print all: print(soup.prettify())
    ▶ Print text: print(soup.get_text())
    from bs4 import BeautifulSoup

    View Slide

  13. BeautifulSoup functions
    ▪ find_all(‘a’)→Returns all links
    ▪ find(‘title’)→Returns the first element
    ▪ get(‘href’)→Returns the attribute href value
    ▪ (element).text → Returns the text inside an
    element
    for link in soup.find_all('a'):
    print(link.get('href'))

    View Slide

  14. External/internal links

    View Slide

  15. External/internal links
    http://python.ie/pycon-2016/

    View Slide

  16. BeautifulSoup PyCon

    View Slide

  17. BeautifulSoup PyCon Output

    View Slide

  18. Parsers Comparison

    View Slide

  19. PyQuery

    View Slide

  20. PyQuery

    View Slide

  21. PyQuery output

    View Slide

  22. Spiders /crawlers
    ▶ A Web crawler is an Internet bot that
    systematically browses the World Wide Web,
    typically for the purpose of Web indexing. A
    Web crawler may also be called a Web
    spider.
    https://en.wikipedia.org/wiki/Web_crawler

    View Slide

  23. Spiders /crawlers
    scrapinghub.com

    View Slide

  24. Scrapy
    https://pypi.python.org/pypi/Scrapy/1.1.2

    View Slide

  25. Scrapy
    ▶ Uses a mechanism based on XPath
    expressions called Xpath
    Selectors.
    ▶ Uses Parser LXML to find elements
    ▶ Twisted for asynchronous
    operations

    View Slide

  26. Scrapy advantages
    ▶ Faster than mechanize because it
    uses twisted for asynchronous operations.
    ▶ Scrapy has better support for html
    parsing.
    ▶ Scrapy has better support for unicode
    characters, redirections, gzipped
    responses, encodings.
    ▶ You can export the extracted data directly
    to JSON,XML and CSV.

    View Slide

  27. Export data
    ▶ scrapy crawl
    ▶ $ scrapy crawl -o items.json -t json
    ▶ $ scrapy crawl -o items.csv -t csv
    ▶ $ scrapy crawl -o items.xml -t xml

    View Slide

  28. Scrapy concurrency

    View Slide

  29. The concurrency problem
    ▶ Different approaches:
    ▶ Multiple processes
    ▶ Threads
    ▶ Separate distributed machines
    ▶ Asynchronous programming(event
    loop)

    View Slide

  30. Requests problems
    ▶ Requests operations are blocking the
    main thread
    ▶ It pauses until operation completed
    ▶ We need one thread for each request if
    we want non-blocking operations

    View Slide

  31. Threads problems
    ▶ Get Overhead
    ▶ Stack size
    ▶ Context changes
    ▶ Synchronization

    View Slide

  32. Solution
    ▶NOT USE THREADS
    ▶USE ONE THREAD
    ▶+ EVENT LOOP

    View Slide

  33. New concepts
    ▶ Event loop
    ▶ Async
    ▶ Await
    ▶ Futures
    ▶ Coroutines
    ▶ Tasks
    ▶ Executors

    View Slide

  34. Event loop implementations
    ▶ Asyncio
    ▶ https://docs.python.org/3.4/library/asyncio.html
    ▶ Tornado web server
    ▶ http://www.tornadoweb.org/en/stable
    ▶ Twisted
    ▶ https://twistedmatrix.com
    ▶ Gevent
    ▶ http://www.gevent.org

    View Slide

  35. Asyncio def.

    View Slide

  36. Asyncio
    ▶ Python >=3.3
    ▶ Event-loop framework
    ▶ I/O Asynchronous
    ▶ Non-blocking approach with sockets
    ▶ All requests in one thread
    ▶ Event-driven switching
    ▶ aio-http module for make requests
    asynchronously

    View Slide

  37. Asyncio
    ▶ Interoperatibility with other frameworks

    View Slide

  38. Requests vs aiohttp
    #!/usr/local/bin/python3.5
    import asyncio
    from aiohttp import ClientSession
    async def hello():
    async with ClientSession() as session:
    async with session.get("http://httpbin.org/headers") as response:
    response = await response.read()
    print(response)
    loop = asyncio.get_event_loop()
    loop.run_until_complete(hello())
    import requests
    def hello()
    return requests.get("http://httpbin.org/get")
    print(hello())

    View Slide

  39. Event Loop
    ▶ An event loop allow us to write asynchronous
    code using callbacks or coroutines.
    ▶ Event loop function like task switcher,just the way
    operating systems switch between active tasks on the
    CPU.
    ▶ The idea is that we have an event loop running until all
    tasks scheduled are completed.
    ▶ Features and tasks are created through the event loop.

    View Slide

  40. Event Loop
    ▶ An event loop is used to orchestrate the
    execution of the coroutines.
    ▶ asyncio.get_event_loop()
    ▶ asyncio.run_until_complete(coroutines,futures)
    ▶ asyncio.run_forever()
    ▶ asyncio.stop()

    View Slide

  41. Starting Event Loop

    View Slide

  42. Coroutines
    ▶ Coroutines are functions that allow for
    multitasking without requiring multiple
    threads or processes.
    ▶ Coroutines are like functions, but they can be
    suspended or resumed at certain points in the
    code.
    ▶ Coroutines allow write asynchronous code that
    combines the efficiency of callbacks with the
    classic good looks of multithreaded.

    View Slide

  43. Coroutines 3.4 vs 3.5
    import asyncio
    @asyncio.coroutine
    def fetch(self, url):
    response = yield from self.session.get(url)
    body = yield from response.read()
    import asyncio
    async def fetch(self, url):
    response = await self.session.get(url)
    body = await response.read()

    View Slide

  44. Coroutines in event loop
    #!/usr/local/bin/python3.5
    import asyncio
    import aiohttp
    async def get_page(url):
    response = await aiohttp.request('GET', url)
    body = await response.read()
    print(body)
    loop = asyncio.get_event_loop()
    loop.run_until_complete(asyncio.wait([get_page('http://python.org'),
    get_page('http://pycon.org')]))

    View Slide

  45. Requests in event loop
    async def getpage_with_requests(url):
    return await loop.run_in_executor(None,requests.get,url)
    #methods equivalents
    async def getpage_with_aiohttp(url):
    with aitohttp.ClientSession() as session:
    async with session.get(url) as response:
    return await response.read()

    View Slide

  46. Tasks
    ▶ The asyncio.Task class is a subclass of
    asyncio.Future to encapsulate and manage
    coroutines.
    ▶ Allow independently running tasks to run
    concurrently with other tasks on the same event
    loop.
    ▶ When a coroutine is wrapped in a task, it
    connects the task to the event loop.

    View Slide

  47. Tasks

    View Slide

  48. Tasks

    View Slide

  49. Tasks

    View Slide

  50. Tasks execution

    View Slide

  51. Futures
    ▶ To manage an object Future in Asyncio, we
    must declare the following:
    ▶ import asyncio
    ▶ future = asyncio.Future()
    ▶ https://docs.python.org/3/library/asyncio
    -task.html#future
    ▶ https://docs.python.org/3/library/concurr
    ent.futures.html

    View Slide

  52. Futures
    ▶ The asyncio.Future class is essentially a
    promise of a result.
    ▶ A Future will returns the results when they
    are available, and once it receives results, it
    will pass them along to all the registered
    callbacks.
    ▶ Each future is a task to be executed in the
    event loop

    View Slide

  53. Futures

    View Slide

  54. Semaphores
    ▶ Adding synchronization
    ▶ Limiting number of concurrent requests.
    ▶ The argument indicates the number of
    simultaneous requests we want to allow.
    ▶ sem = asyncio.Semaphore(5)
    with (await sem):
    page = await get(url, compress=True)

    View Slide

  55. Async Client /server
    ▶ asyncio.start_server
    ▶ server =
    asyncio.start_server(handle_connection,host=HOST,port=PORT)

    View Slide

  56. Async Client /server
    ▶ asyncio.open_connection

    View Slide

  57. Async Client /server

    View Slide

  58. Async Web crawler

    View Slide

  59. Async Web crawler
    ▶ Send asynchronous requests to all the links
    on a web page and add the responses to a
    queue to be processed as we go.
    ▶ Coroutines allow running independent tasks and
    processing their results in 3 ways:
    ▶ Using asyncio.as_completed →by processing
    the results as they come.
    ▶ Using asyncio.gather→ only once they have all
    finished loading.
    ▶ Using asyncio.ensure_future

    View Slide

  60. Async Web crawler
    import asyncio
    import random
    @asyncio.coroutine
    def get_url(url):
    wait_time = random.randint(1, 4)
    yield from asyncio.sleep(wait_time)
    print('Done: URL {} took {}s to get!'.format(url, wait_time))
    return url, wait_time
    @asyncio.coroutine
    def process_results_as_come_in():
    coroutines = [get_url(url) for url in ['URL1', 'URL2', 'URL3']]
    for coroutine in asyncio.as_completed(coroutines):
    url, wait_time = yield from coroutine
    print('Coroutine for {} is done'.format(url))
    def main():
    loop = asyncio.get_event_loop()
    print(“Process results as they come in:")
    loop.run_until_complete(process_results_as_come_in())
    if __name__ == '__main__':
    main()
    asyncio.as_completed

    View Slide

  61. Async Web crawler execution

    View Slide

  62. Async Web crawler
    import asyncio
    import random
    @asyncio.coroutine
    def get_url(url):
    wait_time = random.randint(1, 4)
    yield from asyncio.sleep(wait_time)
    print('Done: URL {} took {}s to get!'.format(url, wait_time))
    return url, wait_time
    @asyncio.coroutine
    def process_once_everything_ready():
    coroutines = [get_url(url) for url in ['URL1', 'URL2', 'URL3']]
    results = yield from asyncio.gather(*coroutines)
    print(results)
    def main():
    loop = asyncio.get_event_loop()
    print(“Process results once they are all ready:")
    loop.run_until_complete(process_once_everything_ready())
    if __name__ == '__main__':
    main()
    asyncio.gather

    View Slide

  63. asyncio.gather
    From Python documentation, this is what asyncio.gather does:
    asyncio.gather(*coros_or_futures, loop=None,
    return_exceptions=False)
    Return a future aggregating results from the given coroutine
    objects or futures.
    All futures must share the same event loop. If all the tasks
    are done successfully, the returned future’s result is the
    list of results (in the order of the original sequence, not
    necessarily the order of results arrival). If
    return_exceptions is True, exceptions in the tasks are
    treated the same as successful results, and gathered in the
    result list; otherwise, the first raised exception will be
    immediately propagated to the returned future.

    View Slide

  64. Async Web crawler
    import asyncio
    import random
    @asyncio.coroutine
    def get_url(url):
    wait_time = random.randint(1, 4)
    yield from asyncio.sleep(wait_time)
    print('Done: URL {} took {}s to get!'.format(url, wait_time))
    return url, wait_time
    @asyncio.coroutine
    def process_ensure_future():
    tasks= [asyncio.ensure_future(get_url(url) )for url in ['URL1',
    'URL2', 'URL3']]
    results = yield from asyncio.wait(tasks)
    print(results)
    def main():
    loop = asyncio.get_event_loop()
    print(“Process ensure future:")
    loop.run_until_complete(process_ensure_future())
    if __name__ == '__main__':
    main()
    asyncio.ensure_future

    View Slide

  65. Async Web crawler execution

    View Slide

  66. Async Web downloader

    View Slide

  67. Async Web downloader faster

    View Slide

  68. Async Web downloader
    ▶ With get_partial_content
    ▶ With download_coroutine

    View Slide

  69. Async Extracting links with r.e

    View Slide

  70. Async Extracting links with bs4

    View Slide

  71. Async Extracting links execution
    ▶ With bs4
    ▶ With regex

    View Slide

  72. Alternatives to asyncio
    ▶ ThreadPoolExecutor
    ▶ https://docs.python.org/3.5/library/concurrent.futures.html#concurrent.fut
    ures.ThreadPoolExecutor
    ▶ ProcessPoolExecutor
    ▶ https://docs.python.org/3.5/library/concurrent.futures.html#concur
    rent.futures.ProcessPoolExecutor
    ▶ Parallel python
    ▶ http://www.parallelpython.com

    View Slide

  73. Parallel python
    ▶ SMP(symmetric multiprocessing)
    architecture with multiple cores in the same
    machine
    ▶ Distribute tasks in multiple machines
    ▶ Cluster

    View Slide

  74. ProcessPoolExecutor
    number_of_cpus = cpu_count()

    View Slide

  75. References
    ▶ http://www.crummy.com/software/BeautifulSoup
    ▶ http://scrapy.org
    ▶ http://docs.webscraping.com
    ▶ https://github.com/KeepSafe/aiohttp
    ▶ http://aiohttp.readthedocs.io/en/stable/
    ▶ https://docs.python.org/3.4/library/asyncio.html
    ▶ https://github.com/REMitchell/python-scraping

    View Slide

  76. Books

    View Slide

  77. Books

    View Slide

  78. Thank you!
    @jmortegac
    http://speakerdeck.com/jmortega
    http://github.com/jmortega

    View Slide