Upgrade to Pro — share decks privately, control downloads, hide ads and more …

There will be Data: Scraping the Web with Python by Andrew Collier

Pycon ZA
October 11, 2019

There will be Data: Scraping the Web with Python by Andrew Collier

Web scraping is an essential weapon for every Data Scientist to have in their arsenal. Whether you're creating a new dataset from scratch or augmenting an existing dataset, there are reams of data available to be harvested.

In this practical talk I'll show how to use CSS to isolate the relevant portions of a web page and then demonstrate how to use BeautifulSoup to retrieve the associated data.

Pycon ZA

October 11, 2019
Tweet

More Decks by Pycon ZA

Other Decks in Programming

Transcript

  1. Andrew B. Collier andrew@exegetic.biz | @datawookie 1 / 19

  2. Data Science Consulting / Training We build a lot of

    bespoke Web Scrapers! 2 / 19
  3. What? What? 3 / 19 3 / 19

  4. Automatically extracting data from websites and storing in structured format

    4 / 19
  5. Why? Why? 5 / 19 5 / 19

  6. Many websites have vast volumes of useful and valuable data

    6 / 19
  7. You could do it manually, but... Sometimes on a single

    page. Often distributed across many pages 7 / 19
  8. How? How? 8 / 19 8 / 19

  9. Site has an API? Use it! 9 / 19

  10. Site has no API? Scrape it! 10 / 19

  11. Pros Quick and easy. Ideal for a well-specified project. Some

    have (limted) free tier. Cons No access to scraper code. Subscription (need to subscribe for feed). Pay for maintenance. Poor for evolving specifications. Poor for evolving websites. SAAS A few sites offering Web Scraping services. For example: https://listly.io https://webhose.io/ https://scrapinghub.com/ https://webscraping.com/ [*] The "other" SAAS: Scraping As A Service. 11 / 19
  12. What's the alternative? 12 / 19

  13. 13 / 19 13 / 19

  14. Tools There are other options, but IMHO these are the

    best. 14 / 19
  15. 15 / 19

  16. Do I need to know HTML? Yes. Do I need

    to know CSS? Yes. Do I need to know Javascript? Nope. 16 / 19
  17. @datawookie This tweet is entirely bogus. Tweet me sites you'd

    like to see scraped. If we have time then we'll do one at the end. [*] WikiPedia pages are always a good option. What about Game of Thrones? 17 / 19
  18. Let's go! Let's go! 18 / 19 18 / 19

  19. @datawookie andrew@exegetic.biz https://www.exegetic.biz https://datawookie.netlify.com 19 / 19