Upgrade to Pro — share decks privately, control downloads, hide ads and more …

There will be Data: Scraping the Web with Python by Andrew Collier

Pycon ZA
October 11, 2019

There will be Data: Scraping the Web with Python by Andrew Collier

Web scraping is an essential weapon for every Data Scientist to have in their arsenal. Whether you're creating a new dataset from scratch or augmenting an existing dataset, there are reams of data available to be harvested.

In this practical talk I'll show how to use CSS to isolate the relevant portions of a web page and then demonstrate how to use BeautifulSoup to retrieve the associated data.

Pycon ZA

October 11, 2019
Tweet

More Decks by Pycon ZA

Other Decks in Programming

Transcript

  1. You could do it manually, but... Sometimes on a single

    page. Often distributed across many pages 7 / 19
  2. Pros Quick and easy. Ideal for a well-specified project. Some

    have (limted) free tier. Cons No access to scraper code. Subscription (need to subscribe for feed). Pay for maintenance. Poor for evolving specifications. Poor for evolving websites. SAAS A few sites offering Web Scraping services. For example: https://listly.io https://webhose.io/ https://scrapinghub.com/ https://webscraping.com/ [*] The "other" SAAS: Scraping As A Service. 11 / 19
  3. Do I need to know HTML? Yes. Do I need

    to know CSS? Yes. Do I need to know Javascript? Nope. 16 / 19
  4. @datawookie This tweet is entirely bogus. Tweet me sites you'd

    like to see scraped. If we have time then we'll do one at the end. [*] WikiPedia pages are always a good option. What about Game of Thrones? 17 / 19