Upgrade to Pro — share decks privately, control downloads, hide ads and more …

There will be Data: Scraping the Web with Python (CSS Slides) by Andrew Collier

Pycon ZA
October 11, 2019

There will be Data: Scraping the Web with Python (CSS Slides) by Andrew Collier

Web scraping is an essential weapon for every Data Scientist to have in their arsenal. Whether you're creating a new dataset from scratch or augmenting an existing dataset, there are reams of data available to be harvested.

In this practical talk I'll show how to use CSS to isolate the relevant portions of a web page and then demonstrate how to use BeautifulSoup to retrieve the associated data.

Pycon ZA

October 11, 2019
Tweet

More Decks by Pycon ZA

Other Decks in Programming

Transcript

  1. HTML HTML = HyperText Markup Language What is the content

    of the page? CSS CSS = Cascading Style Sheets How should the contents appear? CSS is what makes the internet look good. 3 / 23
  2. Inline Style <html> <body> <!-- Nobody does this anymore! -->

    <p style="font-weight: bold;">I'm bold!</p> </body> </html> Embedded Styles <html> <head> <style> /* Styles go here! */ </style> </head> </html> Linked Styles <html> <head> <link rel="stylesheet" type="text/css" href="styles.css" /> </head> </html> 5 / 23
  3. CSS is a set of rules. Each rule has the

    following structure: <selectors> { <declarations> } Selectors Selectors are used to target specific portions of the document. "To which elements does this apply?" Declarations The declaration block contains one or more declarations, each of which has the form: <property>: <value>; "What must be done with the selected elements?" 7 / 23
  4. p { font-color: red; } The selector is p. This

    tells us what portion of the document the rule will be applied to. The property is font-color and the value is red. The effect of this rule is that all paragraph content (within p tags) will be rendered as red text. 8 / 23
  5. CSS p { color: red; } HTML <p> All paragraph

    text will be red. </p> Tags 10 / 23
  6. CSS #introduction { font-style: italic; } HTML <div id="introduction"> This

    is the introduction. </div> <div> This is not the introduction! </div> Identi er An identifier rule begins with a "#" and targets one specific element in the document. An identifier is unique within a document. 11 / 23
  7. CSS .big { font-size: 2rem; } .huge { font-size: 4rem;

    } HTML <p> This is normal text. </p> <p class="big"> This is big text. </p> <p class="huge"> This is HUGE text! </p> Class A class rule begins with a ".". A class rule can be applied to multiple elements in a document. 12 / 23
  8. CSS h1, h2 { font-style: italic; } HTML <h1>Chapter Heading</h1>

    <h2>First Section</h2> <h2>Second Section</h2> Grouping Multiple selectors can be grouped together, separated by a ",". The rules will be applied to all of the selectors listed in the group. 13 / 23
  9. CSS .outer p { } HTML <div class="outer"> <p>Some paragraph

    text.</p> <div class="inner"> <p>Some more paragraph text.</p> </div> </div> <p> This paragraph is not selected! </p> Descendants Descendant selectors identify tags which are descendants (but not necessarily direct descendants!) in the document tree. 14 / 23
  10. CSS .outer > p { } HTML <div class="outer"> <p>I'm

    a direct descendant.</p> <div class="inner"> <p>I'm not a direct descendant.</p> </div> </div> Children Child selectors identify tags which are direct descendants of a parent tag. 15 / 23
  11. CSS .big + p { } HTML <p class="big"> This

    is big text. </p> <p>Some paragraph text.</p> <p>Some more paragraph text.</p> Adjacent Sibling Adjacent sibling selectors identify tags which are at the same level in the document tree and adjacent to each other. 16 / 23
  12. CSS img[width="50%"] { } HTML <img width="50%"> <img width="100%"> Attribute

    Attribute selectors identify tags which have specific attribute values. 17 / 23
  13. CSS /* "begins with" */ a[href^="mailto://"] { } /* "ends

    with" */ a[href$=".pdf"] { } /* "contains" */ img[src*="logo"] { } HTML <a href="http://"> <a href="mailto://"> <a href="ftp://"> <a href="logo.png"> <a href="report.pdf"> <a href="report.docx"> <img src="avatar.png"> <img src="company-logo.png"> <img src="banner.jpg"> Attribute Pattern You can also match on attribute patterns. 18 / 23
  14. CSS p:first-child { /* */ /* */ /* */ }

    p:last-child { /* */ /* */ /* */ } p:only-child { /* */ } HTML <div> <p></p> <p></p> <p></p> </div> <div> <p></p> <p></p> <p></p> </div> <div> <p></p> </div> First, Last & Only Child 20 / 23
  15. CSS li:nth-child(2) { /* */ /* */ /* */ }

    li:nth-child(odd) {} li:nth-child(even) {} li:nth-child(3n) {} li:nth-child(2n+1) {} HTML <ul> <li></li> <li></li> <li></li> </ul> n can be: a number a keyword or a formula. Formula is of form an + b with n starting at 0. nth Child The :nth-child(n) selector matches the nth child element, regardless of type. 21 / 23
  16. CSS table:nth-of-type(2) { /* */ /* */ /* */ /*

    */ } HTML <p></p> <table></table> <p></p> <table></table> <p></p> <table></table> nth of Type The :nth-of-type(n) selector matches the nth element of a specific type (with the same parent!). 22 / 23