Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Scraping the Web with PHP

Avatar for Marcus Moore Marcus Moore
February 21, 2018

Scraping the Web with PHP

Feb. 21, 2018 SDPHP Meeting.

Download the PDF to click the links inside of the presentation.

Avatar for Marcus Moore

Marcus Moore

February 21, 2018
Tweet

More Decks by Marcus Moore

Other Decks in Programming

Transcript

  1. WHY? > Google... > "I want that data but there's

    no API!" > "There is an API but it sucks..." 4
  2. CAVEATS > PHP is not the best tool for the

    job... > Javascript is a potential issue... > Ethics... 5
  3. 7

  4. OVERALL CONCEPTS - BASICS 1. Crawler - retrieves resources via

    HTTP(s) and handles data ingestion 2. Parser - Extracts specific information from the resource ( this can be html, json, images , etc ) 3. Database - Stores the resource for future reference, comparison, or parsing 8
  5. Spatie/Crawler > Uses Guzzle promises to crawl multiple urls at

    the same time > Is capable of rendering Javascript via Puppeteer > Example repo: laravel-link-checker 11
  6. CRAWLER CONFIGURATION > ->setMaximumCrawlCount(5) > ->setMaximumDepth(2) > ->setMaximumResponseSize(1024 * 1024

    * 3) > defaults to 2 MB for avoiding PDFs and mp3s > ->setConcurrency(2) > ->executeJavascript() > shouldCrawl() > Allows you to limit what is crawled > Free to implement your own queue 13
  7. NAVIGATING VIA INDEX OR RELATIONSHIPS $crawler->filter('body > p')->eq(1); $crawler->filter('body >

    p')->first(); $crawler->filter('body > p')->last(); $crawler->filter('body > p')->siblings(); $crawler->filter('body')->children(); $crawler->filter('body > p')->parents(); ACCESSING VALUES $tag = $crawler->filterXPath('//body/*')->nodeName(); $crawler->filterXPath('//body/p')->text(); $crawler->filterXPath('//body/p')->attr('class'); 15
  8. LINKS You can find a link by name or linked

    images via their alt attribute: $crawler->selectLink('Terms of Use')->link(); // the getUri() method cleans up the url and includes query parameters or anchors $crawler->selectLink('Terms of Use')->link()->getUri(); 16
  9. FORMS // Get form via submit button text. Could also

    use #id $form = $crawler->selectButton('Submit')->form(); // Fill the form: $form = $crawler->selectButton('Submit')->form([ 'email' => '[email protected]', ]); > Works with multi-dimensional fields > Handles checkboxes, radio buttons and selects > Can be used to submit forms via an external client...Goutte 18
  10. GOUTTE Simple API for crawling websites and extracting data from

    the responses. > Wraps up Guzzle and DomCrawler 19
  11. GETTING BLOCKED... > Rate-limiting > User-Agent / Cookies > Pages

    might be changing information based on usage 20
  12. OTHER RESOURCES > HackerNews post on web scraping overview >

    Violating a Website’s Terms of Service Is Not a Crime > Schema.org > Fast Web Scraping with ReactPHP 24