Scraping the Web with PHP

SCRAPING THE WEB WITH PHP

hi. i'm Marcus. 2

WHAT? > "...is data scraping used for extracting data from
websites." - Wikipedia 3

WHY? > Google... > "I want that data but there's
no API!" > "There is an API but it sucks..." 4

CAVEATS > PHP is not the best tool for the
job... > Javascript is a potential issue... > Ethics... 5

robots.txt 6

OVERALL CONCEPTS - BASICS 1. Crawler - retrieves resources via
HTTP(s) and handles data ingestion 2. Parser - Extracts specific information from the resource ( this can be html, json, images , etc ) 3. Database - Stores the resource for future reference, comparison, or parsing 8

OVERALL CONCEPTS - ADVANCED 1. Advanced Queueing 2. Proxy infrastructure
9

PHP TOOLS > guzzlehttp/guzzle > spatie/crawler > symfony/dom-crawler and symfony/css-
selector > fabpot/goutte 10

Spatie/Crawler > Uses Guzzle promises to crawl multiple urls at
the same time > Is capable of rendering Javascript via Puppeteer > Example repo: laravel-link-checker 11

CRAWLER EXAMPLE... 12

CRAWLER CONFIGURATION > ->setMaximumCrawlCount(5) > ->setMaximumDepth(2) > ->setMaximumResponseSize(1024 * 1024
* 3) > defaults to 2 MB for avoiding PDFs and mp3s > ->setConcurrency(2) > ->executeJavascript() > shouldCrawl() > Allows you to limit what is crawled > Free to implement your own queue 13

DOM-CRAWLER 14

NAVIGATING VIA INDEX OR RELATIONSHIPS $crawler->filter('body > p')->eq(1); $crawler->filter('body >
p')->first(); $crawler->filter('body > p')->last(); $crawler->filter('body > p')->siblings(); $crawler->filter('body')->children(); $crawler->filter('body > p')->parents(); ACCESSING VALUES $tag = $crawler->filterXPath('//body/*')->nodeName(); $crawler->filterXPath('//body/p')->text(); $crawler->filterXPath('//body/p')->attr('class'); 15

LINKS You can find a link by name or linked
images via their alt attribute: $crawler->selectLink('Terms of Use')->link(); // the getUri() method cleans up the url and includes query parameters or anchors $crawler->selectLink('Terms of Use')->link()->getUri(); 16

IMAGES Images can be found by the alt attribute $crawler->selectImage('Kitten')->image()->getUri();
17

FORMS // Get form via submit button text. Could also
use #id $form = $crawler->selectButton('Submit')->form(); // Fill the form: $form = $crawler->selectButton('Submit')->form([ 'email' => '[email protected]', ]); > Works with multi-dimensional fields > Handles checkboxes, radio buttons and selects > Can be used to submit forms via an external client...Goutte 18

GOUTTE Simple API for crawling websites and extracting data from
the responses. > Wraps up Guzzle and DomCrawler 19

GETTING BLOCKED... > Rate-limiting > User-Agent / Cookies > Pages
might be changing information based on usage 20

OTHER PHP TOOLS > paquettg/php-html-parser > NetResearch/JSONMapper > Rican7/incoming 21

TOOLING IN NODE > Puppeteer 22

TOOLING IN PYTHON > Scrapy > ToApi 23

OTHER RESOURCES > HackerNews post on web scraping overview >
Violating a Website’s Terms of Service Is Not a Crime > Schema.org > Fast Web Scraping with ReactPHP 24

QUESTIONS? 25

THANKS! 26

Scraping the Web with PHP

Scraping the Web with PHP

Marcus Moore

More Decks by Marcus Moore

Other Decks in Programming

Featured

Transcript