Scraping the web

#sfugbln 2016 Web Scraping The Good, The Bad and The
Exciting [email protected] @fanatique

#sfugbln 2016 A. Thomas 2 Alex Thomas @fanatique /fanatique robust
software, resilient teams, php, js, berlin, sailing, … !

#sfugbln 2016 A. Thomas Web Scraping: reverse engineering a web
frontend in order to extract structured data from a website, that doesn’t offer an API… 3

#sfugbln 2016 A. Thomas 4 { "title": "Symfony", "author": "Potencier",
"language": "PHP", ... ... ... } Gear image by Verdy_p Web Scraping

#sfugbln 2016 A. Thomas Can this be legal?

#sfugbln 2016 A. Thomas 7 ✓ Personal Use* *according to
my understanding.

#sfugbln 2016 A. Thomas 8 X Unlicensed Commercial Use!* *according
to my understanding.

#sfugbln 2016 A. Thomas 9 ~ • Unlicensed Commercial Use
• Only Facts (hard to prove) (I wouldn’t recommend it!)* Wannsee Weather This Afternoon: Temperature: 11° C Wind: 3 – 4 Bft. *according to my understanding.

#sfugbln 2016 A. Thomas 10 weather_scraper.php $html = file_get_contents($url); $regex
= '/rain:(.*)/'; preg_match($regex, $html, $match); if ($match[0][0] === 'none') { sendPushMessage($match[0][0]); }  1. Harvesting 2. Extracting 3. Processing

#sfugbln 2016 1. Harvesting aka obtaining the DOM

#sfugbln 2016 A. Thomas file_get_contents gets you nowhere • very
limited support for headers, cookies etc • no way to evaluate return values in detail • only sync requests 12 $opts = array( 'http' => array( 'method'=>"GET", 'header'=>"Accept-language: en\r\n" . "Cookie: foo=bar\r\n" )); $context = stream_context_create($opts); $htmlStr = file_get_contents($url, false, $context); 1. Harvesting

#sfugbln 2016 A. Thomas 1. Harvesting 13 // Send an
asynchronous request. $request = new \GuzzleHttp\Psr7\Request('GET', $url); $promise = $client->sendAsync($request)->then( function ($response) { $result = extractData($response->getBody()); }); Do yourself a favour: use at least guzzle with curl • Async requests • Detailed header manipulation possible • Simple interface

#sfugbln 2016 A. Thomas And always store what you harvested!
14 http://example.com ... $$$ </> </> </> 1. Harvesting

#sfugbln 2016 A. Thomas Unfortunately that does still not solve
your hardest problem in harvesting: obtaining the right DOM!

#sfugbln 2016 A. Thomas This is how guzzle sees your
target site: 16 1. Harvesting

#sfugbln 2016 A. Thomas While in reality it looks like
this: 17 1. Harvesting

#sfugbln 2016 A. Thomas The only solution (I know) is:
18 1. Harvesting Browser automation through Selenium! $welcomeText = $driver ->findElement( WebDriverBy::id('id')) ->getText(); $driver ->click( WebDriverBy::id('id')) ); Selenium Server

#sfugbln 2016 A. Thomas ✓ Can handle JS and XHR!!
✓ Proper cookie handling ✓ Loads assets ✓ Can do screenshots(!) 19 - Veeeryy sloow - Response headers are not accessible - Heavyweight infrastructure 1. Harvesting Scraping with Selenium All in all it behaves like a real browser.   – Because it is a real browser.

#sfugbln 2016 A. Thomas 20 $driver->get($this->getTestPath('index.html')); $welcomeText = $driver ->findElement(
WebDriverBy::id('welcome') ) ->getText(); facebook/php-webdriver • Is the most current implementation of the Selenium WebDriver protocol in PHP • Actively maintained • Not very well documented (imho)… 1. Harvesting

#sfugbln 2016 A. Thomas Usual challenges: • IP blocking (because
of too many requests) • Bandwidth throttling / no responses • Fake responses • … 21 1. Harvesting Maybe your scraper should be a bit nicer!

#sfugbln 2016 2. Extracting (spoiler: you don’t use regex anymore)

#sfugbln 2016 A. Thomas 23 2. Extracting • Allows to
filter the DOM using CSS selectors • Contains convenience methods (like first(), siblings(), parents() etc) • Supports XPath – which is even more powerful! use Symfony\Component\DomCrawler\Crawler; $crawler = new Crawler($htmlStr); $results = $crawler->filter('body > p'); foreach ($results as $result) { doCleverStuffWith($result->textContent); } There is a simple solution for extracting DOM data in PHP:

#sfugbln 2016 A. Thomas var s = document.createElement('script'); s.setAttribute('src','//xmpl.co/scrpr.js'); document.head.appendChild(s);
Inject a remote JS file into a target site. 24 2. Extracting { "title": "Symfony", "author": "Potencier", "language": "PHP", ... ... ... }

#sfugbln 2016 A. Thomas • shuffling DOM elements or •
storing content in images (especially prices) 25 2. Extracting Common Challenges:

#sfugbln 2016 A. Thomas 26 Solution:

#sfugbln 2016 3. Processing (you’ve done stuff like this before)

#sfugbln 2016 A. Thomas • Extracted data using CSS selectors
or XPath   usually looks like this: 28 3. Processing Preis: 3,98 EUR - Amount: 1 - Color: Black - Weight: 1kg 3,7 - 23,4 mail(at)example.com ... You always have to validate and transform the results!

#sfugbln 2016 A. Thomas 29 3. Processing ! Validation &
transformation of input data in Symfony? bubble icon by irene hoffman

#sfugbln 2016 A. Thomas Shortcuts

#sfugbln 2016 A. Thomas 31 Shortcuts – internal APIs Always
look for internal APIs

#sfugbln 2016 A. Thomas 32 Shortcuts – internal APIs Always
look for internal APIs

#sfugbln 2016 A. Thomas 33 Shortcuts – internal APIs Internal
APIs are also provided by Tag Managers

#sfugbln 2016 A. Thomas 33 Shortcuts – internal APIs Internal
APIs are also provided by Tag Managers   $host = 'http://localhost:4444/wd/hub'; $driver = RemoteWebDriver::create($host, DesiredCapabilities::firefox()); $driver->get('http://a.website.de'); $dataLayer = $driver->executeScript("return dataLayer“,[]); var_dump($dataLayer[0]);

#sfugbln 2016 A. Thomas 34   <div itemscope itemtype="http://schema.org/PostalAddress"> Example Inc. P.O. Box 4321 Some Town, GB 12212 Takatuka Land </div> schema.org – the semantic web is your friend! • It’s used to embed structured data on web sites for use by search engines   – and other applications. • Standardised description of entities (i.e. address, product, recipe) • and their relationships. Shortcuts – schema.org

#sfugbln 2016 A. Thomas 35 Shortcuts – PhantomJs Cloud

#sfugbln 2016 A. Thomas 36 Shortcuts – Scrapinghub

#sfugbln 2016 A. Thomas 37 Harvester Extractor Structured data </>
raw data store extraction job Processor fetch job store processing job fetch job the internetz Worker Queue All Together Now

#sfugbln 2016 A. Thomas Thank you!

Scraping the web

Scraping the web

Other Decks in Technology

Featured

Transcript