Scraping the web - Speaker Deck

Scraping the web

by a.thomas

Slide 1

Slide 1 text

#sfugbln 2016 Web Scraping The Good, The Bad and The Exciting [email protected] @fanatique

Slide 2

Slide 2 text

#sfugbln 2016 A. Thomas 2 Alex Thomas @fanatique /fanatique robust software, resilient teams, php, js, berlin, sailing, … !

Slide 3

Slide 3 text

#sfugbln 2016 A. Thomas Web Scraping: reverse engineering a web frontend in order to extract structured data from a website, that doesn’t offer an API… 3

Slide 4

Slide 4 text

#sfugbln 2016 A. Thomas 4 { "title": "Symfony", "author": "Potencier", "language": "PHP", ... ... ... } Gear image by Verdy_p Web Scraping

Slide 5

Slide 5 text

#sfugbln 2016 A. Thomas Can this be legal?

Slide 6

Slide 6 text

#sfugbln 2016 A. Thomas 7 ✓ Personal Use* *according to my understanding.

Slide 7

Slide 7 text

#sfugbln 2016 A. Thomas 8 X Unlicensed Commercial Use!* *according to my understanding.

Slide 8

Slide 8 text

#sfugbln 2016 A. Thomas 9 ~ • Unlicensed Commercial Use • Only Facts (hard to prove) (I wouldn’t recommend it!)* Wannsee Weather This Afternoon: Temperature: 11° C Wind: 3 – 4 Bft. *according to my understanding.

Slide 9

Slide 9 text

#sfugbln 2016 A. Thomas 10 weather_scraper.php $html = file_get_contents($url); $regex = '/rain:(.*)/'; preg_match($regex, $html, $match); if ($match[0][0] === 'none') { sendPushMessage($match[0][0]); }  1. Harvesting 2. Extracting 3. Processing

Slide 10

Slide 10 text

#sfugbln 2016 1. Harvesting aka obtaining the DOM

Slide 11

Slide 11 text

#sfugbln 2016 A. Thomas file_get_contents gets you nowhere • very limited support for headers, cookies etc • no way to evaluate return values in detail • only sync requests 12 $opts = array( 'http' => array( 'method'=>"GET", 'header'=>"Accept-language: en\r\n" . "Cookie: foo=bar\r\n" )); $context = stream_context_create($opts); $htmlStr = file_get_contents($url, false, $context); 1. Harvesting

Slide 12

Slide 12 text

#sfugbln 2016 A. Thomas 1. Harvesting 13 // Send an asynchronous request. $request = new \GuzzleHttp\Psr7\Request('GET', $url); $promise = $client->sendAsync($request)->then( function ($response) { $result = extractData($response->getBody()); }); Do yourself a favour: use at least guzzle with curl • Async requests • Detailed header manipulation possible • Simple interface

Slide 13

Slide 13 text

#sfugbln 2016 A. Thomas And always store what you harvested! 14 http://example.com ... $$$ 1. Harvesting

Slide 14

Slide 14 text

#sfugbln 2016 A. Thomas Unfortunately that does still not solve your hardest problem in harvesting: obtaining the right DOM!

Slide 15

Slide 15 text

#sfugbln 2016 A. Thomas This is how guzzle sees your target site: 16 1. Harvesting

Slide 16

Slide 16 text

#sfugbln 2016 A. Thomas While in reality it looks like this: 17 1. Harvesting

Slide 17

Slide 17 text

#sfugbln 2016 A. Thomas The only solution (I know) is: 18 1. Harvesting Browser automation through Selenium! $welcomeText = $driver ->findElement( WebDriverBy::id('id')) ->getText(); $driver ->click( WebDriverBy::id('id')) ); Selenium Server

Slide 18

Slide 18 text

#sfugbln 2016 A. Thomas ✓ Can handle JS and XHR!! ✓ Proper cookie handling ✓ Loads assets ✓ Can do screenshots(!) 19 - Veeeryy sloow - Response headers are not accessible - Heavyweight infrastructure 1. Harvesting Scraping with Selenium All in all it behaves like a real browser.   – Because it is a real browser.

Slide 19

Slide 19 text

#sfugbln 2016 A. Thomas 20 $driver->get($this->getTestPath('index.html')); $welcomeText = $driver ->findElement( WebDriverBy::id('welcome') ) ->getText(); facebook/php-webdriver • Is the most current implementation of the Selenium WebDriver protocol in PHP • Actively maintained • Not very well documented (imho)… 1. Harvesting

Slide 20

Slide 20 text

#sfugbln 2016 A. Thomas Usual challenges: • IP blocking (because of too many requests) • Bandwidth throttling / no responses • Fake responses • … 21 1. Harvesting Maybe your scraper should be a bit nicer!

Slide 21

Slide 21 text

#sfugbln 2016 2. Extracting (spoiler: you don’t use regex anymore)

Slide 22

Slide 22 text

#sfugbln 2016 A. Thomas 23 2. Extracting • Allows to filter the DOM using CSS selectors • Contains convenience methods (like first(), siblings(), parents() etc) • Supports XPath – which is even more powerful! use Symfony\Component\DomCrawler\Crawler; $crawler = new Crawler($htmlStr); $results = $crawler->filter('body > p'); foreach ($results as $result) { doCleverStuffWith($result->textContent); } There is a simple solution for extracting DOM data in PHP:

Slide 23

Slide 23 text

#sfugbln 2016 A. Thomas var s = document.createElement('script'); s.setAttribute('src','//xmpl.co/scrpr.js'); document.head.appendChild(s); Inject a remote JS file into a target site. 24 2. Extracting { "title": "Symfony", "author": "Potencier", "language": "PHP", ... ... ... }

Slide 24

Slide 24 text

#sfugbln 2016 A. Thomas • shuffling DOM elements or • storing content in images (especially prices) 25 2. Extracting Common Challenges:

Slide 25

Slide 25 text

#sfugbln 2016 A. Thomas 26 Solution:

Slide 26

Slide 26 text

#sfugbln 2016 3. Processing (you’ve done stuff like this before)

Slide 27

Slide 27 text

#sfugbln 2016 A. Thomas • Extracted data using CSS selectors or XPath   usually looks like this: 28 3. Processing Preis: 3,98 EUR - Amount: 1 - Color: Black - Weight: 1kg 3,7 - 23,4 mail(at)example.com ... You always have to validate and transform the results!

Slide 28

Slide 28 text

#sfugbln 2016 A. Thomas 29 3. Processing ! Validation & transformation of input data in Symfony? bubble icon by irene hoffman

Slide 29

Slide 29 text

#sfugbln 2016 A. Thomas Shortcuts

Slide 30

Slide 30 text

#sfugbln 2016 A. Thomas 31 Shortcuts – internal APIs Always look for internal APIs

Slide 31

Slide 31 text

#sfugbln 2016 A. Thomas 31 Shortcuts – internal APIs Always look for internal APIs

Slide 32

Slide 32 text

#sfugbln 2016 A. Thomas 32 Shortcuts – internal APIs Always look for internal APIs

Slide 33

Slide 33 text

#sfugbln 2016 A. Thomas 33 Shortcuts – internal APIs Internal APIs are also provided by Tag Managers

Slide 34

Slide 34 text

#sfugbln 2016 A. Thomas 33 Shortcuts – internal APIs Internal APIs are also provided by Tag Managers   $host = 'http://localhost:4444/wd/hub'; $driver = RemoteWebDriver::create($host, DesiredCapabilities::firefox()); $driver->get('http://a.website.de'); $dataLayer = $driver->executeScript("return dataLayer“,[]); var_dump($dataLayer[0]);

Slide 35

Slide 35 text

#sfugbln 2016 A. Thomas 34  

Example Inc. P.O. Box 4321 Some Town, GB 12212 Takatuka Land

schema.org – the semantic web is your friend! • It’s used to embed structured data on web sites for use by search engines   – and other applications. • Standardised description of entities (i.e. address, product, recipe) • and their relationships. Shortcuts – schema.org

Slide 36

Slide 36 text

#sfugbln 2016 A. Thomas 35 Shortcuts – PhantomJs Cloud

Slide 37

Slide 37 text

#sfugbln 2016 A. Thomas 36 Shortcuts – Scrapinghub

Slide 38

Slide 38 text

#sfugbln 2016 A. Thomas 37 Harvester Extractor Structured data raw data store extraction job Processor fetch job store processing job fetch job the internetz Worker Queue All Together Now

Slide 39

Slide 39 text

#sfugbln 2016 A. Thomas Thank you!