Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Scraping the web

D3e4f92ab82b924e29b8e56e28ce7332?s=47 a.thomas
October 27, 2016

Scraping the web

Webscraping techniques are a bit like porn. A lot of us are using them, but no one's really talking about it. Maybe that's why scraping applications often end up as a bunch of scripts hacked together by some guy who left the company ages ago.
How sad! Let me show you that reverse engineering a frontend in order to extract data is actually quite a fun challenge and that there are lots of amazing tools out there that do most of the hard work for you.

D3e4f92ab82b924e29b8e56e28ce7332?s=128

a.thomas

October 27, 2016
Tweet

Transcript

  1. #sfugbln 2016 Web Scraping The Good, The Bad and The

    Exciting me@alexander-thomas.net @fanatique
  2. #sfugbln 2016 A. Thomas 2 Alex Thomas @fanatique /fanatique robust

    software, resilient teams, php, js, berlin, sailing, … !
  3. #sfugbln 2016 A. Thomas Web Scraping: reverse engineering a web

    frontend in order to extract structured data from a website, that doesn’t offer an API… 3
  4. #sfugbln 2016 A. Thomas 4 { "title": "Symfony", "author": "Potencier",

    "language": "PHP", ... ... ... } Gear image by Verdy_p Web Scraping
  5. #sfugbln 2016 A. Thomas Can this be legal?

  6. #sfugbln 2016 A. Thomas 7 ✓ Personal Use* *according to

    my understanding.
  7. #sfugbln 2016 A. Thomas 8 X Unlicensed Commercial Use!* *according

    to my understanding.
  8. #sfugbln 2016 A. Thomas 9 ~ • Unlicensed Commercial Use

    • Only Facts (hard to prove) (I wouldn’t recommend it!)* Wannsee Weather This Afternoon: Temperature: 11° C Wind: 3 – 4 Bft. *according to my understanding.
  9. #sfugbln 2016 A. Thomas 10 weather_scraper.php $html = file_get_contents($url); $regex

    = '/rain:(.*)/'; preg_match($regex, $html, $match); if ($match[0][0] === 'none') { sendPushMessage($match[0][0]); }
 1. Harvesting 2. Extracting 3. Processing
  10. #sfugbln 2016 1. Harvesting aka obtaining the DOM

  11. #sfugbln 2016 A. Thomas file_get_contents gets you nowhere • very

    limited support for headers, cookies etc • no way to evaluate return values in detail • only sync requests 12 $opts = array( 'http' => array( 'method'=>"GET", 'header'=>"Accept-language: en\r\n" . "Cookie: foo=bar\r\n" )); $context = stream_context_create($opts); $htmlStr = file_get_contents($url, false, $context); 1. Harvesting
  12. #sfugbln 2016 A. Thomas 1. Harvesting 13 // Send an

    asynchronous request. $request = new \GuzzleHttp\Psr7\Request('GET', $url); $promise = $client->sendAsync($request)->then( function ($response) { $result = extractData($response->getBody()); }); Do yourself a favour: use at least guzzle with curl • Async requests • Detailed header manipulation possible • Simple interface
  13. #sfugbln 2016 A. Thomas And always store what you harvested!

    14 http://example.com ... $$$ </> </> </> 1. Harvesting
  14. #sfugbln 2016 A. Thomas Unfortunately that does still not solve

    your hardest problem in harvesting: obtaining the right DOM!
  15. #sfugbln 2016 A. Thomas This is how guzzle sees your

    target site: 16 1. Harvesting
  16. #sfugbln 2016 A. Thomas While in reality it looks like

    this: 17 1. Harvesting
  17. #sfugbln 2016 A. Thomas The only solution (I know) is:

    18 1. Harvesting Browser automation through Selenium! $welcomeText = $driver ->findElement( WebDriverBy::id('id')) ->getText(); $driver ->click( WebDriverBy::id('id')) ); Selenium Server
  18. #sfugbln 2016 A. Thomas ✓ Can handle JS and XHR!!

    ✓ Proper cookie handling ✓ Loads assets ✓ Can do screenshots(!) 19 - Veeeryy sloow - Response headers are not accessible - Heavyweight infrastructure 1. Harvesting Scraping with Selenium All in all it behaves like a real browser. 
 – Because it is a real browser.
  19. #sfugbln 2016 A. Thomas 20 $driver->get($this->getTestPath('index.html')); $welcomeText = $driver ->findElement(

    WebDriverBy::id('welcome') ) ->getText(); facebook/php-webdriver • Is the most current implementation of the Selenium WebDriver protocol in PHP • Actively maintained • Not very well documented (imho)… 1. Harvesting
  20. #sfugbln 2016 A. Thomas Usual challenges: • IP blocking (because

    of too many requests) • Bandwidth throttling / no responses • Fake responses • … 21 1. Harvesting Maybe your scraper should be a bit nicer!
  21. #sfugbln 2016 2. Extracting (spoiler: you don’t use regex anymore)

  22. #sfugbln 2016 A. Thomas 23 2. Extracting • Allows to

    filter the DOM using CSS selectors • Contains convenience methods (like first(), siblings(), parents() etc) • Supports XPath – which is even more powerful! use Symfony\Component\DomCrawler\Crawler; $crawler = new Crawler($htmlStr); $results = $crawler->filter('body > p'); foreach ($results as $result) { doCleverStuffWith($result->textContent); } There is a simple solution for extracting DOM data in PHP:
  23. #sfugbln 2016 A. Thomas var s = document.createElement('script'); s.setAttribute('src','//xmpl.co/scrpr.js'); document.head.appendChild(s);

    Inject a remote JS file into a target site. 24 2. Extracting { "title": "Symfony", "author": "Potencier", "language": "PHP", ... ... ... }
  24. #sfugbln 2016 A. Thomas • shuffling DOM elements or •

    storing content in images (especially prices) 25 2. Extracting Common Challenges:
  25. #sfugbln 2016 A. Thomas 26 Solution:

  26. #sfugbln 2016 3. Processing (you’ve done stuff like this before)

  27. #sfugbln 2016 A. Thomas • Extracted data using CSS selectors

    or XPath 
 usually looks like this: 28 3. Processing Preis: 3,98 EUR - Amount: 1 - Color: Black - Weight: 1kg 3,7 - 23,4 mail(at)example.com ... You always have to validate and transform the results!
  28. #sfugbln 2016 A. Thomas 29 3. Processing ! Validation &

    transformation of input data in Symfony? bubble icon by irene hoffman
  29. #sfugbln 2016 A. Thomas Shortcuts

  30. #sfugbln 2016 A. Thomas 31 Shortcuts – internal APIs Always

    look for internal APIs
  31. #sfugbln 2016 A. Thomas 31 Shortcuts – internal APIs Always

    look for internal APIs
  32. #sfugbln 2016 A. Thomas 32 Shortcuts – internal APIs Always

    look for internal APIs
  33. #sfugbln 2016 A. Thomas 33 Shortcuts – internal APIs Internal

    APIs are also provided by Tag Managers
  34. #sfugbln 2016 A. Thomas 33 Shortcuts – internal APIs Internal

    APIs are also provided by Tag Managers 
 $host = 'http://localhost:4444/wd/hub'; $driver = RemoteWebDriver::create($host, DesiredCapabilities::firefox()); $driver->get('http://a.website.de'); $dataLayer = $driver->executeScript("return dataLayer“,[]); var_dump($dataLayer[0]);
  35. #sfugbln 2016 A. Thomas 34 
 <div itemscope itemtype="http://schema.org/PostalAddress"> <span

    itemprop="name">Example Inc.</span> P.O. Box <span itemprop="postOfficeBoxNumber">4321</span> <span itemprop="addressLocality">Some Town</span>, <span itemprop="addressRegion">GB</span> <span itemprop="postalCode">12212</span> <span itemprop="addressCountry">Takatuka Land</span> </div> schema.org – the semantic web is your friend! • It’s used to embed structured data on web sites for use by search engines 
 – and other applications. • Standardised description of entities (i.e. address, product, recipe) • and their relationships. Shortcuts – schema.org
  36. #sfugbln 2016 A. Thomas 35 Shortcuts – PhantomJs Cloud

  37. #sfugbln 2016 A. Thomas 36 Shortcuts – Scrapinghub

  38. #sfugbln 2016 A. Thomas 37 Harvester Extractor Structured data </>

    raw data store extraction job Processor fetch job store processing job fetch job the internetz Worker Queue All Together Now
  39. #sfugbln 2016 A. Thomas Thank you!