Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Scraping the web

a.thomas
October 27, 2016

Scraping the web

Webscraping techniques are a bit like porn. A lot of us are using them, but no one's really talking about it. Maybe that's why scraping applications often end up as a bunch of scripts hacked together by some guy who left the company ages ago.
How sad! Let me show you that reverse engineering a frontend in order to extract data is actually quite a fun challenge and that there are lots of amazing tools out there that do most of the hard work for you.

a.thomas

October 27, 2016
Tweet

Other Decks in Technology

Transcript

  1. #sfugbln 2016 A. Thomas 2 Alex Thomas @fanatique /fanatique robust

    software, resilient teams, php, js, berlin, sailing, … !
  2. #sfugbln 2016 A. Thomas Web Scraping: reverse engineering a web

    frontend in order to extract structured data from a website, that doesn’t offer an API… 3
  3. #sfugbln 2016 A. Thomas 4 { "title": "Symfony", "author": "Potencier",

    "language": "PHP", ... ... ... } Gear image by Verdy_p Web Scraping
  4. #sfugbln 2016 A. Thomas 9 ~ • Unlicensed Commercial Use

    • Only Facts (hard to prove) (I wouldn’t recommend it!)* Wannsee Weather This Afternoon: Temperature: 11° C Wind: 3 – 4 Bft. *according to my understanding.
  5. #sfugbln 2016 A. Thomas 10 weather_scraper.php $html = file_get_contents($url); $regex

    = '/rain:(.*)/'; preg_match($regex, $html, $match); if ($match[0][0] === 'none') { sendPushMessage($match[0][0]); }
 1. Harvesting 2. Extracting 3. Processing
  6. #sfugbln 2016 A. Thomas file_get_contents gets you nowhere • very

    limited support for headers, cookies etc • no way to evaluate return values in detail • only sync requests 12 $opts = array( 'http' => array( 'method'=>"GET", 'header'=>"Accept-language: en\r\n" . "Cookie: foo=bar\r\n" )); $context = stream_context_create($opts); $htmlStr = file_get_contents($url, false, $context); 1. Harvesting
  7. #sfugbln 2016 A. Thomas 1. Harvesting 13 // Send an

    asynchronous request. $request = new \GuzzleHttp\Psr7\Request('GET', $url); $promise = $client->sendAsync($request)->then( function ($response) { $result = extractData($response->getBody()); }); Do yourself a favour: use at least guzzle with curl • Async requests • Detailed header manipulation possible • Simple interface
  8. #sfugbln 2016 A. Thomas And always store what you harvested!

    14 http://example.com ... $$$ </> </> </> 1. Harvesting
  9. #sfugbln 2016 A. Thomas Unfortunately that does still not solve

    your hardest problem in harvesting: obtaining the right DOM!
  10. #sfugbln 2016 A. Thomas The only solution (I know) is:

    18 1. Harvesting Browser automation through Selenium! $welcomeText = $driver ->findElement( WebDriverBy::id('id')) ->getText(); $driver ->click( WebDriverBy::id('id')) ); Selenium Server
  11. #sfugbln 2016 A. Thomas ✓ Can handle JS and XHR!!

    ✓ Proper cookie handling ✓ Loads assets ✓ Can do screenshots(!) 19 - Veeeryy sloow - Response headers are not accessible - Heavyweight infrastructure 1. Harvesting Scraping with Selenium All in all it behaves like a real browser. 
 – Because it is a real browser.
  12. #sfugbln 2016 A. Thomas 20 $driver->get($this->getTestPath('index.html')); $welcomeText = $driver ->findElement(

    WebDriverBy::id('welcome') ) ->getText(); facebook/php-webdriver • Is the most current implementation of the Selenium WebDriver protocol in PHP • Actively maintained • Not very well documented (imho)… 1. Harvesting
  13. #sfugbln 2016 A. Thomas Usual challenges: • IP blocking (because

    of too many requests) • Bandwidth throttling / no responses • Fake responses • … 21 1. Harvesting Maybe your scraper should be a bit nicer!
  14. #sfugbln 2016 A. Thomas 23 2. Extracting • Allows to

    filter the DOM using CSS selectors • Contains convenience methods (like first(), siblings(), parents() etc) • Supports XPath – which is even more powerful! use Symfony\Component\DomCrawler\Crawler; $crawler = new Crawler($htmlStr); $results = $crawler->filter('body > p'); foreach ($results as $result) { doCleverStuffWith($result->textContent); } There is a simple solution for extracting DOM data in PHP:
  15. #sfugbln 2016 A. Thomas var s = document.createElement('script'); s.setAttribute('src','//xmpl.co/scrpr.js'); document.head.appendChild(s);

    Inject a remote JS file into a target site. 24 2. Extracting { "title": "Symfony", "author": "Potencier", "language": "PHP", ... ... ... }
  16. #sfugbln 2016 A. Thomas • shuffling DOM elements or •

    storing content in images (especially prices) 25 2. Extracting Common Challenges:
  17. #sfugbln 2016 A. Thomas • Extracted data using CSS selectors

    or XPath 
 usually looks like this: 28 3. Processing Preis: 3,98 EUR - Amount: 1 - Color: Black - Weight: 1kg 3,7 - 23,4 mail(at)example.com ... You always have to validate and transform the results!
  18. #sfugbln 2016 A. Thomas 29 3. Processing ! Validation &

    transformation of input data in Symfony? bubble icon by irene hoffman
  19. #sfugbln 2016 A. Thomas 33 Shortcuts – internal APIs Internal

    APIs are also provided by Tag Managers 
 $host = 'http://localhost:4444/wd/hub'; $driver = RemoteWebDriver::create($host, DesiredCapabilities::firefox()); $driver->get('http://a.website.de'); $dataLayer = $driver->executeScript("return dataLayer“,[]); var_dump($dataLayer[0]);
  20. #sfugbln 2016 A. Thomas 34 
 <div itemscope itemtype="http://schema.org/PostalAddress"> <span

    itemprop="name">Example Inc.</span> P.O. Box <span itemprop="postOfficeBoxNumber">4321</span> <span itemprop="addressLocality">Some Town</span>, <span itemprop="addressRegion">GB</span> <span itemprop="postalCode">12212</span> <span itemprop="addressCountry">Takatuka Land</span> </div> schema.org – the semantic web is your friend! • It’s used to embed structured data on web sites for use by search engines 
 – and other applications. • Standardised description of entities (i.e. address, product, recipe) • and their relationships. Shortcuts – schema.org
  21. #sfugbln 2016 A. Thomas 37 Harvester Extractor Structured data </>

    raw data store extraction job Processor fetch job store processing job fetch job the internetz Worker Queue All Together Now