Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Scraping the web with Laravel, Dusk, Docker, and PHP

Scraping the web with Laravel, Dusk, Docker, and PHP

Jumpstart your web scraping automation in the cloud with Laravel Dusk, Docker, and friends. We will discuss the types of web scraping tools, the best tools for the job, and how to deal with running selenium in Docker.

Code examples @ https://github.com/paulredmond/scraping-with-laravel-dusk

Paul Redmond

June 27, 2017
Tweet

Other Decks in Technology

Transcript

  1. Scraping the Web with Laravel Dusk, Docker, and PHP By:

    Paul Redmond @paulredmond paulredmond
  2. What You’ll Learn? • Different types of scraping and when

    to use them • Use Laravel Dusk for rapid browser automation • Different Ways to Run Browser Automation • Run Browser Automation in a Server Environment
  3. What is Web Scraping? It’s a dirty job Gathering data

    from HTML and other media for the purposes of testing, data enrichment, and collection. https://flic.kr/p/8EZMNk
  4. Hundreds of Billions Google “Scrapes” Hundreds of Billions (Or More)

    of Pages and other media on the web. https://www.google.com/search/howsearchworks/crawling-indexing/
  5. Why Do We Need Scraping? • Market analysis • Gain

    a competitive advantage • Increase learning and understanding • Monitor trends • Combine multiple offers into one portal (ie. Shopping comparisons) • Analytics
  6. Other Types of Data Scraping • Competitor Scanning • Military

    Intelligence • Surveillance • Metering
  7. Is Web Scraping Legitimate? • Yes, it can be. •

    Scraping can have a negative/bad connotation, so... ◦ Don’t do bad / illegal stuff ◦ Be nice ◦ Be careful ◦ Be respectful
  8. Keeping Web Scraping Legitimate • Speed. Go slow (watch requests/second)

    • Caution. Code mistakes could create unintended load! • Intent. Even if your intention is pure, always question. • Empathy. Put yourself in the shoes of website owners • Honesty. Don’t steal stuff (PII, copyrights, etc.)
  9. Keep Robots.txt in Mind...Be a Good Bot • https://www.google.com/robots.txt •

    https://www.yahoo.com/robots.txt • https://github.com/robots.txt (see the top comment) * PHP Robots Parser: https://github.com/webignition/robots-txt-file
  10. When Do We Scrape? • What is the purpose? •

    Can we live without the data? • Do they have an API? • If yes, does the API have everything we need? • Do they allow scraping?
  11. Downsides of Scraping • Changes in the HTML/DOM breaks scrapers

    • Changes in the HTML/DOM breaks scrapers • Changes in the HTML/DOM breaks scrapers • Changes in the HTML/DOM breaks scrapers • Rich JavaScript apps can cause headaches • Scraping can be process/memory and time intensive • More manual processing/formatting of collected data than an API • Changes in the HTML/DOM breaks scrapers
  12. How Do we Overcome the Downsides? • Match DOM/Selectors defensively

    • It's a bit of an art that takes practice and experience • Make sure that you handle failure • Good alerting, notifications, and reporting ◦ https://www.bugsnag.com/ ◦ https://sentry.io/ • Learn to accept that scraping will break sometimes
  13. 3 Categories of Web Scraping • Anonymous HTTP Requests (HTML,

    Images, XML, etc.) • Testing elements, asserting expected behavior • Full Browser Automation Tasks
  14. Anonymous Scraping - HTML, Images, etc. • Fastest • Easy

    to run and reproduce • Just speaking HTTP • PHP has a Good DOM Parsing Tools (Goutte)
  15. Testing elements / asserting expected behavior • May use HTTP

    to make basic response assertions • May use a full browser (think testing Rich JavaScript Apps) • Useful for user acceptance testing and browser testing
  16. Full Browser Automation • Like testing, but used for scraping

    • Real browser or headless browser • The closest thing to a real user • Requires more tooling (ie. Selenium, WebDriver, Phantom) • Runs slow in general
  17. • cURL • Goutte (goot) • Guzzle • HTTPFul •

    PHP-Webdriver • file_get_contents() (Some) PHP Tools You Can Use for Scraping
  18. Goutte Overview • Uses Symfony/BrowserKit to Simulate the Browser •

    Uses Symfony/DomCrawler for DOM Traversal/Filtering • Uses Guzzle for HTTP Requests • Get and Set Cookies • History (allows you to go back, forward, clear) Reference: https://github.com/FriendsOfPHP/Goutte HTTP Scraping
  19. Goutte Capabilities • Click on Links and navigate the web

    • Extract data / filter data • Submit forms • Follows redirects (by default) • Requests return an instance of Symfony\Component\DomCrawler\Crawler HTTP Scraping
  20. Ways you might use web scraping for testing • Test

    bulk site redirects before a migration ◦ Request the old URLs ◦ Assert a 3xx response ◦ Assert the redirect location returns a 200 • Functional test suites (ie. Symfony/Laravel) • Healthcheck Probes / HTTP validation (ie. 200 response) Testing and Web Scrapers
  21. Why do we need full browser automation tools? • Simulate

    real browsers • Test/Work with Async JavaScript applications • Automate testing that applications work as expected • Replace repetitive manual QA with automation • Run tests in multiple browsers • Advanced Web Scraping (ie. filtered reports) Full Browser Automation
  22. Noteable Tools in Browser Automation • Selenium • W3 WebDriver

    (https://www.w3.org/TR/webdriver/) • Headless Browsers ◦ PhantomJS ◦ Chrome --headless* ◦ ZombieJS * Chromedriver isn’t quite working with --headless yet, at least for me ¯\_(ツ)_/¯ Full Browser Automation
  23. Noteable PHP Tools in Browser Automation • Behat / Mink

    • PHP-Webdriver ◦ Codeception ◦ Laravel Dusk (recently) • Steward • Any others you consider noteable? Full Browser Automation
  24. Notables in Other Languages... • Python ◦ Selenium WebDriver Bindings

    ◦ BeautifulSoup ◦ Requests: HTTP for Humans ◦ Scrapy • Ruby ◦ Capybara ◦ Nokogiri (DOM Parsing) ◦ Mechanize Gem Full Browser Automation
  25. Notables in Other Languages... • JavaScript ◦ Nightwatch.js ◦ Zombie

    ◦ PhantomJS ◦ Webdriver.io ◦ CasperJS ◦ SlimerJS Full Browser Automation
  26. Why Use PHP for Web Browser Automation? • Developers don’t

    have to learn a new language (good/bad) • More participation in teams already writing PHP • Reduce cross-language mental overhead • Browser Automation can be closer to your domain logic • PHP-Webdriver is Good Enough™ (and backed by Facebook) Full Browser Automation
  27. How Do I Run PHP Browser Automation? • `chrome --headless`

    - as of Chrome 59 • Standalone Selenium • WebDriver • PhantomJS • Any other ways? How Do I Run This Stuff?
  28. Run Chrome Headless (Chrome 59 Stable) $ alias chrome="/Applications/Google\ Chrome.app/Contents/MacOS/Google\

    Chrome" $ chrome --headless --disable-gpu --print-to-pdf https://www.github.com/ $ open output.pdf $ chrome --headless --disable-gpu --dump-dom $ chrome --headless --disable-gpu --repl https://www.chromestatus.com/ Reference: https://developers.google.com/web/updates/2017/04/headless-chrome How Do I Run This Stuff?
  29. Techniques for Triggering Browser Automation • Eager tasks - run

    on a schedule • On-demand - one-off console commands • Event trigger - event queue • What are some other ways? How Do I Run This Stuff?
  30. Intro to Laravel Dusk • Browser testing for Laravel projects

    (primary use case) • Browser abstraction on top of PHP-Webdriver <3 • Doesn’t require JDK or Selenium (you can still use them) • Uses standalone ChromeDriver
  31. But I am going to show you why its great

    for web automation stuff...
  32. Key Laravel Features for Browser Automation • Scheduler to run

    Commands on a schedule (eager) • Create Custom Console Commands (one-off) • Built-in Queues (triggered) • Database Migrations for quick modeling of data storage • Service Container for browse automation classes
  33. Custom Console Commands • Easily run one-off commands • Scheduler

    uses commands, giving you both • Laravel uses the Symfony Console and adds conveniences • Commands run my browser scraping
  34. Queues • Easily trigger web scraping jobs • Queue jobs

    can trigger console commands • Laravel has a built-in queue worker • Redis is my preferred queue driver
  35. XVFB. What the What!? “Xvfb (short for X virtual framebuffer)

    is an in-memory display server for UNIX-like operating system (e.g., Linux). It enables you to run graphical applications without a display (e.g., browser tests on a CI server) while also having the ability to take screenshots.” Reference: http://elementalselenium.com/tips/38-headless How Do I Run This Stuff?
  36. Our Requirements for a Docker Scheduler • Google Chrome Stable

    • Chromedriver • Xvfb • PHP • Entrypoint to run the scheduler Running in Docker
  37. Our Docker Setup • Docker Official php:7.1.6-cli (Scheduler) • Docker

    Official php:7.1.6-fpm (Web Container) • Docker Compose • Redis • MySQL Running in Docker
  38. Why Not the Official Selenium Image? • If you need

    File Downloads through Chrome • Downloads through volumes aren’t ideal • If you want the same PHP installation on app and scheduler (I do) Running in Docker
  39. Scheduler Dockerfile • Extends php:7.1.6-cli • Installs Chrome Stable +

    a script to take chrome out of sandbox mode • Installs Chromedriver • Installs Required PHP Modules • Copies Application Files • Runs a custom entrypoint script Running in Docker
  40. Extending Dusk Browser - Hooking it Together • Provide our

    Own Browser class • A DownloadsManager class for chrome downloads • A DownloadedFile Class to Work with Downloaded Files • Service Container Bindings in AppServiceProvider • Example Command • Lets see it in action... Running in Docker
  41. My Projects Lumen Programming Guide http://www.apress.com/la/book/9781484221860 You will learn to

    write test-driven (TDD) microservices, REST APIs, and web service APIs with PHP using the Lumen micro-framework. * Zero bugs in the book source code ;)
  42. My Projects Docker for PHP Developers https://leanpub.com/docker-for-php-developers A hands-on guide

    to learning how to use Docker as your primary development environment. It covers a diverse range of topics and scenarios you will face as a PHP developer picking up docker.