Slide 1

Slide 1 text

Scraping the Web with Laravel Dusk, Docker, and PHP By: Paul Redmond @paulredmond paulredmond

Slide 2

Slide 2 text

What You’ll Learn? ● Different types of scraping and when to use them ● Use Laravel Dusk for rapid browser automation ● Different Ways to Run Browser Automation ● Run Browser Automation in a Server Environment

Slide 3

Slide 3 text

What is Web Scraping? It’s a dirty job Gathering data from HTML and other media for the purposes of testing, data enrichment, and collection. https://flic.kr/p/8EZMNk

Slide 4

Slide 4 text

Hundreds of Billions Google “Scrapes” Hundreds of Billions (Or More) of Pages and other media on the web. https://www.google.com/search/howsearchworks/crawling-indexing/

Slide 5

Slide 5 text

Why Do We Need Scraping? ● Market analysis ● Gain a competitive advantage ● Increase learning and understanding ● Monitor trends ● Combine multiple offers into one portal (ie. Shopping comparisons) ● Analytics

Slide 6

Slide 6 text

Other Types of Data Scraping ● Competitor Scanning ● Military Intelligence ● Surveillance ● Metering

Slide 7

Slide 7 text

Other Types of Data Scraping

Slide 8

Slide 8 text

Other Types of Data Scraping

Slide 9

Slide 9 text

Is Web Scraping Legitimate? ● Yes, it can be. ● Scraping can have a negative/bad connotation, so... ○ Don’t do bad / illegal stuff ○ Be nice ○ Be careful ○ Be respectful

Slide 10

Slide 10 text

Keeping Web Scraping Legitimate ● Speed ● Caution ● Intent ● Empathy ● Honesty

Slide 11

Slide 11 text

Keeping Web Scraping Legitimate ● Speed. Go slow (watch requests/second) ● Caution. Code mistakes could create unintended load! ● Intent. Even if your intention is pure, always question. ● Empathy. Put yourself in the shoes of website owners ● Honesty. Don’t steal stuff (PII, copyrights, etc.)

Slide 12

Slide 12 text

Keep Robots.txt in Mind...Be a Good Bot ● https://www.google.com/robots.txt ● https://www.yahoo.com/robots.txt ● https://github.com/robots.txt (see the top comment) * PHP Robots Parser: https://github.com/webignition/robots-txt-file

Slide 13

Slide 13 text

When Do We Scrape? ● What is the purpose? ● Can we live without the data? ● Do they have an API? ● If yes, does the API have everything we need? ● Do they allow scraping?

Slide 14

Slide 14 text

Downsides of Scraping ● Changes in the HTML/DOM breaks scrapers ● Changes in the HTML/DOM breaks scrapers ● Changes in the HTML/DOM breaks scrapers ● Changes in the HTML/DOM breaks scrapers ● Rich JavaScript apps can cause headaches ● Scraping can be process/memory and time intensive ● More manual processing/formatting of collected data than an API ● Changes in the HTML/DOM breaks scrapers

Slide 15

Slide 15 text

How Do we Overcome the Downsides? ● Match DOM/Selectors defensively ● It's a bit of an art that takes practice and experience ● Make sure that you handle failure ● Good alerting, notifications, and reporting ○ https://www.bugsnag.com/ ○ https://sentry.io/ ● Learn to accept that scraping will break sometimes

Slide 16

Slide 16 text

Scraping Tools

Slide 17

Slide 17 text

3 Categories of Web Scraping ● Anonymous HTTP Requests (HTML, Images, XML, etc.) ● Testing elements, asserting expected behavior ● Full Browser Automation Tasks

Slide 18

Slide 18 text

Anonymous Scraping - HTML, Images, etc. ● Fastest ● Easy to run and reproduce ● Just speaking HTTP ● PHP has a Good DOM Parsing Tools (Goutte)

Slide 19

Slide 19 text

Testing elements / asserting expected behavior ● May use HTTP to make basic response assertions ● May use a full browser (think testing Rich JavaScript Apps) ● Useful for user acceptance testing and browser testing

Slide 20

Slide 20 text

Full Browser Automation ● Like testing, but used for scraping ● Real browser or headless browser ● The closest thing to a real user ● Requires more tooling (ie. Selenium, WebDriver, Phantom) ● Runs slow in general

Slide 21

Slide 21 text

● cURL ● Goutte (goot) ● Guzzle ● HTTPFul ● PHP-Webdriver ● file_get_contents() (Some) PHP Tools You Can Use for Scraping

Slide 22

Slide 22 text

What Other Tools Have You Used?

Slide 23

Slide 23 text

HTTP Scraping

Slide 24

Slide 24 text

Goutte is the Best Option (in my opinion) Pronounced “goot” HTTP Scraping

Slide 25

Slide 25 text

Goutte Overview ● Uses Symfony/BrowserKit to Simulate the Browser ● Uses Symfony/DomCrawler for DOM Traversal/Filtering ● Uses Guzzle for HTTP Requests ● Get and Set Cookies ● History (allows you to go back, forward, clear) Reference: https://github.com/FriendsOfPHP/Goutte HTTP Scraping

Slide 26

Slide 26 text

Goutte Capabilities ● Click on Links and navigate the web ● Extract data / filter data ● Submit forms ● Follows redirects (by default) ● Requests return an instance of Symfony\Component\DomCrawler\Crawler HTTP Scraping

Slide 27

Slide 27 text

Let’s Look at Some Examples of HTTP Scraping Goutte Examples on Github HTTP Scraping

Slide 28

Slide 28 text

Testing and Web Scrapers

Slide 29

Slide 29 text

Ways you might use web scraping for testing ● Test bulk site redirects before a migration ○ Request the old URLs ○ Assert a 3xx response ○ Assert the redirect location returns a 200 ● Functional test suites (ie. Symfony/Laravel) ● Healthcheck Probes / HTTP validation (ie. 200 response) Testing and Web Scrapers

Slide 30

Slide 30 text

Example Functional Test Asserting HTML Testing and Web Scrapers http://symfony.com/doc/current/testing.html#your-first-functional-test

Slide 31

Slide 31 text

Example Functional Test Asserting Status Testing and Web Scrapers https://laravel.com/docs/5.4/http-tests#introduction

Slide 32

Slide 32 text

Example Functional Browser Test Testing and Web Scrapers https://laravel.com/docs/5.4/dusk#getting-started

Slide 33

Slide 33 text

Full Browser Automation

Slide 34

Slide 34 text

Why do we need full browser automation tools? Full Browser Automation

Slide 35

Slide 35 text

Why do we need full browser automation tools? ● Simulate real browsers ● Test/Work with Async JavaScript applications ● Automate testing that applications work as expected ● Replace repetitive manual QA with automation ● Run tests in multiple browsers ● Advanced Web Scraping (ie. filtered reports) Full Browser Automation

Slide 36

Slide 36 text

Noteable Tools in Browser Automation ● Selenium ● W3 WebDriver (https://www.w3.org/TR/webdriver/) ● Headless Browsers ○ PhantomJS ○ Chrome --headless* ○ ZombieJS * Chromedriver isn’t quite working with --headless yet, at least for me ¯\_(ツ)_/¯ Full Browser Automation

Slide 37

Slide 37 text

Noteable PHP Tools in Browser Automation ● Behat / Mink ● PHP-Webdriver ○ Codeception ○ Laravel Dusk (recently) ● Steward ● Any others you consider noteable? Full Browser Automation

Slide 38

Slide 38 text

Notables in Other Languages... ● Python ○ Selenium WebDriver Bindings ○ BeautifulSoup ○ Requests: HTTP for Humans ○ Scrapy ● Ruby ○ Capybara ○ Nokogiri (DOM Parsing) ○ Mechanize Gem Full Browser Automation

Slide 39

Slide 39 text

Notables in Other Languages... ● JavaScript ○ Nightwatch.js ○ Zombie ○ PhantomJS ○ Webdriver.io ○ CasperJS ○ SlimerJS Full Browser Automation

Slide 40

Slide 40 text

Why Use PHP for Web Browser Automation? ● Developers don’t have to learn a new language (good/bad) ● More participation in teams already writing PHP ● Reduce cross-language mental overhead ● Browser Automation can be closer to your domain logic ● PHP-Webdriver is Good Enough™ (and backed by Facebook) Full Browser Automation

Slide 41

Slide 41 text

How Do I Run PHP Browser Automation?

Slide 42

Slide 42 text

How Do I Run PHP Browser Automation? ● `chrome --headless` - as of Chrome 59 ● Standalone Selenium ● WebDriver ● PhantomJS ● Any other ways? How Do I Run This Stuff?

Slide 43

Slide 43 text

Run Chrome Headless (Chrome 59 Stable) $ alias chrome="/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome" $ chrome --headless --disable-gpu --print-to-pdf https://www.github.com/ $ open output.pdf $ chrome --headless --disable-gpu --dump-dom $ chrome --headless --disable-gpu --repl https://www.chromestatus.com/ Reference: https://developers.google.com/web/updates/2017/04/headless-chrome How Do I Run This Stuff?

Slide 44

Slide 44 text

Getting to Know PHP-WebDriver WebDriver Examples on Github How Do I Run This Stuff?

Slide 45

Slide 45 text

Running the Chromedriver/Phantom Process How Do I Run This Stuff?

Slide 46

Slide 46 text

Techniques for Triggering Browser Automation ● Eager tasks - run on a schedule ● On-demand - one-off console commands ● Event trigger - event queue ● What are some other ways? How Do I Run This Stuff?

Slide 47

Slide 47 text

Intro to Laravel Dusk

Slide 48

Slide 48 text

Intro to Laravel Dusk ● Browser testing for Laravel projects (primary use case) ● Browser abstraction on top of PHP-Webdriver <3 ● Doesn’t require JDK or Selenium (you can still use them) ● Uses standalone ChromeDriver

Slide 49

Slide 49 text

Do I HAVE to use Laravel to Use Dusk!?

Slide 50

Slide 50 text

Do I HAVE to use Laravel to Use Dusk!?

Slide 51

Slide 51 text

But I am going to show you why its great for web automation stuff...

Slide 52

Slide 52 text

Dusk Basics: Elements

Slide 53

Slide 53 text

Dusk Basics: Links/Events

Slide 54

Slide 54 text

Dusk Basics: Form Inputs

Slide 55

Slide 55 text

Dusk Basics: Waiting for Elements

Slide 56

Slide 56 text

Quick Comparison to Our Earlier Vanilla PHP-Webdriver Example Webdriver Dusk Examples on Github

Slide 57

Slide 57 text

Running Browser Automation

Slide 58

Slide 58 text

Key Laravel Features for Browser Automation ● Scheduler to run Commands on a schedule (eager) ● Create Custom Console Commands (one-off) ● Built-in Queues (triggered) ● Database Migrations for quick modeling of data storage ● Service Container for browse automation classes

Slide 59

Slide 59 text

Scheduler (app/Console/Kernel.php)

Slide 60

Slide 60 text

Custom Console Commands ● Easily run one-off commands ● Scheduler uses commands, giving you both ● Laravel uses the Symfony Console and adds conveniences ● Commands run my browser scraping

Slide 61

Slide 61 text

Queues ● Easily trigger web scraping jobs ● Queue jobs can trigger console commands ● Laravel has a built-in queue worker ● Redis is my preferred queue driver

Slide 62

Slide 62 text

Queues

Slide 63

Slide 63 text

Queues

Slide 64

Slide 64 text

Running Browser Automation in Docker

Slide 65

Slide 65 text

How Do I Run PHP Browser Automation on a Server!? How Do I Run This Stuff?

Slide 66

Slide 66 text

How Do I Run PHP Browser Automation on a Server!? How Do I Run This Stuff? XVFB

Slide 67

Slide 67 text

XVFB. What the What!? “Xvfb (short for X virtual framebuffer) is an in-memory display server for UNIX-like operating system (e.g., Linux). It enables you to run graphical applications without a display (e.g., browser tests on a CI server) while also having the ability to take screenshots.” Reference: http://elementalselenium.com/tips/38-headless How Do I Run This Stuff?

Slide 68

Slide 68 text

Example Xvfb Usage $ Xvfb :99 -screen 0 1920x1200x16 & How Do I Run This Stuff?

Slide 69

Slide 69 text

Example Xvfb Usage How Do I Run This Stuff?

Slide 70

Slide 70 text

Our Requirements for a Docker Scheduler ● Google Chrome Stable ● Chromedriver ● Xvfb ● PHP ● Entrypoint to run the scheduler Running in Docker

Slide 71

Slide 71 text

Our Docker Setup ● Docker Official php:7.1.6-cli (Scheduler) ● Docker Official php:7.1.6-fpm (Web Container) ● Docker Compose ● Redis ● MySQL Running in Docker

Slide 72

Slide 72 text

Why Not the Official Selenium Image? ● If you need File Downloads through Chrome ● Downloads through volumes aren’t ideal ● If you want the same PHP installation on app and scheduler (I do) Running in Docker

Slide 73

Slide 73 text

Scheduler Dockerfile ● Extends php:7.1.6-cli ● Installs Chrome Stable + a script to take chrome out of sandbox mode ● Installs Chromedriver ● Installs Required PHP Modules ● Copies Application Files ● Runs a custom entrypoint script Running in Docker

Slide 74

Slide 74 text

Scheduler Dockerfile Review the Scheduler Docker Files Running in Docker

Slide 75

Slide 75 text

How Do I Download Files through Chrome? Running in Docker

Slide 76

Slide 76 text

Extending Dusk Browser - Hooking it Together ● Provide our Own Browser class ● A DownloadsManager class for chrome downloads ● A DownloadedFile Class to Work with Downloaded Files ● Service Container Bindings in AppServiceProvider ● Example Command ● Lets see it in action... Running in Docker

Slide 77

Slide 77 text

Full Docker Setup in Action (Demo) Running in Docker

Slide 78

Slide 78 text

My Projects Lumen Programming Guide http://www.apress.com/la/book/9781484221860 You will learn to write test-driven (TDD) microservices, REST APIs, and web service APIs with PHP using the Lumen micro-framework. * Zero bugs in the book source code ;)

Slide 79

Slide 79 text

My Projects Docker for PHP Developers https://leanpub.com/docker-for-php-developers A hands-on guide to learning how to use Docker as your primary development environment. It covers a diverse range of topics and scenarios you will face as a PHP developer picking up docker.

Slide 80

Slide 80 text

Final Questions?

Slide 81

Slide 81 text

Thank You!