Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Crawling the Web with the New Symfony Components

Crawling the Web with the New Symfony Components

When developing an application, a common feature we need to implement is gathering data from other sources. These data are available in various forms, usually unstructured, and behind some JS application, making them harder to reach.

To make things worse, as the application evolves, we need to get more data, from even more sources. But don't worry, things can be easier!

In this talk we'll use the Symfony's HttpClient, Messenger and Panther to build a crawler, first as a simple console application, then evolving to a distributed one.

Adiel Cristo

November 21, 2019
Tweet

More Decks by Adiel Cristo

Other Decks in Programming

Transcript

  1. Schedule ✔ Crawlers, Spiders, Bots… ✔ Choosing a Tool ✔

    Building a Crawler ✔ Review ✔ Next Steps
  2. Crawlers, Spiders, Bots… “Internet bot that systematically browses the web,

    typically for the purpose of web indexing.” Wikipedia
  3. … Scrapers “A web scraper is an API or tool

    to extract data from a web site.” Wikipedia
  4. And Which Is Which?! Summing up... A crawler downloads content.

    A scraper extracts data from downloaded content.
  5. cURL / libcurl ✔ Command-line tool for getting or sending

    data, including files, using URL syntax ✔ Uses libcurl, so it supports a range of common network protocols ✔ Libcurl has wrappers for a lot of languages, including PHP ✔ Probably is behind almost every download you do
  6. Guzzle ✔ PHP HTTP client ✔ Can send both synchronous

    and asynchronous requests using the same interface ✔ Uses PSR-7 interfaces for requests, responses, and streams ✔ Middleware system allows you to augment and compose client behavior
  7. HttpClient ✔ Low-level HTTP client with support for both PHP

    stream wrapper and cURL ✔ Provides utilities to consume APIs and supports synchronous and asynchronous operations ✔ Provides out-of-box clients for PSR-18, mocking, caching, and scoping features ✔ Supports HTTP/2
  8. Getting the Symfony Versions And getting the Symfony versions can

    be as easy as... { "lts": "3.4.35", "latest": "4.3.8", "dev": "4.4.0-RC1", … "4.0": "4.0.15", "4.1": "4.1.12", "4.2": "4.2.12", "4.3": "4.3.8", "4.4": "4.4.0-RC1", … }
  9. Getting the Symfony Versions $httpClient = HttpClient::create(); // not using

    DI yet :( $response = $httpClient->request('GET', 'https://symfony.com/versions.json'); if (Response::HTTP_OK === $response->getStatusCode()) { return $response->toArray(); } throw new RuntimeException('Could not retrieve the Symfony versions.');
  10. Getting the Symfony Versions $ console app:symfony:version ========== Symfony Versions

    ========== lts: 3.4.35 latest: 4.3.8 dev: 4.4.0-RC1 ... 4.3: 4.3.8 4.4: 4.4.0-RC1
  11. Getting the Symfony Blog Posts (v1) Now the goal is

    to get only the latest posts from the blog, it shall be easy... :D
  12. Getting the Symfony Blog Posts (v1) Now the goal is

    to get only the latest posts from the blog, it shall be easy... :D
  13. Getting the Symfony Blog Posts (v1) To get started, we

    need to understand the page structure and which DOM elements we need.
  14. Getting the Symfony Blog Posts (v1) Adding the symfony/skeleton to

    borrow DI and basic config... :P composer create-project symfony/skeleton crawler
  15. Getting the Symfony Blog Posts (v1) Adding the dom-crawler for

    easy HTML DOM navigation. composer require symfony/dom-crawler
  16. Getting the Symfony Blog Posts (v1) And… adding css-selector for...

    CSS selection. composer require symfony/css-selector
  17. Getting the Symfony Blog Posts (v1) $response = $this->httpClient->request('GET', $url);

    if (Response::HTTP_OK === $response->getStatusCode()) { $content = $response->getContent(); $crawler = new Crawler($content); // Create an object to handle the target DOM elements return new BlogPostSearchPage($crawler); }
  18. Getting the Symfony Blog Posts (v1) $response = $this->httpClient->request('GET', $url);

    if (Response::HTTP_OK === $response->getStatusCode()) { $content = $response->getContent(); $crawler = new Crawler($content, 'https://symfony.com'); // Create an object to handle the target DOM elements return new BlogPostSearchPage($crawler); }
  19. Getting the Symfony Blog Posts (v1) class BlogPostSearchPage { //

    ... public function getBlogPostLinks() : array { return $this->crawler->filter( 'div.container div.post__excerpt h2.m-b-5 a' )->links(); } }
  20. Getting the Symfony Blog Posts (v1) And voilà! $ console

    app:symfony:blog:latest ========== Latest Posts - Symfony Blog ========== https://symfony.com/blog/symfony-4-4-0-released https://symfony.com/blog/new-in-symfony-4-4-service-contain er-linter https://symfony.com/blog/new-in-symfony-4-4-ip-address-ano nymizer ...
  21. Getting the Symfony Blog Posts (v2) So, if we have

    the links, we can get the blog posts, right?!
  22. Getting the Symfony Blog Posts (v2) Yep! We just need

    to add some code to the crawler...
  23. Getting the Symfony Blog Posts (v2) public function getBlogPost(string $url)

    : BlogPostPage { $response = $this->httpClient->request('GET', $url); if (Response::HTTP_OK === $response->getStatusCode()) { $content = $response->getContent(); $crawler = new Crawler($content, 'https://symfony.com'); // Create an object to handle the target DOM elements return new BlogPostPage($crawler); } }
  24. Getting the Symfony Blog Posts (v2) class BlogPostPage { public

    function getTitle() : string { return $this->crawler->filter('div.container div.row main.col-sm-9 h1')->text(); } // ... }
  25. Getting the Symfony Blog Posts (v2) And there it is!

    $ console app:symfony:blog:post \ https://symfony.com/blog/symfony-4-4-0-released ========== Post - Symfony Blog ========== Symfony 4.4.0 released <p><a class="reference external" href="https://github.com/symfony/symfony/pull/34471">Symfony 4.4.0</a> has just been released. Here is a list of the most important changes:</p> ...
  26. Getting the Symfony Blog Posts (v3) And we can get

    the contents of a specific post...
  27. Getting the Symfony Blog Posts (v3) We need to check

    for more links, besides the ones we have from first page. Suggestions?!
  28. Getting the Symfony Blog Posts (v3) class BlogPostSearchPage { //

    ... public function hasNextSearchPageLink() : bool { return count($this->crawler->filter('div.container div.row main.col-sm-9 ul.pager li.text-right a')) > 0; } }
  29. Getting the Symfony Blog Posts (v3) class BlogPostSearchPage { //

    ... public function getNextSearchPageLink() : Link { return $this->crawler->filter('div.container div.row main.col-sm-9 ul.pager li.text-right a')->link(); } }
  30. Getting the Symfony Blog Posts (v3) $ console app:symfony:blog:all ==========

    All Posts - Symfony Blog ========== https://symfony.com/blog/symfony-5-0-0-released https://symfony.com/blog/symfony-4-4-0-released https://symfony.com/blog/new-in-symfony-4-4-service-container-... https://symfony.com/blog/new-in-symfony-4-4-ip-address-... https://symfony.com/blog/new-in-symfony-4-4-httpclient-... https://symfony.com/blog/symfony-5-0-0-rc1-released https://symfony.com/blog/symfony-4-4-0-rc1-released ...
  31. Getting the Symfony Blog Posts (v4) But you do NOT

    want to suddenly hit the target site with thousands of requests...
  32. Getting the Symfony Blog Posts (v4) Envelope: A wrapper for

    messages Allows to add useful information inside through envelope stamps.
  33. Getting the Symfony Blog Posts (v4) Envelope Stamps: Piece of

    information you need to attach to your message.
  34. Getting the Symfony Blog Posts (v4) Sender: Responsible for serializing

    and sending messages to something. This something can be a message broker or a third party API for example.
  35. Getting the Symfony Blog Posts (v4) Receiver: Responsible for retrieving,

    deserializing and forwarding messages to handler(s). This can be a message queue puller or an API endpoint for example.
  36. Getting the Symfony Blog Posts (v4) Handler: Responsible for handling

    messages using the business logic applicable to the messages.
  37. Getting the Symfony Blog Posts (v4) Middleware: Can access the

    message and its wrapper (the envelope) while it is dispatched through the bus. Literally "the software in the middle", those are not about business logic of an application.
  38. What Panther Can Do 1. Executes the JavaScript code contained

    in webpages 2. Supports everything that Chrome (or Firefox) implements 3. Allows screenshots taking 4. Can wait for asynchronously loaded elements to show up 5. Lets you run your own JS code or XPath queries in the context of the loaded page 6. Supports custom Selenium server installations 7. Supports remote browser testing services including SauceLabs and BrowserStack
  39. Review 1. Define the first target page 2. Do a

    crawler for that page 3. Do a scraper for the content you need
  40. Review 1. Define the first target page 2. Do a

    crawler for that page 3. Do a scraper for the content you need 4. Do whatever you need to with the content
  41. Review 1. Define the first target page 2. Do a

    crawler for that page 3. Do a scraper for the content you need 4. Do whatever you need to with the content 5. Rinse and repeat
  42. Next Steps 1. Add tests (hint: Blackfire Player) 2. Leverage

    the http-client to do async calls 3. Look out for security breaches
  43. Next Steps 1. Add tests (hint: Blackfire Player) 2. Leverage

    the http-client to do async calls 3. Look out for security breaches 4. Look out for (more) performance bottlenecks
  44. Next Steps 1. Add tests (hint: Blackfire Player) 2. Leverage

    the http-client to do async calls 3. Look out for security breaches 4. Look out for (more) performance bottlenecks 5. (...) The list increases with the demands
  45. Links ✔ https://symfony.com/doc/current ✔ https://assets.andreiabohner.org/symfony/sf43-httpclient-ch eat-sheet.pdf ✔ https://speakerdeck.com/nicolasgrekas/symfony-httpclient- what-else ✔

    https://speakerdeck.com/raphaeldealmeida/web-scraping-w ith-symfony-panther ✔ https://blackfire.io/docs/player/index ✔ https://blackfire.io/docs/player/index#writing-expectations