Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Web scraping with Symfony Panther

Web scraping with Symfony Panther

Talk presented at Darkmira Tour PHP 2019
https://php.darkmiratour.rocks/2019/

Tools like Guzzle and DomCrawler make it easy to create bots to navigate and get data from other systems in pure HTML, but the web has evolved and we have SPAs built with Javascript and executed client-side. Interacting with alerts and other dynamic elements has become a problem for bots.

This presentation features Symfony Panther, e2e test tool and web crawler with support for running Javascript and interacting with visual elements, simulating a user with extreme fidelity.

Cd0c263b28fce0e1d89a0002cc75648b?s=128

raphaeldealmeida

June 09, 2019
Tweet

More Decks by raphaeldealmeida

Other Decks in Technology

Transcript

  1. Web scraping with Symfony Panther

  2. Raphael de Almeida

  3. https://t.me/phprio

  4. Take pics of conference and publish Make a new friend

    Give feedback for speakers
  5. In the ideal world all applications share data via API

  6. Data returned is structured

  7. Not only in JSON

  8. Richardson Maturity Model

  9. None
  10. But it isn’t the real world

  11. We need to extract structured data from no-strutured data sites

  12. Web Scraping

  13. web indexing data minning auto-typing forms price comparison change detection

  14. This is easy, cURL resolves I'm Hacky

  15. None
  16. How to simulate a user reading HTML, handling forms and

    cookies?
  17. Guzzle DomCrawler

  18. None
  19. The Web evolves and with it we have the SPA

  20. None
  21. https://www.linkedin.com/jobs/

  22. source-code

  23. None
  24. WebDriver W3C protocol Google Chrome Headless

  25. None
  26. None
  27. Show me the code

  28. None
  29. https://github.com/raphaeldealmeida/pokemon-memory-game-player

  30. Tips

  31. Use Page Object

  32. None
  33. Create logs for all important actions

  34. Monolog

  35. None
  36. Monitor the execution of your service

  37. Monit

  38. handle timeout cases

  39. References • https://github.com/symfony/panther • https://martinfowler.com/bliki/PageObject.html • https://symfony.com/doc/current/components/dom_crawler.html • http://docs.guzzlephp.org/en/stable/ •

    http://wttr.in/sao%20paulo • https://developers.google.com/web/updates/2017/04/headless-chrome • https://vue-pokemon-memory-game.vinicius73.dev/ • https://mmonit.com/monit/ • https://github.com/Seldaek/monolog • https://pt.wikipedia.org/wiki/Willis_Carrier
  40. Support http://bit.ly/2Ita3d9

  41. Give me a feedback https://joind.in/talk/e74d7

  42. THANK YOU @raph_almeida https://joind.in/talk/78c98