$30 off During Our Annual Pro Sale. View Details »

Web scraping with Symfony Panther

Web scraping with Symfony Panther

Talk presented at Darkmira Tour PHP 2019
https://php.darkmiratour.rocks/2019/

Tools like Guzzle and DomCrawler make it easy to create bots to navigate and get data from other systems in pure HTML, but the web has evolved and we have SPAs built with Javascript and executed client-side. Interacting with alerts and other dynamic elements has become a problem for bots.

This presentation features Symfony Panther, e2e test tool and web crawler with support for running Javascript and interacting with visual elements, simulating a user with extreme fidelity.

raphaeldealmeida

June 09, 2019
Tweet

More Decks by raphaeldealmeida

Other Decks in Technology

Transcript

  1. Web scraping
    with Symfony
    Panther

    View Slide

  2. Raphael
    de Almeida

    View Slide

  3. https://t.me/phprio

    View Slide

  4. Take pics of conference and
    publish
    Make a new friend
    Give feedback for speakers

    View Slide

  5. In the ideal world all
    applications share data
    via API

    View Slide

  6. Data returned is
    structured

    View Slide

  7. Not only in JSON

    View Slide

  8. Richardson Maturity
    Model

    View Slide

  9. View Slide

  10. But it isn’t the real world

    View Slide

  11. We need to extract
    structured data from
    no-strutured data sites

    View Slide

  12. Web Scraping

    View Slide

  13. web indexing
    data minning
    auto-typing forms
    price comparison
    change detection

    View Slide

  14. This is easy, cURL
    resolves
    I'm Hacky

    View Slide

  15. View Slide

  16. How to simulate a user
    reading HTML, handling
    forms and cookies?

    View Slide

  17. Guzzle
    DomCrawler

    View Slide

  18. View Slide

  19. The Web evolves and
    with it we have the SPA

    View Slide

  20. View Slide

  21. https://www.linkedin.com/jobs/

    View Slide

  22. source-code

    View Slide

  23. View Slide

  24. WebDriver
    W3C
    protocol
    Google
    Chrome
    Headless

    View Slide

  25. View Slide

  26. View Slide

  27. Show me the code

    View Slide

  28. View Slide

  29. https://github.com/raphaeldealmeida/pokemon-memory-game-player

    View Slide

  30. Tips

    View Slide

  31. Use
    Page Object

    View Slide

  32. View Slide

  33. Create logs for all
    important actions

    View Slide

  34. Monolog

    View Slide

  35. View Slide

  36. Monitor the execution of
    your service

    View Slide

  37. Monit

    View Slide

  38. handle timeout cases

    View Slide

  39. References
    ● https://github.com/symfony/panther
    ● https://martinfowler.com/bliki/PageObject.html
    ● https://symfony.com/doc/current/components/dom_crawler.html
    ● http://docs.guzzlephp.org/en/stable/
    ● http://wttr.in/sao%20paulo
    ● https://developers.google.com/web/updates/2017/04/headless-chrome
    ● https://vue-pokemon-memory-game.vinicius73.dev/
    ● https://mmonit.com/monit/
    ● https://github.com/Seldaek/monolog
    ● https://pt.wikipedia.org/wiki/Willis_Carrier

    View Slide

  40. Support
    http://bit.ly/2Ita3d9

    View Slide

  41. Give me a
    feedback
    https://joind.in/talk/e74d7

    View Slide

  42. THANK YOU
    @raph_almeida
    https://joind.in/talk/78c98

    View Slide