Arañas, Webbots y Scrapers con Geb

Arañas, Webbots y Scrapers con Geb

Charla CodeMotion 2016

30995fcc876f40073628b63ea5cfab59?s=128

Sergio del Amo

November 18, 2016
Tweet

Transcript

  1. ARAÑAS, WEBBOTS Y SCRAPERS CON GEB MADRID · NOV 18-19

    · 2016
  2. SERGIO DEL AMO ME@SERGIODELAMO.COM @SDELAMO

  3. None
  4. GEB HTTP://GEBISH.ORG

  5. STEP 1 CREATE MODEL OBJECTS TO STORE THE INFORMATION WHICH

    YOU AIM TO SCRAPE SCRAPING WITH GEB
  6. STEP 2 UNDERSTAND HOW HTML IS BUILT AND ENCAPSULATE HTML

    IN GEB PAGE AND MODULES SCRAPING WITH GEB
  7. Define the interesting parts of your pages in a concise,

    maintanable and extensible manner GEB PAGES
  8. def browser = new Browser() browser.go 'http://sergiodelamo.es' def hPage =

    browser.page HomePage hPage.subscribeToGroovyCalamari(‘me@sergiodelamo.com') def latestPostsPage = browser.page WordpressLatestPostsPage def posts = latestPostsPage.fetchPosts() Source: Wikia GEB PAGES ARE BLUEPRINTS FOR YOUR HTML PAGES
  9. Modules are re-usable definitions of content that can be used

    across multiple pages GEB MODULES
  10. STEP 3 CREATE A FETCHER ORCHESTRATE NAVIGATION IN THE WEBSITE

    SCRAPING WITH GEB
  11. STEP 4 OUTPUT THE INFORMATION ‣ JAVA -JAR OUTPUT-ALL.JAR ‣

    EXPOSE AN API (E.G. AWS LAMBDA + API GATEWAY) SCRAPING WITH GEB
  12. GRADLE SHADOW & APPLICATION HTTPS://GITHUB.COM/JOHNRENGELMAN/SHADOW HTTPS://DOCS.GRADLE.ORG/CURRENT/USERGUIDE/APPLICATION_PLUGIN.HTML

  13. EXAMPLE CODEMOTION AGENDA HTTPS://GITHUB.COM/SDELAMO/ GEBWEBBOT_CODEMOTION2016

  14. None
  15. A .KA-TAB-LI A .KA-TAB-LI

  16. THEAD .KA-TABLE-H .KA-TABLE-H .KA-TABLE-H .KA-TABLE-H

  17. EXAMPLE PAGINATION HTTPS://GITHUB.COM/SDELAMO/WEBBOT_GEB_MEETUP_MEMBERS

  18. DYNAMIC URL http://www.meetup.com/es-ES/madrid-gug/members/49149882/ BASE URL: MEETUP GROUP SLUG: MEMBER ID:

    http://www.meetup.com madrid-gug 28938802
  19. None
  20. PAGINATION .PAGINATION .NAV-NEXT

  21. PAGINATION

  22. None
  23. PAGINATION MODULE

  24. HARVEST AND VISIT

  25. HARVEST LINKS

  26. None
  27. EXAMPLE HIDDEN CONTENT AND ON MOUSE OVER EVENTS

  28. None
  29. None
  30. FAILS: HIDDEN CONTENT

  31. CALL A JS METHOD

  32. MOVE TO ELEMENT

  33. INCLUDE LIBRARY

  34. TIPS & TRICKS

  35. GEB EXAMPLE GRADLE HTTPS://GITHUB.COM/GEB/GEB-EXAMPLE-GRADLE The following commands will launch the

    tests with the individual browsers: ./gradlew chromeTest ./gradlew firefoxTest ./gradlew phantomJsTest To run with all, you can run: ./gradlew test MARCIN ERDMANN
  36. GEB.CONFIG

  37. DIFFERENT BROWSERS Run in html unit: $ ./gradlew -Dgeb.env=htmlUnit test

    Run in PhantomsJS $ ./gradlew -Dgeb.env=phantomJs -Dphantomjs.binary.path=./phantomjs-2.1.1-macosx/bin/phantomjs test Run in Firefox $./gradlew -Dgeb.env=firefox test Run in Chrome $./gradlew -Dgeb.env=chrome -Dwebdriver.chrome.driver=./chromedriver test
  38. USER AGENT SPOOFING

  39. USER AGENT SPOOFING

  40. USER AGENT SPOOFING HTTPS://GITHUB.COM/SDELAMO/GEBWEBBOT_USERAGENT

  41. COOKIES

  42. MAXIMIZE WINDOW

  43. OBTAIN CURRENT PAGE HTML

  44. GROOVYCALAMARI.COM A “weekly” curated email newsletter full of interesting, relevant

    links about the Groovy Ecosystem
  45. ?

  46. EXAMPLE GREACH API HTTPS://GITHUB.COM/SDELAMO/GREACHAPI

  47. FOOTER A A A A DIV.CREDITS

  48. .PTP-PRICING-TABLE .PTP-ITEM-CONTAINER A .PTP-CTA .PTP-ITEM-CONTAINER .PTP-ITEM-CONTAINER .PTP-PLAN .PTP-PRICE .PTP-BULLET-ITEM .PTP-BULLET-ITEM

    .PTP-BULLET-ITEM
  49. None
  50. MODEL

  51. DESIRED OUTPUT