Spiders, Webbots and Scrapers with Geb - Madrid GUG - Nov 2016

Spiders, Webbots and Scrapers with Geb - Madrid GUG - Nov 2016

Deck of the talk presented at Madrid Groovy User Group by Sergio del Amo.

30995fcc876f40073628b63ea5cfab59?s=128

Sergio del Amo

November 08, 2016
Tweet

Transcript

  1. SPIDERS, WEBBOTS AND SCRAPERS WITH GEB

  2. SERGIO DEL AMO ME@SERGIODELAMO.COM @SDELAMO

  3. None
  4. GEB HTTP://GEBISH.ORG

  5. http://www.webbotsspidersscreenscrapers.com

  6. WHAT CAN YOU DO? PRICE-MONITORING WEBBOTS IMAGE-CAPTURING WEBBOTS LINK VERIFICATION

    WEBBOTS WEBBOTS THAT SEND EMAIL WEBBOTS THAT CONVERT A WEBSITE IN AN API SNIPERS
  7. EXAMPLE 1 GREACH API HTTPS://GITHUB.COM/SDELAMO/GREACHAPI

  8. Define the interesting parts of your pages in a concise,

    maintanable and extensible manner GEB PAGES
  9. def browser = new Browser() browser.go 'http://sergiodelamo.es' def hPage =

    browser.page HomePage hPage.subscribeToGroovyCalamari(‘me@sergiodelamo.com') def latestPostsPage = browser.page WordpressLatestPostsPage def posts = latestPostsPage.fetchPosts() Source: Wikia GEB PAGES ARE BLUEPRINTS FOR YOUR HTML PAGES
  10. None
  11. FOOTER A A A A DIV.CREDITS

  12. Modules are re-usable definitions of content that can be used

    across multiple pages GEB MODULES
  13. .PTP-PRICING-TABLE .PTP-ITEM-CONTAINER A .PTP-CTA .PTP-ITEM-CONTAINER .PTP-ITEM-CONTAINER .PTP-PLAN .PTP-PRICE .PTP-BULLET-ITEM .PTP-BULLET-ITEM

    .PTP-BULLET-ITEM
  14. None
  15. DEMO

  16. MODEL

  17. GRADLE SHADOW & APPLICATION HTTPS://GITHUB.COM/JOHNRENGELMAN/SHADOW HTTPS://DOCS.GRADLE.ORG/CURRENT/USERGUIDE/APPLICATION_PLUGIN.HTML

  18. java -jar output-all.jar

  19. DESIRED OUTPUT

  20. EXAMPLE 2 PAGINATION HTTPS://GITHUB.COM/SDELAMO/WEBBOT_GEB_MEETUP_MEMBERS

  21. DYNAMIC URL http://www.meetup.com/es-ES/madrid-gug/members/49149882/ BASE URL: MEETUP GROUP SLUG: MEMBER ID:

    http://www.meetup.com madrid-gug 28938802
  22. None
  23. PAGINATION .PAGINATION .NAV-NEXT

  24. PAGINATION

  25. None
  26. PAGINATION MODULE

  27. HARVEST AND VISIT

  28. HARVEST LINKS

  29. None
  30. EXAMPLE 3 HIDDEN CONTENT AND ON MOUSE OVER EVENTS

  31. None
  32. None
  33. FAILS: HIDDEN CONTENT

  34. CALL A JS METHOD

  35. MOVE TO ELEMENT

  36. INCLUDE LIBRARY

  37. TIPS & TRICKS

  38. GEB EXAMPLE GRADLE HTTPS://GITHUB.COM/GEB/GEB-EXAMPLE-GRADLE The following commands will launch the

    tests with the individual browsers: ./gradlew chromeTest ./gradlew firefoxTest ./gradlew phantomJsTest To run with all, you can run: ./gradlew test MARCIN ERDMANN
  39. GEB.CONFIG

  40. DIFFERENT BROWSERS Run in html unit: $ ./gradlew -Dgeb.env=htmlUnit test

    Run in PhantomsJS $ ./gradlew -Dgeb.env=phantomJs -Dphantomjs.binary.path=./phantomjs-2.1.1-macosx/bin/phantomjs test Run in Firefox $./gradlew -Dgeb.env=firefox test Run in Chrome $./gradlew -Dgeb.env=chrome -Dwebdriver.chrome.driver=./chromedriver test
  41. USER AGENT SPOOFING

  42. USER AGENT SPOOFING

  43. USER AGENT SPOOFING HTTPS://GITHUB.COM/SDELAMO/GEBWEBBOT_USERAGENT

  44. COOKIES

  45. MAXIMIZE WINDOW

  46. OBTAIN CURRENT PAGE HTML

  47. UI INTERACTION

  48. KEYBOARD

  49. SLIDERS

  50. SPLIT LOAD BETWEEN WEBBOTS HTTPS://HTTPSTATUSDOGS.COM

  51. SPLIT LOAD BETWEEN WEBBOTS 1 2 5 3 4 11

    12 15 13 14 21 22 25 23 24 31 32 35 33 34 41 42 45 43 44 6 7 10 8 9 16 17 20 18 19 26 27 30 28 29 36 37 40 38 39 46 47 50 48 49 def sublist(def ids, def webbotIndex, def webbotsInParallel) { int total = ids.size() def sublistsSize = (total / webbotsInParallel) as int ids.collate(sublistsSize)[webbotIndex] } def ids = 1..50 def webbotsInParallel = 6 sublist(ids, 3, webbotsInParallel) [1, 2, 3, 4, 5, 6, 7, 8] [9, 10, 11, 12, 13, 14, 15, 16] [17, 18, 19, 20, 21, 22, 23, 24] [25, 26, 27, 28, 29, 30, 31, 32] [33, 34, 35, 36, 37, 38, 39, 40] [41, 42, 43, 44, 45, 46, 47, 48] [49, 50]
  52. STEALTH MEANS SIMULATING HUMAN PATTERNS ▸ BE KIND TO YOUR

    RESOURCES ▸ RUN YOUR WEBBOTS DURING BUSY HOURS ▸ DON’T RUN YOUR WEBBOTS AT THE SAME TIME EACH DAY ▸ DON’T RUN YOUR WEBBOT ON HOLIDAYS AND WEEKENDS ▸ USE RANDOM, INTRA-FETCH DELAYS
  53. SIMULATE HUMAN CLICK RHYTHM

  54. GROOVYCALAMARI.COM A “weekly” curated email newsletter full of interesting, relevant

    links about the Groovy Ecosystem
  55. ?