Scraping with Geb

Scraping with Geb

GR8Conf Eu 2016

30995fcc876f40073628b63ea5cfab59?s=128

Sergio del Amo

June 02, 2016
Tweet

Transcript

  1. June 1st - 3rd 2016 Copenhagen, Denmark SCRAPING WITH GEB

  2. SERGIO DEL AMO ME@SERGIODELAMO.COM @SDELAMO

  3. None
  4. IOS APP APP STORE

  5. GR8CONFAGENDA GOOGLEPLAY

  6. GROOVYCALAMARI.COM A “weekly” curated email newsletter full of interesting, relevant

    links about the Groovy Ecosystem
  7. GEB HTTP://GEBISH.ORG

  8. http://www.webbotsspidersscreenscrapers.com

  9. WHAT CAN YOU DO? ▸ PRICE-MONITORING WEBBOTS ▸ IMAGE-CAPTURING WEBBOTS

    ▸ LINK VERIFICATION WEBBOTS ▸ WEBBOTS THAT SEND EMAIL ▸ WEBBOTS THAT CONVERT A WEBSITE IN AN API ▸ SNIPERS
  10. EXAMPLE 1 CREW SCRAPING HTTPS://GITHUB.COM/SDELAMO/GEBWEBBOT_GR8CONF

  11. GR8CONF API ▸ Only for agenda, no sponsors or crew

  12. java -jar gebwebbot_gr8conf-all.jar crew|sponsors destinationFolder outputFilename sqlite|plist|csv phantomjs.binary.path

  13. GRADLE SHADOW HTTPS://GITHUB.COM/JOHNRENGELMAN/SHADOW

  14. None
  15. DESIRED OUTPUT

  16. MODEL

  17. MODEL

  18. TEST YOUR FETCHER

  19. SPOCK - TEST YOUR FETCHER

  20. None
  21. None
  22. Define the interesting parts of your pages in a concise,

    maintanable and extensible manner GEB PAGES
  23. DIV.CREW DIV.CREW DIV.CREW DIV.CREW DIV.CREW DIV.CREW DIV.CREW DIV.CREW DIV.CREW

  24. None
  25. Modules are re-usable definitions of content that can be used

    across multiple pages GEB MODULES
  26. None
  27. OUTPUT

  28. None
  29. None
  30. EXAMPLE 2 SPONSORS SCRAPING HTTPS://GITHUB.COM/SDELAMO/GEBWEBBOT_GR8CONF

  31. GR8CONF API ▸ Only for agenda, no sponsors or crew

  32. DIV.SPONSORS DIV.SPONSORS H4 DIV.SPONSOR DIV.SPONSOR DIV.SPONSOR H4

  33. DIV.SPONSORS DIV.SPONSORS H4 DIV.SPONSOR DIV.SPONSOR DIV.SPONSOR H4

  34. DIV.SPONSORS DIV.SPONSORS H4 DIV.SPONSOR DIV.SPONSOR DIV.SPONSOR H4

  35. DIV.SPONSORS DIV.SPONSORS H4 DIV.SPONSOR DIV.SPONSOR DIV.SPONSOR H4

  36. None
  37. None
  38. GEB EXAMPLE GRADLE HTTPS://GITHUB.COM/GEB/GEB-EXAMPLE-GRADLE The following commands will launch the

    tests with the individual browsers: ./gradlew chromeTest ./gradlew firefoxTest ./gradlew phantomJsTest To run with all, you can run: ./gradlew test MARCIN ERDMANN
  39. GEB.CONFIG

  40. EXAMPLE 3 PAGINATION HTTPS://GITHUB.COM/SDELAMO/WEBBOT_GEB_MEETUP_MEMBERS

  41. None
  42. DYNAMIC URL http://www.meetup.com/es-ES/madrid-gug/members/49149882/ BASE URL: MEETUP GROUP SLUG: MEMBER ID:

    http://www.meetup.com/ madrid-gug 28938802
  43. None
  44. PAGINATION .PAGINATION .NAV-NEXT

  45. PAGINATION

  46. None
  47. PAGINATION MODULE

  48. HARVEST AND VISIT

  49. HARVEST LINKS

  50. None
  51. SPLIT LOAD BETWEEN WEBBOTS HTTPS://HTTPSTATUSDOGS.COM

  52. SPLIT LOAD BETWEEN WEBBOTS 1 2 5 3 4 11

    12 15 13 14 21 22 25 23 24 31 32 35 33 34 41 42 45 43 44 6 7 10 8 9 16 17 20 18 19 26 27 30 28 29 36 37 40 38 39 46 47 50 48 49 def ids = 1..50 def webbotIndex = 3 def webbotsInParallel = 6 int total = ids.size() def sublistsSize = (total / webbotsInParallel) as int def s = ids.collate(sublistsSize)[webbotIndex]
  53. USER AGENT SPOOFING

  54. USER AGENT SPOOFING HTTPS://GITHUB.COM/SDELAMO/GEBWEBBOT_USERAGENT

  55. USER AGENT SPOOFING

  56. EXAMPLE 4 HIDDEN CONTENT AND ON MOUSE OVER EVENTS

  57. None
  58. None
  59. FAILS: HIDDEN CONTENT

  60. CALL A JS METHOD

  61. MOVE TO ELEMENT

  62. INCLUDE LIBRARY

  63. UI INTERACTION

  64. KEYBOARD

  65. SLIDERS

  66. STEALTH MEANS SIMULATING HUMAN PATTERNS ▸ BE KIND TO YOUR

    RESOURCES ▸ RUN YOUR WEBBOTS DURING BUSY HOURS ▸ DON’T RUN YOUR WEBBOTS AT THE SAME TIME EACH DAY ▸ DON’T RUN YOUR WEBBOT ON HOLIDAYS AND WEEKENDS ▸ USE RANDOM, INTRA-FETCH DELAYS
  67. SIMULATE HUMAN CLICK RHYTHM

  68. ?