Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Scraping with Geb

Scraping with Geb

GR8Conf Eu 2016

Sergio del Amo

June 02, 2016
Tweet

More Decks by Sergio del Amo

Other Decks in Technology

Transcript

  1. June 1st - 3rd 2016
    Copenhagen, Denmark
    SCRAPING WITH
    GEB

    View Slide

  2. SERGIO DEL AMO
    [email protected]
    @SDELAMO

    View Slide

  3. View Slide

  4. IOS APP
    APP STORE

    View Slide

  5. GR8CONFAGENDA
    GOOGLEPLAY

    View Slide

  6. GROOVYCALAMARI.COM
    A “weekly” curated email
    newsletter full of
    interesting, relevant links
    about the Groovy
    Ecosystem

    View Slide

  7. GEB
    HTTP://GEBISH.ORG

    View Slide

  8. http://www.webbotsspidersscreenscrapers.com

    View Slide

  9. WHAT CAN YOU DO?
    ▸ PRICE-MONITORING WEBBOTS
    ▸ IMAGE-CAPTURING WEBBOTS
    ▸ LINK VERIFICATION WEBBOTS
    ▸ WEBBOTS THAT SEND EMAIL
    ▸ WEBBOTS THAT CONVERT A WEBSITE IN AN API
    ▸ SNIPERS

    View Slide

  10. EXAMPLE 1
    CREW SCRAPING
    HTTPS://GITHUB.COM/SDELAMO/GEBWEBBOT_GR8CONF

    View Slide

  11. GR8CONF API
    ▸ Only for agenda, no
    sponsors or crew

    View Slide

  12. java -jar gebwebbot_gr8conf-all.jar
    crew|sponsors
    destinationFolder
    outputFilename
    sqlite|plist|csv
    phantomjs.binary.path

    View Slide

  13. GRADLE SHADOW
    HTTPS://GITHUB.COM/JOHNRENGELMAN/SHADOW

    View Slide

  14. View Slide

  15. DESIRED OUTPUT

    View Slide

  16. MODEL

    View Slide

  17. MODEL

    View Slide

  18. TEST YOUR FETCHER

    View Slide

  19. SPOCK - TEST YOUR FETCHER

    View Slide

  20. View Slide

  21. View Slide

  22. Define the interesting
    parts of your pages in
    a concise, maintanable
    and extensible manner
    GEB PAGES

    View Slide

  23. DIV.CREW
    DIV.CREW
    DIV.CREW
    DIV.CREW
    DIV.CREW
    DIV.CREW
    DIV.CREW
    DIV.CREW
    DIV.CREW

    View Slide

  24. View Slide

  25. Modules are re-usable
    definitions of content that
    can be used across
    multiple pages
    GEB MODULES

    View Slide

  26. View Slide

  27. OUTPUT

    View Slide

  28. View Slide

  29. View Slide

  30. EXAMPLE 2
    SPONSORS SCRAPING
    HTTPS://GITHUB.COM/SDELAMO/GEBWEBBOT_GR8CONF

    View Slide

  31. GR8CONF API
    ▸ Only for agenda, no
    sponsors or crew

    View Slide

  32. DIV.SPONSORS
    DIV.SPONSORS
    H4
    DIV.SPONSOR
    DIV.SPONSOR
    DIV.SPONSOR
    H4

    View Slide

  33. DIV.SPONSORS
    DIV.SPONSORS
    H4
    DIV.SPONSOR
    DIV.SPONSOR
    DIV.SPONSOR
    H4

    View Slide

  34. DIV.SPONSORS
    DIV.SPONSORS
    H4
    DIV.SPONSOR
    DIV.SPONSOR
    DIV.SPONSOR
    H4

    View Slide

  35. DIV.SPONSORS
    DIV.SPONSORS
    H4
    DIV.SPONSOR
    DIV.SPONSOR
    DIV.SPONSOR
    H4

    View Slide

  36. View Slide

  37. View Slide

  38. GEB EXAMPLE GRADLE
    HTTPS://GITHUB.COM/GEB/GEB-EXAMPLE-GRADLE
    The following commands will launch the tests with the
    individual browsers:
    ./gradlew chromeTest
    ./gradlew firefoxTest
    ./gradlew phantomJsTest
    To run with all, you can run:
    ./gradlew test
    MARCIN ERDMANN

    View Slide

  39. GEB.CONFIG

    View Slide

  40. EXAMPLE 3
    PAGINATION
    HTTPS://GITHUB.COM/SDELAMO/WEBBOT_GEB_MEETUP_MEMBERS

    View Slide

  41. View Slide

  42. DYNAMIC URL
    http://www.meetup.com/es-ES/madrid-gug/members/49149882/
    BASE URL:
    MEETUP GROUP SLUG:
    MEMBER ID:
    http://www.meetup.com/
    madrid-gug
    28938802

    View Slide

  43. View Slide

  44. PAGINATION
    .PAGINATION
    .NAV-NEXT

    View Slide

  45. PAGINATION

    View Slide

  46. View Slide

  47. PAGINATION MODULE

    View Slide

  48. HARVEST AND VISIT

    View Slide

  49. HARVEST LINKS

    View Slide

  50. View Slide

  51. SPLIT LOAD BETWEEN WEBBOTS
    HTTPS://HTTPSTATUSDOGS.COM

    View Slide

  52. SPLIT LOAD BETWEEN WEBBOTS
    1 2 5
    3 4
    11 12 15
    13 14
    21 22 25
    23 24
    31 32 35
    33 34
    41 42 45
    43 44
    6 7 10
    8 9
    16 17 20
    18 19
    26 27 30
    28 29
    36 37 40
    38 39
    46 47 50
    48 49
    def ids = 1..50
    def webbotIndex = 3
    def webbotsInParallel = 6
    int total = ids.size()
    def sublistsSize = (total / webbotsInParallel) as int
    def s = ids.collate(sublistsSize)[webbotIndex]

    View Slide

  53. USER AGENT SPOOFING

    View Slide

  54. USER AGENT SPOOFING
    HTTPS://GITHUB.COM/SDELAMO/GEBWEBBOT_USERAGENT

    View Slide

  55. USER AGENT SPOOFING

    View Slide

  56. EXAMPLE 4
    HIDDEN CONTENT AND
    ON MOUSE OVER EVENTS

    View Slide

  57. View Slide

  58. View Slide

  59. FAILS: HIDDEN CONTENT

    View Slide

  60. CALL A JS METHOD

    View Slide

  61. MOVE TO ELEMENT

    View Slide

  62. INCLUDE LIBRARY

    View Slide

  63. UI INTERACTION

    View Slide

  64. KEYBOARD

    View Slide

  65. SLIDERS

    View Slide

  66. STEALTH MEANS SIMULATING HUMAN
    PATTERNS
    ▸ BE KIND TO YOUR RESOURCES
    ▸ RUN YOUR WEBBOTS DURING BUSY HOURS
    ▸ DON’T RUN YOUR WEBBOTS AT THE SAME TIME EACH
    DAY
    ▸ DON’T RUN YOUR WEBBOT ON HOLIDAYS AND
    WEEKENDS
    ▸ USE RANDOM, INTRA-FETCH DELAYS

    View Slide

  67. SIMULATE HUMAN CLICK
    RHYTHM

    View Slide

  68. ?

    View Slide