Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Scraping with Geb

Scraping with Geb

GR8Conf Eu 2016

Sergio del Amo

June 02, 2016
Tweet

More Decks by Sergio del Amo

Other Decks in Technology

Transcript

  1. June 1st - 3rd 2016
    Copenhagen, Denmark
    SCRAPING WITH
    GEB

    View full-size slide

  2. IOS APP
    APP STORE

    View full-size slide

  3. GR8CONFAGENDA
    GOOGLEPLAY

    View full-size slide

  4. GROOVYCALAMARI.COM
    A “weekly” curated email
    newsletter full of
    interesting, relevant links
    about the Groovy
    Ecosystem

    View full-size slide

  5. GEB
    HTTP://GEBISH.ORG

    View full-size slide

  6. http://www.webbotsspidersscreenscrapers.com

    View full-size slide

  7. WHAT CAN YOU DO?
    ▸ PRICE-MONITORING WEBBOTS
    ▸ IMAGE-CAPTURING WEBBOTS
    ▸ LINK VERIFICATION WEBBOTS
    ▸ WEBBOTS THAT SEND EMAIL
    ▸ WEBBOTS THAT CONVERT A WEBSITE IN AN API
    ▸ SNIPERS

    View full-size slide

  8. EXAMPLE 1
    CREW SCRAPING
    HTTPS://GITHUB.COM/SDELAMO/GEBWEBBOT_GR8CONF

    View full-size slide

  9. GR8CONF API
    ▸ Only for agenda, no
    sponsors or crew

    View full-size slide

  10. java -jar gebwebbot_gr8conf-all.jar
    crew|sponsors
    destinationFolder
    outputFilename
    sqlite|plist|csv
    phantomjs.binary.path

    View full-size slide

  11. GRADLE SHADOW
    HTTPS://GITHUB.COM/JOHNRENGELMAN/SHADOW

    View full-size slide

  12. DESIRED OUTPUT

    View full-size slide

  13. TEST YOUR FETCHER

    View full-size slide

  14. SPOCK - TEST YOUR FETCHER

    View full-size slide

  15. Define the interesting
    parts of your pages in
    a concise, maintanable
    and extensible manner
    GEB PAGES

    View full-size slide

  16. DIV.CREW
    DIV.CREW
    DIV.CREW
    DIV.CREW
    DIV.CREW
    DIV.CREW
    DIV.CREW
    DIV.CREW
    DIV.CREW

    View full-size slide

  17. Modules are re-usable
    definitions of content that
    can be used across
    multiple pages
    GEB MODULES

    View full-size slide

  18. EXAMPLE 2
    SPONSORS SCRAPING
    HTTPS://GITHUB.COM/SDELAMO/GEBWEBBOT_GR8CONF

    View full-size slide

  19. GR8CONF API
    ▸ Only for agenda, no
    sponsors or crew

    View full-size slide

  20. DIV.SPONSORS
    DIV.SPONSORS
    H4
    DIV.SPONSOR
    DIV.SPONSOR
    DIV.SPONSOR
    H4

    View full-size slide

  21. DIV.SPONSORS
    DIV.SPONSORS
    H4
    DIV.SPONSOR
    DIV.SPONSOR
    DIV.SPONSOR
    H4

    View full-size slide

  22. DIV.SPONSORS
    DIV.SPONSORS
    H4
    DIV.SPONSOR
    DIV.SPONSOR
    DIV.SPONSOR
    H4

    View full-size slide

  23. DIV.SPONSORS
    DIV.SPONSORS
    H4
    DIV.SPONSOR
    DIV.SPONSOR
    DIV.SPONSOR
    H4

    View full-size slide

  24. GEB EXAMPLE GRADLE
    HTTPS://GITHUB.COM/GEB/GEB-EXAMPLE-GRADLE
    The following commands will launch the tests with the
    individual browsers:
    ./gradlew chromeTest
    ./gradlew firefoxTest
    ./gradlew phantomJsTest
    To run with all, you can run:
    ./gradlew test
    MARCIN ERDMANN

    View full-size slide

  25. EXAMPLE 3
    PAGINATION
    HTTPS://GITHUB.COM/SDELAMO/WEBBOT_GEB_MEETUP_MEMBERS

    View full-size slide

  26. DYNAMIC URL
    http://www.meetup.com/es-ES/madrid-gug/members/49149882/
    BASE URL:
    MEETUP GROUP SLUG:
    MEMBER ID:
    http://www.meetup.com/
    madrid-gug
    28938802

    View full-size slide

  27. PAGINATION
    .PAGINATION
    .NAV-NEXT

    View full-size slide

  28. PAGINATION MODULE

    View full-size slide

  29. HARVEST AND VISIT

    View full-size slide

  30. HARVEST LINKS

    View full-size slide

  31. SPLIT LOAD BETWEEN WEBBOTS
    HTTPS://HTTPSTATUSDOGS.COM

    View full-size slide

  32. SPLIT LOAD BETWEEN WEBBOTS
    1 2 5
    3 4
    11 12 15
    13 14
    21 22 25
    23 24
    31 32 35
    33 34
    41 42 45
    43 44
    6 7 10
    8 9
    16 17 20
    18 19
    26 27 30
    28 29
    36 37 40
    38 39
    46 47 50
    48 49
    def ids = 1..50
    def webbotIndex = 3
    def webbotsInParallel = 6
    int total = ids.size()
    def sublistsSize = (total / webbotsInParallel) as int
    def s = ids.collate(sublistsSize)[webbotIndex]

    View full-size slide

  33. USER AGENT SPOOFING

    View full-size slide

  34. USER AGENT SPOOFING
    HTTPS://GITHUB.COM/SDELAMO/GEBWEBBOT_USERAGENT

    View full-size slide

  35. USER AGENT SPOOFING

    View full-size slide

  36. EXAMPLE 4
    HIDDEN CONTENT AND
    ON MOUSE OVER EVENTS

    View full-size slide

  37. FAILS: HIDDEN CONTENT

    View full-size slide

  38. CALL A JS METHOD

    View full-size slide

  39. MOVE TO ELEMENT

    View full-size slide

  40. INCLUDE LIBRARY

    View full-size slide

  41. UI INTERACTION

    View full-size slide

  42. STEALTH MEANS SIMULATING HUMAN
    PATTERNS
    ▸ BE KIND TO YOUR RESOURCES
    ▸ RUN YOUR WEBBOTS DURING BUSY HOURS
    ▸ DON’T RUN YOUR WEBBOTS AT THE SAME TIME EACH
    DAY
    ▸ DON’T RUN YOUR WEBBOT ON HOLIDAYS AND
    WEEKENDS
    ▸ USE RANDOM, INTRA-FETCH DELAYS

    View full-size slide

  43. SIMULATE HUMAN CLICK
    RHYTHM

    View full-size slide