Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Spiders, Webbots and Scrapers with Geb - Madrid GUG - Nov 2016

Spiders, Webbots and Scrapers with Geb - Madrid GUG - Nov 2016

Deck of the talk presented at Madrid Groovy User Group by Sergio del Amo.

Sergio del Amo

November 08, 2016
Tweet

More Decks by Sergio del Amo

Other Decks in Programming

Transcript

  1. SPIDERS, WEBBOTS AND
    SCRAPERS WITH GEB

    View full-size slide

  2. GEB
    HTTP://GEBISH.ORG

    View full-size slide

  3. http://www.webbotsspidersscreenscrapers.com

    View full-size slide

  4. WHAT CAN YOU DO?
    PRICE-MONITORING WEBBOTS
    IMAGE-CAPTURING WEBBOTS
    LINK VERIFICATION WEBBOTS
    WEBBOTS THAT SEND EMAIL
    WEBBOTS THAT CONVERT A WEBSITE IN AN API
    SNIPERS

    View full-size slide

  5. EXAMPLE 1
    GREACH API
    HTTPS://GITHUB.COM/SDELAMO/GREACHAPI

    View full-size slide

  6. Define the interesting
    parts of your pages in
    a concise, maintanable
    and extensible manner
    GEB PAGES

    View full-size slide

  7. def browser = new Browser()
    browser.go 'http://sergiodelamo.es'
    def hPage = browser.page HomePage
    hPage.subscribeToGroovyCalamari(‘[email protected]')
    def latestPostsPage = browser.page WordpressLatestPostsPage
    def posts = latestPostsPage.fetchPosts()
    Source: Wikia
    GEB PAGES ARE
    BLUEPRINTS FOR
    YOUR HTML PAGES

    View full-size slide

  8. FOOTER
    A A A A
    DIV.CREDITS

    View full-size slide

  9. Modules are re-usable
    definitions of content that
    can be used across
    multiple pages
    GEB MODULES

    View full-size slide

  10. .PTP-PRICING-TABLE
    .PTP-ITEM-CONTAINER
    A
    .PTP-CTA
    .PTP-ITEM-CONTAINER .PTP-ITEM-CONTAINER
    .PTP-PLAN
    .PTP-PRICE
    .PTP-BULLET-ITEM
    .PTP-BULLET-ITEM
    .PTP-BULLET-ITEM

    View full-size slide

  11. GRADLE SHADOW & APPLICATION
    HTTPS://GITHUB.COM/JOHNRENGELMAN/SHADOW
    HTTPS://DOCS.GRADLE.ORG/CURRENT/USERGUIDE/APPLICATION_PLUGIN.HTML

    View full-size slide

  12. java -jar output-all.jar

    View full-size slide

  13. DESIRED OUTPUT

    View full-size slide

  14. EXAMPLE 2
    PAGINATION
    HTTPS://GITHUB.COM/SDELAMO/WEBBOT_GEB_MEETUP_MEMBERS

    View full-size slide

  15. DYNAMIC URL
    http://www.meetup.com/es-ES/madrid-gug/members/49149882/
    BASE URL:
    MEETUP GROUP SLUG:
    MEMBER ID:
    http://www.meetup.com
    madrid-gug
    28938802

    View full-size slide

  16. PAGINATION
    .PAGINATION
    .NAV-NEXT

    View full-size slide

  17. PAGINATION MODULE

    View full-size slide

  18. HARVEST AND VISIT

    View full-size slide

  19. HARVEST LINKS

    View full-size slide

  20. EXAMPLE 3
    HIDDEN CONTENT AND
    ON MOUSE OVER EVENTS

    View full-size slide

  21. FAILS: HIDDEN CONTENT

    View full-size slide

  22. CALL A JS METHOD

    View full-size slide

  23. MOVE TO ELEMENT

    View full-size slide

  24. INCLUDE LIBRARY

    View full-size slide

  25. TIPS & TRICKS

    View full-size slide

  26. GEB EXAMPLE GRADLE
    HTTPS://GITHUB.COM/GEB/GEB-EXAMPLE-GRADLE
    The following commands will launch the tests with the
    individual browsers:
    ./gradlew chromeTest
    ./gradlew firefoxTest
    ./gradlew phantomJsTest
    To run with all, you can run:
    ./gradlew test
    MARCIN ERDMANN

    View full-size slide

  27. DIFFERENT BROWSERS
    Run in html unit: $ ./gradlew -Dgeb.env=htmlUnit test
    Run in PhantomsJS $ ./gradlew -Dgeb.env=phantomJs
    -Dphantomjs.binary.path=./phantomjs-2.1.1-macosx/bin/phantomjs
    test
    Run in Firefox $./gradlew -Dgeb.env=firefox test
    Run in Chrome $./gradlew -Dgeb.env=chrome
    -Dwebdriver.chrome.driver=./chromedriver
    test

    View full-size slide

  28. USER AGENT SPOOFING

    View full-size slide

  29. USER AGENT SPOOFING

    View full-size slide

  30. USER AGENT SPOOFING
    HTTPS://GITHUB.COM/SDELAMO/GEBWEBBOT_USERAGENT

    View full-size slide

  31. MAXIMIZE WINDOW

    View full-size slide

  32. OBTAIN CURRENT PAGE
    HTML

    View full-size slide

  33. UI INTERACTION

    View full-size slide

  34. SPLIT LOAD BETWEEN WEBBOTS
    HTTPS://HTTPSTATUSDOGS.COM

    View full-size slide

  35. SPLIT LOAD BETWEEN WEBBOTS
    1 2 5
    3 4
    11 12 15
    13 14
    21 22 25
    23 24
    31 32 35
    33 34
    41 42 45
    43 44
    6 7 10
    8 9
    16 17 20
    18 19
    26 27 30
    28 29
    36 37 40
    38 39
    46 47 50
    48 49
    def sublist(def ids, def webbotIndex, def webbotsInParallel) {
    int total = ids.size()
    def sublistsSize = (total / webbotsInParallel) as int
    ids.collate(sublistsSize)[webbotIndex]
    }
    def ids = 1..50
    def webbotsInParallel = 6
    sublist(ids, 3, webbotsInParallel)
    [1, 2, 3, 4, 5, 6, 7, 8]
    [9, 10, 11, 12, 13, 14, 15, 16]
    [17, 18, 19, 20, 21, 22, 23, 24]
    [25, 26, 27, 28, 29, 30, 31, 32]
    [33, 34, 35, 36, 37, 38, 39, 40]
    [41, 42, 43, 44, 45, 46, 47, 48]
    [49, 50]

    View full-size slide

  36. STEALTH MEANS SIMULATING HUMAN
    PATTERNS
    ▸ BE KIND TO YOUR RESOURCES
    ▸ RUN YOUR WEBBOTS DURING BUSY HOURS
    ▸ DON’T RUN YOUR WEBBOTS AT THE SAME TIME EACH
    DAY
    ▸ DON’T RUN YOUR WEBBOT ON HOLIDAYS AND
    WEEKENDS
    ▸ USE RANDOM, INTRA-FETCH DELAYS

    View full-size slide

  37. SIMULATE HUMAN CLICK
    RHYTHM

    View full-size slide

  38. GROOVYCALAMARI.COM
    A “weekly” curated email
    newsletter full of
    interesting, relevant links
    about the Groovy
    Ecosystem

    View full-size slide