Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Arañas, Webbots y Scrapers con Geb

Arañas, Webbots y Scrapers con Geb

Charla CodeMotion 2016

Sergio del Amo

November 18, 2016
Tweet

More Decks by Sergio del Amo

Other Decks in Programming

Transcript

  1. ARAÑAS, WEBBOTS Y
    SCRAPERS CON GEB
    MADRID · NOV 18-19 · 2016

    View full-size slide

  2. GEB
    HTTP://GEBISH.ORG

    View full-size slide

  3. STEP 1
    CREATE MODEL OBJECTS TO
    STORE THE INFORMATION WHICH
    YOU AIM TO SCRAPE
    SCRAPING WITH GEB

    View full-size slide

  4. STEP 2
    UNDERSTAND HOW HTML IS BUILT
    AND ENCAPSULATE HTML IN GEB
    PAGE AND MODULES
    SCRAPING WITH GEB

    View full-size slide

  5. Define the interesting
    parts of your pages in
    a concise, maintanable
    and extensible manner
    GEB PAGES

    View full-size slide

  6. def browser = new Browser()
    browser.go 'http://sergiodelamo.es'
    def hPage = browser.page HomePage
    hPage.subscribeToGroovyCalamari(‘[email protected]')
    def latestPostsPage = browser.page WordpressLatestPostsPage
    def posts = latestPostsPage.fetchPosts()
    Source: Wikia
    GEB PAGES ARE
    BLUEPRINTS FOR
    YOUR HTML PAGES

    View full-size slide

  7. Modules are re-usable
    definitions of content that
    can be used across
    multiple pages
    GEB MODULES

    View full-size slide

  8. STEP 3
    CREATE A FETCHER ORCHESTRATE
    NAVIGATION IN THE WEBSITE
    SCRAPING WITH GEB

    View full-size slide

  9. STEP 4
    OUTPUT THE INFORMATION
    ‣ JAVA -JAR OUTPUT-ALL.JAR
    ‣ EXPOSE AN API (E.G. AWS LAMBDA + API GATEWAY)
    SCRAPING WITH GEB

    View full-size slide

  10. GRADLE SHADOW & APPLICATION
    HTTPS://GITHUB.COM/JOHNRENGELMAN/SHADOW
    HTTPS://DOCS.GRADLE.ORG/CURRENT/USERGUIDE/APPLICATION_PLUGIN.HTML

    View full-size slide

  11. EXAMPLE
    CODEMOTION AGENDA
    HTTPS://GITHUB.COM/SDELAMO/
    GEBWEBBOT_CODEMOTION2016

    View full-size slide

  12. A
    .KA-TAB-LI
    A
    .KA-TAB-LI

    View full-size slide

  13. THEAD
    .KA-TABLE-H .KA-TABLE-H
    .KA-TABLE-H .KA-TABLE-H

    View full-size slide

  14. EXAMPLE
    PAGINATION
    HTTPS://GITHUB.COM/SDELAMO/WEBBOT_GEB_MEETUP_MEMBERS

    View full-size slide

  15. DYNAMIC URL
    http://www.meetup.com/es-ES/madrid-gug/members/49149882/
    BASE URL:
    MEETUP GROUP SLUG:
    MEMBER ID:
    http://www.meetup.com
    madrid-gug
    28938802

    View full-size slide

  16. PAGINATION
    .PAGINATION
    .NAV-NEXT

    View full-size slide

  17. PAGINATION MODULE

    View full-size slide

  18. HARVEST AND VISIT

    View full-size slide

  19. HARVEST LINKS

    View full-size slide

  20. EXAMPLE
    HIDDEN CONTENT AND
    ON MOUSE OVER EVENTS

    View full-size slide

  21. FAILS: HIDDEN CONTENT

    View full-size slide

  22. CALL A JS METHOD

    View full-size slide

  23. MOVE TO ELEMENT

    View full-size slide

  24. INCLUDE LIBRARY

    View full-size slide

  25. TIPS & TRICKS

    View full-size slide

  26. GEB EXAMPLE GRADLE
    HTTPS://GITHUB.COM/GEB/GEB-EXAMPLE-GRADLE
    The following commands will launch the tests with the
    individual browsers:
    ./gradlew chromeTest
    ./gradlew firefoxTest
    ./gradlew phantomJsTest
    To run with all, you can run:
    ./gradlew test
    MARCIN ERDMANN

    View full-size slide

  27. DIFFERENT BROWSERS
    Run in html unit: $ ./gradlew -Dgeb.env=htmlUnit test
    Run in PhantomsJS $ ./gradlew -Dgeb.env=phantomJs
    -Dphantomjs.binary.path=./phantomjs-2.1.1-macosx/bin/phantomjs
    test
    Run in Firefox $./gradlew -Dgeb.env=firefox test
    Run in Chrome $./gradlew -Dgeb.env=chrome
    -Dwebdriver.chrome.driver=./chromedriver
    test

    View full-size slide

  28. USER AGENT SPOOFING

    View full-size slide

  29. USER AGENT SPOOFING

    View full-size slide

  30. USER AGENT SPOOFING
    HTTPS://GITHUB.COM/SDELAMO/GEBWEBBOT_USERAGENT

    View full-size slide

  31. MAXIMIZE WINDOW

    View full-size slide

  32. OBTAIN CURRENT PAGE
    HTML

    View full-size slide

  33. GROOVYCALAMARI.COM
    A “weekly” curated email
    newsletter full of
    interesting, relevant links
    about the Groovy
    Ecosystem

    View full-size slide

  34. EXAMPLE
    GREACH API
    HTTPS://GITHUB.COM/SDELAMO/GREACHAPI

    View full-size slide

  35. FOOTER
    A A A A
    DIV.CREDITS

    View full-size slide

  36. .PTP-PRICING-TABLE
    .PTP-ITEM-CONTAINER
    A
    .PTP-CTA
    .PTP-ITEM-CONTAINER .PTP-ITEM-CONTAINER
    .PTP-PLAN
    .PTP-PRICE
    .PTP-BULLET-ITEM
    .PTP-BULLET-ITEM
    .PTP-BULLET-ITEM

    View full-size slide

  37. DESIRED OUTPUT

    View full-size slide