Slide 1

Slide 1 text

ARAÑAS, WEBBOTS Y SCRAPERS CON GEB MADRID · NOV 18-19 · 2016

Slide 2

Slide 2 text

SERGIO DEL AMO [email protected] @SDELAMO

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

GEB HTTP://GEBISH.ORG

Slide 5

Slide 5 text

STEP 1 CREATE MODEL OBJECTS TO STORE THE INFORMATION WHICH YOU AIM TO SCRAPE SCRAPING WITH GEB

Slide 6

Slide 6 text

STEP 2 UNDERSTAND HOW HTML IS BUILT AND ENCAPSULATE HTML IN GEB PAGE AND MODULES SCRAPING WITH GEB

Slide 7

Slide 7 text

Define the interesting parts of your pages in a concise, maintanable and extensible manner GEB PAGES

Slide 8

Slide 8 text

def browser = new Browser() browser.go 'http://sergiodelamo.es' def hPage = browser.page HomePage hPage.subscribeToGroovyCalamari(‘[email protected]') def latestPostsPage = browser.page WordpressLatestPostsPage def posts = latestPostsPage.fetchPosts() Source: Wikia GEB PAGES ARE BLUEPRINTS FOR YOUR HTML PAGES

Slide 9

Slide 9 text

Modules are re-usable definitions of content that can be used across multiple pages GEB MODULES

Slide 10

Slide 10 text

STEP 3 CREATE A FETCHER ORCHESTRATE NAVIGATION IN THE WEBSITE SCRAPING WITH GEB

Slide 11

Slide 11 text

STEP 4 OUTPUT THE INFORMATION ‣ JAVA -JAR OUTPUT-ALL.JAR ‣ EXPOSE AN API (E.G. AWS LAMBDA + API GATEWAY) SCRAPING WITH GEB

Slide 12

Slide 12 text

GRADLE SHADOW & APPLICATION HTTPS://GITHUB.COM/JOHNRENGELMAN/SHADOW HTTPS://DOCS.GRADLE.ORG/CURRENT/USERGUIDE/APPLICATION_PLUGIN.HTML

Slide 13

Slide 13 text

EXAMPLE CODEMOTION AGENDA HTTPS://GITHUB.COM/SDELAMO/ GEBWEBBOT_CODEMOTION2016

Slide 14

Slide 14 text

No content

Slide 15

Slide 15 text

A .KA-TAB-LI A .KA-TAB-LI

Slide 16

Slide 16 text

THEAD .KA-TABLE-H .KA-TABLE-H .KA-TABLE-H .KA-TABLE-H

Slide 17

Slide 17 text

EXAMPLE PAGINATION HTTPS://GITHUB.COM/SDELAMO/WEBBOT_GEB_MEETUP_MEMBERS

Slide 18

Slide 18 text

DYNAMIC URL http://www.meetup.com/es-ES/madrid-gug/members/49149882/ BASE URL: MEETUP GROUP SLUG: MEMBER ID: http://www.meetup.com madrid-gug 28938802

Slide 19

Slide 19 text

No content

Slide 20

Slide 20 text

PAGINATION .PAGINATION .NAV-NEXT

Slide 21

Slide 21 text

PAGINATION

Slide 22

Slide 22 text

No content

Slide 23

Slide 23 text

PAGINATION MODULE

Slide 24

Slide 24 text

HARVEST AND VISIT

Slide 25

Slide 25 text

HARVEST LINKS

Slide 26

Slide 26 text

No content

Slide 27

Slide 27 text

EXAMPLE HIDDEN CONTENT AND ON MOUSE OVER EVENTS

Slide 28

Slide 28 text

No content

Slide 29

Slide 29 text

No content

Slide 30

Slide 30 text

FAILS: HIDDEN CONTENT

Slide 31

Slide 31 text

CALL A JS METHOD

Slide 32

Slide 32 text

MOVE TO ELEMENT

Slide 33

Slide 33 text

INCLUDE LIBRARY

Slide 34

Slide 34 text

TIPS & TRICKS

Slide 35

Slide 35 text

GEB EXAMPLE GRADLE HTTPS://GITHUB.COM/GEB/GEB-EXAMPLE-GRADLE The following commands will launch the tests with the individual browsers: ./gradlew chromeTest ./gradlew firefoxTest ./gradlew phantomJsTest To run with all, you can run: ./gradlew test MARCIN ERDMANN

Slide 36

Slide 36 text

GEB.CONFIG

Slide 37

Slide 37 text

DIFFERENT BROWSERS Run in html unit: $ ./gradlew -Dgeb.env=htmlUnit test Run in PhantomsJS $ ./gradlew -Dgeb.env=phantomJs -Dphantomjs.binary.path=./phantomjs-2.1.1-macosx/bin/phantomjs test Run in Firefox $./gradlew -Dgeb.env=firefox test Run in Chrome $./gradlew -Dgeb.env=chrome -Dwebdriver.chrome.driver=./chromedriver test

Slide 38

Slide 38 text

USER AGENT SPOOFING

Slide 39

Slide 39 text

USER AGENT SPOOFING

Slide 40

Slide 40 text

USER AGENT SPOOFING HTTPS://GITHUB.COM/SDELAMO/GEBWEBBOT_USERAGENT

Slide 41

Slide 41 text

COOKIES

Slide 42

Slide 42 text

MAXIMIZE WINDOW

Slide 43

Slide 43 text

OBTAIN CURRENT PAGE HTML

Slide 44

Slide 44 text

GROOVYCALAMARI.COM A “weekly” curated email newsletter full of interesting, relevant links about the Groovy Ecosystem

Slide 45

Slide 45 text

?

Slide 46

Slide 46 text

EXAMPLE GREACH API HTTPS://GITHUB.COM/SDELAMO/GREACHAPI

Slide 47

Slide 47 text

FOOTER A A A A DIV.CREDITS

Slide 48

Slide 48 text

.PTP-PRICING-TABLE .PTP-ITEM-CONTAINER A .PTP-CTA .PTP-ITEM-CONTAINER .PTP-ITEM-CONTAINER .PTP-PLAN .PTP-PRICE .PTP-BULLET-ITEM .PTP-BULLET-ITEM .PTP-BULLET-ITEM

Slide 49

Slide 49 text

No content

Slide 50

Slide 50 text

MODEL

Slide 51

Slide 51 text

DESIRED OUTPUT