Slide 1

Slide 1 text

SPIDERS, WEBBOTS AND SCRAPERS WITH GEB

Slide 2

Slide 2 text

SERGIO DEL AMO [email protected] @SDELAMO

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

GEB HTTP://GEBISH.ORG

Slide 5

Slide 5 text

http://www.webbotsspidersscreenscrapers.com

Slide 6

Slide 6 text

WHAT CAN YOU DO? PRICE-MONITORING WEBBOTS IMAGE-CAPTURING WEBBOTS LINK VERIFICATION WEBBOTS WEBBOTS THAT SEND EMAIL WEBBOTS THAT CONVERT A WEBSITE IN AN API SNIPERS

Slide 7

Slide 7 text

EXAMPLE 1 GREACH API HTTPS://GITHUB.COM/SDELAMO/GREACHAPI

Slide 8

Slide 8 text

Define the interesting parts of your pages in a concise, maintanable and extensible manner GEB PAGES

Slide 9

Slide 9 text

def browser = new Browser() browser.go 'http://sergiodelamo.es' def hPage = browser.page HomePage hPage.subscribeToGroovyCalamari(‘[email protected]') def latestPostsPage = browser.page WordpressLatestPostsPage def posts = latestPostsPage.fetchPosts() Source: Wikia GEB PAGES ARE BLUEPRINTS FOR YOUR HTML PAGES

Slide 10

Slide 10 text

No content

Slide 11

Slide 11 text

FOOTER A A A A DIV.CREDITS

Slide 12

Slide 12 text

Modules are re-usable definitions of content that can be used across multiple pages GEB MODULES

Slide 13

Slide 13 text

.PTP-PRICING-TABLE .PTP-ITEM-CONTAINER A .PTP-CTA .PTP-ITEM-CONTAINER .PTP-ITEM-CONTAINER .PTP-PLAN .PTP-PRICE .PTP-BULLET-ITEM .PTP-BULLET-ITEM .PTP-BULLET-ITEM

Slide 14

Slide 14 text

No content

Slide 15

Slide 15 text

DEMO

Slide 16

Slide 16 text

MODEL

Slide 17

Slide 17 text

GRADLE SHADOW & APPLICATION HTTPS://GITHUB.COM/JOHNRENGELMAN/SHADOW HTTPS://DOCS.GRADLE.ORG/CURRENT/USERGUIDE/APPLICATION_PLUGIN.HTML

Slide 18

Slide 18 text

java -jar output-all.jar

Slide 19

Slide 19 text

DESIRED OUTPUT

Slide 20

Slide 20 text

EXAMPLE 2 PAGINATION HTTPS://GITHUB.COM/SDELAMO/WEBBOT_GEB_MEETUP_MEMBERS

Slide 21

Slide 21 text

DYNAMIC URL http://www.meetup.com/es-ES/madrid-gug/members/49149882/ BASE URL: MEETUP GROUP SLUG: MEMBER ID: http://www.meetup.com madrid-gug 28938802

Slide 22

Slide 22 text

No content

Slide 23

Slide 23 text

PAGINATION .PAGINATION .NAV-NEXT

Slide 24

Slide 24 text

PAGINATION

Slide 25

Slide 25 text

No content

Slide 26

Slide 26 text

PAGINATION MODULE

Slide 27

Slide 27 text

HARVEST AND VISIT

Slide 28

Slide 28 text

HARVEST LINKS

Slide 29

Slide 29 text

No content

Slide 30

Slide 30 text

EXAMPLE 3 HIDDEN CONTENT AND ON MOUSE OVER EVENTS

Slide 31

Slide 31 text

No content

Slide 32

Slide 32 text

No content

Slide 33

Slide 33 text

FAILS: HIDDEN CONTENT

Slide 34

Slide 34 text

CALL A JS METHOD

Slide 35

Slide 35 text

MOVE TO ELEMENT

Slide 36

Slide 36 text

INCLUDE LIBRARY

Slide 37

Slide 37 text

TIPS & TRICKS

Slide 38

Slide 38 text

GEB EXAMPLE GRADLE HTTPS://GITHUB.COM/GEB/GEB-EXAMPLE-GRADLE The following commands will launch the tests with the individual browsers: ./gradlew chromeTest ./gradlew firefoxTest ./gradlew phantomJsTest To run with all, you can run: ./gradlew test MARCIN ERDMANN

Slide 39

Slide 39 text

GEB.CONFIG

Slide 40

Slide 40 text

DIFFERENT BROWSERS Run in html unit: $ ./gradlew -Dgeb.env=htmlUnit test Run in PhantomsJS $ ./gradlew -Dgeb.env=phantomJs -Dphantomjs.binary.path=./phantomjs-2.1.1-macosx/bin/phantomjs test Run in Firefox $./gradlew -Dgeb.env=firefox test Run in Chrome $./gradlew -Dgeb.env=chrome -Dwebdriver.chrome.driver=./chromedriver test

Slide 41

Slide 41 text

USER AGENT SPOOFING

Slide 42

Slide 42 text

USER AGENT SPOOFING

Slide 43

Slide 43 text

USER AGENT SPOOFING HTTPS://GITHUB.COM/SDELAMO/GEBWEBBOT_USERAGENT

Slide 44

Slide 44 text

COOKIES

Slide 45

Slide 45 text

MAXIMIZE WINDOW

Slide 46

Slide 46 text

OBTAIN CURRENT PAGE HTML

Slide 47

Slide 47 text

UI INTERACTION

Slide 48

Slide 48 text

KEYBOARD

Slide 49

Slide 49 text

SLIDERS

Slide 50

Slide 50 text

SPLIT LOAD BETWEEN WEBBOTS HTTPS://HTTPSTATUSDOGS.COM

Slide 51

Slide 51 text

SPLIT LOAD BETWEEN WEBBOTS 1 2 5 3 4 11 12 15 13 14 21 22 25 23 24 31 32 35 33 34 41 42 45 43 44 6 7 10 8 9 16 17 20 18 19 26 27 30 28 29 36 37 40 38 39 46 47 50 48 49 def sublist(def ids, def webbotIndex, def webbotsInParallel) { int total = ids.size() def sublistsSize = (total / webbotsInParallel) as int ids.collate(sublistsSize)[webbotIndex] } def ids = 1..50 def webbotsInParallel = 6 sublist(ids, 3, webbotsInParallel) [1, 2, 3, 4, 5, 6, 7, 8] [9, 10, 11, 12, 13, 14, 15, 16] [17, 18, 19, 20, 21, 22, 23, 24] [25, 26, 27, 28, 29, 30, 31, 32] [33, 34, 35, 36, 37, 38, 39, 40] [41, 42, 43, 44, 45, 46, 47, 48] [49, 50]

Slide 52

Slide 52 text

STEALTH MEANS SIMULATING HUMAN PATTERNS ▸ BE KIND TO YOUR RESOURCES ▸ RUN YOUR WEBBOTS DURING BUSY HOURS ▸ DON’T RUN YOUR WEBBOTS AT THE SAME TIME EACH DAY ▸ DON’T RUN YOUR WEBBOT ON HOLIDAYS AND WEEKENDS ▸ USE RANDOM, INTRA-FETCH DELAYS

Slide 53

Slide 53 text

SIMULATE HUMAN CLICK RHYTHM

Slide 54

Slide 54 text

GROOVYCALAMARI.COM A “weekly” curated email newsletter full of interesting, relevant links about the Groovy Ecosystem

Slide 55

Slide 55 text

?