Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Spiders, Webbots and Scrapers with Geb - Madrid GUG - Nov 2016

Spiders, Webbots and Scrapers with Geb - Madrid GUG - Nov 2016

Deck of the talk presented at Madrid Groovy User Group by Sergio del Amo.

Sergio del Amo

November 08, 2016
Tweet

More Decks by Sergio del Amo

Other Decks in Programming

Transcript

  1. WHAT CAN YOU DO? PRICE-MONITORING WEBBOTS IMAGE-CAPTURING WEBBOTS LINK VERIFICATION

    WEBBOTS WEBBOTS THAT SEND EMAIL WEBBOTS THAT CONVERT A WEBSITE IN AN API SNIPERS
  2. Define the interesting parts of your pages in a concise,

    maintanable and extensible manner GEB PAGES
  3. def browser = new Browser() browser.go 'http://sergiodelamo.es' def hPage =

    browser.page HomePage hPage.subscribeToGroovyCalamari(‘[email protected]') def latestPostsPage = browser.page WordpressLatestPostsPage def posts = latestPostsPage.fetchPosts() Source: Wikia GEB PAGES ARE BLUEPRINTS FOR YOUR HTML PAGES
  4. GEB EXAMPLE GRADLE HTTPS://GITHUB.COM/GEB/GEB-EXAMPLE-GRADLE The following commands will launch the

    tests with the individual browsers: ./gradlew chromeTest ./gradlew firefoxTest ./gradlew phantomJsTest To run with all, you can run: ./gradlew test MARCIN ERDMANN
  5. DIFFERENT BROWSERS Run in html unit: $ ./gradlew -Dgeb.env=htmlUnit test

    Run in PhantomsJS $ ./gradlew -Dgeb.env=phantomJs -Dphantomjs.binary.path=./phantomjs-2.1.1-macosx/bin/phantomjs test Run in Firefox $./gradlew -Dgeb.env=firefox test Run in Chrome $./gradlew -Dgeb.env=chrome -Dwebdriver.chrome.driver=./chromedriver test
  6. SPLIT LOAD BETWEEN WEBBOTS 1 2 5 3 4 11

    12 15 13 14 21 22 25 23 24 31 32 35 33 34 41 42 45 43 44 6 7 10 8 9 16 17 20 18 19 26 27 30 28 29 36 37 40 38 39 46 47 50 48 49 def sublist(def ids, def webbotIndex, def webbotsInParallel) { int total = ids.size() def sublistsSize = (total / webbotsInParallel) as int ids.collate(sublistsSize)[webbotIndex] } def ids = 1..50 def webbotsInParallel = 6 sublist(ids, 3, webbotsInParallel) [1, 2, 3, 4, 5, 6, 7, 8] [9, 10, 11, 12, 13, 14, 15, 16] [17, 18, 19, 20, 21, 22, 23, 24] [25, 26, 27, 28, 29, 30, 31, 32] [33, 34, 35, 36, 37, 38, 39, 40] [41, 42, 43, 44, 45, 46, 47, 48] [49, 50]
  7. STEALTH MEANS SIMULATING HUMAN PATTERNS ▸ BE KIND TO YOUR

    RESOURCES ▸ RUN YOUR WEBBOTS DURING BUSY HOURS ▸ DON’T RUN YOUR WEBBOTS AT THE SAME TIME EACH DAY ▸ DON’T RUN YOUR WEBBOT ON HOLIDAYS AND WEEKENDS ▸ USE RANDOM, INTRA-FETCH DELAYS
  8. ?