Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Scraping: 10 mistakes to avoid @ Breizhcamp 2016

Scraping: 10 mistakes to avoid @ Breizhcamp 2016

From website, to storage, learn webscraping

#webscraping #tricks

Fabien Vauchelles

March 24, 2016
Tweet

More Decks by Fabien Vauchelles

Other Decks in Science

Transcript

  1. FABIEN VAUCHELLES Developer for 16 years CTO of Expert in

    data extraction (scraping) Creator of Scrapoxy.io
  2. EXAMPLES No API ! API with a requests limit Prices

    Emails Profiles Train machine learning models Addresses Face recognition
  3. THE LEGAL PATH Can we track the data ? Does

    the company tends to sue ? Data is private ? Is the company is in France ? Do the data provide added value ? no yes yes yes yes no no no no yes
  4. THE LEGAL PATH Can we track the data ? Does

    the company tends to sue ? Data is private ? Is the company is in France ? Do the data provide added value ? no yes yes yes yes no no no no yes
  5. THE LEGAL PATH Can we track the data ? Does

    the company tends to sue ? Data is private ? Is the company is in France ? Do the data provide added value ? no yes yes yes yes no no no no yes
  6. THE LEGAL PATH Can we track the data ? Does

    the company tends to sue ? Data is private ? Is the company is in France ? Do the data provide added value ? no yes yes yes yes no no no no yes
  7. THE LEGAL PATH Can we track the data ? Does

    the company tends to sue ? Data is private ? Is the company is in France ? Do the data provide added value ? no yes yes yes yes no no no no yes
  8. USE A FRAMEWORK Limit concurrents request by site Limit speed

    Change user agent Follow redirects Export results to CSV or JSON etc. Only 15 minutes to extract structured data !
  9. IDENTIFY AS A DESKTOP BROWSER CHROME Mozilla/5.0 (Macintosh; Intel Mac

    OS X 10_11_3)↵ AppleWebKit/537.36 (KHTML, like Gecko)↵ Chrome/50.0.2661.37 Safari/537.36 200 503
  10. TYPE OF BLACKLISTING Change HTTP status (200 -> 503) HTTP

    200 but content change (login page) CAPTCHA Longer to respond And many others !
  11. ESTIMATE IP FLOW SCRAPER PROXY TARGET 10 requests / IP

    / minute ✔ 20 requests / IP / minute ✔
  12. ESTIMATE IP FLOW SCRAPER PROXY TARGET 10 requests / IP

    / minute ✔ 20 requests / IP / minute ✔ 30 requests / IP / minute X
  13. ESTIMATE IP FLOW The flow is 20 requests / IP

    / minute I want to refresh 200 items every minute I need 200 / 20 = 10 proxies !
  14. 2 METHODS TO EXTRACT DATA <div class=”parts> <div class=”part experience”>

    <div class=”year”>2014</div> <div class=”title”>Data Engineer</div> </div> </div> How to get the job title ?
  15. #1. BY POSITION <div class=”parts> <div class=”part experience”> <div class=”year”>2014</div>

    <div class=”title”>Data Engineer</div> </div> </div> /div/div/div[2] (with XPath parser)
  16. #1. BY POSITION <div class=”parts> <div class=”part experience”> <div class=”year”>2014</div>

    <div class=”location”>Paris</div> <div class=”title”>Data Engineer</div> </div> </div> /div/div/div[2] (with XPath parser)
  17. #2. BY FEATURE <div class=”parts> <div class=”part experience”> <div class=”year”>2014</div>

    <div class=”title”>Data Engineer</div> </div> </div> .experience .title (with CSS parser)