Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Scraping: 10 mistakes to avoid @ Breizhcamp 2016
Search
Sponsored
·
Your Podcast. Everywhere. Effortlessly.
Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.
→
Fabien Vauchelles
March 24, 2016
Science
230
2
Share
Scraping: 10 mistakes to avoid @ Breizhcamp 2016
From website, to storage, learn webscraping
#webscraping #tricks
Fabien Vauchelles
March 24, 2016
More Decks by Fabien Vauchelles
See All by Fabien Vauchelles
[StartupCourse/18] Discover Machine Learning
fabienvauchelles
0
87
[StartupCourse/01] Gérer sa carrière @ Polytech Paris Sud 2016
fabienvauchelles
0
65
[StartupCourse/02] Monter Une Startup @ Polytech Paris Sud 2016
fabienvauchelles
0
72
[StartupCourse/03] De l'idée au produit @ Polytech Paris Sud 2016
fabienvauchelles
0
48
Other Decks in Science
See All in Science
Question Driven Development using Python
willingc
PRO
1
110
KISHIMOTO Atsuo
genomethica
0
140
データベース01: データベースを使わない世界
trycycle
PRO
1
1.2k
Conversation is the New Dashboard: 属人性を排除する第4世代BIツールの勢力図
shomaekawa
1
580
Bリーグのショットデータを活用した得点期待値モデルの構築 / Construction of expected points model using shot data of B.LEAGUE
konakalab
0
130
SpatialRDDパッケージによる空間回帰不連続デザイン
saltcooky12
0
230
水耕栽培:古代の知恵から宇宙農業まで
grow_design_lab
0
120
Amusing Abliteration
ianozsvald
1
180
東北地方における過去20年間の降水量の変化
naokimuroki
1
210
データベース06: SQL (3/3) 副問い合わせ
trycycle
PRO
1
890
あなたに水耕栽培を愛していないとは言わせない
mutsumix
1
330
なぜ21は素因数分解されないのか? - Shorのアルゴリズムの現在と壁
daimurat
0
420
Featured
See All Featured
A Soul's Torment
seathinner
6
2.8k
End of SEO as We Know It (SMX Advanced Version)
ipullrank
3
4.2k
Producing Creativity
orderedlist
PRO
348
40k
We Have a Design System, Now What?
morganepeng
55
8.1k
Discover your Explorer Soul
emna__ayadi
2
1.1k
Testing 201, or: Great Expectations
jmmastey
46
8.2k
Skip the Path - Find Your Career Trail
mkilby
1
130
Have SEOs Ruined the Internet? - User Awareness of SEO in 2025
akashhashmi
0
350
Building Better People: How to give real-time feedback that sticks.
wjessup
370
20k
Future Trends and Review - Lecture 12 - Web Technologies (1019888BNR)
signer
PRO
0
3.6k
brightonSEO & MeasureFest 2025 - Christian Goodrich - Winning strategies for Black Friday CRO & PPC
cargoodrich
3
710
Fight the Zombie Pattern Library - RWD Summit 2016
marcelosomers
234
17k
Transcript
Fabien VAUCHELLES zelros.com /
[email protected]
/ @fabienv http://bit.ly/breizhscraping (24/03/2016)
FABIEN VAUCHELLES Developer for 16 years CTO of Expert in
data extraction (scraping) Creator of Scrapoxy.io
What is Scraping
“Scraping is to transform human-readable webpage into machine-readable data.” Neo
Why do we do Scraping
EXAMPLES No API ! API with a requests limit Prices
Emails Profiles Train machine learning models Addresses Face recognition
“I used Scraping to create my clients list !” Walter
White
FORGET THE LAW 1.
THE LEGAL PATH Can we track the data ? Does
the company tends to sue ? Data is private ? Is the company is in France ? Do the data provide added value ? no yes yes yes yes no no no no yes
THE LEGAL PATH Can we track the data ? Does
the company tends to sue ? Data is private ? Is the company is in France ? Do the data provide added value ? no yes yes yes yes no no no no yes
THE LEGAL PATH Can we track the data ? Does
the company tends to sue ? Data is private ? Is the company is in France ? Do the data provide added value ? no yes yes yes yes no no no no yes
THE LEGAL PATH Can we track the data ? Does
the company tends to sue ? Data is private ? Is the company is in France ? Do the data provide added value ? no yes yes yes yes no no no no yes
THE LEGAL PATH Can we track the data ? Does
the company tends to sue ? Data is private ? Is the company is in France ? Do the data provide added value ? no yes yes yes yes no no no no yes
RUBBER DUCK E-MARKET LET’S STUDY THE
BUILD YOUR OWN SCRIPT 2.
USE A FRAMEWORK Limit concurrents request by site Limit speed
Change user agent Follow redirects Export results to CSV or JSON etc. Only 15 minutes to extract structured data !
USE THE ECOSYSTEM Frontera ScrapyRT PhantomJS Selenium PROXY EMULATION HELPER
STORAGE
RUSH ON THE FIRST DATA SOURCE 3.
FIND THE EXPORT BUTTON
TAKE TIME TO FIND DATA
How to find a developer on Rennes
#1. GO TO BREIZHCAMP
#2. SCRAP GITHUB
#3. SCRAP GITHUB ARCHIVE
#4. USE GOOGLE BIG QUERY
None
None
None
KEEP THE DEFAULT USER-AGENT 4.
DEFAULT USER-AGENT SCRAPY Scrapy/1.0.3 (+http://scrapy.org) URLLIB2 (Python) Python-urllib/2.1
IDENTIFY AS A DESKTOP BROWSER CHROME Mozilla/5.0 (Macintosh; Intel Mac
OS X 10_11_3)↵ AppleWebKit/537.36 (KHTML, like Gecko)↵ Chrome/50.0.2661.37 Safari/537.36 200 503
SCRAP WITH YOUR DSL ACCESS 5.
BLACKLISTED
What is Blacklisting
TYPE OF BLACKLISTING Change HTTP status (200 -> 503) HTTP
200 but content change (login page) CAPTCHA Longer to respond And many others !
USE A PROXY SCRAPER PROXY TARGET 88.77.66.55 44.33.22.11 1.2.3.411
TYPE OF PROXIES PUBLIC PRIVATE
HIDE BEHIND SCRAPOXY SCRAPERS SCRAPOXY TARGET http://scrapoxy.io
TRIGGER ALERTS ON THE REMOTE SITE 6.
STAY OFF THE RADAR
ESTIMATE IP FLOW SCRAPER PROXY TARGET 10 requests / IP
/ minute ✔
ESTIMATE IP FLOW SCRAPER PROXY TARGET 10 requests / IP
/ minute ✔ 20 requests / IP / minute ✔
ESTIMATE IP FLOW SCRAPER PROXY TARGET 10 requests / IP
/ minute ✔ 20 requests / IP / minute ✔ 30 requests / IP / minute X
ESTIMATE IP FLOW The flow is 20 requests / IP
/ minute I want to refresh 200 items every minute I need 200 / 20 = 10 proxies !
MIX UP SCRAPER AND CRAWLER 7.
SCRAPERS ARE NOT CRAWLERS
FOCUS ON ESSENTIAL
What is the URL frontier
URL frontier is the list of URL to fetch.
TYPE OF URL FRONTIER FIX SEQUENTIAL TREE
STORE ONLY PARSED RESULTS 8.
SCRAPING IS AN ITERATIVE PROCESS EXTRACT AND CLEAN DATA SCRAP
DATA USE DATA REFACTOR
SCRAP EVERYTHING... AGAIN ?
STORE FULL HTML PAGE
SCRAPING IS AN ITERATIVE PROCESS EXTRACT ALL CLEAN DATA SCRAP
DATA USE DATA REFACTOR
STORE WEBPAGE ONE BY ONE 9.
STORAGE CAN’T MANAGE MILLIONS OF SMALL FILES !
BLOCK IS THE NEW STORAGE
BLOCK IS THE NEW STORAGE
BLOCK IS THE NEW STORAGE
BLOCK IS THE NEW STORAGE
BLOCK IS THE NEW STORAGE
BLOCK IS THE NEW STORAGE
BLOCK IS THE NEW STORAGE
BLOCK IS THE NEW STORAGE
BLOCK IS THE NEW STORAGE
STORE HTML IN 128 MO ZIPPED FILES
PARSING IS SIMPLE ! 10.
PARSERS There is a lot of parsers ! XPATH CSS
REGEX TAGS TAG CLEANER
2 METHODS TO EXTRACT DATA <div class=”parts> <div class=”part experience”>
<div class=”year”>2014</div> <div class=”title”>Data Engineer</div> </div> </div> How to get the job title ?
#1. BY POSITION <div class=”parts> <div class=”part experience”> <div class=”year”>2014</div>
<div class=”title”>Data Engineer</div> </div> </div> /div/div/div[2] (with XPath parser)
#1. BY POSITION <div class=”parts> <div class=”part experience”> <div class=”year”>2014</div>
<div class=”location”>Paris</div> <div class=”title”>Data Engineer</div> </div> </div> /div/div/div[2] (with XPath parser)
#2. BY FEATURE <div class=”parts> <div class=”part experience”> <div class=”year”>2014</div>
<div class=”title”>Data Engineer</div> </div> </div> .experience .title (with CSS parser)
LET’S RECAP !
STEP BY STEP FIND A SOURCE LIMIT THE URL FRONTIER
SCRAP AND STORE PARSE BLOCS
STEP BY STEP FIND A SOURCE LIMIT THE URL FRONTIER
SCRAP AND STORE PARSE BLOCS
STEP BY STEP FIND A SOURCE LIMIT THE URL FRONTIER
SCRAP AND STORE PARSE BLOCS
STEP BY STEP FIND A SOURCE LIMIT THE URL FRONTIER
SCRAP AND STORE PARSE BLOCS
ARCHITECTURE SCRAPERS SCRAPERS SCRAPERS
ARCHITECTURE SCRAPERS SCRAPERS SCRAPERS URL FRONTIER QUEUE
ARCHITECTURE SCRAPERS SCRAPERS SCRAPERS SCRAPERS SCRAPERS PROXIES URL FRONTIER QUEUE
ARCHITECTURE SCRAPERS SCRAPERS SCRAPERS SCRAPERS SCRAPERS PROXIES URL FRONTIER TARGET
QUEUE
ARCHITECTURE SCRAPERS SCRAPERS SCRAPERS SCRAPERS SCRAPERS PROXIES URL FRONTIER STORAGE
TARGET QUEUE
ARCHITECTURE SCRAPERS SCRAPERS SCRAPERS SCRAPERS SCRAPERS PROXIES URL FRONTIER SCRAPERS
SCRAPERS PARSERS STORAGE TARGET QUEUE
ARCHITECTURE SCRAPERS SCRAPERS SCRAPERS SCRAPERS SCRAPERS PROXIES URL FRONTIER SCRAPERS
SCRAPERS PARSERS STORAGE DATABASE TARGET QUEUE
Fabien VAUCHELLES zelros.com /
[email protected]
/ @fabienv http://bit.ly/breizhscraping The best
opensource proxy for Scraping !