Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Improving Data Gathering And Research
Search
Luca Matteis
November 26, 2011
Programming
2
130
Improving Data Gathering And Research
How to improve data gathering using web scraping methodologies.
Luca Matteis
November 26, 2011
Tweet
Share
More Decks by Luca Matteis
See All by Luca Matteis
Linked Open Data
lmatteis
1
110
What I do
lmatteis
1
68
Crop Ontology
lmatteis
1
71
Why NPM rocks!
lmatteis
2
280
Informatics Development Tools
lmatteis
0
94
Other Decks in Programming
See All in Programming
Google I/O Extended Incheon 2025 ~ What's new in Android development tools
pluu
1
120
MCPで実現できる、Webサービス利用体験について
syumai
7
1.9k
202507_ADKで始めるエージェント開発の基本 〜デモを通じて紹介〜(奥田りさ)
risatube
PRO
4
700
バイブコーディング超えてバイブデプロイ〜CloudflareMCPで実現する、未来のアプリケーションデリバリー〜
azukiazusa1
2
700
商品比較サービス「マイベスト」における パーソナライズレコメンドの第一歩
ucchiii43
0
200
Understanding Kotlin Multiplatform
l2hyunwoo
0
120
DMMを支える決済基盤の技術的負債にどう立ち向かうか / Addressing Technical Debt in Payment Infrastructure
yoshiyoshifujii
4
600
可変変数との向き合い方 $$変数名が踊り出す$$ / php conference Variable variables
gunji
0
230
20250708_JAWS_opscdk
takuyay0ne
2
150
[SRE NEXT] 複雑なシステムにおけるUser Journey SLOの導入
yakenji
0
740
No Install CMS戦略 〜 5年先を見据えたフロントエンド開発を考える / no_install_cms
rdlabo
0
260
MDN Web Docs に日本語翻訳でコントリビュートしたくなる
ohmori_yusuke
1
130
Featured
See All Featured
The Illustrated Children's Guide to Kubernetes
chrisshort
48
50k
Build your cross-platform service in a week with App Engine
jlugia
231
18k
Speed Design
sergeychernyshev
32
1k
Creating an realtime collaboration tool: Agile Flush - .NET Oxford
marcduiker
30
2.2k
How to train your dragon (web standard)
notwaldorf
96
6.1k
Distributed Sagas: A Protocol for Coordinating Microservices
caitiem20
331
22k
How To Stay Up To Date on Web Technology
chriscoyier
790
250k
Raft: Consensus for Rubyists
vanstee
140
7k
Improving Core Web Vitals using Speculation Rules API
sergeychernyshev
18
1k
Statistics for Hackers
jakevdp
799
220k
Building an army of robots
kneath
306
45k
Into the Great Unknown - MozCon
thekraken
40
1.9k
Transcript
RESEARCH IMPROVING DATA GATHERING & Luca Matteis
What is Research?
"In the broadest sense of the word, the definition of
research includes any gathering of data, information and facts for the advancement of knowledge."
"Research is a process of steps used to collect and
analyze information to increase our understanding of a topic or issue"
Data is essential for research
Where do we get data from? Einstein got his data
from his own experiments and from other peoples experiments Information exchange took weeks if not months
Today we have the internet! Information exchange takes milliseconds Works
much better than anything Einstein had
BUT THERE’S STILL ISSUES
DATA IS SCATTERED ALL OVER THE WEB
http://science.com/paper.... http://newton.com/research... http://national.com/ goo... http://biology.com/ science... http://newscientist.com/ neutrinodiscovery... http:// astronomynow.com/
themoon http://space.com/ november2001 http://science.com/ paper.... http://newton.com/research... http://science.com/paper.... http://space.com/astro... http://space.com/astro... http://space.com/astro... http://science.com/paper....
Information that can be extremely valuable, lives somewhere online and
we don’t know it because we can’t find it
EVEN WITH GOOGLE, IT’S STILL HARD TO FIND WHAT WE
NEED
Scientific data searching is facilitated if there is a central
repository or data bank
http://science.com/paper.... http://newton.com/research... http://national.com/ goo... http://biology.com/ science... http://newscientist.com/ neutrinodiscovery... http:// astronomynow.com/
themoon http://space.com/ november2001 http://science.com/ paper.... http://newton.com/research... http://science.com/paper.... http://space.com/astro... http://space.com/astro... http://space.com/astro... http://science.com/paper....
When our information is centralized by context, we can more
easily find what we’re looking for
We already have websites that centralize this information
And allow us to find data that Google couldn’t
BUT THERE’S ROOM FOR IMPROVEMENT
How is this data currently being centralized?
Each center sends us their data in the form of
Excel or Access files, through FTP or Email
None
THIS IS AN ENTIRELY MANUAL PROCESS
Is this sustainable?
Is this sustainable? This process needs to be automated
• no human interference • less communication hassles • less
human errors • more accurate data • more data What are the advantages of automating the data exchange process?
How do we automate? Centers no longer have to send
us anything. We get it directly from their website
There’s no secret. Google, hotel sites, flight search engines and
many others do this It is called web scraping
How does it work
We automatically navigate to the centers websites and fetch the
information that we need
We automatically navigate to the centers websites and fetch the
information that we need This is done by little scripts called spiders or web crawlers
What? Spiders?
“A Web crawler (or spider) is a computer program that
browses the World Wide Web in a methodical, automated manner or in an orderly fashion.”
None
This process allows us to reach more centers and gather
more data
For each center to have a website that displays their
information The main requirement Without a website we wouldn’t be able to automate this exchange
Working prototype http://seeds.iriscouch.com/
Working prototype http://seeds.iriscouch.com/ PASSPORT DATA
Working prototype http://seeds.iriscouch.com/ PASSPORT DATA CHARACTERIZATION
Working prototype http://seeds.iriscouch.com/ PASSPORT DATA CHARACTERIZATION OTHER...
RECAP
RECAP Automation of the data exchange process is the only
sustainable solution
RECAP Automation of the data exchange process is the only
sustainable solution With new technologies, web scraping has become a very reliable system
RECAP Automation of the data exchange process is the only
sustainable solution With new technologies, web scraping has become a very reliable system The process is modular and will allow us to plug-in systems such as GRIN-Global
THANK YOU