Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Improving Data Gathering And Research
Search
Sponsored
·
Your Podcast. Everywhere. Effortlessly.
Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.
→
Luca Matteis
November 26, 2011
Programming
140
2
Share
Improving Data Gathering And Research
How to improve data gathering using web scraping methodologies.
Luca Matteis
November 26, 2011
More Decks by Luca Matteis
See All by Luca Matteis
Linked Open Data
lmatteis
1
120
What I do
lmatteis
1
76
Crop Ontology
lmatteis
1
78
Why NPM rocks!
lmatteis
2
290
Informatics Development Tools
lmatteis
0
100
Other Decks in Programming
See All in Programming
仕様漏れ実装漏れをなくすトレーサビリティAI基盤のご紹介
orgachem
PRO
8
3.7k
PHPで TLSのプロトコルを実装してみる
higaki_program
0
600
AI時代のシステム設計:ドメインモデルで変更しやすさを守る設計戦略
masuda220
PRO
6
1.1k
脱 雰囲気実装!AgentCoreを良い感じにWEBアプリケーションに組み込むために
takuyay0ne
3
420
Rethinking API Platform Filters
vinceamstoutz
0
4.2k
Smarter Angular mit Transformers.js & Prompt API
christianliebel
PRO
1
110
Symfonyの特性(設計思想)を手軽に活かす特性(trait)
ickx
0
110
モックわからないマン卒業記 ~振る舞いを起点に見直した、フロントエンドテストにおけるモックの使いどころ~
tasukuwatanabe
3
430
見せてもらおうか、 OpenSearchの性能とやらを!
shunta27
1
160
GoのDB アクセスにおける 「型安全」と「柔軟性」の両立 - Bob という選択肢
tak848
0
290
20260313 - Grafana & Friends Taipei #1 - Kubernetes v1.36 的開發雜記:那些困在 Alpha 加護病房太久的 Metrics
tico88612
0
240
年間50登壇、単著出版、雑誌寄稿、Podcast出演、YouTube、CM、カンファレンス主催……全部やってみたので面白さ等を比較してみよう / I’ve tried them all, so let’s compare how interesting they are.
nrslib
4
430
Featured
See All Featured
Responsive Adventures: Dirty Tricks From The Dark Corners of Front-End
smashingmag
254
22k
Helping Users Find Their Own Way: Creating Modern Search Experiences
danielanewman
31
3.1k
Jess Joyce - The Pitfalls of Following Frameworks
techseoconnect
PRO
1
120
Automating Front-end Workflow
addyosmani
1370
200k
How to build an LLM SEO readiness audit: a practical framework
nmsamuel
1
700
個人開発の失敗を避けるイケてる考え方 / tips for indie hackers
panda_program
122
21k
The Anti-SEO Checklist Checklist. Pubcon Cyber Week
ryanjones
0
110
The Impact of AI in SEO - AI Overviews June 2024 Edition
aleyda
5
780
RailsConf 2023
tenderlove
30
1.4k
Save Time (by Creating Custom Rails Generators)
garrettdimon
PRO
32
2.6k
Kristin Tynski - Automating Marketing Tasks With AI
techseoconnect
PRO
0
210
The B2B funnel & how to create a winning content strategy
katarinadahlin
PRO
1
310
Transcript
RESEARCH IMPROVING DATA GATHERING & Luca Matteis
What is Research?
"In the broadest sense of the word, the definition of
research includes any gathering of data, information and facts for the advancement of knowledge."
"Research is a process of steps used to collect and
analyze information to increase our understanding of a topic or issue"
Data is essential for research
Where do we get data from? Einstein got his data
from his own experiments and from other peoples experiments Information exchange took weeks if not months
Today we have the internet! Information exchange takes milliseconds Works
much better than anything Einstein had
BUT THERE’S STILL ISSUES
DATA IS SCATTERED ALL OVER THE WEB
http://science.com/paper.... http://newton.com/research... http://national.com/ goo... http://biology.com/ science... http://newscientist.com/ neutrinodiscovery... http:// astronomynow.com/
themoon http://space.com/ november2001 http://science.com/ paper.... http://newton.com/research... http://science.com/paper.... http://space.com/astro... http://space.com/astro... http://space.com/astro... http://science.com/paper....
Information that can be extremely valuable, lives somewhere online and
we don’t know it because we can’t find it
EVEN WITH GOOGLE, IT’S STILL HARD TO FIND WHAT WE
NEED
Scientific data searching is facilitated if there is a central
repository or data bank
http://science.com/paper.... http://newton.com/research... http://national.com/ goo... http://biology.com/ science... http://newscientist.com/ neutrinodiscovery... http:// astronomynow.com/
themoon http://space.com/ november2001 http://science.com/ paper.... http://newton.com/research... http://science.com/paper.... http://space.com/astro... http://space.com/astro... http://space.com/astro... http://science.com/paper....
When our information is centralized by context, we can more
easily find what we’re looking for
We already have websites that centralize this information
And allow us to find data that Google couldn’t
BUT THERE’S ROOM FOR IMPROVEMENT
How is this data currently being centralized?
Each center sends us their data in the form of
Excel or Access files, through FTP or Email
None
THIS IS AN ENTIRELY MANUAL PROCESS
Is this sustainable?
Is this sustainable? This process needs to be automated
• no human interference • less communication hassles • less
human errors • more accurate data • more data What are the advantages of automating the data exchange process?
How do we automate? Centers no longer have to send
us anything. We get it directly from their website
There’s no secret. Google, hotel sites, flight search engines and
many others do this It is called web scraping
How does it work
We automatically navigate to the centers websites and fetch the
information that we need
We automatically navigate to the centers websites and fetch the
information that we need This is done by little scripts called spiders or web crawlers
What? Spiders?
“A Web crawler (or spider) is a computer program that
browses the World Wide Web in a methodical, automated manner or in an orderly fashion.”
None
This process allows us to reach more centers and gather
more data
For each center to have a website that displays their
information The main requirement Without a website we wouldn’t be able to automate this exchange
Working prototype http://seeds.iriscouch.com/
Working prototype http://seeds.iriscouch.com/ PASSPORT DATA
Working prototype http://seeds.iriscouch.com/ PASSPORT DATA CHARACTERIZATION
Working prototype http://seeds.iriscouch.com/ PASSPORT DATA CHARACTERIZATION OTHER...
RECAP
RECAP Automation of the data exchange process is the only
sustainable solution
RECAP Automation of the data exchange process is the only
sustainable solution With new technologies, web scraping has become a very reliable system
RECAP Automation of the data exchange process is the only
sustainable solution With new technologies, web scraping has become a very reliable system The process is modular and will allow us to plug-in systems such as GRIN-Global
THANK YOU