Improving Data Gathering And Research

RESEARCH IMPROVING DATA GATHERING & Luca Matteis

What is Research?

"In the broadest sense of the word, the deﬁnition of
research includes any gathering of data, information and facts for the advancement of knowledge."

"Research is a process of steps used to collect and
analyze information to increase our understanding of a topic or issue"

Data is essential for research

Where do we get data from? Einstein got his data
from his own experiments and from other peoples experiments Information exchange took weeks if not months

Today we have the internet! Information exchange takes milliseconds Works
much better than anything Einstein had

BUT THERE’S STILL ISSUES

DATA IS SCATTERED ALL OVER THE WEB

http://science.com/paper.... http://newton.com/research... http://national.com/ goo... http://biology.com/ science... http://newscientist.com/ neutrinodiscovery... http:// astronomynow.com/
themoon http://space.com/ november2001 http://science.com/ paper.... http://newton.com/research... http://science.com/paper.... http://space.com/astro... http://space.com/astro... http://space.com/astro... http://science.com/paper....

Information that can be extremely valuable, lives somewhere online and
we don’t know it because we can’t ﬁnd it

EVEN WITH GOOGLE, IT’S STILL HARD TO FIND WHAT WE
NEED

Scientiﬁc data searching is facilitated if there is a central
repository or data bank

http://science.com/paper.... http://newton.com/research... http://national.com/ goo... http://biology.com/ science... http://newscientist.com/ neutrinodiscovery... http:// astronomynow.com/
themoon http://space.com/ november2001 http://science.com/ paper.... http://newton.com/research... http://science.com/paper.... http://space.com/astro... http://space.com/astro... http://space.com/astro... http://science.com/paper....

When our information is centralized by context, we can more
easily ﬁnd what we’re looking for

We already have websites that centralize this information

And allow us to ﬁnd data that Google couldn’t

BUT THERE’S ROOM FOR IMPROVEMENT

How is this data currently being centralized?

Each center sends us their data in the form of
Excel or Access ﬁles, through FTP or Email

THIS IS AN ENTIRELY MANUAL PROCESS

Is this sustainable?

Is this sustainable? This process needs to be automated

• no human interference • less communication hassles • less
human errors • more accurate data • more data What are the advantages of automating the data exchange process?

How do we automate? Centers no longer have to send
us anything. We get it directly from their website

There’s no secret. Google, hotel sites, ﬂight search engines and
many others do this It is called web scraping

How does it work

We automatically navigate to the centers websites and fetch the
information that we need

We automatically navigate to the centers websites and fetch the
information that we need This is done by little scripts called spiders or web crawlers

What? Spiders?

“A Web crawler (or spider) is a computer program that
browses the World Wide Web in a methodical, automated manner or in an orderly fashion.”

This process allows us to reach more centers and gather
more data

For each center to have a website that displays their
information The main requirement Without a website we wouldn’t be able to automate this exchange

Working prototype http://seeds.iriscouch.com/

Working prototype http://seeds.iriscouch.com/ PASSPORT DATA

Working prototype http://seeds.iriscouch.com/ PASSPORT DATA CHARACTERIZATION

Working prototype http://seeds.iriscouch.com/ PASSPORT DATA CHARACTERIZATION OTHER...

RECAP Automation of the data exchange process is the only
sustainable solution

sustainable solution With new technologies, web scraping has become a very reliable system

sustainable solution With new technologies, web scraping has become a very reliable system The process is modular and will allow us to plug-in systems such as GRIN-Global

THANK YOU

Improving Data Gathering And Research

Improving Data Gathering And Research

More Decks by Luca Matteis

Other Decks in Programming

Featured

Transcript