Improving Data Gathering And Research

Slide 1

Slide 1 text

RESEARCH IMPROVING DATA GATHERING & Luca Matteis

Slide 2

Slide 2 text

What is Research?

Slide 3

Slide 3 text

"In the broadest sense of the word, the deﬁnition of research includes any gathering of data, information and facts for the advancement of knowledge."

Slide 4

Slide 4 text

"Research is a process of steps used to collect and analyze information to increase our understanding of a topic or issue"

Slide 5

Slide 5 text

Data is essential for research

Slide 6

Slide 6 text

Where do we get data from? Einstein got his data from his own experiments and from other peoples experiments Information exchange took weeks if not months

Slide 7

Slide 7 text

Today we have the internet! Information exchange takes milliseconds Works much better than anything Einstein had

Slide 8

Slide 8 text

BUT THERE’S STILL ISSUES

Slide 9

Slide 9 text

DATA IS SCATTERED ALL OVER THE WEB

Slide 10

Slide 10 text

http://science.com/paper.... http://newton.com/research... http://national.com/ goo... http://biology.com/ science... http://newscientist.com/ neutrinodiscovery... http:// astronomynow.com/ themoon http://space.com/ november2001 http://science.com/ paper.... http://newton.com/research... http://science.com/paper.... http://space.com/astro... http://space.com/astro... http://space.com/astro... http://science.com/paper....

Slide 11

Slide 11 text

Information that can be extremely valuable, lives somewhere online and we don’t know it because we can’t ﬁnd it

Slide 12

Slide 12 text

EVEN WITH GOOGLE, IT’S STILL HARD TO FIND WHAT WE NEED

Slide 13

Slide 13 text

Scientiﬁc data searching is facilitated if there is a central repository or data bank

Slide 14

Slide 14 text

Slide 15

Slide 15 text

When our information is centralized by context, we can more easily ﬁnd what we’re looking for

Slide 16

Slide 16 text

We already have websites that centralize this information

Slide 17

Slide 17 text

And allow us to ﬁnd data that Google couldn’t

Slide 18

Slide 18 text

BUT THERE’S ROOM FOR IMPROVEMENT

Slide 19

Slide 19 text

How is this data currently being centralized?

Slide 20

Slide 20 text

Each center sends us their data in the form of Excel or Access ﬁles, through FTP or Email

Slide 21

Slide 21 text

No content

Slide 22

Slide 22 text

THIS IS AN ENTIRELY MANUAL PROCESS

Slide 23

Slide 23 text

Is this sustainable?

Slide 24

Slide 24 text

Is this sustainable? This process needs to be automated

Slide 25

Slide 25 text

• no human interference • less communication hassles • less human errors • more accurate data • more data What are the advantages of automating the data exchange process?

Slide 26

Slide 26 text

How do we automate? Centers no longer have to send us anything. We get it directly from their website

Slide 27

Slide 27 text

There’s no secret. Google, hotel sites, ﬂight search engines and many others do this It is called web scraping

Slide 28

Slide 28 text

How does it work

Slide 29

Slide 29 text

We automatically navigate to the centers websites and fetch the information that we need

Slide 30

Slide 30 text

We automatically navigate to the centers websites and fetch the information that we need This is done by little scripts called spiders or web crawlers

Slide 31

Slide 31 text

What? Spiders?

Slide 32

Slide 32 text

“A Web crawler (or spider) is a computer program that browses the World Wide Web in a methodical, automated manner or in an orderly fashion.”

Slide 33

Slide 33 text

No content

Slide 34

Slide 34 text

This process allows us to reach more centers and gather more data

Slide 35

Slide 35 text

For each center to have a website that displays their information The main requirement Without a website we wouldn’t be able to automate this exchange

Slide 36

Slide 36 text

Working prototype http://seeds.iriscouch.com/

Slide 37

Slide 37 text

Working prototype http://seeds.iriscouch.com/ PASSPORT DATA

Slide 38

Slide 38 text

Working prototype http://seeds.iriscouch.com/ PASSPORT DATA CHARACTERIZATION

Slide 39

Slide 39 text

Working prototype http://seeds.iriscouch.com/ PASSPORT DATA CHARACTERIZATION OTHER...

Slide 40

Slide 40 text

RECAP

Slide 41

Slide 41 text

RECAP Automation of the data exchange process is the only sustainable solution

Slide 42

Slide 42 text

RECAP Automation of the data exchange process is the only sustainable solution With new technologies, web scraping has become a very reliable system

Slide 43

Slide 43 text

RECAP Automation of the data exchange process is the only sustainable solution With new technologies, web scraping has become a very reliable system The process is modular and will allow us to plug-in systems such as GRIN-Global

Slide 44

Slide 44 text

THANK YOU