Slide 1

Slide 1 text

Cloud Computing and NGS data analysis INTERCROSSING course August 2013 - Granada Welcome and Introduction Eduardo Pareja August 2013 - Granada

Slide 2

Slide 2 text

Era7 Bioinformatics activity is based in: • NGS (Next Generation Sequencing) Research • Research • Focus in Bacterial Genomics • Cloud Computing

Slide 3

Slide 3 text

Era7 Bioinformatics activity is based in: • NGS (Next Generation Sequencing) • Research • Focus in Bacterial Genomics • Cloud Computing

Slide 4

Slide 4 text

Walter Goad of the Theoretical Biology and Biophysics Group at Los Alamos National Laboratory and others established the Los Alamos Sequence Database in 1979, which culminated in 1982 with the creation of the public Next Generation Sequencing. DNA sequences GenBank: which culminated in 1982 with the creation of the public GenBank.[4] Funding was provided by the National Institutes of Health, the National Science Foundation, the Department of Energy, and the Department of Defense. LANL collaborated on GenBank with the firm Bolt, Beranek, and Newman, and by the end of 1983 more than 2,000 sequences were stored in it. In the mid 1980s, the Intelligenetics bioinformatics company at Stanford University managed the GenBank project in collaboration with LANL.[5]

Slide 5

Slide 5 text

1988

Slide 6

Slide 6 text

1988 19,044 loci 23,018 sequences 22,019,698 bases

Slide 7

Slide 7 text

1988 360 Kb

Slide 8

Slide 8 text

Some numbers related with DNA sequencing

Slide 9

Slide 9 text

Release MonthYear Base Pairs Entries 3 Dec 1982 680338 606 14 Nov 1983 2274029 2427 20 May 1984 3002088 3665 24 Sep 1984 3323270 4135 25 Oct 1984 3368765 4175 26 Nov 1984 3689752 4393 32 May 1985 4211931 4954 36 Sep 1985 5204420 5700 36 Sep 1985 5204420 5700 40 Feb 1986 5925429 6642 42 May 1986 6765476 7416 44 Aug 1986 8442357 8823 46 Nov 1986 9615371 9978 48 Feb 1987 10961380 10913 50 May 1987 13048473 12534 52 Aug 1987 14855145 14020

Slide 10

Slide 10 text

No content

Slide 11

Slide 11 text

We were interested in DNA sequences and • NGS was introduced in 2005 • NGS was introduced in 2005 • Era7 was founded in Sept 2004

Slide 12

Slide 12 text

No content

Slide 13

Slide 13 text

No content

Slide 14

Slide 14 text

No content

Slide 15

Slide 15 text

http://omicsmaps.com/

Slide 16

Slide 16 text

http://omicsmaps.com/

Slide 17

Slide 17 text

http://omicsmaps.com/

Slide 18

Slide 18 text

http://omicsmaps.com/

Slide 19

Slide 19 text

http://omicsmaps.com/

Slide 20

Slide 20 text

What is the situation of NGS today ? NGS today ?

Slide 21

Slide 21 text

illumina PacBio Ion Torrent Roche 454

Slide 22

Slide 22 text

>90 % of the DNA ever sequenced has been sequenced with has been sequenced with illumina machines

Slide 23

Slide 23 text

The Next Big Thing could be Personal Sequencers Similar, perhaps, to the PCR in the 90s

Slide 24

Slide 24 text

MiSeq from illumina: Up to 15 Gb and 2 × 300 bp runs—with the highest data quality.

Slide 25

Slide 25 text

E coli 2x300 MiSeq

Slide 26

Slide 26 text

Era7 Bioinformatics activity is based in: • NGS (Next Generation Sequencing) Research • Research • Focus in Bacterial Genomics • Cloud Computing

Slide 27

Slide 27 text

Research at Era7:

Slide 28

Slide 28 text

Some Projects: • INTERCROSSING • bio4j • bio4j • BIOGRAPHIKA (bio4j related) • NEXTMICRO

Slide 29

Slide 29 text

Some Projects: • INTERCROSSING • bio4j • bio4j • BIOGRAPHIKA (bio4j related) • NEXTMICRO

Slide 30

Slide 30 text

NEXTMICRO • AG7 Assembling Genomes: illumina and PacBio • BG7 Bacterial Genome Annotation (PLOS ONE Nov 2012) • CG7 Comparative Genomics

Slide 31

Slide 31 text

NEXTMICRO • Outbreaks Different Steps in the Management • Different Steps in the Management • Managing Information about Clones

Slide 32

Slide 32 text

NEXTMICRO • Era7 Bioinformatics Hospital Ramon y Cajal Madrid • Hospital Ramon y Cajal Madrid • Funded by CDTI

Slide 33

Slide 33 text

Era7 Bioinformatics activity is based in: • NGS (Next Generation Sequencing) • Research • Research • Focus in Bacterial Genomics • Cloud Computing

Slide 34

Slide 34 text

Bacteria are all over the world

Slide 35

Slide 35 text

Focus in Bacterial Genomics: • Bacteria • Microbiome • Host-Pathogen relationships: Dual RNA-seq • Human and animal models • Biofuels • Food • Environmental • ………………….

Slide 36

Slide 36 text

Era7 Bioinformatics activity is based in: • NGS (Next Generation Sequencing) • Research • Research • Focus in Bacterial Genomics • Cloud Computing

Slide 37

Slide 37 text

Cloud Computing and NGS data analysis INTERCROSSING course INTERCROSSING course

Slide 38

Slide 38 text

To understand Cloud Computing meaning and importance for data analysis in NGS and science in general Objectives of the course: To be able to design and use (basic) Cloud Solutions for not to be tied to current solutions

Slide 39

Slide 39 text

To reach these goals: 1. We will give an overview of what is the cloud, how it affects research in general and data analysis (NGS) in particular 2. introduce some of the work that we’re doing within intercrossing, giving other partners the opportunity to find possible uses and collaboration through these developments 3. hands-on approach: we want you to do something, and to do it 3. hands-on approach: we want you to do something, and to do it by yourselves (with our help of course). Don’t hide real, practical issues under the rug of thoroughly prepared artificial examples

Slide 40

Slide 40 text

Monday 26 Tuesday 27 Wednesday 28 Thursday 29 Friday 30 10:00 - 11:00 T Welcome T/P Problem T Architechture P Q&A III P Presentations 11:00 - 11:30 break break break break break 11:30 - 12:30 T Introduction T NGS P nispero P TW III P Presentations 12:30 - 14:00 lunch lunch lunch lunch lunch 14:00 - 15:30 T Cloud What? P statika P bio4j P TW IV Conclusions 14:00 - 15:30 T Cloud What? P statika P bio4j P TW IV Conclusions 15:30 - 15:45 break break break break 15:45 - 16:45 P AWS I P Q&A I P Q&A II P Q&A IV 16:45 - 17:15 break break break break 17:15: - END P AWS II P TW I P TW II P TW V

Slide 41

Slide 41 text

People:

Slide 42

Slide 42 text

People:

Slide 43

Slide 43 text

People:

Slide 44

Slide 44 text

People:

Slide 45

Slide 45 text

People:

Slide 46

Slide 46 text

Granada:

Slide 47

Slide 47 text

Granada:

Slide 48

Slide 48 text

Granada:

Slide 49

Slide 49 text

How did we get the idea of using Cloud Computing ? Cloud Computing ?

Slide 50

Slide 50 text

Nature News News Nov. 2006

Slide 51

Slide 51 text

From the news article in Nature: “You spend a few dollars, you have a computer farm and you get results” computer farm and you get results”

Slide 52

Slide 52 text

From the news article in Nature: The South African National Bioinformatics Institute at the University of Westerns Cape, Belleville, has already been testing Amazon’s system to power already been testing Amazon’s system to power large-scale genome comparisons. “The pay-as-you-go system offers computing power and bandwith that the Institute could not afford to maintain itself.”

Slide 53

Slide 53 text

From the news article in Nature: Runing since August 2006, Amazon’s service enables customers to create multiple virtual computers for $0.10 per multiple virtual computers for $0.10 per computing hour and to store data for $0.15 per gigabyte per month Today is even cheaper !!

Slide 54

Slide 54 text

From the news article in Nature: Industry supercomputer power on the desktop PC could have a big impact on scientific research. The main attraction is Amazon’s use of virtualization technologies, which many predict will change not just research but computing itself

Slide 55

Slide 55 text

Granada’s local provider:

Slide 56

Slide 56 text

So, It seemed that we could have Computing and Storage: • On-demand • Scalable • Pay-per-use

Slide 57

Slide 57 text

We discussed the news, and we started to work in AWS at Era7 Bioinformatics from 2007

Slide 58

Slide 58 text

Aws Services Today

Slide 59

Slide 59 text

Use cases: The New York Times. The New York Times Archives + Amazon Web Services = TimesMachine. TimesMachine is a collection of full-page image TimesMachine is a collection of full-page image scans of the newspaper from 1851–1922

Slide 60

Slide 60 text

Use cases: 2008

Slide 61

Slide 61 text

Use cases: Telefonica (Spanish global telephone operator) uses AWS for elaborating the bills once a month the bills once a month

Slide 62

Slide 62 text

Use cases: The Force.com Toolkit for Amazon Web Services makes it easy for developers to combine the functionality of Force.com—salesforce.com’s platform for building software-as-a-service applications—with Amazon Web Services to create innovative business applications in the cloud. applications—with Amazon Web Services to create innovative business applications in the cloud.

Slide 63

Slide 63 text

Use cases: DICOM Grid, Arizona, uses AWS to store, distribute and share medical images images

Slide 64

Slide 64 text

Use cases: DNAnexus relies on Amazon Simple Storage Service (Amazon S3) to meet the company's extensive storage demand, which will grow from terabytes into petabytes of data from terabytes into petabytes of data

Slide 65

Slide 65 text

Use cases: Era7 Bioinformatics uses S3, EC2, ….. To assemble, annotate and compare Bacterial Genomes and performs Bacterial Genomes and performs Metagenomics studies

Slide 66

Slide 66 text

This is a very interesting use case based in AWS because the data is uploaded from the machines in real time before the run has finished

Slide 67

Slide 67 text

No content

Slide 68

Slide 68 text

Is there any reason for not using AWS? Probably there could be a few. What I have found many times: found many times: Security and Privacy Concerns

Slide 69

Slide 69 text

The Health Insurance Portability and Accountability Act of 1996 (HIPAA) Privacy, Security and Breach Notification Rules Security and Breach Notification Rules

Slide 70

Slide 70 text

No content

Slide 71

Slide 71 text

But the privacy and security problem is not a specific problem of the Cloud: A lot of laptop thefts in the USA with patient’s data from medical records, clinical trials, etc. This would not happen in the Cloud

Slide 72

Slide 72 text

No content

Slide 73

Slide 73 text

2013 New version of HIPAA

Slide 74

Slide 74 text

Some discussion now from January 2013

Slide 75

Slide 75 text

No content

Slide 76

Slide 76 text

No content

Slide 77

Slide 77 text

“Foreign clouds in the European sky” Welcome and Introduction Eduardo Pareja

Slide 78

Slide 78 text

No content

Slide 79

Slide 79 text

There is a LinkedIn group for NGS:

Slide 80

Slide 80 text

There is also a LinkedIn group for this:

Slide 81

Slide 81 text

In summary: Welcome to Granada !! and we will do our best and we will do our best to helping you in your way to the Cloud

Slide 82

Slide 82 text

No content