Slide 1

Slide 1 text

Station Biologique de Roscoff UMR7144 - CNRS and Sorbonne Université PR2 release 4.11 Daniel Vaulot [email protected] November 3, 2018

Slide 2

Slide 2 text

1 Content What is PR2 ? What is PR2 used for ? History of PR2 PR2 now How is PR2 implemented and maintained ? MySQL database R scripts Access to PR2 What is next ? |

Slide 3

Slide 3 text

2 What is PR2 ? |

Slide 4

Slide 4 text

3 What is PR2 ? PR2 = Protist Ribosomal Reference database. Open access database of eukaryotic 18S rRNA sequences. |

Slide 5

Slide 5 text

3 What is PR2 ? PR2 = Protist Ribosomal Reference database. Open access database of eukaryotic 18S rRNA sequences. All sequences originate from GenBank. |

Slide 6

Slide 6 text

3 What is PR2 ? PR2 = Protist Ribosomal Reference database. Open access database of eukaryotic 18S rRNA sequences. All sequences originate from GenBank. Sequences receive a detailed taxonomic assignment (8 levels). Taxonomic annotation for both strain and environmental sequences. |

Slide 7

Slide 7 text

3 What is PR2 ? PR2 = Protist Ribosomal Reference database. Open access database of eukaryotic 18S rRNA sequences. All sequences originate from GenBank. Sequences receive a detailed taxonomic assignment (8 levels). Taxonomic annotation for both strain and environmental sequences. 176,818 sequences |

Slide 8

Slide 8 text

3 What is PR2 ? PR2 = Protist Ribosomal Reference database. Open access database of eukaryotic 18S rRNA sequences. All sequences originate from GenBank. Sequences receive a detailed taxonomic assignment (8 levels). Taxonomic annotation for both strain and environmental sequences. 176,818 sequences 47,000 species |

Slide 9

Slide 9 text

4 What is PR2 used for ? |

Slide 10

Slide 10 text

5 What is PR2 used for ? Annotation of metabarcoding data |

Slide 11

Slide 11 text

5 What is PR2 used for ? Annotation of metabarcoding data Biogeography |

Slide 12

Slide 12 text

5 What is PR2 used for ? Annotation of metabarcoding data Biogeography Sequence analysis |

Slide 13

Slide 13 text

5 What is PR2 used for ? Annotation of metabarcoding data Biogeography Sequence analysis Links between phylogeny and functional traits |

Slide 14

Slide 14 text

5 What is PR2 used for ? Annotation of metabarcoding data Biogeography Sequence analysis Links between phylogeny and functional traits 220 papers citing PR2. |

Slide 15

Slide 15 text

6 What is PR2 used for ? Metabarcoding Marine ecosystems |

Slide 16

Slide 16 text

6 What is PR2 used for ? Metabarcoding Marine ecosystems Ballast waters |

Slide 17

Slide 17 text

6 What is PR2 used for ? Metabarcoding Marine ecosystems Ballast waters River systems |

Slide 18

Slide 18 text

6 What is PR2 used for ? Metabarcoding Marine ecosystems Ballast waters River systems Hot Springs |

Slide 19

Slide 19 text

6 What is PR2 used for ? Metabarcoding Marine ecosystems Ballast waters River systems Hot Springs Soil |

Slide 20

Slide 20 text

6 What is PR2 used for ? Metabarcoding Marine ecosystems Ballast waters River systems Hot Springs Soil Farming systems |

Slide 21

Slide 21 text

6 What is PR2 used for ? Metabarcoding Marine ecosystems Ballast waters River systems Hot Springs Soil Farming systems Urban ecology |

Slide 22

Slide 22 text

6 What is PR2 used for ? Metabarcoding Marine ecosystems Ballast waters River systems Hot Springs Soil Farming systems Urban ecology Criminology |

Slide 23

Slide 23 text

7 What is PR2 used for ? Biogeography Simon, N. et al. 2017. Revision of the Genus Micromonas Manton et Parke (Chlorophyta, Mamiellophyceae), of the Type Species M. pusilla (Butcher) Manton & Parke and of the Species M. commoda van Baren, Bachy and Worden and Description of Two New Species. Protist. 168:612–35. |

Slide 24

Slide 24 text

8 What is PR2 used for ? Sequence analysis Primer analysis - On-going work with S. Geisen and D. Bass. |

Slide 25

Slide 25 text

9 History of PR2 |

Slide 26

Slide 26 text

10 PR2 history 1997 Excel file created by D. Vaulot during L. Guillou thesis 2000-2003 Access/ARB database maintained by D. Vaulot during PICODIV 2006-2010 KeyDNAtools developed by L. Guillou 2010-2013 Project BioMarks: creation of PR2 by L. Guillou Database maintained by R. Christen : ssu-rrna.org mid-2016 Web site died 2016 D. Vaulot takes over maintenance Raw data deposited to Figshare 2017 Database moved to MySQL Development of R scripts to manage the database Repository on GitHub |

Slide 27

Slide 27 text

11 PR2 history PICODIV |

Slide 28

Slide 28 text

12 PR2 history PICODIV |

Slide 29

Slide 29 text

13 PR2 history Guillou et al. 2013 paper |

Slide 30

Slide 30 text

14 PR2 Now |

Slide 31

Slide 31 text

15 Statistics Taxonomic distribution |

Slide 32

Slide 32 text

16 Statistics Sequence size distribution |

Slide 33

Slide 33 text

17 Statistics Geographical distribution |

Slide 34

Slide 34 text

18 Recent updates Version Date Who Major group updated 4.11 30/10/2018 D. Vaulot, A. Lopes Chloropicophyceae, Mamiellophyceae EukRef Ciliates 4.9 20/02/2018 S. Mordret, R. Piredda, D. Sarno Dinophyceae 4.7 27/09/2017 C. Bachy, W.-T. Chen Cilates (Sprirotrichea) 4.4 10/11/2016 D. Vaulot Bolidophyceae 4.0 21/10/2015 B. Edvardsen Haptophyta 3.0 31/8/2015 M. Tragin Chlorophyta 2.0 07/02/2015 T. Biard Rhizaria |

Slide 35

Slide 35 text

19 Recent updates Dinoflagellates |

Slide 36

Slide 36 text

20 Recent updates Ciliates - Eukref |

Slide 37

Slide 37 text

21 How is PR2 implemented and maintained ? |

Slide 38

Slide 38 text

22 Implementation MySQL database |

Slide 39

Slide 39 text

22 Implementation MySQL database Processing done with R (tidyr libraries) |

Slide 40

Slide 40 text

22 Implementation MySQL database Processing done with R (tidyr libraries) Data available on GitHub (and Figshare - DOI number) |

Slide 41

Slide 41 text

23 MySQL database Tables pr2_main : sequences assigned to species |

Slide 42

Slide 42 text

23 MySQL database Tables pr2_main : sequences assigned to species pr2_sequence : sequence of each entry |

Slide 43

Slide 43 text

23 MySQL database Tables pr2_main : sequences assigned to species pr2_sequence : sequence of each entry pr2_metadata : metadata for each entry |

Slide 44

Slide 44 text

23 MySQL database Tables pr2_main : sequences assigned to species pr2_sequence : sequence of each entry pr2_metadata : metadata for each entry pr2_taxonomy : one line per species |

Slide 45

Slide 45 text

24 MySQL database Table: pr2_main Each entry has a PR2 accession number (2 entries may correspond to the same Genbank accession number, e.g. for genomes) Sequences are linked to taxonomy by species name Annotation of Chimera (removed when PR2 is exported) |

Slide 46

Slide 46 text

25 MySQL database Table: pr2_metadata Genbank annotations (gb_ fields) Some gb fields have been manually edited (e.g. gb_strain and gb_clone) Manually curated annotations (eg. sample_ fields) Fields computed from gb fields such as longitude and latitude Phenotypic information (auto vs. hetero, mixotroph etc. . . ) |

Slide 47

Slide 47 text

26 MySQL database Table: pr2_taxonomy 8 taxonomic levels (kingdom -> species) Follows PR2 convention (_X, _XX etc..) Contains 47 000 species Each name is unique (i.e. does not appear in different columns or different lines). Any daughter taxon has a unique mother taxon. |

Slide 48

Slide 48 text

27 R scripts R scripts - uses tidyr universe Add new sequences from GenBank Correct taxonomy of existing sequences (EukRef output) Extract metadata from Genbank entries Check sequences problems (short sequences, sequences with ambiguities) Analyze taxonomy Export data to a variety of format (fasta, R data) |

Slide 49

Slide 49 text

28 R scripts Sequence processing |

Slide 50

Slide 50 text

29 Access to PR2 |

Slide 51

Slide 51 text

30 GitHub GitHub https://github.com/vaulot/pr2database Releases - current version 4.11.0 Wiki Issues |

Slide 52

Slide 52 text

31 GitHub Wiki |

Slide 53

Slide 53 text

32 GitHub Download formats Export formats metabarcode annotation mothur Qiime dada2 USEARCH, VSEARCH BLAST - fasta files metadata R dataset - new |

Slide 54

Slide 54 text

33 GitHub R dataset |

Slide 55

Slide 55 text

34 Figshare Figshare https://doi.org/10.6084/m9.figshare.5913181 PhytoRef (16S plastid) is also on Figshare . |

Slide 56

Slide 56 text

35 What is next ? |

Slide 57

Slide 57 text

36 Database Coordinate with EukRef |

Slide 58

Slide 58 text

36 Database Coordinate with EukRef Reference sequences |

Slide 59

Slide 59 text

36 Database Coordinate with EukRef Reference sequences Chimeras |

Slide 60

Slide 60 text

36 Database Coordinate with EukRef Reference sequences Chimeras Reannotate environmental sequences (Wang/DECIPHER) |

Slide 61

Slide 61 text

36 Database Coordinate with EukRef Reference sequences Chimeras Reannotate environmental sequences (Wang/DECIPHER) Import more recent GenBank sequences |

Slide 62

Slide 62 text

36 Database Coordinate with EukRef Reference sequences Chimeras Reannotate environmental sequences (Wang/DECIPHER) Import more recent GenBank sequences Incorporate new metadata types (e.g. mixotrophs) |

Slide 63

Slide 63 text

36 Database Coordinate with EukRef Reference sequences Chimeras Reannotate environmental sequences (Wang/DECIPHER) Import more recent GenBank sequences Incorporate new metadata types (e.g. mixotrophs) Incorporate 16S plastid, ITS, SSU |

Slide 64

Slide 64 text

36 Database Coordinate with EukRef Reference sequences Chimeras Reannotate environmental sequences (Wang/DECIPHER) Import more recent GenBank sequences Incorporate new metadata types (e.g. mixotrophs) Incorporate 16S plastid, ITS, SSU Provide alignments for specific groups |

Slide 65

Slide 65 text

37 Web site In the coming years, we will try to provide users with new functionalities. However this already can be done easily using R and the pr2database library. Specific datasets |

Slide 66

Slide 66 text

37 Web site In the coming years, we will try to provide users with new functionalities. However this already can be done easily using R and the pr2database library. Specific datasets Reference sequences (e.g.for alignements) |

Slide 67

Slide 67 text

37 Web site In the coming years, we will try to provide users with new functionalities. However this already can be done easily using R and the pr2database library. Specific datasets Reference sequences (e.g.for alignements) Chimeras |

Slide 68

Slide 68 text

37 Web site In the coming years, we will try to provide users with new functionalities. However this already can be done easily using R and the pr2database library. Specific datasets Reference sequences (e.g.for alignements) Chimeras Taxonomic groups (e.g. diatoms . . . ) |

Slide 69

Slide 69 text

37 Web site In the coming years, we will try to provide users with new functionalities. However this already can be done easily using R and the pr2database library. Specific datasets Reference sequences (e.g.for alignements) Chimeras Taxonomic groups (e.g. diatoms . . . ) BLAST search |

Slide 70

Slide 70 text

37 Web site In the coming years, we will try to provide users with new functionalities. However this already can be done easily using R and the pr2database library. Specific datasets Reference sequences (e.g.for alignements) Chimeras Taxonomic groups (e.g. diatoms . . . ) BLAST search Automatic metabarcode annotation using Wang classifier/DECIPHER |

Slide 71

Slide 71 text

37 Web site In the coming years, we will try to provide users with new functionalities. However this already can be done easily using R and the pr2database library. Specific datasets Reference sequences (e.g.for alignements) Chimeras Taxonomic groups (e.g. diatoms . . . ) BLAST search Automatic metabarcode annotation using Wang classifier/DECIPHER Primer and Probe specificity (cf. work with S.Geisen) |

Slide 72

Slide 72 text

37 Web site In the coming years, we will try to provide users with new functionalities. However this already can be done easily using R and the pr2database library. Specific datasets Reference sequences (e.g.for alignements) Chimeras Taxonomic groups (e.g. diatoms . . . ) BLAST search Automatic metabarcode annotation using Wang classifier/DECIPHER Primer and Probe specificity (cf. work with S.Geisen) Visualisation of metadata (position . . . ) |

Slide 73

Slide 73 text

38 Web site Example of interactive download of sequences. |

Slide 74

Slide 74 text

39 Research Gate Follow PR2 on Research Gate |

Slide 75

Slide 75 text

Thank you for your attention