Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Work Log - 12/07

Liang Bo Wang
December 05, 2012
150

Work Log - 12/07

Liang Bo Wang

December 05, 2012
Tweet

Transcript

  1. Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine D

    a t a b a s e s t a t i s t i c s , v e r i f i c a t i o n s t a t i s t i c s  Work Progress 12/7 1
  2. Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine 3

    •  Not collapsing for disk usage and BLAST function •  Summarize results by myself •  Quality control will be added in near future •  NGS QC Toolkit (建樂學長) may be used •  Automation & Parallelization NCBI SRA-Toolkit dump to fasta/q format Original datasets on GEO SRRxxx.sra, SRRyyy.sra, ... Fastq format with QC and sequencing details SRRxxx.fastq, SRRyyy.fastq, ... FastX Toolkit clip off 3' adapter Only clipped sequences left SRRxxx_clipped.fastq, SRRyyy_clipped.fastq, ... Quality Control discard low score reads Fasta Converter file format conversion Simpler file format: fasta SRRxxx_clipped.fasta, SRRyyy_clipped.fasta, ... BLAST+ make blast database Original datasets on GEO SRRxxx.sra, SRRyyy.sra, ... BLAST+ blastn query for every candidates on every dataset novel miRNA candidates candidate01.fa, candidate02.fa, ... Handmade Script summary all queries BLAST detail results for every query candidate01_xxx.csv, candidate02_xxx.csv, … candidate01_yyy.csv, candidate02_yyy.csv, … …, …, … Summarized read count for all candidates candidate01-10_xxx-zzz.csv Excel table output Script Automation
  3. Tool for parallel computing & automation •  parse output direct

    into csv (comma separate file), which can be imported into excel directly •  like usual command line tool •  argument take •  standard lib of Python Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine 4
  4. Dataset GSE33858 (lung) •  lung cancer tissues versus adjacent lung

    tissues •  Illumina Genome Analyzer IIx •  32 patients and 62 samples in total •  21 adenocarcinoma patient (肺腺癌) •  11 squamous cell carcinoma patients (犬鱗狀上皮細胞癌?) •  Normalize expression rate using sample size •  start from raw sequence data •  use read count after 3’ adapter clipped •  adapter-only, too-short, N’s, no adapter clipped reads will be discarded. Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine 6
  5. Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine 0.E+00

    5.E+08 1.E+09 2.E+09 2.E+09 3.E+09 3.E+09 SRR372611 SRR372612 SRR372613 SRR372614 SRR372615 SRR372616 SRR372617 SRR372618 SRR372619 SRR372620 SRR372621 SRR372622 SRR372623 SRR372624 SRR372625 SRR372626 SRR372627 SRR372628 SRR372629 SRR372630 SRR372631 SRR372632 SRR372633 SRR372634 SRR372635 SRR372636 SRR372637 SRR372638 SRR372639 SRR372640 SRR372641 SRR372642 SRR372643 SRR372644 SRR372645 SRR372646 SRR372647 SRR372648 SRR372649 SRR372650 SRR372651 SRR372652 SRR372653 SRR372654 SRR372655 SRR372656 SRR372657 SRR372658 SRR372659 SRR372660 SRR372661 SRR372662 SRR372663 SRR372664 SRR372665 SRR372666 SRR372667 SRR372668 SRR372669 SRR372670 SRR372671 SRR372672 read count (UNKOWN UNIT) N reads non-clipped adapter-only too-short output 0.E+00 5.E+06 1.E+07 2.E+07 2.E+07 3.E+07 3.E+07 4.E+07 SRR372611 SRR372612 SRR372613 SRR372614 SRR372615 SRR372616 read count (UNKOWN UNIT) N reads non-clipped adapter-only too-short output •  Size of different datasets varies in 2 orders •  Different Sequencing machine used: •  SRRXXXX [11-16]: •  HWI-EAS438_42AHVAAXX •  SRRXXXX [17-72]: •  ILLUMINA-053F9F 7
  6. Further into these statistics •  First I thought that the

    tool logged the reads in terms of their length, •  After I computed the ratio of input(output) to number of sequence, •  All ratios are exactly 53. •  Still don’t know why, since they run the same argument •  Probably due to the different platform they used Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine 8
  7. Dataset statistics after adjustment •  for dataset SRRXXXX [17-72] • 

    numbers of reads are all divided by 53 •  Some datasets discard large proportion of reads with no adapter seq (no clipped), especially for dataset [11-16] •  May some other 3’adpater has been used ? (need recheck) Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine 0.E+00 1.E+07 2.E+07 3.E+07 4.E+07 5.E+07 6.E+07 SRR372611 SRR372612 SRR372613 SRR372614 SRR372615 SRR372616 SRR372617 SRR372618 SRR372619 SRR372620 SRR372621 SRR372622 SRR372623 SRR372624 SRR372625 SRR372626 SRR372627 SRR372628 SRR372629 SRR372630 SRR372631 SRR372632 SRR372633 SRR372634 SRR372635 SRR372636 SRR372637 SRR372638 SRR372639 SRR372640 SRR372641 SRR372642 SRR372643 SRR372644 SRR372645 SRR372646 SRR372647 SRR372648 SRR372649 SRR372650 SRR372651 SRR372652 SRR372653 SRR372654 SRR372655 SRR372656 SRR372657 SRR372658 SRR372659 SRR372660 SRR372661 SRR372662 SRR372663 SRR372664 SRR372665 SRR372666 SRR372667 SRR372668 SRR372669 SRR372670 SRR372671 SRR372672 read count N reads non-clipped adapter-only too-short output 9
  8. GSE29173 (breast tumor) •  21 barcoded sequencing runs •  Thanks

    to 建樂學長 for great help and inspiration •  information at its supplementary file •  so each sample has relatively small size •  Illumina Genome Analyzer IIx •  185 unique samples and 54 samples in replicate •  total 245 samples •  includes IDC, ILC, DCIS, Apocrine, Adenoid, Metaplastic, Atypical Medullary => treated as ‘other’ type of samples •  normal: 16, other: 229 Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine 10
  9. Unusually large read count still appears and even worse • 

    Clipping data made by FastX Toolkit •  ratio is not const as well •  need to debug its source code •  planning to replace this tool •  However, fastq-dump and BLAST makedb gave the correct and reasonable read count •  verify by counting number of lines in fasta file Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine 11
  10. Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine 0.E+00

    2.E+05 4.E+05 6.E+05 8.E+05 1.E+06 1.E+06 1.E+06 2.E+06 2.E+06 SRR191393 SRR191398 SRR191403 SRR191408 SRR191413 SRR191418 SRR191423 SRR191428 SRR191433 SRR191438 SRR191443 SRR191448 SRR191453 SRR191458 SRR191463 SRR191468 SRR191473 SRR191478 SRR191483 SRR191488 SRR191493 SRR191498 SRR191503 SRR191508 SRR191513 SRR191518 SRR191523 SRR191528 SRR191533 SRR191538 SRR191543 SRR191548 SRR191553 SRR191558 SRR191563 SRR191568 SRR191573 SRR191578 SRR191583 SRR191588 SRR191593 SRR191598 SRR191603 SRR191608 SRR191613 SRR191618 SRR191623 SRR191628 SRR191633 read count Read count of samples in dataset GSE29173 for BLAST database  seq. not mapped output 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 SRR191393 SRR191398 SRR191403 SRR191408 SRR191413 SRR191418 SRR191423 SRR191428 SRR191433 SRR191438 SRR191443 SRR191448 SRR191453 SRR191458 SRR191463 SRR191468 SRR191473 SRR191478 SRR191483 SRR191488 SRR191493 SRR191498 SRR191503 SRR191508 SRR191513 SRR191518 SRR191523 SRR191528 SRR191533 SRR191538 SRR191543 SRR191548 SRR191553 SRR191558 SRR191563 SRR191568 SRR191573 SRR191578 SRR191583 SRR191588 SRR191593 SRR191598 SRR191603 SRR191608 SRR191613 SRR191618 SRR191623 SRR191628 SRR191633 read count ( in log scale) Read count of samples in dataset GSE29173 for BLAST database ! seq. not mapped output 12
  11. Verification Result •  Normalized by RPM = Reads per Million

    •  Average expression rate of same type of samples •  However, not all samples in one type have similar behavior Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine 13 ¯ Xn = 1 n n X i=0 xi Ni ⇥ 106 ¯ Xn : average expression rate n : number of samples of same type xi : read count in sample i Ni : size of sample i
  12. Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine 14

    •  expression rate differs in several orders. •  separate view in the following 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 chr7_8791 chr11_13342 chr22_20736 chr13_14817 chr17_17828 chr20_19494 chr20_19450 chr18_18769 chr11_13709 chr2_2356 chr4_5692 chr11_12760 chr3_3910 chr1_692 chr10_12452 chr6_7548 chr1_944 chr6_8151 chr11_13239 chr7_8991 chr7_8849 chr22_20809 chr2_3487 chr7_8673 chr14_15459 chr20_19463 chr6_8250 chr19_19992 chr1_689 chr17_17785 Expression rate in RPM (reads per million) Normalized average expression rate of miRNA candidates in dataset GSE33858 normal' tumor'
  13. Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine 15

    •  the top 3 highest total expression 0 10 20 30 40 50 60 70 chr2_2356 chr22_20809 chr19_19992 Expression rate in RPM (reads per million) Normalized average expression rate of miRNA candidates in dataset GSE33858 tumor& normal& 0.01 0.1 1 10 100 chr2_2356 chr22_20809 chr19_19992 Expression rate in RPM (reads per million) Normalized average expression rate in log scale of miRNA candidatesin dataset GSE33858 normal' tumor'
  14. Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine 16

    •  relatively higher expression rate than in GSE33858 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 chr7_8791 chr11_13342 chr22_20736 chr13_14817 chr17_17828 chr20_19494 chr20_19450 chr18_18769 chr11_13709 chr2_2356 chr4_5692 chr11_12760 chr3_3910 chr1_692 chr10_12452 chr6_7548 chr1_944 chr6_8151 chr11_13239 chr7_8991 chr7_8849 chr22_20809 chr2_3487 chr7_8673 chr14_15459 chr20_19463 chr6_8250 chr19_19992 chr1_689 chr17_17785 Expression rate in RPM (reads per million)! Normalized average expression rate of miRNA candidates in dataset GSE29173 normal' other'
  15. Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine 17

    •  expression rate differs in a much larger range •  many samples with small size •  detailed view 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 chr7_8791 chr11_13342 chr22_20736 chr13_14817 chr17_17828 chr20_19494 chr20_19450 chr18_18769 chr11_13709 chr2_2356 chr4_5692 chr11_12760 chr3_3910 chr1_692 chr10_12452 chr6_7548 chr1_944 chr6_8151 chr11_13239 chr7_8991 chr7_8849 chr22_20809 chr2_3487 chr7_8673 chr14_15459 chr20_19463 chr6_8250 chr19_19992 chr1_689 chr17_17785 Expression rate in RPM (reads per million)! Normalized average expression rate of miRNA candidates in dataset GSE29173 normal' other'
  16. Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine 18

    0.1 1 10 100 1000 10000 100000 chr4_5692 chr3_3910 chr22_20809 chr2_3487 chr6_8250 chr19_19992 chr17_17785 Expression rate in RPM (reads per million) Normalized average expression rate in log scale of miRNA candidates in dataset GSE29173 normal' other'
  17. Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine 19

    0.0001 0.001 0.01 0.1 1 10 100 1000 10000 chr7_8791 chr11_13342 chr22_20736 chr13_14817 chr17_17828 chr20_19494 chr20_19450 chr18_18769 chr11_13709 chr2_2356 chr4_5692 chr11_12760 chr3_3910 chr1_692 chr10_12452 chr6_7548 chr1_944 chr6_8151 chr11_13239 chr7_8991 chr7_8849 chr22_20809 chr2_3487 chr7_8673 chr14_15459 chr20_19463 chr6_8250 chr19_19992 chr1_689 chr17_17785 Expression rate in RPM (reads per million) Normalized average expression rate in log scale of miRNA candidates acrosss datasets GSE33858_normal GSE33858_tumor GSE29173_normal GSE29173_other
  18. Current status of verification •  need interpretation for these statistics

    •  “AVERAGE” expression rate? •  define the different trend of expression across cancer type •  select possible miRNA candidates •  not sure whether this method is correct •  will normalized expression rate differs by sample size? •  Original data should be included Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine 20