Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Ruby on bioinformatics

tsechingho
April 26, 2014

Ruby on bioinformatics

RubyConf Taiwan 2014

tsechingho

April 26, 2014
Tweet

More Decks by tsechingho

Other Decks in Science

Transcript

  1. High Through Put Data ❖ Big Data! ❖ file size

    is small but there are many files! ❖ file size is large but there are just few files! ❖ Data size of bioinformatics! ❖ 1,000,000,000 records for a subject (person) is normal
  2. Data Size of Sequencing After 5 Years https://www.nanoporetech.com 70,000 New

    Born Baby X 500 GB = 35 TB 30,000 patients X 10,000 cells X 500 GB = 1.5 X 1011 GB = 150 EB from Dr. Yu-Tai Wang 1. count by current NGS data! 2. not include civil medical institutes
  3. Computing Power is Required ❖ HPC! ❖ Infiniband cluster! ❖

    Amazon EC2 cluster! ❖ Hadoop cluster! ❖ Many cores of CPU! ❖ Large Memory! ❖ High IO efficiency http://arstechnica.com/business/2012/05/amazons-hpc-cloud-supercomputing-for-the-99/
  4. http://glennklockwood.blogspot.tw/2013/12/high-performance-virtualization-sr-iov_14.html Infiniband is good at I/O efficiency • Interconnect speed.!

    • I/O performance.! • Infiniband system is about 3.8GB/s of Bandwidth.! • 10 GB network is about 400MB/s of Bandwidth.
  5. Data Scientist Concerns ❖ Data quality! ❖ Factors of filter!

    ❖ Statistics! ❖ Visualization! ❖ Interpretation
  6. Programmer also Concerns ❖ High through put data (Big Data)

    handling! ❖ Data format / File format! ❖ Data parsing! ❖ Statistic tools! ❖ Visualization! ❖ Profit / Markets
  7. http://biodbnet.abcc.ncifcrf.gov/dbInfo/netGraph.php ID mapping of Databases Each node is a database.!

    Each database has it’s unique id.! These ids connected as a network.! 
 I think handling these complexity should be easy for the people seating here.
  8. Simple Mapping Sequence Data Convert to SAM Compress to BAM

    Index, Sort, Remove duplicate PCR (Rmdup) 1. .seq -> fastq 2. Illumina score -> Phred score 1. cleaned bam file 2. quality control, get statistics, mapped, unmapped, etc. 1. SNVs in VCFs 2. structural variants 3. copy number changes, etc. Aligner (soap2, bwa, bowtie, etc.) from Jui-Tse Hsu Illumina Exome sequence reads Aligned reads Aligned reads! (sam file) Aligned reads! (bam file) Useful reads data Call variants Visualization 
 in browsers
  9. C/C++ ❖ Key Algorithms! ❖ Written by C/C++! ❖ Foundation

    Tools! ❖ BWA! ❖ Bowtie / Bowtie2! ❖ samtools / bamtools! ❖ GMAP / GSNAP! ❖ BLAT! ❖ Tophat
  10. Java ❖ good part of java! ❖ GATK! ❖ Taverna!

    ❖ Hadoop http://shop.oreilly.com/product/9780596803742.do
  11. R ❖ Statistic tools! ❖ Bioconductor! ❖ EdgeR! ❖ Data

    Mining and Analysis Books http://exploringdata.github.io/data-visualization-books/analysis/
  12. C binding & wrapper ❖ bio-samtools! ❖ bio-bwa! ❖ bio-affy!

    ❖ bio-faster! ❖ mpi-ruby! ❖ bio-grid! ❖ gsl! ❖ rb-gsl! ❖ nmatrix! ❖ sambamba - D language! !
  13. Data parser / analyser ❖ bio-genomic-interval! ❖ bio-blastxmlparser! ❖ bio-assembly!

    ❖ bio-gff3! ❖ bio-gff3-pltools! ❖ bio-alignment! ❖ bio-maf! ❖ bio-table! ❖ bio-rdf! ❖ bio-vcf! ❖ bio-velvet! ❖ bio-gngm! ❖ bio-gag! ❖ bio-dbsnp
  14. Data parser / analyser ❖ bio-phyloxml! ❖ bio-jplace! ❖ bio-gex!

    ❖ bio-ipcress! ❖ bio-stockholm! ❖ bio-synreport! ❖ bio-cigar! ❖ bio-wolf_psort_wrapper! ❖ bio-hmmer3_report! ❖ bio-dbla-finder! ❖ bio-newbler_outputs! ❖ bio-sra_fastq_dumper! ❖ bigbio!
  15. Data parser / analyser - protein ❖ protk! ❖ mascot-dat!

    ❖ bio-protparam! ❖ bio-plasmoap! ❖ bio-signalp! ❖ bio-exportpred! ❖ bio-hydropathy! ❖ bio-epitope! ❖ bio-bio-orthomcl! ❖ bio-isoelectric_point! ❖ bio-octopus! ❖ bio-tm_hmm! ❖ bio-aliphatic_index!
  16. Database / Web API ❖ ruby-ensembl-api! ❖ bio-ucsc-api! ❖ bio-liftover!

    ❖ intermine! ❖ bio-eupathdb! ❖ bio-krona! ❖ bio-sra! ❖ bio-sradl http://www.ensembl.org
  17. Statistics ❖ statsample! ❖ statsample-sem! ❖ statsample-optimization! ❖ statsample-timeseries! ❖

    distribution! ❖ rinruby http://www.ncss.com/software/ncss/survival-analysis-in-ncss
  18. SVG & Graph ❖ rubyvis! ❖ plotrb! ❖ bio-svgenes! ❖

    bio-vis! ❖ gnuplot http://rubyvis.rubyforge.org
  19. Ensembl Virtual Machine ❖ Powered by VeeWee, Vagrant and Chef!

    ❖ Automatic build versioned Ensembl system (perl)! ❖ Include database, queuing services and analysis tools! ❖ Multi sites, multi species in one virtual machine! ❖ Help to build local & custom system from Tse-Ching Ho
  20. Ensembl Virtual Machine Use existed vagrant box Prepare SOP for

    Chef recipes Provision VM with Chef recipes Write Chef recipes Export VM by Virtualbox Setup Vagrantfile Create Vagrant box by Veewee Write definition of Vagrant box by Veewee Ensembl VM Automation from Tse-Ching Ho
  21. DR. RAW ❖ Derived from DRAW and SneakPeek! ❖ Composed

    of C/C++, bash, perl, java, ruby! ❖ Have both DNA and RNA re-sequence analysis! ❖ Enhanced quality control for DNA and RNA! ❖ Distributed computing pipeline! ❖ Support PBS, LSF, SGE platforms (queuing system) from Hannah Lin
  22. DR. RAW Analysis Tools Analysis Pipeline Quality Control Resource Manager

    System DNA QC
 Forward : Reverse RNA QC! Forward : Reverse BWA-0.7.7! Samtools-0.1.19! GATK-3.1 GSNAP-13-10-25! Cufflink-13-11! FusionGene … DNA Sequencing data RNA Sequencing data SGE (Sun Grid Engine) PBS (Portable Batch System)! LSF (Load Sharing Facility) Green: new components! Red: updated components from Hannah Lin
  23. Neo4j - JRuby Data Parser ❖ Graph database for data

    integration of discrete clinical research documents! ❖ Origin data are excel/csv files collected in different time, by different people! ❖ Neo4j is good for cleanup such massive data set! ❖ Cooperation between biologist and programmer from Wei-Ming Wu, Chia-Hsuan Lee
  24. API Server for Third Party Firm ❖ API server based

    on Rails, run by JRuby! ❖ ActiveRecord models for Oracle database! ❖ activerecord-oracle_enhanced-adapter gem! ❖ Import excel files to third party GUI client ! ❖ Third party server send XML request to API server from Wei-Ming Wu, Sean Wang
  25. API Server for Third Party Firm TCHC server API server


    (rails, jruby) CSIS (java, oracle) Send data by XML Write into database Read data by client program Upload data Parse request Third Party Our Servers Windows GUI from Wei-Ming Wu, Sean Wang
  26. Daily Checking Rule ❖ Based on Rails, run by JRuby!

    ❖ ActiveRecord models for Oracle database! ❖ activerecord-oracle_enhanced-adapter gem! ❖ User can define rules for checking data, usually values in filled forms! ❖ Run checking rules daily, not before filling forms from Wei-Ming Wu, Sean Wang
  27. Patient Randomization ❖ Based on Rails, run by JRuby! ❖

    ActiveRecord models for Oracle database! ❖ activerecord-oracle_enhanced-adapter gem! ❖ Assign patients into different groups by randomization method! ❖ Cooperation between statistician and programmer from Wei-Ming Wu, Sean Wang
  28. Database Statistics Dashboard ❖ Based on Rails, run by JRuby!

    ❖ ActiveRecord models for Oracle database! ❖ activerecord-oracle_enhanced-adapter gem! ❖ google_visualr gem for visualization! ❖ Count number of projects, forms, fields, records and patients from Wei-Ming Wu, Winnie Lui
  29. Topics to take in action ❖ data generation and data

    management! ❖ data analysis and software! ❖ data processing and storage! ❖ application of bioinformatics in pharma research and development http://www.giichinese.com.tw/report/bc268909- bioinformatics-technologies-global-markets.html
  30. Health Care in Cloud ❖ Health promotion cloud! ❖ Vaccination

    cloud! ❖ Exercise cloud! ❖ Workplace wellness! ❖ Physical checkup cloud! ❖ Welfare cloud from Dr. Chi-Hung Lin