Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Ruby on bioinformatics

61b55110e7f363bff43bcab8789930fb?s=47 tsechingho
April 26, 2014

Ruby on bioinformatics

RubyConf Taiwan 2014

61b55110e7f363bff43bcab8789930fb?s=128

tsechingho

April 26, 2014
Tweet

Transcript

  1. Ruby Conference Taiwan 2014 Ruby on Bioinformatics Tse-Ching Ho !

    何澤清! @tsechingho! 2014 / 4 / 26
  2. Horse + Stripe = Zebra

  3. Biology + Informatics = Bioinformatics

  4. Age of Big Data

  5. Age of Data Science

  6. High Through Put Data ❖ Big Data! ❖ file size

    is small but there are many files! ❖ file size is large but there are just few files! ❖ Data size of bioinformatics! ❖ 1,000,000,000 records for a subject (person) is normal
  7. The Storage Demand is Increasing from Dr. Yu-Tai Wang

  8. Data Size of Sequencing After 5 Years https://www.nanoporetech.com 70,000 New

    Born Baby X 500 GB = 35 TB 30,000 patients X 10,000 cells X 500 GB = 1.5 X 1011 GB = 150 EB from Dr. Yu-Tai Wang 1. count by current NGS data! 2. not include civil medical institutes
  9. Computing Power is Required ❖ HPC! ❖ Infiniband cluster! ❖

    Amazon EC2 cluster! ❖ Hadoop cluster! ❖ Many cores of CPU! ❖ Large Memory! ❖ High IO efficiency http://arstechnica.com/business/2012/05/amazons-hpc-cloud-supercomputing-for-the-99/
  10. http://arstechnica.com/business/2012/04/4829-per-hour-supercomputer-built-on-amazon-cloud-to-fuel-cancer-research/ $4,828.85 per hour 51,132 cores, 58.78TB RAM
 6,742 Amazon

    EC2 instances 2012! Protein simulation! Cycle Computing System! Ganglia HPC clusters! Deployed by Opscode Chef
  11. http://www.hpcwire.com/2013/07/08/infiniband_snaps_up_strong_super_share/ Is 10 GB network enough for I/O? embarrassingly parallel:


    The calculations are independent of each other.
  12. http://glennklockwood.blogspot.tw/2013/12/high-performance-virtualization-sr-iov_14.html Infiniband is good at I/O efficiency • Interconnect speed.!

    • I/O performance.! • Infiniband system is about 3.8GB/s of Bandwidth.! • 10 GB network is about 400MB/s of Bandwidth.
  13. Data science is about DATA!

  14. Data Scientist Concerns ❖ Data quality! ❖ Factors of filter!

    ❖ Statistics! ❖ Visualization! ❖ Interpretation
  15. Programmer also Concerns ❖ High through put data (Big Data)

    handling! ❖ Data format / File format! ❖ Data parsing! ❖ Statistic tools! ❖ Visualization! ❖ Profit / Markets
  16. Biology

  17. http://businessintelligence.com/bi-insights/the-personalized-medicine-revolution-is-almost-here/

  18. A Dream of Personalized Medicine from Dr. Yen-Hua Huang

  19. Genomic Disease http://www1.imperial.ac.uk/computationalsystemsmedicine/biomolecularmedicine/personalised/

  20. Cure by Medicines http://scienceroll.com/2008/04/25/personalized-medicine-real-clinical-examples/

  21. Personalized Medicine http://www.genomicslawreport.com/index.php/tag/personalized-medicine/

  22. Personal Genomic Analysis http://www.thecureisnow.org/index.php/our-strategy/philosophy-of-tcin/personalized-medicine

  23. http://www.genengnews.com/insight-and-intelligence/personalized-medicine-not-quite-there-yet/77899649/

  24. DNA http://cisncancer.org/research/what_we_know/omics/personalized_medicine_02.html

  25. DNA Sequencing http://www.scq.ubc.ca/genome-projects-uncovering-the-blueprints-of-biology/ http://www.scq.ubc.ca/genome-projects-uncovering-the-blueprints-of-biology/ http://www.broadinstitute.org/blog/beyond-genome-new-uses-dna-sequencers

  26. No Teach for Reading DNA http://intellimedix.com

  27. Do The Right Things http://www.dnadirect.com

  28. http://biodbnet.abcc.ncifcrf.gov/dbInfo/netGraph.php ID mapping of Databases Each node is a database.!

    Each database has it’s unique id.! These ids connected as a network.! 
 I think handling these complexity should be easy for the people seating here.
  29. Bioinformatics Sites for Rubists

  30. NCBI http://www.ncbi.nlm.nih.gov

  31. Ensembl http://www.ensembl.org

  32. Nature Biotechnology http://www.nature.com/nbt

  33. PLOS Computational Biology http://www.ploscompbiol.org

  34. Biostarts https://www.biostars.org

  35. SEQanswers http://seqanswers.com

  36. Ruby Sites for Bioinformatists

  37. GitHub https://github.com/

  38. RubyGems.org https://rubygems.org

  39. The Ruby Toolbox https://www.ruby-toolbox.com

  40. Biogems.info http://www.biogems.info

  41. BioRuby http://bioruby.org

  42. SciRuby http://sciruby.com

  43. What programming language is best for a bioinformatics beginner?

  44. Mapping Sequence Data from Jui-Tse Hsu

  45. Simple Mapping Sequence Data Convert to SAM Compress to BAM

    Index, Sort, Remove duplicate PCR (Rmdup) 1. .seq -> fastq 2. Illumina score -> Phred score 1. cleaned bam file 2. quality control, get statistics, mapped, unmapped, etc. 1. SNVs in VCFs 2. structural variants 3. copy number changes, etc. Aligner (soap2, bwa, bowtie, etc.) from Jui-Tse Hsu Illumina Exome sequence reads Aligned reads Aligned reads! (sam file) Aligned reads! (bam file) Useful reads data Call variants Visualization 
 in browsers
  46. C/C++ ❖ Key Algorithms! ❖ Written by C/C++! ❖ Foundation

    Tools! ❖ BWA! ❖ Bowtie / Bowtie2! ❖ samtools / bamtools! ❖ GMAP / GSNAP! ❖ BLAT! ❖ Tophat
  47. http://genomebiology.com/2010/11/12/220 Analysis Pipeline Overview of the RNA-seq analysis pipeline for

    detecting differential expression
  48. Perl ❖ First language! ❖ Bioperl! ❖ Ensembl http://millionchimpanzees.blogspot.tw/2011/09/book-review-learning-perl-sixth-edition.html

  49. Java ❖ good part of java! ❖ GATK! ❖ Taverna!

    ❖ Hadoop http://shop.oreilly.com/product/9780596803742.do
  50. R ❖ Statistic tools! ❖ Bioconductor! ❖ EdgeR! ❖ Data

    Mining and Analysis Books http://exploringdata.github.io/data-visualization-books/analysis/
  51. Python ❖ young people! ❖ Galaxy http://news.oreilly.com/2008/08/python-for-unix-and-linux-syst.html

  52. The Ruby Way in Bioinformatics

  53. What kinds of libraries would you think it is important?

  54. Foundation gems ❖ activerecord! ❖ nokogiri! ❖ ffi! ❖ parallel!

    ! ! ❖ bioruby! ❖ sciruby! ! ! !
  55. C binding & wrapper ❖ bio-samtools! ❖ bio-bwa! ❖ bio-affy!

    ❖ bio-faster! ❖ mpi-ruby! ❖ bio-grid! ❖ gsl! ❖ rb-gsl! ❖ nmatrix! ❖ sambamba - D language! !
  56. Data parser / analyser ❖ bio-genomic-interval! ❖ bio-blastxmlparser! ❖ bio-assembly!

    ❖ bio-gff3! ❖ bio-gff3-pltools! ❖ bio-alignment! ❖ bio-maf! ❖ bio-table! ❖ bio-rdf! ❖ bio-vcf! ❖ bio-velvet! ❖ bio-gngm! ❖ bio-gag! ❖ bio-dbsnp
  57. Data parser / analyser ❖ bio-phyloxml! ❖ bio-jplace! ❖ bio-gex!

    ❖ bio-ipcress! ❖ bio-stockholm! ❖ bio-synreport! ❖ bio-cigar! ❖ bio-wolf_psort_wrapper! ❖ bio-hmmer3_report! ❖ bio-dbla-finder! ❖ bio-newbler_outputs! ❖ bio-sra_fastq_dumper! ❖ bigbio!
  58. Data parser / analyser - protein ❖ protk! ❖ mascot-dat!

    ❖ bio-protparam! ❖ bio-plasmoap! ❖ bio-signalp! ❖ bio-exportpred! ❖ bio-hydropathy! ❖ bio-epitope! ❖ bio-bio-orthomcl! ❖ bio-isoelectric_point! ❖ bio-octopus! ❖ bio-tm_hmm! ❖ bio-aliphatic_index!
  59. Database / Web API ❖ ruby-ensembl-api! ❖ bio-ucsc-api! ❖ bio-liftover!

    ❖ intermine! ❖ bio-eupathdb! ❖ bio-krona! ❖ bio-sra! ❖ bio-sradl http://www.ensembl.org
  60. Statistics ❖ statsample! ❖ statsample-sem! ❖ statsample-optimization! ❖ statsample-timeseries! ❖

    distribution! ❖ rinruby http://www.ncss.com/software/ncss/survival-analysis-in-ncss
  61. SVG & Graph ❖ rubyvis! ❖ plotrb! ❖ bio-svgenes! ❖

    bio-vis! ❖ gnuplot http://rubyvis.rubyforge.org
  62. Tools ❖ minimization! ❖ integration! ❖ quorum - rails engine

  63. I am Not Analyst,
 I am Programmer.

  64. What can I get involved?

  65. Pipeline / Workflow Galaxy - python! Taverna - java! ???

    - Ruby
  66. Web System ❖ Data warehouse! ❖ Pipeline management! ❖ Coordination

    center! ❖ Visualisation
  67. Cloud / Distributed / Parallel http://www.mynamesnotmommy.com/yes-there-are-dumb-questions/question-mark/

  68. What We Are Doing By Ruby?

  69. Ensembl Virtual Machine ❖ Powered by VeeWee, Vagrant and Chef!

    ❖ Automatic build versioned Ensembl system (perl)! ❖ Include database, queuing services and analysis tools! ❖ Multi sites, multi species in one virtual machine! ❖ Help to build local & custom system from Tse-Ching Ho
  70. Ensembl Virtual Machine Use existed vagrant box Prepare SOP for

    Chef recipes Provision VM with Chef recipes Write Chef recipes Export VM by Virtualbox Setup Vagrantfile Create Vagrant box by Veewee Write definition of Vagrant box by Veewee Ensembl VM Automation from Tse-Ching Ho
  71. Ensembl Virtual Machine Web view of Ensembl from Tse-Ching Ho

  72. DR. RAW ❖ Derived from DRAW and SneakPeek! ❖ Composed

    of C/C++, bash, perl, java, ruby! ❖ Have both DNA and RNA re-sequence analysis! ❖ Enhanced quality control for DNA and RNA! ❖ Distributed computing pipeline! ❖ Support PBS, LSF, SGE platforms (queuing system) from Hannah Lin
  73. DR. RAW Analysis Tools Analysis Pipeline Quality Control Resource Manager

    System DNA QC
 Forward : Reverse RNA QC! Forward : Reverse BWA-0.7.7! Samtools-0.1.19! GATK-3.1 GSNAP-13-10-25! Cufflink-13-11! FusionGene … DNA Sequencing data RNA Sequencing data SGE (Sun Grid Engine) PBS (Portable Batch System)! LSF (Load Sharing Facility) Green: new components! Red: updated components from Hannah Lin
  74. DR. RAW Web view by Rails from Hannah Lin

  75. Neo4j - JRuby Data Parser ❖ Graph database for data

    integration of discrete clinical research documents! ❖ Origin data are excel/csv files collected in different time, by different people! ❖ Neo4j is good for cleanup such massive data set! ❖ Cooperation between biologist and programmer from Wei-Ming Wu, Chia-Hsuan Lee
  76. Neo4j - JRuby Data Parser from Wei-Ming Wu, Chia-Hsuan Lee

  77. Neo4j - JRuby Data Parser from Wei-Ming Wu, Chia-Hsuan Lee

    Collision Rate of Input Data: 1.3 %
  78. API Server for Third Party Firm ❖ API server based

    on Rails, run by JRuby! ❖ ActiveRecord models for Oracle database! ❖ activerecord-oracle_enhanced-adapter gem! ❖ Import excel files to third party GUI client ! ❖ Third party server send XML request to API server from Wei-Ming Wu, Sean Wang
  79. API Server for Third Party Firm TCHC server API server


    (rails, jruby) CSIS (java, oracle) Send data by XML Write into database Read data by client program Upload data Parse request Third Party Our Servers Windows GUI from Wei-Ming Wu, Sean Wang
  80. Daily Checking Rule ❖ Based on Rails, run by JRuby!

    ❖ ActiveRecord models for Oracle database! ❖ activerecord-oracle_enhanced-adapter gem! ❖ User can define rules for checking data, usually values in filled forms! ❖ Run checking rules daily, not before filling forms from Wei-Ming Wu, Sean Wang
  81. Daily Checking Rule from Wei-Ming Wu, Sean Wang

  82. Daily Checking Rule from Wei-Ming Wu, Sean Wang

  83. Daily Checking Rule from Wei-Ming Wu, Sean Wang

  84. Daily Checking Rule from Wei-Ming Wu, Sean Wang

  85. Patient Randomization ❖ Based on Rails, run by JRuby! ❖

    ActiveRecord models for Oracle database! ❖ activerecord-oracle_enhanced-adapter gem! ❖ Assign patients into different groups by randomization method! ❖ Cooperation between statistician and programmer from Wei-Ming Wu, Sean Wang
  86. Patient Randomization from Wei-Ming Wu, Sean Wang

  87. Patient Randomization from Wei-Ming Wu, Sean Wang

  88. Patient Randomization from Wei-Ming Wu, Sean Wang Assign patients to

    treatment groups
  89. Database Statistics Dashboard ❖ Based on Rails, run by JRuby!

    ❖ ActiveRecord models for Oracle database! ❖ activerecord-oracle_enhanced-adapter gem! ❖ google_visualr gem for visualization! ❖ Count number of projects, forms, fields, records and patients from Wei-Ming Wu, Winnie Lui
  90. Database Statistics Dashboard from Wei-Ming Wu, Winnie Lui

  91. Education

  92. Learning Bioinformatics ❖ http://www.nature.com/nbt/journal/v31/n11/full/ nbt.2740.html! ❖ http://www.liacs.nl/~hoogeboo/mcb/ nature_primer.html! ❖ http://www.mygoblet.org

    - python, R! ❖ http://www.biotnet.org
  93. Books for Beginners http://practicalcomputing.org Python

  94. Python Book for Bioinformatics http://shop.oreilly.com/product/9780596154516.do

  95. Python is very successful in Teach than Ruby

  96. Do we lack a killer application by Ruby? http://www.witardroadbaptist.org/im-new/im-not-sure-im-ready-for-church-yet/

  97. We Need Human !!

  98. Are You Ready 
 To Be A Data Scientist Or

    A Binformactis?
  99. Markets http://www.genengnews.com/gen-articles/personalized-medicine-health-economic-aspects/4824/

  100. http://www.bccresearch.com/market-research/biotechnology/bioinformatics-market-technology-bio051b.html Under developing Do Asia have enough market sharing?

  101. Topics to take in action ❖ data generation and data

    management! ❖ data analysis and software! ❖ data processing and storage! ❖ application of bioinformatics in pharma research and development http://www.giichinese.com.tw/report/bc268909- bioinformatics-technologies-global-markets.html
  102. Health Care in Cloud ❖ Health promotion cloud! ❖ Vaccination

    cloud! ❖ Exercise cloud! ❖ Workplace wellness! ❖ Physical checkup cloud! ❖ Welfare cloud from Dr. Chi-Hung Lin
  103. Code For Bioinformatics

  104. Q & A