Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Ruby on bioinformatics

tsechingho
April 26, 2014

Ruby on bioinformatics

RubyConf Taiwan 2014

tsechingho

April 26, 2014
Tweet

More Decks by tsechingho

Other Decks in Science

Transcript

  1. Ruby Conference Taiwan 2014
    Ruby on
    Bioinformatics
    Tse-Ching Ho !
    何澤清!
    @tsechingho!
    2014 / 4 / 26

    View full-size slide

  2. Horse + Stripe = Zebra

    View full-size slide

  3. Biology + Informatics = Bioinformatics

    View full-size slide

  4. Age of Big Data

    View full-size slide

  5. Age of Data Science

    View full-size slide

  6. High Through Put Data
    ❖ Big Data!
    ❖ file size is small but there are many files!
    ❖ file size is large but there are just few files!
    ❖ Data size of bioinformatics!
    ❖ 1,000,000,000 records for a subject (person) is normal

    View full-size slide

  7. The Storage Demand is Increasing
    from Dr. Yu-Tai Wang

    View full-size slide

  8. Data Size of Sequencing After 5 Years
    https://www.nanoporetech.com
    70,000 New Born Baby X 500 GB = 35 TB
    30,000 patients X 10,000 cells X 500 GB = 1.5 X 1011 GB = 150 EB
    from Dr. Yu-Tai Wang
    1. count by current NGS data!
    2. not include civil medical institutes

    View full-size slide

  9. Computing Power is Required
    ❖ HPC!
    ❖ Infiniband cluster!
    ❖ Amazon EC2 cluster!
    ❖ Hadoop cluster!
    ❖ Many cores of CPU!
    ❖ Large Memory!
    ❖ High IO efficiency
    http://arstechnica.com/business/2012/05/amazons-hpc-cloud-supercomputing-for-the-99/

    View full-size slide

  10. http://arstechnica.com/business/2012/04/4829-per-hour-supercomputer-built-on-amazon-cloud-to-fuel-cancer-research/
    $4,828.85 per hour
    51,132 cores, 58.78TB RAM

    6,742 Amazon EC2 instances
    2012!
    Protein simulation!
    Cycle Computing System!
    Ganglia HPC clusters!
    Deployed by Opscode Chef

    View full-size slide

  11. http://www.hpcwire.com/2013/07/08/infiniband_snaps_up_strong_super_share/
    Is 10 GB network
    enough for I/O?
    embarrassingly parallel:

    The calculations are
    independent of each other.

    View full-size slide

  12. http://glennklockwood.blogspot.tw/2013/12/high-performance-virtualization-sr-iov_14.html
    Infiniband is good at
    I/O efficiency
    • Interconnect speed.!
    • I/O performance.!
    • Infiniband system is about
    3.8GB/s of Bandwidth.!
    • 10 GB network is about
    400MB/s of Bandwidth.

    View full-size slide

  13. Data science is about DATA!

    View full-size slide

  14. Data Scientist Concerns
    ❖ Data quality!
    ❖ Factors of filter!
    ❖ Statistics!
    ❖ Visualization!
    ❖ Interpretation

    View full-size slide

  15. Programmer also Concerns
    ❖ High through put data (Big Data) handling!
    ❖ Data format / File format!
    ❖ Data parsing!
    ❖ Statistic tools!
    ❖ Visualization!
    ❖ Profit / Markets

    View full-size slide

  16. http://businessintelligence.com/bi-insights/the-personalized-medicine-revolution-is-almost-here/

    View full-size slide

  17. A Dream of Personalized Medicine
    from Dr. Yen-Hua Huang

    View full-size slide

  18. Genomic Disease
    http://www1.imperial.ac.uk/computationalsystemsmedicine/biomolecularmedicine/personalised/

    View full-size slide

  19. Cure by Medicines
    http://scienceroll.com/2008/04/25/personalized-medicine-real-clinical-examples/

    View full-size slide

  20. Personalized Medicine
    http://www.genomicslawreport.com/index.php/tag/personalized-medicine/

    View full-size slide

  21. Personal Genomic Analysis
    http://www.thecureisnow.org/index.php/our-strategy/philosophy-of-tcin/personalized-medicine

    View full-size slide

  22. http://www.genengnews.com/insight-and-intelligence/personalized-medicine-not-quite-there-yet/77899649/

    View full-size slide

  23. DNA
    http://cisncancer.org/research/what_we_know/omics/personalized_medicine_02.html

    View full-size slide

  24. DNA Sequencing
    http://www.scq.ubc.ca/genome-projects-uncovering-the-blueprints-of-biology/
    http://www.scq.ubc.ca/genome-projects-uncovering-the-blueprints-of-biology/
    http://www.broadinstitute.org/blog/beyond-genome-new-uses-dna-sequencers

    View full-size slide

  25. No Teach for Reading DNA
    http://intellimedix.com

    View full-size slide

  26. Do The Right Things
    http://www.dnadirect.com

    View full-size slide

  27. http://biodbnet.abcc.ncifcrf.gov/dbInfo/netGraph.php
    ID mapping of
    Databases
    Each node is a database.!
    Each database has it’s unique id.!
    These ids connected as a network.!

    I think handling these complexity should
    be easy for the people seating here.

    View full-size slide

  28. Bioinformatics Sites for Rubists

    View full-size slide

  29. NCBI
    http://www.ncbi.nlm.nih.gov

    View full-size slide

  30. Ensembl
    http://www.ensembl.org

    View full-size slide

  31. Nature Biotechnology
    http://www.nature.com/nbt

    View full-size slide

  32. PLOS Computational Biology
    http://www.ploscompbiol.org

    View full-size slide

  33. Biostarts
    https://www.biostars.org

    View full-size slide

  34. SEQanswers
    http://seqanswers.com

    View full-size slide

  35. Ruby Sites for Bioinformatists

    View full-size slide

  36. GitHub
    https://github.com/

    View full-size slide

  37. RubyGems.org
    https://rubygems.org

    View full-size slide

  38. The Ruby Toolbox
    https://www.ruby-toolbox.com

    View full-size slide

  39. Biogems.info
    http://www.biogems.info

    View full-size slide

  40. BioRuby
    http://bioruby.org

    View full-size slide

  41. SciRuby
    http://sciruby.com

    View full-size slide

  42. What programming language is
    best for a bioinformatics beginner?

    View full-size slide

  43. Mapping Sequence Data
    from Jui-Tse Hsu

    View full-size slide

  44. Simple Mapping Sequence Data
    Convert to SAM
    Compress
    to BAM
    Index, Sort,
    Remove duplicate
    PCR (Rmdup)
    1. .seq -> fastq
    2. Illumina score -> Phred score
    1. cleaned bam file
    2. quality control, get statistics,
    mapped, unmapped, etc.
    1. SNVs in VCFs
    2. structural variants
    3. copy number changes, etc.
    Aligner (soap2,
    bwa, bowtie, etc.)
    from Jui-Tse Hsu
    Illumina Exome
    sequence reads
    Aligned reads
    Aligned reads!
    (sam file)
    Aligned reads!
    (bam file)
    Useful reads
    data
    Call variants
    Visualization 

    in browsers

    View full-size slide

  45. C/C++
    ❖ Key Algorithms!
    ❖ Written by C/C++!
    ❖ Foundation Tools!
    ❖ BWA!
    ❖ Bowtie / Bowtie2!
    ❖ samtools / bamtools!
    ❖ GMAP / GSNAP!
    ❖ BLAT!
    ❖ Tophat

    View full-size slide

  46. http://genomebiology.com/2010/11/12/220
    Analysis Pipeline
    Overview of the RNA-seq analysis
    pipeline for detecting differential
    expression

    View full-size slide

  47. Perl
    ❖ First language!
    ❖ Bioperl!
    ❖ Ensembl
    http://millionchimpanzees.blogspot.tw/2011/09/book-review-learning-perl-sixth-edition.html

    View full-size slide

  48. Java
    ❖ good part of java!
    ❖ GATK!
    ❖ Taverna!
    ❖ Hadoop
    http://shop.oreilly.com/product/9780596803742.do

    View full-size slide

  49. R
    ❖ Statistic tools!
    ❖ Bioconductor!
    ❖ EdgeR!
    ❖ Data Mining and Analysis
    Books
    http://exploringdata.github.io/data-visualization-books/analysis/

    View full-size slide

  50. Python
    ❖ young people!
    ❖ Galaxy
    http://news.oreilly.com/2008/08/python-for-unix-and-linux-syst.html

    View full-size slide

  51. The Ruby Way in Bioinformatics

    View full-size slide

  52. What kinds of libraries
    would you think it is important?

    View full-size slide

  53. Foundation gems
    ❖ activerecord!
    ❖ nokogiri!
    ❖ ffi!
    ❖ parallel!
    !
    !
    ❖ bioruby!
    ❖ sciruby!
    !
    !
    !

    View full-size slide

  54. C binding & wrapper
    ❖ bio-samtools!
    ❖ bio-bwa!
    ❖ bio-affy!
    ❖ bio-faster!
    ❖ mpi-ruby!
    ❖ bio-grid!
    ❖ gsl!
    ❖ rb-gsl!
    ❖ nmatrix!
    ❖ sambamba - D language!
    !

    View full-size slide

  55. Data parser / analyser
    ❖ bio-genomic-interval!
    ❖ bio-blastxmlparser!
    ❖ bio-assembly!
    ❖ bio-gff3!
    ❖ bio-gff3-pltools!
    ❖ bio-alignment!
    ❖ bio-maf!
    ❖ bio-table!
    ❖ bio-rdf!
    ❖ bio-vcf!
    ❖ bio-velvet!
    ❖ bio-gngm!
    ❖ bio-gag!
    ❖ bio-dbsnp

    View full-size slide

  56. Data parser / analyser
    ❖ bio-phyloxml!
    ❖ bio-jplace!
    ❖ bio-gex!
    ❖ bio-ipcress!
    ❖ bio-stockholm!
    ❖ bio-synreport!
    ❖ bio-cigar!
    ❖ bio-wolf_psort_wrapper!
    ❖ bio-hmmer3_report!
    ❖ bio-dbla-finder!
    ❖ bio-newbler_outputs!
    ❖ bio-sra_fastq_dumper!
    ❖ bigbio!

    View full-size slide

  57. Data parser / analyser - protein
    ❖ protk!
    ❖ mascot-dat!
    ❖ bio-protparam!
    ❖ bio-plasmoap!
    ❖ bio-signalp!
    ❖ bio-exportpred!
    ❖ bio-hydropathy!
    ❖ bio-epitope!
    ❖ bio-bio-orthomcl!
    ❖ bio-isoelectric_point!
    ❖ bio-octopus!
    ❖ bio-tm_hmm!
    ❖ bio-aliphatic_index!

    View full-size slide

  58. Database / Web API
    ❖ ruby-ensembl-api!
    ❖ bio-ucsc-api!
    ❖ bio-liftover!
    ❖ intermine!
    ❖ bio-eupathdb!
    ❖ bio-krona!
    ❖ bio-sra!
    ❖ bio-sradl
    http://www.ensembl.org

    View full-size slide

  59. Statistics
    ❖ statsample!
    ❖ statsample-sem!
    ❖ statsample-optimization!
    ❖ statsample-timeseries!
    ❖ distribution!
    ❖ rinruby
    http://www.ncss.com/software/ncss/survival-analysis-in-ncss

    View full-size slide

  60. SVG & Graph
    ❖ rubyvis!
    ❖ plotrb!
    ❖ bio-svgenes!
    ❖ bio-vis!
    ❖ gnuplot
    http://rubyvis.rubyforge.org

    View full-size slide

  61. Tools
    ❖ minimization!
    ❖ integration!
    ❖ quorum - rails engine

    View full-size slide

  62. I am Not Analyst,

    I am Programmer.

    View full-size slide

  63. What can I get involved?

    View full-size slide

  64. Pipeline / Workflow
    Galaxy - python!
    Taverna - java!
    ??? - Ruby

    View full-size slide

  65. Web System
    ❖ Data warehouse!
    ❖ Pipeline management!
    ❖ Coordination center!
    ❖ Visualisation

    View full-size slide

  66. Cloud / Distributed / Parallel
    http://www.mynamesnotmommy.com/yes-there-are-dumb-questions/question-mark/

    View full-size slide

  67. What We Are Doing By Ruby?

    View full-size slide

  68. Ensembl Virtual Machine
    ❖ Powered by VeeWee, Vagrant and Chef!
    ❖ Automatic build versioned Ensembl system (perl)!
    ❖ Include database, queuing services and analysis tools!
    ❖ Multi sites, multi species in one virtual machine!
    ❖ Help to build local & custom system
    from Tse-Ching Ho

    View full-size slide

  69. Ensembl Virtual Machine
    Use existed
    vagrant box
    Prepare SOP for
    Chef recipes
    Provision VM
    with Chef recipes
    Write Chef recipes
    Export VM
    by Virtualbox
    Setup Vagrantfile
    Create Vagrant box
    by Veewee
    Write definition of
    Vagrant box by Veewee
    Ensembl VM
    Automation
    from Tse-Ching Ho

    View full-size slide

  70. Ensembl Virtual Machine
    Web view of Ensembl
    from Tse-Ching Ho

    View full-size slide

  71. DR. RAW
    ❖ Derived from DRAW and SneakPeek!
    ❖ Composed of C/C++, bash, perl, java, ruby!
    ❖ Have both DNA and RNA re-sequence analysis!
    ❖ Enhanced quality control for DNA and RNA!
    ❖ Distributed computing pipeline!
    ❖ Support PBS, LSF, SGE platforms (queuing system)
    from Hannah Lin

    View full-size slide

  72. DR. RAW
    Analysis
    Tools
    Analysis
    Pipeline
    Quality
    Control
    Resource
    Manager
    System
    DNA QC

    Forward : Reverse
    RNA QC!
    Forward : Reverse
    BWA-0.7.7!
    Samtools-0.1.19!
    GATK-3.1
    GSNAP-13-10-25!
    Cufflink-13-11!
    FusionGene …
    DNA Sequencing data
    RNA Sequencing data
    SGE (Sun Grid Engine)
    PBS (Portable Batch System)!
    LSF (Load Sharing Facility)
    Green: new components!
    Red: updated components
    from Hannah Lin

    View full-size slide

  73. DR. RAW
    Web view by Rails
    from Hannah Lin

    View full-size slide

  74. Neo4j - JRuby Data Parser
    ❖ Graph database for data integration of discrete clinical
    research documents!
    ❖ Origin data are excel/csv files collected in different
    time, by different people!
    ❖ Neo4j is good for cleanup such massive data set!
    ❖ Cooperation between biologist and programmer
    from Wei-Ming Wu, Chia-Hsuan Lee

    View full-size slide

  75. Neo4j - JRuby Data Parser
    from Wei-Ming Wu, Chia-Hsuan Lee

    View full-size slide

  76. Neo4j - JRuby Data Parser
    from Wei-Ming Wu, Chia-Hsuan Lee
    Collision Rate of Input Data: 1.3 %

    View full-size slide

  77. API Server for Third Party Firm
    ❖ API server based on Rails, run by JRuby!
    ❖ ActiveRecord models for Oracle database!
    ❖ activerecord-oracle_enhanced-adapter gem!
    ❖ Import excel files to third party GUI client !
    ❖ Third party server send XML request to API server
    from Wei-Ming Wu, Sean Wang

    View full-size slide

  78. API Server for Third Party Firm
    TCHC server
    API server

    (rails, jruby)
    CSIS
    (java, oracle)
    Send data by XML
    Write into database
    Read data by client program
    Upload data
    Parse request
    Third Party
    Our Servers
    Windows GUI
    from Wei-Ming Wu, Sean Wang

    View full-size slide

  79. Daily Checking Rule
    ❖ Based on Rails, run by JRuby!
    ❖ ActiveRecord models for Oracle database!
    ❖ activerecord-oracle_enhanced-adapter gem!
    ❖ User can define rules for checking data, usually values
    in filled forms!
    ❖ Run checking rules daily, not before filling forms
    from Wei-Ming Wu, Sean Wang

    View full-size slide

  80. Daily Checking Rule
    from Wei-Ming Wu, Sean Wang

    View full-size slide

  81. Daily Checking Rule
    from Wei-Ming Wu, Sean Wang

    View full-size slide

  82. Daily Checking Rule
    from Wei-Ming Wu, Sean Wang

    View full-size slide

  83. Daily Checking Rule
    from Wei-Ming Wu, Sean Wang

    View full-size slide

  84. Patient Randomization
    ❖ Based on Rails, run by JRuby!
    ❖ ActiveRecord models for Oracle database!
    ❖ activerecord-oracle_enhanced-adapter gem!
    ❖ Assign patients into different groups by randomization
    method!
    ❖ Cooperation between statistician and programmer
    from Wei-Ming Wu, Sean Wang

    View full-size slide

  85. Patient Randomization
    from Wei-Ming Wu, Sean Wang

    View full-size slide

  86. Patient Randomization
    from Wei-Ming Wu, Sean Wang

    View full-size slide

  87. Patient Randomization
    from Wei-Ming Wu, Sean Wang
    Assign patients to treatment groups

    View full-size slide

  88. Database Statistics Dashboard
    ❖ Based on Rails, run by JRuby!
    ❖ ActiveRecord models for Oracle database!
    ❖ activerecord-oracle_enhanced-adapter gem!
    ❖ google_visualr gem for visualization!
    ❖ Count number of projects, forms, fields, records and
    patients
    from Wei-Ming Wu, Winnie Lui

    View full-size slide

  89. Database Statistics Dashboard
    from Wei-Ming Wu, Winnie Lui

    View full-size slide

  90. Learning Bioinformatics
    ❖ http://www.nature.com/nbt/journal/v31/n11/full/
    nbt.2740.html!
    ❖ http://www.liacs.nl/~hoogeboo/mcb/
    nature_primer.html!
    ❖ http://www.mygoblet.org - python, R!
    ❖ http://www.biotnet.org

    View full-size slide

  91. Books for Beginners
    http://practicalcomputing.org
    Python

    View full-size slide

  92. Python Book for Bioinformatics
    http://shop.oreilly.com/product/9780596154516.do

    View full-size slide

  93. Python is very successful
    in Teach than Ruby

    View full-size slide

  94. Do we lack a killer application by Ruby?
    http://www.witardroadbaptist.org/im-new/im-not-sure-im-ready-for-church-yet/

    View full-size slide

  95. We Need Human !!

    View full-size slide

  96. Are You Ready 

    To Be A Data Scientist
    Or A Binformactis?

    View full-size slide

  97. Markets
    http://www.genengnews.com/gen-articles/personalized-medicine-health-economic-aspects/4824/

    View full-size slide

  98. http://www.bccresearch.com/market-research/biotechnology/bioinformatics-market-technology-bio051b.html
    Under developing Do Asia have enough market
    sharing?

    View full-size slide

  99. Topics to take in action
    ❖ data generation and data management!
    ❖ data analysis and software!
    ❖ data processing and storage!
    ❖ application of bioinformatics in pharma research and
    development
    http://www.giichinese.com.tw/report/bc268909-
    bioinformatics-technologies-global-markets.html

    View full-size slide

  100. Health Care in Cloud
    ❖ Health promotion cloud!
    ❖ Vaccination cloud!
    ❖ Exercise cloud!
    ❖ Workplace wellness!
    ❖ Physical checkup cloud!
    ❖ Welfare cloud
    from Dr. Chi-Hung Lin

    View full-size slide

  101. Code For Bioinformatics

    View full-size slide