Slide 1

Slide 1 text

Ruby Conference Taiwan 2014 Ruby on Bioinformatics Tse-Ching Ho ! 何澤清! @tsechingho! 2014 / 4 / 26

Slide 2

Slide 2 text

Horse + Stripe = Zebra

Slide 3

Slide 3 text

Biology + Informatics = Bioinformatics

Slide 4

Slide 4 text

Age of Big Data

Slide 5

Slide 5 text

Age of Data Science

Slide 6

Slide 6 text

High Through Put Data ❖ Big Data! ❖ file size is small but there are many files! ❖ file size is large but there are just few files! ❖ Data size of bioinformatics! ❖ 1,000,000,000 records for a subject (person) is normal

Slide 7

Slide 7 text

The Storage Demand is Increasing from Dr. Yu-Tai Wang

Slide 8

Slide 8 text

Data Size of Sequencing After 5 Years https://www.nanoporetech.com 70,000 New Born Baby X 500 GB = 35 TB 30,000 patients X 10,000 cells X 500 GB = 1.5 X 1011 GB = 150 EB from Dr. Yu-Tai Wang 1. count by current NGS data! 2. not include civil medical institutes

Slide 9

Slide 9 text

Computing Power is Required ❖ HPC! ❖ Infiniband cluster! ❖ Amazon EC2 cluster! ❖ Hadoop cluster! ❖ Many cores of CPU! ❖ Large Memory! ❖ High IO efficiency http://arstechnica.com/business/2012/05/amazons-hpc-cloud-supercomputing-for-the-99/

Slide 10

Slide 10 text

http://arstechnica.com/business/2012/04/4829-per-hour-supercomputer-built-on-amazon-cloud-to-fuel-cancer-research/ $4,828.85 per hour 51,132 cores, 58.78TB RAM
 6,742 Amazon EC2 instances 2012! Protein simulation! Cycle Computing System! Ganglia HPC clusters! Deployed by Opscode Chef

Slide 11

Slide 11 text

http://www.hpcwire.com/2013/07/08/infiniband_snaps_up_strong_super_share/ Is 10 GB network enough for I/O? embarrassingly parallel:
 The calculations are independent of each other.

Slide 12

Slide 12 text

http://glennklockwood.blogspot.tw/2013/12/high-performance-virtualization-sr-iov_14.html Infiniband is good at I/O efficiency • Interconnect speed.! • I/O performance.! • Infiniband system is about 3.8GB/s of Bandwidth.! • 10 GB network is about 400MB/s of Bandwidth.

Slide 13

Slide 13 text

Data science is about DATA!

Slide 14

Slide 14 text

Data Scientist Concerns ❖ Data quality! ❖ Factors of filter! ❖ Statistics! ❖ Visualization! ❖ Interpretation

Slide 15

Slide 15 text

Programmer also Concerns ❖ High through put data (Big Data) handling! ❖ Data format / File format! ❖ Data parsing! ❖ Statistic tools! ❖ Visualization! ❖ Profit / Markets

Slide 16

Slide 16 text

Biology

Slide 17

Slide 17 text

http://businessintelligence.com/bi-insights/the-personalized-medicine-revolution-is-almost-here/

Slide 18

Slide 18 text

A Dream of Personalized Medicine from Dr. Yen-Hua Huang

Slide 19

Slide 19 text

Genomic Disease http://www1.imperial.ac.uk/computationalsystemsmedicine/biomolecularmedicine/personalised/

Slide 20

Slide 20 text

Cure by Medicines http://scienceroll.com/2008/04/25/personalized-medicine-real-clinical-examples/

Slide 21

Slide 21 text

Personalized Medicine http://www.genomicslawreport.com/index.php/tag/personalized-medicine/

Slide 22

Slide 22 text

Personal Genomic Analysis http://www.thecureisnow.org/index.php/our-strategy/philosophy-of-tcin/personalized-medicine

Slide 23

Slide 23 text

http://www.genengnews.com/insight-and-intelligence/personalized-medicine-not-quite-there-yet/77899649/

Slide 24

Slide 24 text

DNA http://cisncancer.org/research/what_we_know/omics/personalized_medicine_02.html

Slide 25

Slide 25 text

DNA Sequencing http://www.scq.ubc.ca/genome-projects-uncovering-the-blueprints-of-biology/ http://www.scq.ubc.ca/genome-projects-uncovering-the-blueprints-of-biology/ http://www.broadinstitute.org/blog/beyond-genome-new-uses-dna-sequencers

Slide 26

Slide 26 text

No Teach for Reading DNA http://intellimedix.com

Slide 27

Slide 27 text

Do The Right Things http://www.dnadirect.com

Slide 28

Slide 28 text

http://biodbnet.abcc.ncifcrf.gov/dbInfo/netGraph.php ID mapping of Databases Each node is a database.! Each database has it’s unique id.! These ids connected as a network.! 
 I think handling these complexity should be easy for the people seating here.

Slide 29

Slide 29 text

Bioinformatics Sites for Rubists

Slide 30

Slide 30 text

NCBI http://www.ncbi.nlm.nih.gov

Slide 31

Slide 31 text

Ensembl http://www.ensembl.org

Slide 32

Slide 32 text

Nature Biotechnology http://www.nature.com/nbt

Slide 33

Slide 33 text

PLOS Computational Biology http://www.ploscompbiol.org

Slide 34

Slide 34 text

Biostarts https://www.biostars.org

Slide 35

Slide 35 text

SEQanswers http://seqanswers.com

Slide 36

Slide 36 text

Ruby Sites for Bioinformatists

Slide 37

Slide 37 text

GitHub https://github.com/

Slide 38

Slide 38 text

RubyGems.org https://rubygems.org

Slide 39

Slide 39 text

The Ruby Toolbox https://www.ruby-toolbox.com

Slide 40

Slide 40 text

Biogems.info http://www.biogems.info

Slide 41

Slide 41 text

BioRuby http://bioruby.org

Slide 42

Slide 42 text

SciRuby http://sciruby.com

Slide 43

Slide 43 text

What programming language is best for a bioinformatics beginner?

Slide 44

Slide 44 text

Mapping Sequence Data from Jui-Tse Hsu

Slide 45

Slide 45 text

Simple Mapping Sequence Data Convert to SAM Compress to BAM Index, Sort, Remove duplicate PCR (Rmdup) 1. .seq -> fastq 2. Illumina score -> Phred score 1. cleaned bam file 2. quality control, get statistics, mapped, unmapped, etc. 1. SNVs in VCFs 2. structural variants 3. copy number changes, etc. Aligner (soap2, bwa, bowtie, etc.) from Jui-Tse Hsu Illumina Exome sequence reads Aligned reads Aligned reads! (sam file) Aligned reads! (bam file) Useful reads data Call variants Visualization 
 in browsers

Slide 46

Slide 46 text

C/C++ ❖ Key Algorithms! ❖ Written by C/C++! ❖ Foundation Tools! ❖ BWA! ❖ Bowtie / Bowtie2! ❖ samtools / bamtools! ❖ GMAP / GSNAP! ❖ BLAT! ❖ Tophat

Slide 47

Slide 47 text

http://genomebiology.com/2010/11/12/220 Analysis Pipeline Overview of the RNA-seq analysis pipeline for detecting differential expression

Slide 48

Slide 48 text

Perl ❖ First language! ❖ Bioperl! ❖ Ensembl http://millionchimpanzees.blogspot.tw/2011/09/book-review-learning-perl-sixth-edition.html

Slide 49

Slide 49 text

Java ❖ good part of java! ❖ GATK! ❖ Taverna! ❖ Hadoop http://shop.oreilly.com/product/9780596803742.do

Slide 50

Slide 50 text

R ❖ Statistic tools! ❖ Bioconductor! ❖ EdgeR! ❖ Data Mining and Analysis Books http://exploringdata.github.io/data-visualization-books/analysis/

Slide 51

Slide 51 text

Python ❖ young people! ❖ Galaxy http://news.oreilly.com/2008/08/python-for-unix-and-linux-syst.html

Slide 52

Slide 52 text

The Ruby Way in Bioinformatics

Slide 53

Slide 53 text

What kinds of libraries would you think it is important?

Slide 54

Slide 54 text

Foundation gems ❖ activerecord! ❖ nokogiri! ❖ ffi! ❖ parallel! ! ! ❖ bioruby! ❖ sciruby! ! ! !

Slide 55

Slide 55 text

C binding & wrapper ❖ bio-samtools! ❖ bio-bwa! ❖ bio-affy! ❖ bio-faster! ❖ mpi-ruby! ❖ bio-grid! ❖ gsl! ❖ rb-gsl! ❖ nmatrix! ❖ sambamba - D language! !

Slide 56

Slide 56 text

Data parser / analyser ❖ bio-genomic-interval! ❖ bio-blastxmlparser! ❖ bio-assembly! ❖ bio-gff3! ❖ bio-gff3-pltools! ❖ bio-alignment! ❖ bio-maf! ❖ bio-table! ❖ bio-rdf! ❖ bio-vcf! ❖ bio-velvet! ❖ bio-gngm! ❖ bio-gag! ❖ bio-dbsnp

Slide 57

Slide 57 text

Data parser / analyser ❖ bio-phyloxml! ❖ bio-jplace! ❖ bio-gex! ❖ bio-ipcress! ❖ bio-stockholm! ❖ bio-synreport! ❖ bio-cigar! ❖ bio-wolf_psort_wrapper! ❖ bio-hmmer3_report! ❖ bio-dbla-finder! ❖ bio-newbler_outputs! ❖ bio-sra_fastq_dumper! ❖ bigbio!

Slide 58

Slide 58 text

Data parser / analyser - protein ❖ protk! ❖ mascot-dat! ❖ bio-protparam! ❖ bio-plasmoap! ❖ bio-signalp! ❖ bio-exportpred! ❖ bio-hydropathy! ❖ bio-epitope! ❖ bio-bio-orthomcl! ❖ bio-isoelectric_point! ❖ bio-octopus! ❖ bio-tm_hmm! ❖ bio-aliphatic_index!

Slide 59

Slide 59 text

Database / Web API ❖ ruby-ensembl-api! ❖ bio-ucsc-api! ❖ bio-liftover! ❖ intermine! ❖ bio-eupathdb! ❖ bio-krona! ❖ bio-sra! ❖ bio-sradl http://www.ensembl.org

Slide 60

Slide 60 text

Statistics ❖ statsample! ❖ statsample-sem! ❖ statsample-optimization! ❖ statsample-timeseries! ❖ distribution! ❖ rinruby http://www.ncss.com/software/ncss/survival-analysis-in-ncss

Slide 61

Slide 61 text

SVG & Graph ❖ rubyvis! ❖ plotrb! ❖ bio-svgenes! ❖ bio-vis! ❖ gnuplot http://rubyvis.rubyforge.org

Slide 62

Slide 62 text

Tools ❖ minimization! ❖ integration! ❖ quorum - rails engine

Slide 63

Slide 63 text

I am Not Analyst,
 I am Programmer.

Slide 64

Slide 64 text

What can I get involved?

Slide 65

Slide 65 text

Pipeline / Workflow Galaxy - python! Taverna - java! ??? - Ruby

Slide 66

Slide 66 text

Web System ❖ Data warehouse! ❖ Pipeline management! ❖ Coordination center! ❖ Visualisation

Slide 67

Slide 67 text

Cloud / Distributed / Parallel http://www.mynamesnotmommy.com/yes-there-are-dumb-questions/question-mark/

Slide 68

Slide 68 text

What We Are Doing By Ruby?

Slide 69

Slide 69 text

Ensembl Virtual Machine ❖ Powered by VeeWee, Vagrant and Chef! ❖ Automatic build versioned Ensembl system (perl)! ❖ Include database, queuing services and analysis tools! ❖ Multi sites, multi species in one virtual machine! ❖ Help to build local & custom system from Tse-Ching Ho

Slide 70

Slide 70 text

Ensembl Virtual Machine Use existed vagrant box Prepare SOP for Chef recipes Provision VM with Chef recipes Write Chef recipes Export VM by Virtualbox Setup Vagrantfile Create Vagrant box by Veewee Write definition of Vagrant box by Veewee Ensembl VM Automation from Tse-Ching Ho

Slide 71

Slide 71 text

Ensembl Virtual Machine Web view of Ensembl from Tse-Ching Ho

Slide 72

Slide 72 text

DR. RAW ❖ Derived from DRAW and SneakPeek! ❖ Composed of C/C++, bash, perl, java, ruby! ❖ Have both DNA and RNA re-sequence analysis! ❖ Enhanced quality control for DNA and RNA! ❖ Distributed computing pipeline! ❖ Support PBS, LSF, SGE platforms (queuing system) from Hannah Lin

Slide 73

Slide 73 text

DR. RAW Analysis Tools Analysis Pipeline Quality Control Resource Manager System DNA QC
 Forward : Reverse RNA QC! Forward : Reverse BWA-0.7.7! Samtools-0.1.19! GATK-3.1 GSNAP-13-10-25! Cufflink-13-11! FusionGene … DNA Sequencing data RNA Sequencing data SGE (Sun Grid Engine) PBS (Portable Batch System)! LSF (Load Sharing Facility) Green: new components! Red: updated components from Hannah Lin

Slide 74

Slide 74 text

DR. RAW Web view by Rails from Hannah Lin

Slide 75

Slide 75 text

Neo4j - JRuby Data Parser ❖ Graph database for data integration of discrete clinical research documents! ❖ Origin data are excel/csv files collected in different time, by different people! ❖ Neo4j is good for cleanup such massive data set! ❖ Cooperation between biologist and programmer from Wei-Ming Wu, Chia-Hsuan Lee

Slide 76

Slide 76 text

Neo4j - JRuby Data Parser from Wei-Ming Wu, Chia-Hsuan Lee

Slide 77

Slide 77 text

Neo4j - JRuby Data Parser from Wei-Ming Wu, Chia-Hsuan Lee Collision Rate of Input Data: 1.3 %

Slide 78

Slide 78 text

API Server for Third Party Firm ❖ API server based on Rails, run by JRuby! ❖ ActiveRecord models for Oracle database! ❖ activerecord-oracle_enhanced-adapter gem! ❖ Import excel files to third party GUI client ! ❖ Third party server send XML request to API server from Wei-Ming Wu, Sean Wang

Slide 79

Slide 79 text

API Server for Third Party Firm TCHC server API server
 (rails, jruby) CSIS (java, oracle) Send data by XML Write into database Read data by client program Upload data Parse request Third Party Our Servers Windows GUI from Wei-Ming Wu, Sean Wang

Slide 80

Slide 80 text

Daily Checking Rule ❖ Based on Rails, run by JRuby! ❖ ActiveRecord models for Oracle database! ❖ activerecord-oracle_enhanced-adapter gem! ❖ User can define rules for checking data, usually values in filled forms! ❖ Run checking rules daily, not before filling forms from Wei-Ming Wu, Sean Wang

Slide 81

Slide 81 text

Daily Checking Rule from Wei-Ming Wu, Sean Wang

Slide 82

Slide 82 text

Daily Checking Rule from Wei-Ming Wu, Sean Wang

Slide 83

Slide 83 text

Daily Checking Rule from Wei-Ming Wu, Sean Wang

Slide 84

Slide 84 text

Daily Checking Rule from Wei-Ming Wu, Sean Wang

Slide 85

Slide 85 text

Patient Randomization ❖ Based on Rails, run by JRuby! ❖ ActiveRecord models for Oracle database! ❖ activerecord-oracle_enhanced-adapter gem! ❖ Assign patients into different groups by randomization method! ❖ Cooperation between statistician and programmer from Wei-Ming Wu, Sean Wang

Slide 86

Slide 86 text

Patient Randomization from Wei-Ming Wu, Sean Wang

Slide 87

Slide 87 text

Patient Randomization from Wei-Ming Wu, Sean Wang

Slide 88

Slide 88 text

Patient Randomization from Wei-Ming Wu, Sean Wang Assign patients to treatment groups

Slide 89

Slide 89 text

Database Statistics Dashboard ❖ Based on Rails, run by JRuby! ❖ ActiveRecord models for Oracle database! ❖ activerecord-oracle_enhanced-adapter gem! ❖ google_visualr gem for visualization! ❖ Count number of projects, forms, fields, records and patients from Wei-Ming Wu, Winnie Lui

Slide 90

Slide 90 text

Database Statistics Dashboard from Wei-Ming Wu, Winnie Lui

Slide 91

Slide 91 text

Education

Slide 92

Slide 92 text

Learning Bioinformatics ❖ http://www.nature.com/nbt/journal/v31/n11/full/ nbt.2740.html! ❖ http://www.liacs.nl/~hoogeboo/mcb/ nature_primer.html! ❖ http://www.mygoblet.org - python, R! ❖ http://www.biotnet.org

Slide 93

Slide 93 text

Books for Beginners http://practicalcomputing.org Python

Slide 94

Slide 94 text

Python Book for Bioinformatics http://shop.oreilly.com/product/9780596154516.do

Slide 95

Slide 95 text

Python is very successful in Teach than Ruby

Slide 96

Slide 96 text

Do we lack a killer application by Ruby? http://www.witardroadbaptist.org/im-new/im-not-sure-im-ready-for-church-yet/

Slide 97

Slide 97 text

We Need Human !!

Slide 98

Slide 98 text

Are You Ready 
 To Be A Data Scientist Or A Binformactis?

Slide 99

Slide 99 text

Markets http://www.genengnews.com/gen-articles/personalized-medicine-health-economic-aspects/4824/

Slide 100

Slide 100 text

http://www.bccresearch.com/market-research/biotechnology/bioinformatics-market-technology-bio051b.html Under developing Do Asia have enough market sharing?

Slide 101

Slide 101 text

Topics to take in action ❖ data generation and data management! ❖ data analysis and software! ❖ data processing and storage! ❖ application of bioinformatics in pharma research and development http://www.giichinese.com.tw/report/bc268909- bioinformatics-technologies-global-markets.html

Slide 102

Slide 102 text

Health Care in Cloud ❖ Health promotion cloud! ❖ Vaccination cloud! ❖ Exercise cloud! ❖ Workplace wellness! ❖ Physical checkup cloud! ❖ Welfare cloud from Dr. Chi-Hung Lin

Slide 103

Slide 103 text

Code For Bioinformatics

Slide 104

Slide 104 text

Q & A