Nimbus ruby library.

An opensource library to implement random forests in genomic contexts
Oscar González-Recio Juanjo Bazán Selma Forni

The Problem

The Problem Massive amount of information from high troughput genotyping
platforms.

platforms. Need to extract knowledge from large, noisy, redundant, missing and fuzzy data.

platforms. Massive amount of information consumes the attention of its recipients. We need to allocate that attention efciently. Need to extract knowledge from large, noisy, redundant, missing and fuzzy data.

Why Random Forest?

Why Random Forest? Using Machine Learning techniques we can extract
hidden relationships that exist in these huge volumes of data and do not follow a particular parametric design.

hidden relationships that exist in these huge volumes of data and do not follow a particular parametric design. Random Forest have desirable statistical properties.

hidden relationships that exist in these huge volumes of data and do not follow a particular parametric design. Random Forest have desirable statistical properties. Random Forest scales well computationally.

Using Machine Learning techniques we can extract hidden relationships that
exist in these huge volumes of data and do not follow a particular parametric design. Random Forest have desirable statistical properties. Random Forest scales well computationally. Random Forest performs extremely well in a variety of possible complex domains (Breiman, 2001; Gonzalez-Recio & Forni, 2011). Why Random Forest?

The Algorithm

Ensemble methods: - Combination of diferent methods (usually simple models).
- They have very good predictive ability because use additivity of models performances. Based on Classifcation And Regression Trees (CART). Use Randomization and Bagging. Performs Feature Subset Selection. Convenient for classifcation problems. Fast computation. Simple interpretation of results for human minds. Previous work in genome-wide prediction (Gonzalez-Recio and Forni, 2011) The Algorithm

The Algorithm Perform bootstrap on data: Ψ* = (y, X)
Build a CART ( fi (y, X) = ht (x) ) using only mtry proportion of SNPs in each node. Repeat M times to reduce residuals by a factor of M. Average estimates c0 = μ ; ci = 1/M

Let Ψ = (y, X) be a set of data,
with y = vector of phenotypes (response variables) X = (x1, x2) = matrix of features y1 x11 x12 … … ... yi xi1 xi2 … … ... yn xn1 xn2 The Algorithm Classifcation and regression trees:

The Algorithm

Nimbus library

Nimbus library Written in Ruby www.ruby-lang.org Open source programming language
Syntax focused on simplicity Natural to read and easy to write

Nimbus library How to install: > gem install nimbus Prerequisites:
Ruby and Rubygems (default library manager) installed in the system

Nimbus library How to run: > nimbus Confguration: Via confg.yml
fle

Nimbus library confg.yml fle:

Nimbus library Input fles: training testing

Nimbus library Features, use cases: Training of a prediction forest
Training of a prediction forest Genomic Prediction of a testing sample Using a training set of individuals Nimbus creates a reutilizable forest Nimbus calculates SNP importances Generalization error are computed for every tree in the forest Using a custom/reused forest, specif i ed via conf i g.yml Using a new trained forest

Outputs Random Forest fle In standard YAML format

Outputs Predictions for the training sample

Outputs Predictions for the testing sample

Outputs SNP importances

More info: Nimbus website: Source code: Report bugs/request features: www.nimbusgem.org
www.github.com/xuanxu/nimbus/issues www.github.com/xuanxu/nimbus

Thank you!

Questions?

Nimbus ruby library.

Nimbus ruby library.

Juanjo Bazán

More Decks by Juanjo Bazán

Other Decks in Research

Featured

Transcript

An opensource library to implement random forests in genomic contexts

The Problem

The Problem Massive amount of information from high troughput genotyping

The Problem Massive amount of information from high troughput genotyping

The Problem Massive amount of information from high troughput genotyping

Why Random Forest?

Why Random Forest? Using Machine Learning techniques we can extract

Why Random Forest? Using Machine Learning techniques we can extract

Why Random Forest? Using Machine Learning techniques we can extract

Using Machine Learning techniques we can extract hidden relationships that

The Algorithm

Ensemble methods: - Combination of diferent methods (usually simple models).

The Algorithm Perform bootstrap on data: Ψ* = (y, X)

Let Ψ = (y, X) be a set of data,

The Algorithm

Nimbus library

Nimbus library Written in Ruby www.ruby-lang.org Open source programming language

Nimbus library How to install: > gem install nimbus Prerequisites:

Nimbus library How to run: > nimbus Confguration: Via confg.yml

Nimbus library confg.yml fle:

Nimbus library Input fles: training testing

Nimbus library Features, use cases: Training of a prediction forest

Outputs Random Forest fle In standard YAML format

Outputs Predictions for the training sample

Outputs Predictions for the testing sample

Outputs SNP importances

More info: Nimbus website: Source code: Report bugs/request features: www.nimbusgem.org

Thank you!

Questions?