Slide 1

Slide 1 text

An opensource library to implement random forests in genomic contexts Oscar González-Recio Juanjo Bazán Selma Forni

Slide 2

Slide 2 text

The Problem

Slide 3

Slide 3 text

The Problem Massive amount of information from high troughput genotyping platforms.

Slide 4

Slide 4 text

The Problem Massive amount of information from high troughput genotyping platforms. Need to extract knowledge from large, noisy, redundant, missing and fuzzy data.

Slide 5

Slide 5 text

The Problem Massive amount of information from high troughput genotyping platforms. Massive amount of information consumes the attention of its recipients. We need to allocate that attention efciently. Need to extract knowledge from large, noisy, redundant, missing and fuzzy data.

Slide 6

Slide 6 text

Why Random Forest?

Slide 7

Slide 7 text

Why Random Forest? Using Machine Learning techniques we can extract hidden relationships that exist in these huge volumes of data and do not follow a particular parametric design.

Slide 8

Slide 8 text

Why Random Forest? Using Machine Learning techniques we can extract hidden relationships that exist in these huge volumes of data and do not follow a particular parametric design. Random Forest have desirable statistical properties.

Slide 9

Slide 9 text

Why Random Forest? Using Machine Learning techniques we can extract hidden relationships that exist in these huge volumes of data and do not follow a particular parametric design. Random Forest have desirable statistical properties. Random Forest scales well computationally.

Slide 10

Slide 10 text

Using Machine Learning techniques we can extract hidden relationships that exist in these huge volumes of data and do not follow a particular parametric design. Random Forest have desirable statistical properties. Random Forest scales well computationally. Random Forest performs extremely well in a variety of possible complex domains (Breiman, 2001; Gonzalez-Recio & Forni, 2011). Why Random Forest?

Slide 11

Slide 11 text

The Algorithm

Slide 12

Slide 12 text

Ensemble methods: - Combination of diferent methods (usually simple models). - They have very good predictive ability because use additivity of models performances. Based on Classifcation And Regression Trees (CART). Use Randomization and Bagging. Performs Feature Subset Selection. Convenient for classifcation problems. Fast computation. Simple interpretation of results for human minds. Previous work in genome-wide prediction (Gonzalez-Recio and Forni, 2011) The Algorithm

Slide 13

Slide 13 text

The Algorithm Perform bootstrap on data: Ψ* = (y, X) Build a CART ( fi (y, X) = ht (x) ) using only mtry proportion of SNPs in each node. Repeat M times to reduce residuals by a factor of M. Average estimates c0 = μ ; ci = 1/M

Slide 14

Slide 14 text

Let Ψ = (y, X) be a set of data, with y = vector of phenotypes (response variables) X = (x1, x2) = matrix of features y1 x11 x12 … … ... yi xi1 xi2 … … ... yn xn1 xn2 The Algorithm Classifcation and regression trees:

Slide 15

Slide 15 text

The Algorithm

Slide 16

Slide 16 text

Nimbus library

Slide 17

Slide 17 text

Nimbus library Written in Ruby www.ruby-lang.org Open source programming language Syntax focused on simplicity Natural to read and easy to write

Slide 18

Slide 18 text

Nimbus library How to install: > gem install nimbus Prerequisites: Ruby and Rubygems (default library manager) installed in the system

Slide 19

Slide 19 text

Nimbus library How to run: > nimbus Confguration: Via confg.yml fle

Slide 20

Slide 20 text

Nimbus library confg.yml fle:

Slide 21

Slide 21 text

Nimbus library Input fles: training testing

Slide 22

Slide 22 text

Nimbus library Features, use cases: Training of a prediction forest Training of a prediction forest Genomic Prediction of a testing sample Using a training set of individuals Nimbus creates a reutilizable forest Nimbus calculates SNP importances Generalization error are computed for every tree in the forest Using a custom/reused forest, specif i ed via conf i g.yml Using a new trained forest

Slide 23

Slide 23 text

Outputs Random Forest fle In standard YAML format

Slide 24

Slide 24 text

Outputs Predictions for the training sample

Slide 25

Slide 25 text

Outputs Predictions for the testing sample

Slide 26

Slide 26 text

Outputs SNP importances

Slide 27

Slide 27 text

More info: Nimbus website: Source code: Report bugs/request features: www.nimbusgem.org www.github.com/xuanxu/nimbus/issues www.github.com/xuanxu/nimbus

Slide 28

Slide 28 text

Thank you!

Slide 29

Slide 29 text

Questions?