Slide 1

Slide 1 text

© Copyright 2010 Hewlett-Packard Development Company, L.P. 1 © Copyright 2010 Hewlett-Packard Development Company, L.P. 1 Chang Sau Sheong Director, Applied Research, HP Labs Singapore RUBY AND R

Slide 2

Slide 2 text

© Copyright 2010 Hewlett-Packard Development Company, L.P. 2 Programming language and platform for statistical computing, licensed under GPL

Slide 3

Slide 3 text

© Copyright 2010 Hewlett-Packard Development Company, L.P. 3 Strengths in statistical processing and data visualization

Slide 4

Slide 4 text

© Copyright 2010 Hewlett-Packard Development Company, L.P. 4 Extensive library of statistical computing packages (CRAN) written by statisticians

Slide 5

Slide 5 text

© Copyright 2010 Hewlett-Packard Development Company, L.P. 5 Statistics is not just for statisticians

Slide 6

Slide 6 text

© Copyright 2010 Hewlett-Packard Development Company, L.P. 6 Fingerprint identification Speech recognition Financial forecasting Face recognition Credit scoring Card fraud detection Spam detection Recommendatio n engine OCR Data mining

Slide 7

Slide 7 text

© Copyright 2010 Hewlett-Packard Development Company, L.P. 7 CRAN – Almost 2000 packages, mostly created by statisticians • BiodiversityR – GUI for biodiversity and community ecology analysis • Emu – analyze speech patterns • GenABEL – study human genome • Quantmod– quantitative financial modeling framework • Ftrading – technical trading analysis • Cyclones – cyclone identification • DOSim – disease analysis toolkit for gene set • Agricolae – statistical procedures for agricultural research

Slide 8

Slide 8 text

© Copyright 2010 Hewlett-Packard Development Company, L.P. 8 EXAMPLE R CODE – EPL data from football-data.co.uk – Show home/away goals distribution for 2011 season

Slide 9

Slide 9 text

© Copyright 2010 Hewlett-Packard Development Company, L.P. 9 Why Ruby and R?

Slide 10

Slide 10 text

© Copyright 2010 Hewlett-Packard Development Company, L.P. 10 Stand on shoulders of giants

Slide 11

Slide 11 text

© Copyright 2010 Hewlett-Packard Development Company, L.P. 11 – Ruby • Human focused programming! • Better general purpose programming capabilities • Great frameworks! • Great libraries (20,000+ gems in RubyGems) – R • Focus on statistical computing/crunching • Lots of packages written by domain experts/ statisticians • Great graphing libraries

Slide 12

Slide 12 text

© Copyright 2010 Hewlett-Packard Development Company, L.P. 12 Ruby and R integration

Slide 13

Slide 13 text

© Copyright 2010 Hewlett-Packard Development Company, L.P. 13 RINRUBY –  100% Ruby –  Uses pipes to send commands and evals –  Uses TCP/IP Sockets to send and retrieve data –  Pros: •  Doesn't requires anything but R •  Works flawlessly on Windows •  Work with Ruby 1.8, 1.9 and JRuby 1.5 •  All API tested –  Cons: •  VERY SLOW in assigning •  Very limited datatypes: only Vector and Matrix •  Not released since 2009 •  Poor documentation

Slide 14

Slide 14 text

© Copyright 2010 Hewlett-Packard Development Company, L.P. 14 RSRUBY –  C Extension for Ruby, linked to R's shared library –  Pros: •  Blazing speed! 5-10 times faster than Rserve and 100-1000 than RinRuby. •  Seamless integration with Ruby. Every method and object is treated like a Ruby object –  Cons: •  Transformation between R and Ruby types aren't trivial •  Dependent on operating system, Ruby implementation and R version •  Not available for alternative implementations of Ruby (eg JRuby) •  Not released since 2009 •  Poor documentation

Slide 15

Slide 15 text

© Copyright 2010 Hewlett-Packard Development Company, L.P. 15 RSERVE –  100% Ruby –  Uses TCP/IP sockets to interchange data and commands –  Requires Rserve installed on the server machine –  Access with Ruby uses Ruby-Rserve-Client library –  Pros: •  Work with Ruby 1.8, 1.9 and JRuby 1.5. •  Session allows to process data asynchronously •  Fast: 5-10 times faster than RinRuby •  Most recently updated (Jan 2011) –  Cons: •  Requires Rserve •  Limited features on Windows •  Poor documentation

Slide 16

Slide 16 text

© Copyright 2010 Hewlett-Packard Development Company, L.P. 16 RAPACHE/RRACK –  Web service based –  Run R scripts as web services, consumed by Ruby front-end apps –  Pros: •  Modular and separate (no direct integration) •  Can be scalable, ‘cloud’-ready –  Cons: •  Requires Rapache/rRack •  rRack is very new (not accepted by CRAN yet, as of today!), requires R 2.13 (just released a few weeks ago) •  Rapache specific to Apache web server only •  Communications overhead for smaller integrations

Slide 17

Slide 17 text

© Copyright 2010 Hewlett-Packard Development Company, L.P. 17 Let’s look at some code! (I’m going to use Rserve)

Slide 18

Slide 18 text

© Copyright 2010 Hewlett-Packard Development Company, L.P. 18 Text classification

Slide 19

Slide 19 text

© Copyright 2010 Hewlett-Packard Development Company, L.P. 19 TEXT CLASSIFICATION – Automatically sorting a set of documents into different categories from a predefined set – Classic uses: • Spam filtering • Email prioritization Classifier Training data category Test data

Slide 20

Slide 20 text

© Copyright 2010 Hewlett-Packard Development Company, L.P. 20

Slide 21

Slide 21 text

© Copyright 2010 Hewlett-Packard Development Company, L.P. 21 TEXT CLASSIFIER CODE Prepare

Slide 22

Slide 22 text

© Copyright 2010 Hewlett-Packard Development Company, L.P. 22 Train classifier by counting frequency of each word in the document

Slide 23

Slide 23 text

© Copyright 2010 Hewlett-Packard Development Company, L.P. 23 Get word count

Slide 24

Slide 24 text

© Copyright 2010 Hewlett-Packard Development Company, L.P. 24 What you get {"check"=>1, "result"=>3, "marissa"=>1, "experi"=>1, "click"=>1, "engin"=>1, "simpli"=>1, "mistakenli"=>1, "pick"=>1, "prevent"=>1, "40"=>1, "regularli"=>1, "place"=>1, "user"=>5, "prefer"=>1, "malevol"=>1, "access"=>1, "robust"=>1, "servic"=>1, "fault"=>1, "malici"=>1, "list"=>2, "hand"=>1, "internet"=>1, "attribut"=>1, "instal"=>1, "file"=>1, "unabl"=>1, "vice"=>1, "stopbadwareorg"=>2, "merit"=>1, "decid"=>1, "flag"=>2, "saturdai"=>2, "hit"=>2, "offici"=>1, "error"=>3, "work"=>1, "site"=>5, "happen"=>2, "incid"=>1, "technic"=>1, "advis"=>1, "put"=>1, "human"=>3, "harm"=>2, "softwar"=>1, "ms"=>1, "affect"=>1, "carefulli"=>1, "product"=>1, "presid"=>1, "complaint"=>1, "potenti"=>2, "googl"=>6, "comput"=>2, "peopl"=>1, "investig"=>2, "consum"=>1, "danger"=>2, "period"=>1, "wrote"=>2, "search"=>7, "ascertain"=>1, "blog"=>1, "warn"=>2, "problem"=>1, "updat"=>2, "minut"=>1, "mayer"=>2}

Slide 25

Slide 25 text

© Copyright 2010 Hewlett-Packard Development Company, L.P. 25 Generate training data for prediction

Slide 26

Slide 26 text

© Copyright 2010 Hewlett-Packard Development Company, L.P. 26 Training data

Slide 27

Slide 27 text

© Copyright 2010 Hewlett-Packard Development Company, L.P. 27 category,googl,report,search,user,review,court,mckinnon,year,internet,microsoft,site,soft war,warn,browser,oper,expert,rise,lawyer,digit,extradit,sharpli,error,group,result,system ,rebel,econom,presid,crisi,find,year,accus,global,obama,china,civilian,shrink,hous,wall,s treet,quarter,white,heavi,lehman,economi,session,ey,time,davo,human not_interesting, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0 ,0,0,0,0,0 not_interesting, 0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,5,0,2,0,0,0,3,0,0,0,3,1,0,0,0,0,0,3,0,0 ,0,0,0,0,2 not_interesting, 0,1,0,0,0,0,0,2,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,3,0,3,1,2,0,2,0,0,0,0,0,0,0,0,0,0,3,1 ,3,1,0,2,0 not_interesting, 0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ,0,0,0,0,1 not_interesting, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,2,0,0,1,2,1,4,0,0,2,0,0,0,2,0,0,0 ,0,2,0,1,0 not_interesting, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,1,0,0,0,0,3,3,0,0,0,0,0 ,0,0,2,0,0 not_interesting, 0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,2,0,0,2,0,0,2,1,0,0,2,1,0,0,2 ,0,0,1,0,0 interesting, 6,0,7,5,0,0,0,0,1,0,5,1,2,0,0,0,0,0,0,0,0,3,0,3,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ,0,0,0,0,3 interesting, 0,7,0,0,2,0,0,0,0,0,0,0,1,0,0,1,0,0,3,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ,0,0,0,0,0 interesting, 0,1,0,0,0,0,0,3,3,1,0,1,1,1,0,3,3,0,1,0,3,0,1,0,2,0,1,0,0,0,3,0,0,0,0,0,0,0,0,0,0,0,0,0,1 ,1,0,0,3,0 interesting, The top 25 most frequent words in the training dataset

Slide 28

Slide 28 text

© Copyright 2010 Hewlett-Packard Development Company, L.P. 28 category,googl,report,search,user,review,court,mckinnon,year,internet,microsoft,site,soft war,warn,browser,oper,expert,rise,lawyer,digit,extradit,sharpli,error,group,result,system ,rebel,econom,presid,crisi,find,year,accus,global,obama,china,civilian,shrink,hous,wall,s treet,quarter,white,heavi,lehman,economi,session,ey,time,davo,human not_interesting, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0 ,0,0,0,0,0 not_interesting, 0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,5,0,2,0,0,0,3,0,0,0,3,1,0,0,0,0,0,3,0,0 ,0,0,0,0,2 not_interesting, 0,1,0,0,0,0,0,2,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,3,0,3,1,2,0,2,0,0,0,0,0,0,0,0,0,0,3,1 ,3,1,0,2,0 not_interesting, 0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ,0,0,0,0,1 not_interesting, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,2,0,0,1,2,1,4,0,0,2,0,0,0,2,0,0,0 ,0,2,0,1,0 not_interesting, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,1,0,0,0,0,3,3,0,0,0,0,0 ,0,0,2,0,0 not_interesting, 0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,2,0,0,2,0,0,2,1,0,0,2,1,0,0,2 ,0,0,1,0,0 interesting, 6,0,7,5,0,0,0,0,1,0,5,1,2,0,0,0,0,0,0,0,0,3,0,3,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ,0,0,0,0,3 interesting, 0,7,0,0,2,0,0,0,0,0,0,0,1,0,0,1,0,0,3,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ,0,0,0,0,0 interesting, 0,1,0,0,0,0,0,3,3,1,0,1,1,1,0,3,3,0,1,0,3,0,1,0,2,0,1,0,0,0,3,0,0,0,0,0,0,0,0,0,0,0,0,0,1 ,1,0,0,3,0 interesting, Each line represents 1 document trained

Slide 29

Slide 29 text

© Copyright 2010 Hewlett-Packard Development Company, L.P. 29 category,googl,report,search,user,review,court,mckinnon,year,internet,microsoft,site, softwar,warn,browser,oper,expert,rise,lawyer,digit,extradit,sharpli,error,group,result,sy stem,rebel,econom,presid,crisi,find,year,accus,global,obama,china,civilian,shrink,hous,wa ll,street,quarter,white,heavi,lehman,economi,session,ey,time,davo,human not_interesting, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0 ,0,0,0,0,0 not_interesting, 0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,5,0,2,0,0,0,3,0,0,0,3,1,0,0,0,0,0,3,0,0 ,0,0,0,0,2 not_interesting, 0,1,0,0,0,0,0,2,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,3,0,3,1,2,0,2,0,0,0,0,0,0,0,0,0,0,3,1 ,3,1,0,2,0 not_interesting, 0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ,0,0,0,0,1 not_interesting, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,2,0,0,1,2,1,4,0,0,2,0,0,0,2,0,0,0 ,0,2,0,1,0 not_interesting, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,1,0,0,0,0,3,3,0,0,0,0,0 ,0,0,2,0,0 not_interesting, 0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,2,0,0,2,0,0,2,1,0,0,2,1,0,0,2 ,0,0,1,0,0 interesting, 6,0,7,5,0,0,0,0,1,0,5,1,2,0,0,0,0,0,0,0,0,3,0,3,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ,0,0,0,0,3 interesting, 0,7,0,0,2,0,0,0,0,0,0,0,1,0,0,1,0,0,3,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ,0,0,0,0,0 interesting, 0,1,0,0,0,0,0,3,3,1,0,1,1,1,0,3,3,0,1,0,3,0,1,0,2,0,1,0,0,0,3,0,0,0,0,0,0,0,0,0,0,0,0,0,1 ,1,0,0,3,0 Categories set when the classifier is created

Slide 30

Slide 30 text

© Copyright 2010 Hewlett-Packard Development Company, L.P. 30 category,googl,report,search,user,review,court,mckinnon,year,internet,microsoft,site,s oftwar,warn,browser,oper,expert,rise,lawyer,digit,extradit,sharpli,error,group,result,sys tem,rebel,econom,presid,crisi,find,year,accus,global,obama,china,civilian,shrink,hous,wal l,street,quarter,white,heavi,lehman,economi,session,ey,time,davo,human not_interesting, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,0,0,0,0,0,0,0,0,0,0,1,0,0,0, 0,0,0,0,0,0 not_interesting, 0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,5,0,2,0,0,0,3,0,0,0,3,1,0,0,0,0,0,3,0,0 ,0,0,0,0,2 not_interesting, 0,1,0,0,0,0,0,2,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,3,0,3,1,2,0,2,0,0,0,0,0,0,0,0,0,0,3,1 ,3,1,0,2,0 not_interesting, 0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ,0,0,0,0,1 not_interesting, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,2,0,0,1,2,1,4,0,0,2,0,0,0,2,0,0,0 ,0,2,0,1,0 not_interesting, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,1,0,0,0,0,3,3,0,0,0,0,0 ,0,0,2,0,0 not_interesting, 0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,2,0,0,2,0,0,2,1,0,0,2,1,0,0,2 ,0,0,1,0,0 interesting, 6,0,7,5,0,0,0,0,1,0,5,1,2,0,0,0,0,0,0,0,0,3,0,3,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ,0,0,0,0,3 interesting, 0,7,0,0,2,0,0,0,0,0,0,0,1,0,0,1,0,0,3,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ,0,0,0,0,0 interesting, 0,1,0,0,0,0,0,3,3,1,0,1,1,1,0,3,3,0,1,0,3,0,1,0,2,0,1,0,0,0,3,0,0,0,0,0,0,0,0,0,0,0,0,0,1 ,1,0,0,3,0 Number indicates the number of times the word appears in that document

Slide 31

Slide 31 text

© Copyright 2010 Hewlett-Packard Development Company, L.P. 31 Test data

Slide 32

Slide 32 text

© Copyright 2010 Hewlett-Packard Development Company, L.P. 32 category,googl,report,search,user,review,court,mckinnon,year,internet,micr osoft,site,softwar,warn,browser,oper,expert,rise,lawyer,digit,extradit,sha rpli,error,group,result,system,rebel,econom,presid,crisi,find,year,accus,g lobal,obama,china,civilian,shrink,hous,wall,street,quarter,white,heavi,leh man,economi,session,ey,time,davo,human category, 0,0,0,2,0,0,0,2,1,4,0,2,0,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,2,0,0

Slide 33

Slide 33 text

© Copyright 2010 Hewlett-Packard Development Company, L.P. 33 Using different classification models

Slide 34

Slide 34 text

© Copyright 2010 Hewlett-Packard Development Company, L.P. 34 NAÏVE BAYES

Slide 35

Slide 35 text

© Copyright 2010 Hewlett-Packard Development Company, L.P. 35 SVM

Slide 36

Slide 36 text

© Copyright 2010 Hewlett-Packard Development Company, L.P. 36 RANDOM FOREST

Slide 37

Slide 37 text

© Copyright 2010 Hewlett-Packard Development Company, L.P. 37 NEURAL NETWORKS

Slide 38

Slide 38 text

© Copyright 2010 Hewlett-Packard Development Company, L.P. 38 Using the classifier

Slide 39

Slide 39 text

© Copyright 2010 Hewlett-Packard Development Company, L.P. 39

Slide 40

Slide 40 text

© Copyright 2010 Hewlett-Packard Development Company, L.P. 40

Slide 41

Slide 41 text

© Copyright 2010 Hewlett-Packard Development Company, L.P. 41 RESOURCES –  HP Labs Worldwide http://www.hpl.hp.com/ –  R Project http://www.r-project.org/ –  RsRuby https://github.com/alexgutteridge/rsruby –  RinRuby http://rinruby.ddahl.org/ –  Rserve http://www.rforge.net/Rserve/ –  Rserve-Ruby-Client https://github.com/clbustos/Rserve- Ruby-client –  rApache http://rapache.net/index.html –  rRack https://github.com/jeffreyhorner/rRack/

Slide 42

Slide 42 text

© Copyright 2010 Hewlett-Packard Development Company, L.P. 42 Thank you [email protected] [email protected] http://twitter.com/sausheong http://blog.saush.com