Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Ruby and R

Ruby and R

Integrating Ruby and R

Sau Sheong Chang

May 01, 2010
Tweet

More Decks by Sau Sheong Chang

Other Decks in Technology

Transcript

  1. © Copyright 2010 Hewlett-Packard Development Company, L.P. 1 © Copyright

    2010 Hewlett-Packard Development Company, L.P. 1 Chang Sau Sheong Director, Applied Research, HP Labs Singapore RUBY AND R
  2. © Copyright 2010 Hewlett-Packard Development Company, L.P. 2 Programming language

    and platform for statistical computing, licensed under GPL
  3. © Copyright 2010 Hewlett-Packard Development Company, L.P. 4 Extensive library

    of statistical computing packages (CRAN) written by statisticians
  4. © Copyright 2010 Hewlett-Packard Development Company, L.P. 6 Fingerprint identification

    Speech recognition Financial forecasting Face recognition Credit scoring Card fraud detection Spam detection Recommendatio n engine OCR Data mining
  5. © Copyright 2010 Hewlett-Packard Development Company, L.P. 7 CRAN – Almost

    2000 packages, mostly created by statisticians • BiodiversityR – GUI for biodiversity and community ecology analysis • Emu – analyze speech patterns • GenABEL – study human genome • Quantmod– quantitative financial modeling framework • Ftrading – technical trading analysis • Cyclones – cyclone identification • DOSim – disease analysis toolkit for gene set • Agricolae – statistical procedures for agricultural research
  6. © Copyright 2010 Hewlett-Packard Development Company, L.P. 8 EXAMPLE R

    CODE – EPL data from football-data.co.uk – Show home/away goals distribution for 2011 season
  7. © Copyright 2010 Hewlett-Packard Development Company, L.P. 11 – Ruby • Human

    focused programming! • Better general purpose programming capabilities • Great frameworks! • Great libraries (20,000+ gems in RubyGems) – R • Focus on statistical computing/crunching • Lots of packages written by domain experts/ statisticians • Great graphing libraries
  8. © Copyright 2010 Hewlett-Packard Development Company, L.P. 13 RINRUBY – 

    100% Ruby –  Uses pipes to send commands and evals –  Uses TCP/IP Sockets to send and retrieve data –  Pros: •  Doesn't requires anything but R •  Works flawlessly on Windows •  Work with Ruby 1.8, 1.9 and JRuby 1.5 •  All API tested –  Cons: •  VERY SLOW in assigning •  Very limited datatypes: only Vector and Matrix •  Not released since 2009 •  Poor documentation
  9. © Copyright 2010 Hewlett-Packard Development Company, L.P. 14 RSRUBY – 

    C Extension for Ruby, linked to R's shared library –  Pros: •  Blazing speed! 5-10 times faster than Rserve and 100-1000 than RinRuby. •  Seamless integration with Ruby. Every method and object is treated like a Ruby object –  Cons: •  Transformation between R and Ruby types aren't trivial •  Dependent on operating system, Ruby implementation and R version •  Not available for alternative implementations of Ruby (eg JRuby) •  Not released since 2009 •  Poor documentation
  10. © Copyright 2010 Hewlett-Packard Development Company, L.P. 15 RSERVE – 

    100% Ruby –  Uses TCP/IP sockets to interchange data and commands –  Requires Rserve installed on the server machine –  Access with Ruby uses Ruby-Rserve-Client library –  Pros: •  Work with Ruby 1.8, 1.9 and JRuby 1.5. •  Session allows to process data asynchronously •  Fast: 5-10 times faster than RinRuby •  Most recently updated (Jan 2011) –  Cons: •  Requires Rserve •  Limited features on Windows •  Poor documentation
  11. © Copyright 2010 Hewlett-Packard Development Company, L.P. 16 RAPACHE/RRACK – 

    Web service based –  Run R scripts as web services, consumed by Ruby front-end apps –  Pros: •  Modular and separate (no direct integration) •  Can be scalable, ‘cloud’-ready –  Cons: •  Requires Rapache/rRack •  rRack is very new (not accepted by CRAN yet, as of today!), requires R 2.13 (just released a few weeks ago) •  Rapache specific to Apache web server only •  Communications overhead for smaller integrations
  12. © Copyright 2010 Hewlett-Packard Development Company, L.P. 19 TEXT CLASSIFICATION

    – Automatically sorting a set of documents into different categories from a predefined set – Classic uses: • Spam filtering • Email prioritization Classifier Training data category Test data
  13. © Copyright 2010 Hewlett-Packard Development Company, L.P. 22 Train classifier

    by counting frequency of each word in the document
  14. © Copyright 2010 Hewlett-Packard Development Company, L.P. 24 What you

    get {"check"=>1, "result"=>3, "marissa"=>1, "experi"=>1, "click"=>1, "engin"=>1, "simpli"=>1, "mistakenli"=>1, "pick"=>1, "prevent"=>1, "40"=>1, "regularli"=>1, "place"=>1, "user"=>5, "prefer"=>1, "malevol"=>1, "access"=>1, "robust"=>1, "servic"=>1, "fault"=>1, "malici"=>1, "list"=>2, "hand"=>1, "internet"=>1, "attribut"=>1, "instal"=>1, "file"=>1, "unabl"=>1, "vice"=>1, "stopbadwareorg"=>2, "merit"=>1, "decid"=>1, "flag"=>2, "saturdai"=>2, "hit"=>2, "offici"=>1, "error"=>3, "work"=>1, "site"=>5, "happen"=>2, "incid"=>1, "technic"=>1, "advis"=>1, "put"=>1, "human"=>3, "harm"=>2, "softwar"=>1, "ms"=>1, "affect"=>1, "carefulli"=>1, "product"=>1, "presid"=>1, "complaint"=>1, "potenti"=>2, "googl"=>6, "comput"=>2, "peopl"=>1, "investig"=>2, "consum"=>1, "danger"=>2, "period"=>1, "wrote"=>2, "search"=>7, "ascertain"=>1, "blog"=>1, "warn"=>2, "problem"=>1, "updat"=>2, "minut"=>1, "mayer"=>2}
  15. © Copyright 2010 Hewlett-Packard Development Company, L.P. 27 category,googl,report,search,user,review,court,mckinnon,year,internet,microsoft,site,soft war,warn,browser,oper,expert,rise,lawyer,digit,extradit,sharpli,error,group,result,system

    ,rebel,econom,presid,crisi,find,year,accus,global,obama,china,civilian,shrink,hous,wall,s treet,quarter,white,heavi,lehman,economi,session,ey,time,davo,human not_interesting, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0 ,0,0,0,0,0 not_interesting, 0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,5,0,2,0,0,0,3,0,0,0,3,1,0,0,0,0,0,3,0,0 ,0,0,0,0,2 not_interesting, 0,1,0,0,0,0,0,2,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,3,0,3,1,2,0,2,0,0,0,0,0,0,0,0,0,0,3,1 ,3,1,0,2,0 not_interesting, 0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ,0,0,0,0,1 not_interesting, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,2,0,0,1,2,1,4,0,0,2,0,0,0,2,0,0,0 ,0,2,0,1,0 not_interesting, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,1,0,0,0,0,3,3,0,0,0,0,0 ,0,0,2,0,0 not_interesting, 0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,2,0,0,2,0,0,2,1,0,0,2,1,0,0,2 ,0,0,1,0,0 interesting, 6,0,7,5,0,0,0,0,1,0,5,1,2,0,0,0,0,0,0,0,0,3,0,3,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ,0,0,0,0,3 interesting, 0,7,0,0,2,0,0,0,0,0,0,0,1,0,0,1,0,0,3,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ,0,0,0,0,0 interesting, 0,1,0,0,0,0,0,3,3,1,0,1,1,1,0,3,3,0,1,0,3,0,1,0,2,0,1,0,0,0,3,0,0,0,0,0,0,0,0,0,0,0,0,0,1 ,1,0,0,3,0 interesting, The top 25 most frequent words in the training dataset
  16. © Copyright 2010 Hewlett-Packard Development Company, L.P. 28 category,googl,report,search,user,review,court,mckinnon,year,internet,microsoft,site,soft war,warn,browser,oper,expert,rise,lawyer,digit,extradit,sharpli,error,group,result,system

    ,rebel,econom,presid,crisi,find,year,accus,global,obama,china,civilian,shrink,hous,wall,s treet,quarter,white,heavi,lehman,economi,session,ey,time,davo,human not_interesting, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0 ,0,0,0,0,0 not_interesting, 0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,5,0,2,0,0,0,3,0,0,0,3,1,0,0,0,0,0,3,0,0 ,0,0,0,0,2 not_interesting, 0,1,0,0,0,0,0,2,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,3,0,3,1,2,0,2,0,0,0,0,0,0,0,0,0,0,3,1 ,3,1,0,2,0 not_interesting, 0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ,0,0,0,0,1 not_interesting, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,2,0,0,1,2,1,4,0,0,2,0,0,0,2,0,0,0 ,0,2,0,1,0 not_interesting, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,1,0,0,0,0,3,3,0,0,0,0,0 ,0,0,2,0,0 not_interesting, 0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,2,0,0,2,0,0,2,1,0,0,2,1,0,0,2 ,0,0,1,0,0 interesting, 6,0,7,5,0,0,0,0,1,0,5,1,2,0,0,0,0,0,0,0,0,3,0,3,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ,0,0,0,0,3 interesting, 0,7,0,0,2,0,0,0,0,0,0,0,1,0,0,1,0,0,3,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ,0,0,0,0,0 interesting, 0,1,0,0,0,0,0,3,3,1,0,1,1,1,0,3,3,0,1,0,3,0,1,0,2,0,1,0,0,0,3,0,0,0,0,0,0,0,0,0,0,0,0,0,1 ,1,0,0,3,0 interesting, Each line represents 1 document trained
  17. © Copyright 2010 Hewlett-Packard Development Company, L.P. 29 category,googl,report,search,user,review,court,mckinnon,year,internet,microsoft,site, softwar,warn,browser,oper,expert,rise,lawyer,digit,extradit,sharpli,error,group,result,sy

    stem,rebel,econom,presid,crisi,find,year,accus,global,obama,china,civilian,shrink,hous,wa ll,street,quarter,white,heavi,lehman,economi,session,ey,time,davo,human not_interesting, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0 ,0,0,0,0,0 not_interesting, 0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,5,0,2,0,0,0,3,0,0,0,3,1,0,0,0,0,0,3,0,0 ,0,0,0,0,2 not_interesting, 0,1,0,0,0,0,0,2,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,3,0,3,1,2,0,2,0,0,0,0,0,0,0,0,0,0,3,1 ,3,1,0,2,0 not_interesting, 0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ,0,0,0,0,1 not_interesting, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,2,0,0,1,2,1,4,0,0,2,0,0,0,2,0,0,0 ,0,2,0,1,0 not_interesting, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,1,0,0,0,0,3,3,0,0,0,0,0 ,0,0,2,0,0 not_interesting, 0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,2,0,0,2,0,0,2,1,0,0,2,1,0,0,2 ,0,0,1,0,0 interesting, 6,0,7,5,0,0,0,0,1,0,5,1,2,0,0,0,0,0,0,0,0,3,0,3,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ,0,0,0,0,3 interesting, 0,7,0,0,2,0,0,0,0,0,0,0,1,0,0,1,0,0,3,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ,0,0,0,0,0 interesting, 0,1,0,0,0,0,0,3,3,1,0,1,1,1,0,3,3,0,1,0,3,0,1,0,2,0,1,0,0,0,3,0,0,0,0,0,0,0,0,0,0,0,0,0,1 ,1,0,0,3,0 Categories set when the classifier is created
  18. © Copyright 2010 Hewlett-Packard Development Company, L.P. 30 category,googl,report,search,user,review,court,mckinnon,year,internet,microsoft,site,s oftwar,warn,browser,oper,expert,rise,lawyer,digit,extradit,sharpli,error,group,result,sys

    tem,rebel,econom,presid,crisi,find,year,accus,global,obama,china,civilian,shrink,hous,wal l,street,quarter,white,heavi,lehman,economi,session,ey,time,davo,human not_interesting, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,0,0,0,0,0,0,0,0,0,0,1,0,0,0, 0,0,0,0,0,0 not_interesting, 0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,5,0,2,0,0,0,3,0,0,0,3,1,0,0,0,0,0,3,0,0 ,0,0,0,0,2 not_interesting, 0,1,0,0,0,0,0,2,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,3,0,3,1,2,0,2,0,0,0,0,0,0,0,0,0,0,3,1 ,3,1,0,2,0 not_interesting, 0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ,0,0,0,0,1 not_interesting, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,2,0,0,1,2,1,4,0,0,2,0,0,0,2,0,0,0 ,0,2,0,1,0 not_interesting, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,1,0,0,0,0,3,3,0,0,0,0,0 ,0,0,2,0,0 not_interesting, 0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,2,0,0,2,0,0,2,1,0,0,2,1,0,0,2 ,0,0,1,0,0 interesting, 6,0,7,5,0,0,0,0,1,0,5,1,2,0,0,0,0,0,0,0,0,3,0,3,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ,0,0,0,0,3 interesting, 0,7,0,0,2,0,0,0,0,0,0,0,1,0,0,1,0,0,3,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ,0,0,0,0,0 interesting, 0,1,0,0,0,0,0,3,3,1,0,1,1,1,0,3,3,0,1,0,3,0,1,0,2,0,1,0,0,0,3,0,0,0,0,0,0,0,0,0,0,0,0,0,1 ,1,0,0,3,0 Number indicates the number of times the word appears in that document
  19. © Copyright 2010 Hewlett-Packard Development Company, L.P. 32 category,googl,report,search,user,review,court,mckinnon,year,internet,micr osoft,site,softwar,warn,browser,oper,expert,rise,lawyer,digit,extradit,sha

    rpli,error,group,result,system,rebel,econom,presid,crisi,find,year,accus,g lobal,obama,china,civilian,shrink,hous,wall,street,quarter,white,heavi,leh man,economi,session,ey,time,davo,human category, 0,0,0,2,0,0,0,2,1,4,0,2,0,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,2,0,0
  20. © Copyright 2010 Hewlett-Packard Development Company, L.P. 41 RESOURCES – 

    HP Labs Worldwide http://www.hpl.hp.com/ –  R Project http://www.r-project.org/ –  RsRuby https://github.com/alexgutteridge/rsruby –  RinRuby http://rinruby.ddahl.org/ –  Rserve http://www.rforge.net/Rserve/ –  Rserve-Ruby-Client https://github.com/clbustos/Rserve- Ruby-client –  rApache http://rapache.net/index.html –  rRack https://github.com/jeffreyhorner/rRack/