Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Microsoft R Server: Analytics and Deployment at scale

tiagoh
January 30, 2017

Microsoft R Server: Analytics and Deployment at scale

Microsoft R Server (MRS) is the most broadly deployable enterprise-class analytics platform for R available today. Supporting a variety of big data statistics, predictive modeling and machine learning capabilities, R Server supports the full range of analytics – exploration, analysis, visualization and modeling based on open source R.

In this talk we will go through a technical overview of most recent release MRS 9.0 capabilities such as state-of-the-art Machine Learning algorithms, simplified Operationalization of R models and Additional support for Spark.

Bio: Tiago is an Advanced Analytics and Big Data Specialist at Microsoft where he works with multiple customers and partners across Europe to bring the value of Advanced Analytics into multiple business use cases. He focus on technical and architecture level to help customers to get better insights of their data across Azure, the Microsoft Intelligent Cloud. In the past Tiago worked for multiple companies focused on Cloud exclusively like Opscaling and Amazon Web Services. Previously he also worked at Bestiario a world leading interactive data visualization company.

tiagoh

January 30, 2017
Tweet

More Decks by tiagoh

Other Decks in Technology

Transcript

  1. Microsoft R Server: Analytics and Deployment at scale Tiago Henriques

    Advanced Analytics & Big Data Specialist Microsoft Corporation Luxembourg Data Science Meetup 30/01/2017
  2. Transform data into intelligent action INTELLIGENCE Intelligence Dashboards & Visualizations

    Information Management Big Data Stores Machine Learning and Analytics Event Hubs HDInsight (Hadoop and Spark) Stream Analytics SQL Data Warehouse Data Catalog Data Lake Analytics Data Factory Machine Learning Data Lake Store Power BI Cortana Web Mobile Bots Bot Framework Cognitive Services
  3. Bing maps launches What’s the best way home? Microsoft Research

    formed Kinect launches What does that motion “mean”? Azure Machine Learning GA What will happen next? Hotmail launches Which email is junk? Bing search launches Which searches are most relevant? Skype Translator launches What is that person saying? Microsoft & Machine Learning Investments Answering questions with experience 1991 2014 2009 1997 2015 2010 2008 Revolution Analytics CNTK Deep learning: Computational Network Toolkit Cortana Analytics Suite Advanced Analytics Platform
  4. Advanced Analytics Microsoft & Machine Learning Investments Answering questions with

    experience TLC++ Notebooks RML Data Science Professional Degree Join openAI.com 2016
  5. Azure Machine Learning What is the feature? Azure Machine Learning

    is a fully managed Platform as a Service in the cloud, integrated with data sources like HDInsight, Azure SQL Database, SQL in a VM, etc. Based mainly on the open source language R, an Python, it leverages algorithms from businesses like Bing, Xbox, etc., in more than 350 packages. The APIs can be then published in the marketplace. Azure ML APIs at marketplace: Wealth Score, Giving Score, Frequently Bought Together, Recommendations, Anomaly Detection, Lexicon Based Sentiment Analysis, Forecasting-Exponential Smoothing, etc.
  6. • A statistics programming language • Data analysis & visualization

    capabilities • Majority of data scientists use R • Thriving user groups worldwide • Vibrant open Source community • 9,000 + free algorithms in CRAN • New and recent grad’s use it #1 Language Advanced Analytics 2.5M+ Users Open Biggest Ecosystem • Strong ties to academia feeds ever- growing machine learning capabilities What is • Constantly innovating
  7. R Usage Growth Rexer Data Miner Survey, 2007-2015 Language Popularity

    IEEE Spectrum Top Programming Languages, 2016 76% of analytic professionals report using R 36% select R as their primary tool
  8. SQL Server R Services Linux Hadoop Teradata Windows Commercial Support

    Community R Server R Open • Free and open source R distribution • Enhanced and distributed by Microsoft • It is the foundation of MRS
  9. • Microsoft R Open utilises Intel’s Maths Kernel Library •

    More efficient and multi-threaded math computation. • Benefits math intensive processing. • No benefit to program logic and data transform
  10. • Enhanced Open Source R distribution • Compatible with all

    R-related software • Revolutions Open-Source R packages • MRAN website mran.revolutionanalytics.com • Open source (GPLv2 license) - 100% free to download, use and share
  11. 21

  12. ?

  13. R Open R Server Microsoft R Server is a broadly

    deployable enterprise-class analytics platform based on R that is supported, scalable and secure. Supporting a variety of big data statistics, predictive modeling and machine learning capabilities, R Server supports the full range of analytics – exploration, analysis, visualization and modeling Introducing Microsoft R Server
  14. R Open Microsoft R Server ConnectR • High-speed & direct

    connectors Available for: • High-performance XDF • SAS, SPSS, delimited & fixed format text data files • Hadoop HDFS (text & XDF) • Teradata Database (TPT) • EDWs and ADWs • ODBC ScaleR • Ready-to-Use high-performance big data big analytics • Fully-parallelized analytics • Data prep & data distillation • Descriptive statistics & statistical tests • Range of predictive functions • User tools for distributing customized R algorithms across nodes • Wide data sets supported – thousands of variables DistributedR • Distributed computing framework • Delivers cross-platform portability R+CRAN • Open source R interpreter • R 3.3.2 • Freely-available huge range of R algorithms • Algorithms callable by MSR • Embeddable in R scripts • 100% Compatible with existing R scripts, functions and packages RevoR • Performance enhanced R interpreter • Based on open source R • Adds high-performance math library to speed up linear algebra functions
  15. Stream data in to RAM in blocks. “Big Data” can

    be any data size. We handle Megabytes to Gigabytes to Terabytes… Our ScaleR algorithms work inside multiple cores / nodes in parallel at high speed Interim results are collected and combined analytically to produce the output on the entire data set XDF file format is optimised to work with the ScaleR library and significantly speeds up iterative algorithm processing.
  16. R Server: scale-out R, Enterprise Class! • 100% compatible with

    open source R • Any code/package that works today with R will work in R Server • Wide range of scalable and distributed R functions • Examples: rxDataStep(), rxSummary(), rxGlm(), rxDForest(), rxPredict() • Ability to parallelize any R function • Ideal for parameter sweeps, simulation or multiple runs
  17. New Machine Learning Package • GPU-accelerated DNNs • Fast linear

    learner, with support for L1 and L2 regularization • Fast boosted decision tree • Fast random forest • Logistic regression, with support for L1 and L2 regularization • Binary classification using a One- Class Support Vector Machine Operationalization Enhancements • Expose as R Models with one line of code • Ease of application integration with Swagger support • Write once & deploy in multiple platforms • High availability Ease of use enhancements • Support for all three distributions of Hadoop on three flavors of Linux • Support for Hive and Parquet data sources • Support for OLAP cubes as data source • Generate T-SQL stored procedures with one line of code • New R Client 3.3.2 New solution template for campaign optimization
  18. Write Once Deploy Anywhere ### ANALYTICAL PROCESSING ### ### Statistical

    Summary of the data rxSummary(~ArrDelay+DayOfWeek, data= AirlineDataSet, reportProgress=1) ### CrossTab the data rxCrossTabs(ArrDelay ~ DayOfWeek, data= AirlineDataSet, means=T) ### Linear Model and plot hdfsXdfArrLateLinMod <- rxLinMod(ArrDelay ~ DayOfWeek + 0 , data = AirlineDataSet) plot(hdfsXdfArrLateLinMod$coefficients) # SETUP SQLSERVER ENVIRONMENT VARIABLES mySqlServer <- RxInSqlServer() # SQL SERVER COMPUTE CONTEXT AND TABLE REF rxSetComputeContext(mySqlServer) AirlineDataSet <- RxSqlServerData(table=“AirlineDemoSmall”) myHadoopCluster <- RxHadoopMR() rxSetComputeContext(myHadoopCluster) hdfsFS <- RxHdfsFileSystem() hdfsFS Local Parallel – Linux or Windows In – Hadoop ScaleR functions can run in-Hadoop or in-Database without any functional R recoding R script – does not need to change to run across different platforms # SETUP LINUX ENVIRONMENT VARIABLES rxSetComputeContext("localpar") # CREATE LINUX, DIRECTORY AND FILE OBJECTS linuxFS <- RxNativeFileSystem() AirlineDataSet <- RxXdfData(“AirlineDemoSmall.xdf”, fileSystem = linuxFS) SQL Server
  19. Write Once – Deploy Anywhere R Server portfolio Cloud RDBMS

    Desktops & Servers Hadoop & Spark EDW R Server Technology
  20. R R R R R R R R R R

    ScaleR Production RStudio Server Pro Microsoft R Server 1. Copy 2. Stream 3. Send
  21. DistributedR - Hadoop Processing Methods Method 1: Local (Linux) parallel

    processing using all cores on one node, copying data from HDFS to store in local Linux file-system. Compute Context Hadoop Compute Context Hadoop Compute Context Local Parallel Linux (Local) File-System HDFS Csv, Xdf Processing Data 1 Edge node 1:n data nodes 1:n disks 1:(n x number of nodes) disks Csv, Xdf Linux FS Read / write Method 1 (“Beside” or “Edge”) Copy to Local File Method 2: Local (Linux) parallel processing using all cores on one node, streaming data from / to HDFS Compute Context Hadoop Compute Context Hadoop Compute Context Local Parallel Compute Context Hadoop Linux (Local) File-System HDFS Csv, Xdf 1:n nodes 1:n disks 1:(n x number of nodes) disks 1 Edge node
  22. Method 3 Compute Context Hadoop Compute Context Hadoop Compute Context

    Local Parallel Compute Context Hadoop Linux (Local) File-System HDFS Csv, Xdf Processing Data 1:n nodes 1:n disks 1:(n x number of nodes) disks Csv, Xdf HDFS Read / write (“inside”) R script sent to data nodes 1 Edge node R model script sent to Master Node: 1. Starts a master process 2. Distribute work 3. Master tasks for each node 4. Master initiates distributed work 1.Hadoop schedules mapper for each split 2.Algorithm computes intermediate result 3.Reducer combines intermediate results 5. Master process evaluates completion 6. Iterates as required by the algorithm 7. Returns consolidated answer to script
  23. Analytic data set size and processing complexity (e.g. simple summary

    statistics vs iterative algorithm) guide the use of Method 1 and 2 (Edge Node / Server Linux local processing) vs Method 3 (in-Hadoop processing) Low Medium High Small Data < 10GB Medium Data < 50GB Bigger Data > 50GB Edge Node Linux processing In-Hadoop processing Local Linux file-system Hadoop file-system Legend Processing Complexity Data Size
  24. Typical advanced analytics lifecycle Ingest Transform Explore Model Deploy 

        Score Visualize Measure     Model Score ƒ(x) Preparation Modeling Operationalization
  25. • Ground-up re-architecture based on ASP .Net Core with simplified

    APIs • Streamlined install experience fully integrated into Microsoft R Server • Turn your R script into a Web Service with a single-line of R code • Simplified integration experience using Swagger based APIs; now you can consume R web services from application in ANY language • High Scale and High-Availability Active- Active grid computing support • mrsdeploy package to run R Scripts against remote Microsoft R Server – you can execute R code and control your working directory files and packages mrsdeploy
  26. Data Scientist Developer Easy Integration Easy Deployment Easy Setup 

    In-cloud or on-prem  Adding nodes to scale  High availability & load balancing  Remote execution server Microsoft R Server configured for operationalizing R analytics Microsoft R Client (mrsdeploy package) Easy Consumption publishService Microsoft R Client (mrsdeploy package) Data Scientist
  27. Example: fraud analytics deployed to BI tool Example: Market Basket

    Analysis in HTML tool Size of circles indicate credit card balance, and the darkness of the circle shows the prediction of fraud Example: integration with Excel Example R as a service for BI / web apps Copyright Microsoft Corporation. All rights reserved.
  28. Example – C# web app business user front end for

    portfolio optimisation User selected and entered parameters used to run various R models and produce different R model outputs
  29. • Microsoft Machine Learning package • New algorithms based on

    MS TLC (The Learning Code) • Currently available on Windows • Other platforms on road map • Supplements ScaleR – works with ScaleR data sources • Fast, parallel algorithms New Package – MML Scenarios Examples Text Analytics Sentiment Analysis, Ticket categorization, topic clustering High Cardinality factors High dimension categorical features Anomaly Detection (with One- SVM) Fraud/spam detection; other binary classification tasks Learner Strength Sample Applications Deep Neural Networks Non-linear model that allows combining a medium number of features that have very strong correlations; GPU acceleration Bing Ads Click Prediction ($50M per year revenue gain) Fast Linear L1 & L2 regularization Logistic Regression L1 & L2 regularization Fast Boosted Tree/Fast Random Forest State-of-the-art tree ensembles Used in >200k experiments internally in 2014 alone
  30. 1010010011101001001101010101 ScaleR Microsoft R Open SQL Server Hadoop Teradata Linux

    R Server Hadoop Teradata R Client SQL Server Hadoop Teradata Microsoft R Product Family
  31. • Get started today on Azure • Data Science quick-start

    • Multiple languages: R, Python, Julia • Data Exploration and Visualization • Development Tools Available on Azure Marketplace: - Linux - Windows
  32. Combine the best of open source and Microsoft innovation Simplify

    operationalization of new insights with enterprise-grade security Future-proof advanced analytics investments Deliver high performance analytics wherever data lives