Slide 1

Slide 1 text

© MapD 2018 © MapD 2018 1 The Need for Speed: How the Auto Industry Accelerates Machine Learning with Visual Analytics Zach Izham, VW | Aaron Williams, MapD March 27, 2018

Slide 2

Slide 2 text

© MapD 2018 Introductions Aaron Williams VP of Global Community @_arw_ [email protected] /in/aaronwilliams/ /williamsaaron Zach Izham Legend @drizham [email protected] /in/dr-zach-izham-02090b5/ /drizham Asghar Ghorbani Data Scientist @ghorbani_asghar [email protected] /in/aghorbani/ /a-ghorbani slides: https://speakerdeck.com/mapd/

Slide 3

Slide 3 text

© MapD 2018 Agenda 3 A Real World Problem: Churn • Partial Dependency Analysis - An Accelerated Review • A Complete Machine Learning Pipeline • Demo: Data Engineering + Training + Predictive Analytics + Black Box Interrogation The GPU Data Frame in Action • GO.ai and MapD Q&A

Slide 4

Slide 4 text

© MapD 2018 © MapD 2018 4 “Every business will become a software business, build applications, use advanced analytics and provide SaaS services.” - Smart CEO Guy has

Slide 5

Slide 5 text

© MapD 2018 The Evolution of Data for Competition 5 Collect It Make It Actionable Make it Predictive

Slide 6

Slide 6 text

© MapD 2018 © MapD 2018 6 Partial Dependency Analysis An Accelerated Review

Slide 7

Slide 7 text

© MapD 2018 Assume the following example: Failure rate of machine component: Only depends on hours of work (HoW) of components, h Not (within reason) Age of components Assume failure rate, f(x) is only dependent on hours of work and not age: Where α and β are constants dependent on machine operating conditions L. Greene et al. Simpson’s paradox: A cautionary tale in advanced analytics. Significance, 2012. Example: Partial Dependency 7

Slide 8

Slide 8 text

© MapD 2018 Simpson’s Paradox Example Partial Dependency 8

Slide 9

Slide 9 text

© MapD 2018 Impact of Each Variable on Target Value Example Partial Dependency 9 fs(Xs)= Exc (f (Xs,Xc)|Xs)

Slide 10

Slide 10 text

© MapD 2018 T. Hastie, et al. The Elements of Statistical Learning., 2001 Generating Data for the Complete State Space The Failure Rate Generated from the Trained Black Box Model 10

Slide 11

Slide 11 text

© MapD 2018 Impact of each variable on target value fs(Xs)= Exc f (Xs,Xc) T. Hastie, et al. The Elements of Statistical Learning., 2001 Investigating the System with the Simulated Data Partial Dependency Analysis 11

Slide 12

Slide 12 text

© MapD 2018 fs(Xs)= Exc f (Xs,Xc) fs(Xs)= Exc (f (Xs,Xc)|Xs) Collect Data to Build a Model Generate Data for the Whole State Space 12

Slide 13

Slide 13 text

© MapD 2018 © MapD 2018 13 Why do We Need GPUs?

Slide 14

Slide 14 text

© MapD 2018 Coarse Grid – Small Data Dense Grid / Data dimensionality – Large Data Grid resolution 10: ● 1 variable: 10 ● 2 variables: 10 x 10 = 100 ● ... ● 10 variables: 10^ 10 = 10,000,000,000 Data Size Explosion 14

Slide 15

Slide 15 text

© MapD 2018 © MapD 2018 15 Analysis Logistics

Slide 16

Slide 16 text

© MapD 2018 Data Engineering Some Background - Objective - Creating the Master Data Frame 16 Relevant data reside on separate tables and databases Collecting, cleaning and curating the relevant data Creating a target variable: Which cars will not be returning to the garage / service center for service

Slide 17

Slide 17 text

© MapD 2018 VW Data Pipeline Getting Data to the Environment 17 1. Loading data to MapD database: a. Table extracted from database and exported as csv. b. mapdql used to create table and import in data. 2. Exploratory Data Analysis: a. MapD dashboard used to perform exploratory data analysis (EDA), gives: ‘spatial awareness’ of the of data. b. Using MapD allows this on the fly investigation on large datasets.

Slide 18

Slide 18 text

© MapD 2018 Demo Data Stats Rough Feeling of Data 18 Number of rows ~2.2 million 25 relevant columns 2 categorical columns 23 numerical columns

Slide 19

Slide 19 text

© MapD 2018 Hardware Stack Details of Hardware Setup On Microsoft Azure Instance: Instance Name: NC24S_V2 Standard 19 vCPUs (Cores) 24 Storage 448GB Data Disks 32 GPUs 4 Tesla P100-PCIE-16GB OS CentOS

Slide 20

Slide 20 text

© MapD 2018 © MapD 2018 20 Demo Time!

Slide 21

Slide 21 text

© MapD 2018 Machine Learning Pipeline 21 Personas in Analytics Lifecycle (Illustrative) Business Analyst Data Scientist Data Engineer IT Systems Admin Data Scientist / Business Analyst Data Preparation Data Discovery & Feature Engineering Model & Validate Predict Operationalize Monitoring & Refinement Evaluate & Decide GPUs

Slide 22

Slide 22 text

© MapD 2018 MapD is the analytics platform created for GPUs

Slide 23

Slide 23 text

© MapD 2018 Advanced memory management Three-tier caching to GPU RAM for speed and to SSDs for persistent storage 23 SSD or NVRAM STORAGE (L3) 250GB to 20TB 1-2 GB/sec CPU RAM (L2) 32GB to 3TB 70-120 GB/sec GPU RAM (L1) 24GB to 256GB 1000-6000 GB/sec Hot Data Speedup = 1500x to 5000x Over Cold Data Warm Data Speedup = 35x to 120x Over Cold Data Cold Data COMPUTE LAYER STORAGE LAYER Data Lake/Data Warehouse/System Of Record

Slide 24

Slide 24 text

© MapD 2018 The GPU Open Analytics Initiative (GOAI) Creating common data frameworks to accelerate data science on GPUs 24 /mapd/pymapd /gpuopenanalytics/pygdf

Slide 25

Slide 25 text

© MapD 2018 The Time Is Now 25 Collect It Make It Actionable Make it Predictive

Slide 26

Slide 26 text

© MapD 2018 • We’ve published a few notebooks showing how to connect to a MapD database and use an ML algorithm to make predictions • We’ve also published the notebook from the VW churn example 26 ML Examples /gpuopenanalytics/demo-docker /mapd/mapd-ml-demo

Slide 27

Slide 27 text

© MapD 2018 © MapD 2018 • community.mapd.com Ask questions and share your experiences • mapd.com/demos Play with our demos • mapd.com/platform/download-community/ Get our free Community Edition and start playing 27 Next Steps

Slide 28

Slide 28 text

© MapD 2018 Thanks to the whole team! 28 • Asghar Ghorbani • Wamsi Viswanath • Abraham Duplaa

Slide 29

Slide 29 text

© MapD 2018 Questions? Aaron Williams VP of Global Community @_arw_ [email protected] /in/aaronwilliams/ /williamsaaron Zach Izham Legend @drizham [email protected] /in/dr-zach-izham-02090b5/ /drizham Asghar Ghorbani Data Scientist @ghorbani_asghar [email protected] /in/aghorbani/ /a-ghorbani slides: https://speakerdeck.com/mapd/