Slide 1

Slide 1 text

Become a Data Scientist Francesco Tisiot Analytics Tech Lead

Slide 2

Slide 2 text

Verona, Italy http://ritt.md/ftisiot Over10 Years in Analytics [email protected] @FTisiot Oracle ACE Director ITOUG Board President Francesco Tisiot Analytics Tech Lead

Slide 3

Slide 3 text

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | bit.ly/OracleACEProgram 450+ Technical Experts Helping Peers Globally Nominate yourself or someone you know: acenomination.oracle.com

Slide 4

Slide 4 text

Data Engineering Analytics Data Science www.rittmanmead.com [email protected] @rittmanmead

Slide 5

Slide 5 text

Agenda •OAC •Data Scientist •Become a Data Scientist

Slide 6

Slide 6 text

Oracle Analytics Cloud • Platform Services (PaaS) • Delivered entirely in the cloud: •No infrastructure footprint •Flexibility •Simplified, metered licensing • Several options to suit your needs: •BYOL •Functionality bundled into 2 editions •Professional •Enterprise

Slide 7

Slide 7 text

Functions OAC supports Every type of analytics Classic Modern

Slide 8

Slide 8 text

Classic Enterprise BI • Similar to OBIEE 12c •Centrally maintained & governed •Semantic model • Interactive Dashboards •KPI measurement & monitoring •Guided navigation paths • BI Publisher •Highly formatted, burst outputs • Action Framework •Navigation actions •Scheduled agents

Slide 9

Slide 9 text

Modern Data Discovery • Data Preparation •Acquire data •Clean/Enrich •Transform •Repeatable Flows • Data Visualisation •Create visual insights rapidly •Construct narrated storyboards •Share findings

Slide 10

Slide 10 text

Unique Source of Truth Raw Data To Insights Specific Access Control Data Enrichment and Cleaning Unified Analytics Free Discovery Centralised Reporting https://speakerdeck.com/ftisiot/become-an-equilibrista-find-the-right-balance-in-the-analytics-tech-ecosystem

Slide 11

Slide 11 text

Augmented Analytics Data Enrichment Suggestions Explain One-Click Advanced Analytics Advanced Machine Learning Natural Language Processing

Slide 12

Slide 12 text

Data Scientist

Slide 13

Slide 13 text

https://bigdata-madesimple.com/what-is-a-data-scientist-14-definitions-of-a-data-scientist/ Data Scientist Is a person who has the knowledge and skills to conduct sophisticated and systematic analyses of data. A data scientist extracts insights from data sets, and evaluates and identifies strategic opportunities.

Slide 14

Slide 14 text

https://bigdata-madesimple.com/what-is-a-data-scientist-14-definitions-of-a-data-scientist/ D ata Scientist Is a Data Analyst who lives in California!

Slide 15

Slide 15 text

Data Scientist Skills

Slide 16

Slide 16 text

https://www.oralytics.com/2012/06/data-science-is-multidisciplinary.html Brendan Tierney Oracle Ace Director

Slide 17

Slide 17 text

Data Scientist …Company Missing a Data Scientist

Slide 18

Slide 18 text

Low Hanging Fruit Theory Democratise Data Science

Slide 19

Slide 19 text

Basic Operations What are the Drivers for My Sales? Based on my Experience I can Guess…. Statistically Significant Drivers for Sales Are … Augmented Analytics

Slide 20

Slide 20 text

Basic Operations Is this Client going to accept the Offer? YES/NO 50% 70% Basic ML Model

Slide 21

Slide 21 text

Become a Data Scientist with OAC

Slide 22

Slide 22 text

Before Starting…. Define the Problem!

Slide 23

Slide 23 text

Problem Definition: Predicting Wine Quality

Slide 24

Slide 24 text

Task Experience Performance Classify Good/Bad Wine TEP Corpus of Wine Descriptions with Rating Accuracy

Slide 25

Slide 25 text

Become a Data Scientist with OAC Connect

Slide 26

Slide 26 text

Connection Options in OAC Pre-Defined Data Models External Data Sources

Slide 27

Slide 27 text

Select Relevant Columns and Apply Filters

Slide 28

Slide 28 text

Become a Data Scientist with OAC Connect Clean

Slide 29

Slide 29 text

What He Really Does What Everybody Thinks a Data Scientist Does

Slide 30

Slide 30 text

https://www.infoworld.com/article/3228245/data-science/the-80-20-data-science-dilemma.html

Slide 31

Slide 31 text

Cleaning What? N/A Missing Values Mark <> MArk Wrong Values City “Rome” Irrelevant Observations Role: CIO Salary:500 K$ Handling Outliers Train: 80% Test: 20% Train/Test Set Split Col1 -> Name Labelling Columns 0-200k 0-1 Feature Scaling # Of Clicks Aggregation

Slide 32

Slide 32 text

Cleaning How? Data Flows - Filter - Aggregate - Join

Slide 33

Slide 33 text

0-200k 0-1 Feature Scaling Train: 80% Test: 20% Train/Test Set Split Col1 -> Name Labelling Columns City “Rome” Irrelevant Observations Mark <> MArk Wrong Values Cleaning What? N/A Missing Values Role: CIO Salary:500 K$ Handling Outliers CASE … WHEN… UPPER FILTER COLUMN RENAME FILTER KPI/ (MAX-MIN) FILTER? # of Clicks Aggregation COUNT Automated Automated Automated

Slide 34

Slide 34 text

Why Removing an Outlier? Years Experience Salary 1 30.000 2 32.000 3 35.000 4 35.500 5 36.000 6 40.000 7 50.000 8 70.000 9 90.000 10 500.000

Slide 35

Slide 35 text

How To Find Outliers? One Dimension

Slide 36

Slide 36 text

How To Find Outliers? Two Dimensions

Slide 37

Slide 37 text

Become a Data Scientist with OAC Connect Clean Transform & Enrich

Slide 38

Slide 38 text

Feature Engineering Location -> ZIP Code 2 Locations -> Distance Name -> Sex Day/Month/Year -> Date Data Flow Additional Data Sources?

Slide 39

Slide 39 text

Data Preparation Recommendations

Slide 40

Slide 40 text

Spatial Enrichment Oracle Spatial Studio http://ritt.md/spatial-studio

Slide 41

Slide 41 text

Become a Data Scientist with OAC Connect Clean Transform & Enrich Analyse

Slide 42

Slide 42 text

Data Overview

Slide 43

Slide 43 text

Explain

Slide 44

Slide 44 text

Explain - Key Drivers

Slide 45

Slide 45 text

Become a Data Scientist with OAC Connect Clean Analyse Train & Evaluate Transform & Enrich

Slide 46

Slide 46 text

What Problem are we Trying to Solve? Supervised Unsupervised “I want to predict the value of Y, here are some examples” “Here is a dataset, make sense out of it!” Classification Regression https://towardsdatascience.com/supervised-vs-unsupervised-learning-14f68e32ea8d Clustering

Slide 47

Slide 47 text

Easy Models

Slide 48

Slide 48 text

NLP

Slide 49

Slide 49 text

DataFlow Train Model

Slide 50

Slide 50 text

Which Model - Parameters?

Slide 51

Slide 51 text

Select, Try, Save, Change, Try, Save …..

Slide 52

Slide 52 text

Compare - Classification Real Value Predicted Value Good Bad Bad Good

Slide 53

Slide 53 text

There is No Single Truth… 502/(502+896) = 64.09% 471/(471+866)=64.77% Precision

Slide 54

Slide 54 text

Compare - Regression

Slide 55

Slide 55 text

Become a Data Scientist with OAC Connect Clean Analyse Train & Evaluate Predict Transform & Enrich

Slide 56

Slide 56 text

Use On the Fly

Slide 57

Slide 57 text

Step of a Data Flow

Slide 58

Slide 58 text

Congratulations! …You are now a Data Scientist!

Slide 59

Slide 59 text

Nearly There

Slide 60

Slide 60 text

97% 95% 90% 80% 60% 50% . Required Knowledge

Slide 61

Slide 61 text

…But 80% > 50% Data Cleaning Model Creation & Evaluation Feature Engineering Feature Selection

Slide 62

Slide 62 text

ML Production Deployment Data Scientist ML -> Data Oracle Machine Learning

Slide 63

Slide 63 text

Become a Data Scientist with OAC http://ritt.md/OAC-datascience

Slide 64

Slide 64 text

ML in Action with OAC http://ritt.md/OAC-ML-Video

Slide 65

Slide 65 text

https://www.rittmanmead.com/insight-lab/ Insights Lab

Slide 66

Slide 66 text

Data Science O AC

Slide 67

Slide 67 text

Become a Data Scientist Francesco Tisiot Analytics Tech Lead