Slide 1

Slide 1 text

Data Science Methodology Chapter 1 – A Living Parable Fiqry Revadiansyah

Slide 2

Slide 2 text

Trainer Bio Data Scientist Bukalapak (2018 – Present) Technical Content Reviewer Packt Publishing (2019 – Present) Email : [email protected] | Linkedin : https://www.linkedin.com/in/fiqryrevadiansyah Fiqry Revadiansyah Work Experience Teaching Experience Guest Lecturer MB1201 Business Statistics at SBM ITB (April 2020) Part-time Teacher Data Science Purwadhika (2019), and Workshop speaker: Introduction to ML for DS (June 2020) & Statistics for Business Analytics (Nov 2019)

Slide 3

Slide 3 text

Background

Slide 4

Slide 4 text

What do you think about planting?

Slide 5

Slide 5 text

Planting?

Slide 6

Slide 6 text

If you want to plant, what should you do?

Slide 7

Slide 7 text

Planting Activity Choose Plant Seeds Select Planting Method Determine the Planting Objective Optimize Plant Result

Slide 8

Slide 8 text

So, this is not a planting class, right? What is the relation to Data Science?

Slide 9

Slide 9 text

Analogy of Farming My Expression when hear Data Science = Farming

Slide 10

Slide 10 text

Analogy of Farming The Data Science Methodology is an iterative system of methods that guides data scientists on the ideal approach to solving problems with data science, through a prescribed sequence of steps.

Slide 11

Slide 11 text

Data Science Methodology *John Rollins, SDS IBM

Slide 12

Slide 12 text

Data Science Methodology 1. Define the Goal, and Choose the Road 2. Gather the resource, and Identify its Characteristic 3. Do the Plan, and Evaluate it 4. Market to Public, and Regather the Opinion Preparation Stage Data Related Stage Drive the Solution Stage

Slide 13

Slide 13 text

1. Business Understanding What problem do you want to take action? As a farmer - Gain More Profit - Live Healthy As a Data Scientist - Increase Visitor Numbers - Reduce Fraudulent Acts

Slide 14

Slide 14 text

1. Business Understanding What problem do you want to take action? Impression • Number of Visit • Number of Active Visitors Budget • Cost Per Product • Cost of Acquisition Users • Number of Retain Users • Satisfaction Score

Slide 15

Slide 15 text

2. Analytic Approach Which lane do you prefer to take? As a farmer Choose planting method: Hydroponics or Aquaponics As a Data Scientist Choose analytics method: Predictive Modeling (ML) or Diagnostic Analysis

Slide 16

Slide 16 text

2. Analytic Approach Which lane do you prefer to take? If the question is to determine probabilities of an action ● Use a Predictive model If the question is to show relationships ● Use a Descriptive model If the question requires a yes/no answer ● Use a Classification model

Slide 17

Slide 17 text

3. Data Requirements What kind of resource do you need? As a farmer Prepare the tools & seeds to achieve the farming goal As a Data Scientist Choose what kind of data that might solve the problem

Slide 18

Slide 18 text

3. Data Requirements What kind of resource do you need?

Slide 19

Slide 19 text

4. Data Collections How do you collect the resource? As a farmer Select shop to purchase seeds, either local/online As a Data Scientist Determine and Collect the data from data source

Slide 20

Slide 20 text

4. Data Collections How do you collect the resource? Local/Internal • User Data • Traffic History Data Public/External • Open Data Repository Scraping • Social Media • Website

Slide 21

Slide 21 text

5. Data Understanding What have the resource tell you? As a farmer Get to know what is happening on the soil, plants, and the environment As a Data Scientist Get to know, what does the data tell us about the problem, and visualize it

Slide 22

Slide 22 text

5. Data Understanding What have the resource tell you? Data Visualization Discover data trend, pattern, and any other relevancies accordingly Descriptive Statistics Decipher the aggregate information, such as average median, mean, missing value, etc Funnel Analysis Uncover the hidden information

Slide 23

Slide 23 text

6. Data Preparation What have to do before doing an action? As a farmer Prepare the suitable soil for the selected plants, set the growing medium well As a Data Scientist Handle data problem, such as missing values, duplicates, and other

Slide 24

Slide 24 text

6. Data Preparation What have to do before doing an action? Missing Values Duplicated Data Irregular Format Imbalanced Clean Data

Slide 25

Slide 25 text

7. Modeling How do you make a model from your data to solve the problem? As a farmer Planting the seeds, watering the plants, etc. As a Data Scientist Model the data by Machine Learning Process

Slide 26

Slide 26 text

7. Modeling How do you make a model from your data to solve the problem? Choose ML Model Determine the model based on expected output (prediction or regression) Model Iteration Iterate the modeling process by K- fold Ensemble Model Combine ML models to gain better model accuracy

Slide 27

Slide 27 text

8. Evaluation Have the model already answer the problem or need to be improved? As a farmer Inspect the plants, is it free from pest/disease? As a Data Scientist Do the model has good fitting accuracy, should it be enchanted?

Slide 28

Slide 28 text

8. Evaluation Have the model already answer the problem or need to be improved? Model Interpretation Interpret the model result to be understood by other people Model Evaluation Validate the model performance by its problem type (Accuracy, Precision, Recall, RMSE, etc)

Slide 29

Slide 29 text

9. Deployment Can you apply the model to the real life? As a farmer Sell the vegetables to the market/store As a Data Scientist Integrate ML model into production ecosystem

Slide 30

Slide 30 text

9. Deployment Can you apply the model to the real life?

Slide 31

Slide 31 text

10. Feedback Is there any input to your business solution? As a farmer Gather suggestions/ comments from our customer As a Data Scientist Take many feedbacks from various entity such as end- user, stakeholders, etc.

Slide 32

Slide 32 text

10. Feedback Is there any input to your business solution? METRICS BEFORE AFTER DEPLOYMENT ML MODEL Daily Active Users 1000 1600 (+60%) Cost Spent 1 mio/month 500k/month (-50%) Revenue Gain 10 mio 30 mio (+300%) SLA 3 days 2 days (-33%)

Slide 33

Slide 33 text

In short… Data Science Methodology are… Define the Goal, and Choose the Road • What problem do you want to take action? • Which lane do you prefer to take? Gather the resource, and Identify its Characteristic • What kind of resource do you need? • How do you collect the resource? • What have the resource tell you? • What have to do before doing an action? Do the Plan, and Evaluate it • How do you make a model from your data to solve the problem? • Have the model already answer the problem or need to be improved? Market to Public, and Regather the Opinion • Can you apply the model to the real life? • Is there any input to your business solution?

Slide 34

Slide 34 text

Lets take a simple exercise Data Science Methodology on these cases:

Slide 35

Slide 35 text

Lets take a simple exercise Youtube Case Define the Goal, and Choose the Road • What problem do you want to take action? • Which lane do you prefer to take? Gather the resource, and Identify its Characteristic • What kind of resource do you need? • How do you collect the resource? • What have the resource tell you? • What have to do before doing an action? Do the Plan, and Evaluate it • How do you make a model from your data to solve the problem? • Have the model already answer the problem or need to be improved? Market to Public, and Regather the Opinion • Can you apply the model to the real life? • Is there any input to your business solution?

Slide 36

Slide 36 text

Lets take a simple exercise Jenius Case Define the Goal, and Choose the Road • What problem do you want to take action? • Which lane do you prefer to take? Gather the resource, and Identify its Characteristic • What kind of resource do you need? • How do you collect the resource? • What have the resource tell you? • What have to do before doing an action? Do the Plan, and Evaluate it • How do you make a model from your data to solve the problem? • Have the model already answer the problem or need to be improved? Market to Public, and Regather the Opinion • Can you apply the model to the real life? • Is there any input to your business solution?

Slide 37

Slide 37 text

Lets take a simple exercise Gojek Case Define the Goal, and Choose the Road • What problem do you want to take action? • Which lane do you prefer to take? Gather the resource, and Identify its Characteristic • What kind of resource do you need? • How do you collect the resource? • What have the resource tell you? • What have to do before doing an action? Do the Plan, and Evaluate it • How do you make a model from your data to solve the problem? • Have the model already answer the problem or need to be improved? Market to Public, and Regather the Opinion • Can you apply the model to the real life? • Is there any input to your business solution?

Slide 38

Slide 38 text

Data Science Methodology Chapter 2 – A Lesson from Tech Industry Fiqry Revadiansyah

Slide 39

Slide 39 text

Frameworks on How every different Division Works

Slide 40

Slide 40 text

Data Driven Framework Product Management AAARRR Model Product Design Double Diamond Cross-Division Lesson

Slide 41

Slide 41 text

Data Driven Framework Business Development Business Model Canvas Engineering Agile Project Management Cross-Division Lesson

Slide 42

Slide 42 text

Product Management AAARRR Model Journey Prototype Describe an end-to-end process of how our customer get impression until producing revenue stream Develop Persistently Focus on dropping channel, constantly evaluate with the whole complexity (engineer, ux, data, etc) Data Driven! They evaluate those channel based on data. Analytics is needed to enhance the decision making process here.

Slide 43

Slide 43 text

Product Design Double Diamond Design Thinking Careful on Iteration A belief to double check progress, start from a helicopter view, end to the ant view. Exploration to Action Focus on explore the situation first, define hypothesis based on pain points, develop product to solve, deliver to evaluate Data Driven! From the beginning till the end, they use data to tell the story about our customers

Slide 44

Slide 44 text

Business Development Business Model Canvas Plan on Demand List down all funnels of business process, set the subjects of every key points to ensure reliability Customer Satisfaction Aside from the streams, this model also focus on customer growth, such how to maintain the relationship, how to segment them Data Driven! Data is always needed to recap every key points of this model

Slide 45

Slide 45 text

Engineering Agile Project Management Iterative Approach Managing software development projects that focuses on continuous releases and incorporating customer feedback with every iteration Scrum and Kanban Scrum is focused on fixed-length project iterations, Kanban is focused on continuous releases. Data Driven! In order to track the process, data is needed to evaluate the process

Slide 46

Slide 46 text

Data Driven Framework Data on Stakeholder Expectations “Okay, so what?” “OK, Thanks” “Seems interesting” “Good, I think we need to do an AB Test” “Impressive! Lets do A, B, C tomorrow!”

Slide 47

Slide 47 text

Data Driven Framework Objective: Connect the dots to find a significance

Slide 48

Slide 48 text

Data utilization through stakeholders expectation “Okay, so what?” “The Actionless Data” When you give data without any particular advantages (rough format) User Table Transaction Table

Slide 49

Slide 49 text

Data utilization through stakeholders expectation “OK, Thanks” “The Labeled Data” Data with particular group/label, an aggregated information. Daily Transacting Users The average of Production Cost on June 2020

Slide 50

Slide 50 text

Data utilization through stakeholders expectation “Seems interesting” “The Data Combination” Interconnected labelled-data brings better point of view. Count Transaction by User Location Conversion Rate (CVR) of Voucher X

Slide 51

Slide 51 text

Data utilization through stakeholders expectation “Good, I think we need to do an AB Test” “The Most Significant Data” Aggregated/Interconnected data, which acted as main metrics of the company Retention Cohort of Total Customer from city X Budget Allocation of Product X based on Customer Group

Slide 52

Slide 52 text

Data utilization through stakeholders expectation “Impressive! Lets do A, B, C tomorrow!” “The Data Guru” Interconnected of most significant findings, as a funnel to answer missing gap (Funnel Type) Retention Cohort based on User Type, Country of Origin, etc Combination of Product Type, Location, and Revenue

Slide 53

Slide 53 text

Lets take a simple exercise Frameworks Application

Slide 54

Slide 54 text

Product Management Awareness Acquisition Activation Retention Referral Revenue Trending Topics Content Uniqueness Content Quality Content Stickiness Gift Away Ads

Slide 55

Slide 55 text

Product Management Awareness Acquisition Activation Retention Referral Revenue Impression Trial Register Flexi Cash Monthly User Gift Promotions Interest Percentage

Slide 56

Slide 56 text

Product Management Awareness Acquisition Activation Retention Referral Revenue Impression Product Trial Sign Up Daily User Satisfaction Partner Gains

Slide 57

Slide 57 text

Are you ready to work like a Data Scientist?