Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Predict Clicked Ads Customer Classification by ...

Predict Clicked Ads Customer Classification by using Machine Learning

Agustina Sri Wardani

March 06, 2023
Tweet

More Decks by Agustina Sri Wardani

Other Decks in Technology

Transcript

  1. Predict Clicked Ads Customer Classification by using Machine Learning Created

    by: Agustina Sri Wardani [email protected] https://www.linkedin.com/in/agustinaswd/ Hi, nice to meet you. In this 4th mini project from Data Science Bootcamp at Rakamin Academy. In this project, I'm a Data Analyst in a company. I'm responsible for seeking insights regarding user behavior from this data by visualizing it, creating machine learning relevant to company needs, and making recommendations based on the findings obtained.
  2. Overview "A company in Indonesia wants to know the effectiveness

    of an advertisement they are showing. The company must know how much the advertising has been marketed to attract customers to see the advertisement. Processing historical advertisement data and finding insights and patterns that occur can help companies determine marketing targets. This case focuses on creating a machine learning classification model that functions to determine the right target customers.”
  3. Data Understanding • Change data type Timestamp to datetime •

    Change column name Male to Gender • Change column name Area Income to Income • Delete column Unnamed: 0 Adjust Data Daily Time Spent on Site, Area Income, Daily Internet Usage have missing values and we will drop them in the next step Missing Value There’s no duplicated value Duplicated Value You can check here for the source code
  4. Exploration Data Analysis Daily Time Spent on Site • Both

    data for customers who clicked on ads and no are skewed distribution • Customers who do not click Ads have a much larger breaking point than customers who click Ads • Column Daily Time Spent on Site has a bimodal distribution Age • Both data for customers who clicked on ads and no is the normal distribution • The distribution of the data in Age overall is Normal Distribution Daily Internet Usage • Both data for customers who clicked on ads and no are skewed distribution • Column Daily Internet Usage has a bimodal distribution • Customers who clicked on Ads and didn't have almost the same peak point of data distribution Area Income • Both data for customers who clicked on ads and no is the normal distribution • The distribution of the data in Area Income overall is Normal Distribution Univariate Analysis
  5. Exploration Data Analysis Daily Time Spent on Site • Customers

    with Daily Time Spent on Site 35 - 45 are more clicked on ads • Customers with Daily Time Spent on Site 70 - 80 are more didn't clicked on ads Age • Customers with Age 35 - 45 are more clicked on ads • Customers with Age 25 - 35 are more didn't clicked on ads Daily Internet Usage • Customers with Daily Internet Usage of 100 - 150 are more clicked on ads • Customers with Daily Internet Usage of 175 - 225 are more didn't clicked on ads Area Income Customers with an income range of around 380 - 460 million didn't clicked on ad more than those with other income ranges Bivariate Analysis
  6. Exploration Data Analysis • Column Daily Internet Usage has a

    positive correlation with Daily Time Spent on Site equal to 0.52 and 0.34 with column Area Income • Column Daily Internet Usage and Age have a negative correlation equal to - 0.37 • Column Daily Time Spent on Site has a negative correlation with Age equal to -0.33 • From the pairplot above, for columns Daily Time Spent on Site and Daily Internet Usage, we know that customers who cliked on ad and didn't can be grouped quite clearly Multivariate Analysis
  7. Data Cleaning & Preprocessing Feature Encoding Encode Strategy Label Encoding:

    Gender, Clicked on Ad | One Hot Encoding: province, category One Hot Encoding Label Encoding
  8. Data Modeling Modeling Without Scalling Logistic Regression kNN Decision Tree

    XGBoost Random Forest CatBoost AdaBoost Accuracy (Test Set) 0.49 0.68 0.94 0.95 0.95 0.95 0.94 Precision (Test Set) 0.00 0.71 0.94 0.95 0.95 0.96 0.94 Recall (Test Set) 0.00 0.64 0.94 0.94 0.95 0.94 0.94 F1-Score (Test Set) 0.00 0.67 0.94 0.95 0.95 0.95 0.94 roc_auc (test-proba) 0.73 0.7 0.94 0.98 0.99 0.99 0.98 roc_auc (test-proba) 0.79 0.86 1.0 1.0 1.0 1.0 1.0 roc_auc (crossval train-mean) 0.7692 0.8512 1.00 0.9998 1.0 1.0 0.0007 roc_auc (crossval test-mean) 0.7693 0.7075 0.9359 0.9885 0.9898 0.9895 0.9842 roc_auc (crossval train-std) 0.0046 0.004 0.0 0.0001 0.0 0.0 0.0001 roc_auc (crossval test-std) 0.0174 0.0221 0.0115 0.007 0.0076 0.0075 0.0066 Best Fit
  9. Data Modeling Random Forest Accuracy (Test Set) 0.49 Precision (Test

    Set) 0.00 Recall (Test Set) 0.00 F1-Score (Test Set) 0.00 roc_auc (test-proba) 0.48 roc_auc (test-proba) 0.57 roc_auc (crossval train-mean) 1.0 roc_auc (crossval test-mean) 0.9898 roc_auc (crossval train-std) 0.0 roc_auc (crossval test-std) 0.0076 We use the Accuracy metric to determine the performance of the Model we make. We chosee Random Forest for our modeling Modeling With Scalling Cause accuracy model with scalling equal to 0.49, so we used dataset without scalling with accuracy model equal to 0.95 with Random Forest
  10. Data Modeling Confusion Matrix A: Predicted clicked on ad but

    didn’t in the actual B: Predicted didn’t clicked on ad but didn’t clicked the ad in the actual A B
  11. Data Modeling • Daily Internet Usage and Daily Time Spent

    on Site feature the most importance. We will use these two features to determine the future marketing success • From EDA, we know that customers with Daily Internet Usage of 100 - 150, Customers with Daily Time Spent on Site 35 - 45 are more clicked on ad • We can also Age to determine future marketing success. • From EDA, we know that customers with Age 35 - 45 are more clicked on ad. • Customers with an income range of around 380 - 460 million didn't clicked on ad more than those with other income ranges SHAP Values You can check here for the source code
  12. Business Recommendation – Daily Internet Usage • Customers who have

    short daily internet usage will click our ads based on curiosity cause they don't have much time to use the internet. We need to give a special promo in the ads so they won't just click the ads but will buy our product. • Customers who daily use internet for a relatively long time (211-240 minutes) rarely click our ads. To increase their possibility of clicking our ads. We need to give the perfect ads, to do more analysis to know the perfect time to give the ads for them to click our ads.
  13. Business Recommendation – Daily Time Spent on Site • Must

    maintain our customers who spend 46-60 minutes daily on our site. We need to give the perfect promo so they won’t just click the ads but will buy our product. • The customers with daily time spent on site >75 minute are customers who rarely clicked on ads, maybe because they’re customers who already know what they want to buy. To increase their possibility of clicking our ads, we need to give ads that custom/suit their need.
  14. Simulation Skema Business (Assumption) • IDR 10K/customer for cost digital

    marketing • Revenue each customers is IDR 13K when customers click the ads • We will focus on losses caused by costs that have been issued by the company to display ads but don’t generate revenue for companies cause customers don't click ads
  15. 300 Customers Without ML With ML Actual 300 139 153

    147 147 Can’t predict Predict click the ads Predict don’t click the ads Click the ads Don’t click the ads We assume we have 300 customers, with 51% customers click the ads. Our ML model have 95% accuracy
  16. Business Simulation for Cost, Revenue, & Profit Without Machine Learning

    Total 300 customers. 153 customers (51%) click the ads 147 customers (49%) don’t click the ads Marketing Cost: 300 x IDR 10K = IDR 3.000K Revenue: 153 x IDR13K = IDR 1.989K With Machine Learning Total 300 customers. 147 customers (51%) click the ads 139 customers (49%) don’t click the ads Marketing Cost: 147 x IDR 10K = IDR 1.470K Revenue: 147 x IDR13K = IDR 1.911K IDR Without Machine Learning With Machine Learning Marketing Cost 3.000K 1.470K Revenue 1.989K 1.911K Profit - 1.011K 441K IDR 441.000 Profit with Machine Learning IDR