Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Introduction to Machine Learning for Data Science

Introduction to Machine Learning for Data Science

A gentle introduction to Machine Learning, consist of Supervised & Unsupervised ML, End-to-End Process of ML, Data Preprocessing and Feature Engineering, and also Evaluation Metrics for ML. This is a workshop session.

Fiqry Revadiansyah

June 29, 2020
Tweet

More Decks by Fiqry Revadiansyah

Other Decks in Technology

Transcript

  1. Workshop Eps. Introduction to Machine Learning for Data Science 29th

    of June, 2020, Purwadhika Classroom @Hangout Google Hello data geeks!
  2. greetings! just call me Fiqry (with/without suffix) I currently working

    as a Data Scientist @Bukalapak, also had been working as Technical Content Reviewer @Packt Publishing (working remotely) I also passionate on Time Series Analytics, Immersive Computing (VR & AR), and Gamification Business. That’s it ya!
  3. Please expect these things: Everything will be delivered in introduction

    level, please don’t expect you will be an expert Data Scientist/ML Engineer after attending this workshop There will be a hands-on session, coding with python language. If you are not familiar with python (Or don’t even have any programming background), please expect you just understand the syntax selection Please expect that there won’t be any content repetition during the workshop once you were disconnected, you are able to rewind the material later using recorded video by Purwadhika No repetition For Disconnect Accident Know the Workshop Scope Programming Language using Python
  4. This talk will be conducted in Three Different levels: Talk

    in various spectrums, from Technology, Business, Sociology, Economy, to widen and enlarge the point of view and paradigm 100% Theory, 0% Practice @ 1 hour (Modul 1) Talk specific in one or two domains, describe the process from the upstream to downstream, and a bit of coding 50% Theory, 50% Practice @ 1 hour (Modul 2) Talk fundamentally in one domain, answering “how” question, get ready to make hands dirty 0% Theory, 100% Practice @ 1 hour (Modul 3) Low-Level High-Level Med-Level
  5. Bookmark The Star Slides that have a star on the

    top left corner, is very important to give more focus
  6. Dismantle Machine Learning Engine Table of contents Low-level discussion: Get

    your hands dirty on data forecasting and predictive modelling. High-level discussion: How to get inspired by data science in a different perspective. Data Science, an unpredictable tale Med-level discussion: An end-to-end process on how does machine learning works. Try to code. Hands-On real industry case 01 02 03
  7. High-level discussion: How to get inspired by data science in

    a different perspective Data Science, an unpredictable tale 01
  8. Med-Level Discussion: An end-to-end process on how does machine learning

    works. Try to code. Dismantle the Machine Learning Engine 02
  9. Dismantle the Machine Learning Engine: Understanding Level of Data Science

    Understand the differences between Supervised, Unsupervised, Deep Learning. Know how to determine and present best model. Early level of EDA (Exploratory Data Analysis) Understand how to increase model accuracy, handle data problems (imbalance data, missing value), and proficient in model selections. Able to do Feature Engineering. Expert level of increasing model accuracy (New Deep Learning Arch), very proficient on handling data problems, able to propose new algorithm in different data case Entry medior Senior Managerial Route Technical Route
  10. Dismantle the Machine Learning Engine: Things to Touch and Say

    Hello with Greetings to Machine Learning types, such as Supervised Learning, Unsupervised Learning Get in touch the end-to-end process of doing Machine Learning things, along with the most recent tools Understand the mechanism of data behavior and model selection by its metrics Data, Metrics, and Model Selection Say hello to Machine Learning Tools and End-to-End Process
  11. Variable a medium to store data/information/value Data is a value/information,

    can be numeric/alphabetic/picture/video/etc Machine Learning Model a simplified program that can be taught data (input) to predict output Common used Terms
  12. Variable X open variable that used to predict Y (independent/feature

    variable) Variable Y label variable that determined by X (dependent/response variable) Common used Terms
  13. Algorithm a sequence of steps to solve problem Train &

    Testing Data Train data is data that used to Train ML Model Testing data is data that used to Test ML Model Accuracy Feature Engineering a specific domain to produce new features based on existing features Common used Terms
  14. Machine that can makes own decision Machine that learn from

    data Machine that can predict data living computer algorithm math and stats stuff
  15. Say Hello to Machine Learning: Definition and Its Derivatives Input

    (Data) Static Code/Syntax Output (Data) Traditional Programming Input (Data) + Output (Data) Train ML Model (Learn) ML Model (Program) Prediction (from Input) Machine Learning Programming
  16. Say Hello to Machine Learning: Definition and Its Derivatives Input:

    Height = [145,154,177,150,170] Static Program: IF Logic IF Height < 150 then Short IF Height >= 150 or Height <= 175 then Average IF Height > 175 then Tall Output: [Short, Average, Tall, Average, Average] Input: Height = [145,154,177,150,170] Classification_Label = [Short, Average, Tall, Average, Average] Machine Learning Algorithm: Train the Model using Height and Classification_Label Prediction using Trained ML Model: New Height = 190 New Classification Label = Tall Traditional Programming Machine Learning Programming
  17. Say Hello to Machine Learning: Definition and Its Derivatives Supervised

    Machine Learning Unsupervised Machine Learning Teach ML Model by using Predictor Variable X and Label Variable Y Teach ML Model by using Predictor Variable X only, let the model predict the Label Variable Y
  18. Say Hello to Machine Learning: Supervised Machine Learning COW Group

    A COW Group B COW Group C COW DATA (Height, Color, Weight) COW CLASS (A,B,C) “Supervised” ML MODEL
  19. Data, Metrics, and Model Selection supervised Machine Learning Popular Algorithm

    Supervised Machine Learning Linear Regression XGBoost (XGB) Random Forest (RF) Support Vector Machine (SVM)
  20. Say Hello to Machine Learning: Supervised Machine Learning Regression Dynamic

    Pricing (Surge Price) House Price Prediction Classification Captcha Security Email Spam Filtering
  21. Say Hello to Machine Learning: Supervised Machine Learning COW Group

    A COW Group B COW Group C COW DATA (Height, Color, Weight) COW CLASS (A,B,C) “Unsupervised” ML MODEL
  22. Say Hello to Machine Learning: Unsupervised Machine Learning Clustering User

    Segmentation Dimensional Reduction User Segmentation
  23. Data, Metrics, and Model Selection Unsupervised Machine Learning Popular Algorithm

    Unsupervised Machine Learning Hierarchical Clustering t-SNE K-Means Principal Component Analysis (PCA)
  24. Tools and End-to-End Process: Data Science Workflow in industry Adjustment

    of model weight matrices to be stored in microservice, create an architecture workflow to be data pipeline, ready to deploy 3.Adjustment and Deployment 25% Utilizing machine learning algorithm to build an automation, also evaluating the built model accuracy 2.Modeling and Evaluation 25% Starts from collecting data, preprocessing, and doing exploratory data analysis 1.Ingestion and Analysis 50%
  25. Tools and End-to-End Process: Step 1 - Ingestion and Analysis

    5 Analysis and Visualization Making analysis from the preprocessed data, drive and proof the research hypothesis rightness by visualize it by some graphs or descriptive one 3 Data Retrieval Retrieving data from query schema, could be from data warehouse, or scraping from the internet 1 Research Hypothesis Conducting research flow along with the hypotheses that might solve the problems 4 Data Preprocessing Cleaning the whole data, such as control the outlier, transform or standardize, null values handling, etc. 2 Data Query Schema Determine which data to take, which tables, which features, etc.
  26. Tools and End-to-End Process: Step 2 - Modeling and Evaluation

    5 M odel Selection Select best m odel by highest accuracy/interpretation 4 M odel Evaluation Evaluate M odel Accuracy using Test Data 3 Train M achine Learning M odel Train M L M odel by Train Data Feature Engineering Produce new features by existing feature 2 1 Research M ethodology Choose a Proper M L Algorithm to Research Objective
  27. Tools and End-to-End Process: Step 3 - Adjustment and Deployment

    Production 1 2 Deployment to Production Store the model weight matrices into container that runs their requirements and dependencies Ensure the model pipeline runs smoothly from upstream to downstream Adjustment and Communication
  28. Data, Metrics, and Model Selection Data, and its derivative DATA

    NUMERICAL (numbers) cATEGORICAL (text,alphabetic) Discrete [0, 1, 2, 3, 4, … N] CONTINUOUS [0.1,0.001,...,1] NOMINAL no hierarchy [gender, address] ORDINAL have hierarchy [education level] No Encoding
  29. Data, Metrics, and Model Selection Data, and its derivative DATA

    Problems Missing Value Duplicated High-Value Gap Imbalanced
  30. Data, Metrics, and Model Selection Data, and its derivative Missing

    Value NA = Not Available. Might be NULL, NaN or etc. Global symbol of missing value
  31. Data, Metrics, and Model Selection Data, and its derivative Missing

    Value Substitute with Statistical Ways: Mean | Median | Mode (Most Frequent) Drop Columns/Rows from Missing Values 1 2
  32. Data, Metrics, and Model Selection Data, and its derivative High

    Value Gap Target Variable Predictor Variable Task: Predict the competition winner by given a set of predictor variables and target variable!
  33. Data, Metrics, and Model Selection Data, and its derivative High

    Value Gap Normalization Technique Normalizing will ensure that a convergence problem does not have a massive variance, making optimization feasible.
  34. Data, Metrics, and Model Selection Data, and its derivative High

    Value Gap Normalization Technique Normalizing will ensure that a convergence problem does not have a massive variance, making optimization feasible. min-max scaler normal distribution
  35. Data, Metrics, and Model Selection Data, and its derivative Train:

    part on which your ML algorithms are actually trained to build a model (60% of your data) Validation: to validate our various model fits (20% of your data) Test: to test our model hypothesis. left untouched and unseen until the model and hyperparameters are decided (20% of your data) Train Test Split
  36. Data, Metrics, and Model Selection Metrics, and its derivative ML

    Metrics NUMERICAL (numbers) cATEGORICAL (text,alphabetic) Distance-based (show condition) RMSE, MAE, .. percentage (interpretable) MAPE, R2, ... Interpretability (Meaningful) Precision, Recall, .. Reliability (Stable) AUC, ROC, ... Based on Target Variable
  37. Data, Metrics, and Model Selection Metrics, and its derivative Numerical

    MAPE = Example: Model A evaluation: MAPE = 7.9% ~ The model is only wrong 7.9% to predict Y R2 = 88% ~ The given Variable X could precisely (88%) illustrate the variance of variable target (Y) Distance Principal “The lower, the better” Percentage Principal “Have a rule”
  38. Data, Metrics, and Model Selection Metrics, and its derivative Categorical

    Case: No pregnancy (event), A person (man/woman) False Positive (FP): Predict an event when there is no event (bad) False Negative (FN): Predict no event when there is an event (bad) True Positive (TP): Predict an event when there is an event (good) True Negative (TN): Predict no event when there is no event (good) Event: Pregnancy Logic: - Man can’t pregnant - Woman can pregnant FP: ML Model predict man pregnant FN: ML Model predict woman not pregnant (but the reality is pregnant)
  39. Case: No pregnancy (event), A person (man/woman) False Positive (FP):

    Man is pregnant, actual is not pregnant False Negative (FN): Woman is not pregnant, actual is pregnant True Positive (TP): Man is not pregnant, actual is not pregnant True Negative (TN): Woman is pregnant, actual is pregnant Data, Metrics, and Model Selection Metrics, and its derivative Categorical Precision = TP / (TP+FP) → Stay Aggressive Recall = TP / (TP + FN) → Stay Careful
  40. Precision: We want ML model could predict an event with

    aggressively (event exist or not exist, our prediction must predict an event is exist) Recall: We want ML model could predict an event carefully (better if not to predict an event, rather than wrong prediction) ---------------------------------------------------------------------------------------------- Another Example: A rain prediction, A man False Positive (FP): A man told to bring an umbrella, but the actual is no rain in the whole day. False Negative (FN): A man told not to bring an umbrella. but the actual is rain in the whole day If you a businessman, which risk will you minimize first ? FP/FN? Data, Metrics, and Model Selection Metrics, and its derivative Categorical Precision = TP / (TP+FP) → Stay Aggressive Recall = TP / (TP + FN) → Stay Careful
  41. Data, Metrics, and Model Selection Metrics, and its derivative Precision:

    We want ML model could predict an event with aggressively (event exist or not exist, our prediction must be correct) Recall: We want ML model could predict an event carefully (better if not to predict an event, rather than wrong prediction) ---------------------------------------------------------------------------------------------- Another Example: An email, A spam flagger False Positive (FP): An email is flagged as spam by system, but the actual is not a spam message overall False Negative (FN): An email is not flagged as spam by system, but the actual is really spam and full of phishing links If you a businessman, which risk will you minimize first ? FP/FN? Categorical Precision = TP / (TP+FP) → Stay Aggressive Recall = TP / (TP + FN) → Stay Careful
  42. Data, Metrics, and Model Selection Metrics, and its derivative Categorical

    AUC Score: Better if it is approaching 1.0 *Best metric to the desribe model reliability (imbalanced dataset)
  43. Data, Metrics, and Model Selection Model Selection, and its derivative

    Model Selection Underfitting Overfitting High Bias Low Variance Low Bias High Variance Based on Bias-Variance Tradeoff
  44. Data, Metrics, and Model Selection Model Selection, and its derivative

    Bias is the difference between the average prediction of our model and the correct value which we are trying to predict. Variance is the variability of model prediction for a given data point or a value which tells us spread of our data. Bias Variance Tradeoff
  45. Data, Metrics, and Model Selection Model Selection, and its derivative

    Underfitting - Model unable to capture the underlying pattern of the data - High bias, Low variance - Usually less amount of data train - or, model is too simple and has very few parameters
  46. Data, Metrics, and Model Selection Model Selection, and its derivative

    Underfitting - Model captures the noise along with the underlying pattern in data - Low bias, High variance - Have a lot over noisy dataset - or, model is too complex, and has many parameters
  47. Special Parts: some notes of How to be an expert

    data scientist in this era Interpretable ML is good, but most importantly the explainable one. (This skillset is one of most prospective fields of ML) Data Science is an iterative process. Everyone could be a DS as long as they follow the guided process. If you want to be different, show your domain expertise. Having a lot of learning process is awesome, but most importantly show your side project/analysis impact which can be calculated and be a strong prove of Data Scientist. Throne your impact, not your certificate(s) Take a serious focus on Explainable AI Know your Domain Science
  48. Special Parts: some notes of How to be an expert

    data scientist in this era Recommended Book/Course Udacity Intro to ML Coursera ML E-Book