Slide 1

Slide 1 text

Lutske de Leeuw Groningen 04-04-2024 Where to begin? Machine Learning 101

Slide 2

Slide 2 text

So many ideas

Slide 3

Slide 3 text

The original problem

Slide 4

Slide 4 text

Table of content • About me • Introduction to Machine Learning • Types of Machine Learning • Understand the problem • Data Collection and Preprocessing • Model selection and training • Model evaluation • Real life examples • Conclusion

Slide 5

Slide 5 text

https://www.linkedin.com/in/lutske/

Slide 6

Slide 6 text

Introduction to Machine Learning

Slide 7

Slide 7 text

Machine Learning ‘Teach’ the computer Without programming

Slide 8

Slide 8 text

Artificial Intelligence Imitate the human brain Including machine learning

Slide 9

Slide 9 text

Deep Learning Neural networks Subset of machine learning

Slide 10

Slide 10 text

Generative AI Generate content Subset of deep learning

Slide 11

Slide 11 text

Artificial Intelligence Machine learning Deep learning Generative AI

Slide 12

Slide 12 text

Data Science Knowledge extracted from data Advice & predict

Slide 13

Slide 13 text

Data science Artificial Intelligence Machine learning Deep learning Generative AI Data Science

Slide 14

Slide 14 text

Types of Machine Learning

Slide 15

Slide 15 text

Types of learning Supervised Unsupervised Semi-supervised Reinforcement

Slide 16

Slide 16 text

Supervised Prediction Input Label: Cat Model ? It’s a cat Label: Not a Cat

Slide 17

Slide 17 text

Unsupervised Input Model

Slide 18

Slide 18 text

Semi-supervised Input Model It’s a cat Duck Cat ? Prediction

Slide 19

Slide 19 text

Reinforcement learning Agent Environment ? Walk Sit Action= Sit Reward! Update policy Sit = good!

Slide 20

Slide 20 text

Types of data Numerical Categorial Text data Yes / No Ordered series Blood groups Measured values Sentiment analysis Translations Documents

Slide 21

Slide 21 text

Understanding the problem

Slide 22

Slide 22 text

Understanding the problem • Nature of the problem • Type of data available • Desired outcome

Slide 23

Slide 23 text

Example • Problem: My 2 cats needs different food • Type of data: lots of cat pictures • Desired outcome: Recognize cat for access to the correct bowl of food

Slide 24

Slide 24 text

Is machine learning the way? • Big if else statement? • Do I have enough data? • Is there an expert to check the data?

Slide 25

Slide 25 text

Code Example pip install pandas numpy matplotlib scikit-learn jupyter import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error

Slide 26

Slide 26 text

Data collection and preprocessing

Slide 27

Slide 27 text

Data collection • Identify data sources (databases, API’s, web scraping, manual) • Determine the data requirements (type, size, format) • Collect the data • Clean the data • Validate the data • Organize the data

Slide 28

Slide 28 text

Don’t reinvent the wheel Kaggle UCI Google

Slide 29

Slide 29 text

Don’t reinvent the wheel OpenML Data.gov Data.overheid.nl

Slide 30

Slide 30 text

Code example: Load dataset from sklearn.datasets import fetch_california_housing california = fetch_california_housing(as_frame=True) dataFrame = pd.DataFrame(california.data, columns=california.feature_names) dataFrame['prices'] = california.target

Slide 31

Slide 31 text

Code example: See the data dataFrame.head() # MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude Longitude Prices 0 8.3252 41.0 6.984127 1.023810 322.0 2.555556 37.88 -122.23 4.526 1 8.3014 21.0 6.238137 0.971880 2401.0 2.109842 37.86 -122.22 3.585 2 7.2574 52.0 8.288136 1.073446 496.0 2.802260 37.85 -122.24 3.521 3 5.6431 52.0 5.817352 1.073059 558.0 2.547945 37.85 -122.25 3.413 4 3.8462 52.0 6.281853 1.081081 565.0 2.181467 37.85 -122.25 3.422

Slide 32

Slide 32 text

Preprocessing • Data cleaning • Data transformation • Feature extraction • Feature engineering • Data splitting

Slide 33

Slide 33 text

Code example: Prepare data housing = dataFrame.drop('prices', axis=1) pricing = dataFrame['prices']

Slide 34

Slide 34 text

Data processing

Slide 35

Slide 35 text

Gather data • From your resources • From existing resources • Data Augmentation

Slide 36

Slide 36 text

Split the data Dataset Train Validation Test

Slide 37

Slide 37 text

Split the data Train 80% Test 10% Validate 10% Train 70% Test 15% Validate 15% Train 60% Test 20% Validate 20%

Slide 38

Slide 38 text

Code example: Split & scale the data housing_train, housing_test, pricing_train, pricing_test = train_test_split(housing, pricing, test_size=0.2, random_state=42) scaler = StandardScaler() housing_train_scaled = scaler.fit_transform(housing_train) housing_test_scaled = scaler.transform(housing_test)

Slide 39

Slide 39 text

Algorithms

Slide 40

Slide 40 text

Linear Regression

Slide 41

Slide 41 text

Logistic Regression

Slide 42

Slide 42 text

Decision Tree Is this person fit? Age < 30 ? Eats a lot of pizzas? Exercises in the morning? Unfit Fit Fit Unfit Yes? Yes? Yes? No? No? No?

Slide 43

Slide 43 text

Code example: Train the model & predict model = LinearRegression() model.fit(housing_train_scaled, pricing_train) pricing_pred = model.predict(housing_test_scaled)

Slide 44

Slide 44 text

Model evaluation

Slide 45

Slide 45 text

Use metrics www.mathworks.com/discovery/overfitting.html

Slide 46

Slide 46 text

Confusion Matrix Positive (Cat) Negative (Dog) True Positive True Negative False Negative False Positive Cat Cat Dog Not a cat Positive (Cat) Negative (Dog) Actual Predicted

Slide 47

Slide 47 text

Accuracy • True Positive (TP) • True Negative (TN) • False Positive (FP) • False Negative (FN)

Slide 48

Slide 48 text

Formulas

Slide 49

Slide 49 text

Mean Squared Error

Slide 50

Slide 50 text

Mean Squared Error

Slide 51

Slide 51 text

Code example: Mean Squared error mse = mean_squared_error(pricing_test, pricing_pred) print(f"Mean Squared Error: {mse:.2f}") Mean Squared Error: 0.56

Slide 52

Slide 52 text

Improving the model

Slide 53

Slide 53 text

Improving the model • Feature Engineering • Hyperparameter tuning • Model selection • Handling outliers • Cross Validation • Collect more data • Regularization

Slide 54

Slide 54 text

Code example: Handling Outliers # Example: Winsorization from scipy.stats.mstats import winsorize dataFrame['prices'] = winsorize(dataFrame['prices'], limits=[0.05, 0.05])

Slide 55

Slide 55 text

Code example: Using another model # Example: Descision tree from sklearn.tree import DecisionTreeRegressor model = DecisionTreeRegressor() model.fit(housing_train_scaled, pricing_train) pricing_pred = model.predict(housing_test_scaled) mse = mean_squared_error(pricing_test, pricing_pred) print(f"Mean Squared Error: {mse:.2f}") Mean Squared Error: 0.49

Slide 56

Slide 56 text

Code example: Using another model # Example: Random Forest Regression from sklearn.ensemble import RandomForestRegressor model = RandomForestRegressor() model.fit(housing_train_scaled, pricing_train) pricing_pred = model.predict(housing_test_scaled) mse = mean_squared_error(pricing_test, pricing_pred) print(f"Mean Squared Error: {mse:.2f}") Mean Squared Error: 0.25

Slide 57

Slide 57 text

Do try this at home! https://tinyurl.com/485n4byj https://github.com/Lutske/ML101_where_to_begin

Slide 58

Slide 58 text

Real life examples

Slide 59

Slide 59 text

Real life examples

Slide 60

Slide 60 text

Conclusion

Slide 61

Slide 61 text

Key take aways • Collect a lot of data • Don’t invent everything yourself • Machine learning isn’t always the solution • Verify the model

Slide 62

Slide 62 text

Questions?

Slide 63

Slide 63 text

Feedback

Slide 64

Slide 64 text

Thank you!