Machine Learning (ML) Tutorial -ML as a tool for Scientists

Slide 1

Slide 1 text

Machine Learning (ML) Tutorial - ML as a tool for Scientists 200316 @oyoroco

Slide 2

Slide 2 text

Machine Learning (ML) Output ● Prediction Input ● Data

Slide 3

Slide 3 text

Machine Learning (ML) Output ● Prediction ● Score Input ● Data (features) ● Task ○ Target ○ Regression or Classiﬁcation ○ Metrics

Slide 4

Slide 4 text

ML as a tool Output ● Prediction ● Score Input ● Data (features) ● Task ○ Target ○ Regression or Classiﬁcation ○ Metrics Auto ML 1. Deﬁne a good task 2. Validate results 3. Accelerate your work!

Slide 5

Slide 5 text

Today’s goal 1. Make a simple ML model 2. Help your decision whether machine learning can be used for your works

Slide 6

Slide 6 text

Ref. https://gihyo.jp/book/2019/978-4-297-10843-4 ML ﬂow chart 1. Exploratory Data Analysis (EDA) 2. Task & Metrics 3. Feature engineering 4. Modeling 5. Model validation 6. (Model tuning) 7. (Ensemble)

Slide 7

Slide 7 text

1. Exploratory Data Analysis (EDA) ● Contents of the data ● Prediction target

Slide 8

Slide 8 text

2. Task & Metrics Task ● Regression ● Classification ○ Binary ○ Multi Evaluation metrics ● Regression ○ Root mean square error: RMSE ○ Mean square error: MAE ● Classification ○ Confusion matrix ○ Log loss ○ AUC Objective function ● Regression ○ RMSE ● Classification ○ Log loss cf.) Gradient descent, Differentiable

Slide 9

Slide 9 text

3. Feature engineering ● Missing values (NaN) ● Standardization (for regularization) ● Categorical features ○ One-hot encoding ○ Label encoding ● Dimension reduction ○ Principal component analysis: PCA

Slide 10

Slide 10 text

4. Modeling ● Choose model (& hyper parameters) ■ model = Model(params) ● Training ○ Data (Features), Target ■ model.ﬁt(train_x, train_y) ● Prediction ○ Test data ■ pred = model.predict(test_x)

Slide 11

Slide 11 text

4. Modeling: Model ● Linear model ● k-nearest neighbor algorithm: kNN ● Random forest ○ Decision tree + bagging ● Neural network: NN ● Gradient boosting decision tree: GBDT ○ Decision tree + Gradient boosting

Slide 12

Slide 12 text

5. Model validation Training data Test data Validation data All data https://scikit-learn.org/stable/modules/cross_validation.html

Slide 13

Slide 13 text

Example: ● 当日はここから実際の実験データを題材に，スライドとnotebookで実際にコードを動かしながらデモをしました． ● タスクは2値分類で，評価指標はAUC． ● データ数~1600, 特徴量~200 ● モデルはlogistic回帰とLGBM ● 流れは，EDA -> タスクの設定(+AUCの説明) -> モデリング(logistic 回帰の説明) -> train/predictののち，結果を踏まえ，もう一度EDA をしてから特徴量エンジニアリング -> 実験を行いました．

Slide 14

Slide 14 text

ML as a tool 1. Deﬁne a good task 2. Validate results 3. Accelerate your work!