Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Machine Learning (ML) Tutorial -ML as a tool for Scientists

Machine Learning (ML) Tutorial -ML as a tool for Scientists

機械学習を研究室のメンバーにインストラクションする会を企画した時の資料です.中身はほとんどありませんのでご注意ください.
経緯はこちら.https://link.medium.com/Xq3gXhwVd5

oyoroco

March 16, 2020
Tweet

More Decks by oyoroco

Other Decks in Programming

Transcript

  1. Machine Learning (ML) Output • Prediction • Score Input •

    Data (features) • Task ◦ Target ◦ Regression or Classification ◦ Metrics
  2. ML as a tool Output • Prediction • Score Input

    • Data (features) • Task ◦ Target ◦ Regression or Classification ◦ Metrics Auto ML 1. Define a good task 2. Validate results 3. Accelerate your work!
  3. Today’s goal 1. Make a simple ML model 2. Help

    your decision whether machine learning can be used for your works
  4. Ref. https://gihyo.jp/book/2019/978-4-297-10843-4 ML flow chart 1. Exploratory Data Analysis (EDA)

    2. Task & Metrics 3. Feature engineering 4. Modeling 5. Model validation 6. (Model tuning) 7. (Ensemble)
  5. 2. Task & Metrics Task • Regression • Classification ◦

    Binary ◦ Multi Evaluation metrics • Regression ◦ Root mean square error: RMSE ◦ Mean square error: MAE • Classification ◦ Confusion matrix ◦ Log loss ◦ AUC Objective function • Regression ◦ RMSE • Classification ◦ Log loss cf.) Gradient descent, Differentiable
  6. 3. Feature engineering • Missing values (NaN) • Standardization (for

    regularization) • Categorical features ◦ One-hot encoding ◦ Label encoding • Dimension reduction ◦ Principal component analysis: PCA
  7. 4. Modeling • Choose model (& hyper parameters) ▪ model

    = Model(params) • Training ◦ Data (Features), Target ▪ model.fit(train_x, train_y) • Prediction ◦ Test data ▪ pred = model.predict(test_x)
  8. 4. Modeling: Model • Linear model • k-nearest neighbor algorithm:

    kNN • Random forest ◦ Decision tree + bagging • Neural network: NN • Gradient boosting decision tree: GBDT ◦ Decision tree + Gradient boosting
  9. 5. Model validation Training data Test data Validation data All

    data https://scikit-learn.org/stable/modules/cross_validation.html
  10. Example: • 当日はここから実際の実験データを題材に,スライドとnotebookで 実際にコードを動かしながらデモをしました. • タスクは2値分類で,評価指標はAUC. • データ数~1600, 特徴量~200 •

    モデルはlogistic回帰とLGBM • 流れは,EDA -> タスクの設定(+AUCの説明) -> モデリング(logistic 回帰の説明) -> train/predictののち,結果を踏まえ,もう一度EDA をしてから特徴量エンジニアリング -> 実験を行いました.
  11. ML as a tool 1. Define a good task 2.

    Validate results 3. Accelerate your work!