Quick Start Guide of Kaggle: Machine Learning Competitions with Python

Slide 1

Slide 1 text

Quick Start Guide of Kaggle: Machine Learning Competitions with Python SciPy Japan 2020 0900-1230, October 30 Speaker: Shotaro Ishihara 1

Slide 2

Slide 2 text

Abstract As you may know, Kaggle is a high-profile machine learning competition platform. In Kaggle, data scientists from all over the world are using Python to build machine learning models. In this hands-on tutorial, you'll learn the basics of machine learning and Kaggle by running the Notebook-style source code. The objective is to help participants learn how to compete and learn with Kaggle using Python. 2

Slide 3

Slide 3 text

Table of Contents 0900-0935 Introduction: What is machine learning & Kaggle? 0940-1045 Practice 1: From participation to submission 1100-1145 Practice 2: How to boost your score 1150-1230 Conclusion: Wrap up & future resources 3

Slide 4

Slide 4 text

Announcements Resources: https://github.com/upura/scipy-japan-2020- kaggle-tutorial Questions are welcome by both English and Japanese Zoom Chat in the tutorial GitHub Issues after the tutorial 4

Slide 5

Slide 5 text

Profile: Shotaro Ishihara Data scientist at Nikkei Kaggle "PetFinder.my Adoption Prediction" 1st place Tier: Kaggle Master Host of "Kaggle Days Tokyo" offline competition 『PythonではじめるKaggleスタートブック』（講談社） INMA "30 Under 30 Awards", Grand Prize in Asia/Pacific 5

Slide 6

Slide 6 text

Introduction What is machine learning & Kaggle? 6

Slide 7

Slide 7 text

In this section, you will learn ... ☑ What is machine learning? ☑ What is Kaggle? 7

Slide 8

Slide 8 text

What is machine learning? 8

Slide 9

Slide 9 text

AI, Machine learning, and Deep learning This is a relationship among them. Artificial intelligence (AI) Machine learning Supervised learning Logistic regression Random Forest Deep learning Unsupervised learning Reinforcement learning 9

Slide 10

Slide 10 text

What is Artificial intelligence? There are two main approaches to AI research: . Create a machine with human intelligence itself . Let a machine do a specific task that humans can do with their intelligence Most current researches are in the latter position. https://www.ai-gakkai.or.jp/whatsai/AIwhats.html 10

Slide 11

Slide 11 text

What is machine learning? General term for technology that enables computers to acquire human-like learning ability Artificial intelligence ∋ Machine Learning The recent "artificial intelligence" boom has been driven by the rapid growth of machine learning research 杉⼭将, 『イラストで学ぶ機械学習』, 講談社 11

Slide 12

Slide 12 text

What machine learning can do? Supervised learning Unsupervised learning Reinforcement learning 12

Slide 13

Slide 13 text

Supervised learning By giving several (problem, answer) pairs, it can acquire the ability to answer questions that it has not been taught. ⽯原ら, 『PythonではじめるKaggleスタートブック』, 講談社 13

Slide 14

Slide 14 text

Concept of supervised learning https://www.hubertwang.me/machinelearning/intro-to-tf-for-ai-ml-and-dl 14

Slide 15

Slide 15 text

Exercise 1 1.1 Calculate the area of a circle with a radius of 3. 1.2 When a ball was dropped from a bridge, it reached the surface of the water 2 seconds later. Calculate the distance between the bridge and the surface of the water with gravitational acceleration g. 15

Slide 16

Slide 16 text

Answer 1 There are well known theories / formulas. 1.1 A = π × r2 1.2 y = × 2 1 g × t2 16

Slide 17

Slide 17 text

Exercise 2 f(0) = 3 f(1) = 5 f(2) = 7 f(3) = 9 f(4) = 11 f(5) =? 17

Slide 18

Slide 18 text

Answer 2 f(x) = 2 × x + 3 f(0) = 3 f(1) = 5 f(2) = 7 f(3) = 9 f(4) = 11 f(5) = 13 18

Slide 19

Slide 19 text

Exercise 3 f(0.42) = 3.24 f(1.223) = 5.23 f(2.84) = 7.02 f(3.43) = 9.24 f(4.58) = 11.22 f(5) =? 19

Slide 20

Slide 20 text

Answer 3 f(5) ≈ 12 20

Slide 21

Slide 21 text

Characteristic of supervised learning To introduce a rule from past results. If we don't know the exact rules, they'll give us an answer that looks "good". 21

Slide 22

Slide 22 text

Examples of supervised learning Image Recognition (Google Image Search) Voice recognition (Siri) Automatic spam classification (Gmail) Machine Translation (DeepL) 22

Slide 23

Slide 23 text

Unsupervised learning To gain useful knowledge from data. There's a lot of data in the world without labels. 23

Slide 24

Slide 24 text

Examples of unsupervised learning Classification Anomaly detection https://www.naftaliharris.com/blog/visualizing-k-means- clustering/ 24

Slide 25

Slide 25 text

Reinforcement learning The goal is to get a computer to acquire the same ability as supervised learning. Give a computer not data but "environment" Action / State / Reward Instead of considering each piece of data separately, we give a "reward" of the "state" after a series of "actions" as the correct answer. 久保隆宏, 『Pythonで学ぶ強化学習⼊⾨から実践まで』, 講談社 25

Slide 26

Slide 26 text

Examples of reinforcement learning Computer games (AlphaGo) Automatic control of robots Advertising distribution 26

Slide 27

Slide 27 text

What is Kaggle? 27

Slide 28

Slide 28 text

Kaggle: Machine Learning Competitions A competition platform for testing the performance of different machine learning models, mainly for supervised learning. Organized by Google-owned Kaggle. Data scientists from around the world participate. 28

Slide 29

Slide 29 text

Overview of Kaggle ⽯原ら, 『PythonではじめるKaggleスタートブック』, 講談社 29

Slide 30

Slide 30 text

DeNA's Kaggle ranking system https://dena.ai/kaggle/ 30

Slide 31

Slide 31 text

What we can do in Kaggle? Use Python or R to create machine learning models Stay tuned for the next Practice section! 31

Slide 32

Slide 32 text

In this section, you learned ... ☑ What is machine learning? Supervised learning Unsupervised learning Reinforcement learning ☑ What is machine Kaggle? 32

Slide 33

Slide 33 text

Practice Python tutorial with Kaggle Notebooks 33

Slide 34

Slide 34 text

In this section, you will learn ... From participation to submission ☑ Participation in a competition ☑ How to use Python environment in Kaggle ☑ Loading packages ☑ Loading datasets ☑ Feature engineering ☑ Training and prediction of machine learning algorithms ☑ Submission to the leaderboard 34

Slide 35

Slide 35 text

In this section, you will learn ... How to boost your score ☑ Exploratory data analysis ☑ Adding hypothesis-based features ☑ Switching machine learning algorithms ☑ Hyper parameters tuning ☑ The importance of validation ☑ Ensembling 35

Slide 36

Slide 36 text

Titanic: Machine Learning from Disaster Competition page: https://www.kaggle.com/c/titanic 36

Slide 37

Slide 37 text

Exploratory data analysis Get insight of data and the task There is no free lunch What I did in the past competition: https://www.kaggle.com/c/petfinder-adoption- prediction/discussion/88773 ⽯原ら, 『PythonではじめるKaggleスタートブック』, 講談社 37

Slide 38

Slide 38 text

Exercise 4 Let's come up with at least 3 hypotheses that might contribute to the better prediction. Tool You can use useful packages such as pandas_profiling , matplotlib , and seaborn . 38

Slide 39

Slide 39 text

Adding hypothesis-based features Be careful for reproducibility Then, add hypothesis-based features 39

Slide 40

Slide 40 text

Switching machine learning algorithms In general, LightGBM works better in tabular competitions. sklearn: https://scikit- learn.org/stable/supervised_learning.html LightGBM: https://lightgbm.readthedocs.io/en/latest/ Neural Network: https://keras.io/, https://pytorch.org/ 40

Slide 41

Slide 41 text

Decision Trees 41

Slide 42

Slide 42 text

Grandient Boosting 42

Slide 43

Slide 43 text

Hyper parameters tuning . Hand tuning . Use Optuna: https://optuna.org/ 43

Slide 44

Slide 44 text

The importance of validation Why do we need validation? There is a limitation of submission The risk of overfitting ⽯原ら, 『PythonではじめるKaggleスタートブック』, 講談社 44

Slide 45

Slide 45 text

Hold out ⽯原ら, 『PythonではじめるKaggleスタートブック』, 講談社 45

Slide 46

Slide 46 text

Cross validation ⽯原ら, 『PythonではじめるKaggleスタートブック』, 講談社 46

Slide 47

Slide 47 text

Ensembling Diversity boosts the score 47

Slide 48

Slide 48 text

In this section, you learned ... From participation to submission ☑ Participation in a competition ☑ How to use Python environment in Kaggle ☑ Loading packages ☑ Loading datasets ☑ Feature engineering ☑ Training and prediction of machine learning algorithms ☑ Submission to the leaderboard 48

Slide 49

Slide 49 text

In this section, you learned ... How to boost your score ☑ Exploratory data analysis ☑ Adding hypothesis-based features ☑ Switching machine learning algorithms ☑ Hyper parameters tuning ☑ The importance of validation ☑ Ensembling 49

Slide 50

Slide 50 text

Conclusion Wrap up & future resources 50

Slide 51

Slide 51 text

In this section, you will learn ... ☑ Summary of this tutorial ☑ Future resources 51

Slide 52

Slide 52 text

Workflow in Kaggle competitions . Understand the competition and the task . Create benchmark Simple feature engineering Training and prediction Validation . Exploratory data analysis and hypothesis . Check validation and submission scores . Back to 3 (Sometimes 1 or 2) 52

Slide 53

Slide 53 text

Recommended way for beginners to participate . Understand the competition and the task Read competition page Read EDA Notebooks . Create benchmark Utilize public Notebooks . Improve benchmark Utilize public Notebooks and Discussions . Do ensembling 53

Slide 54

Slide 54 text

Which competitions? There are several perspectives Medal Type of dataset Schedule Code competitions Competition platform https://speakerdeck.com/upura/how-to-choose-kaggle- competitions 54

Slide 55

Slide 55 text

Future resources 『PythonではじめるKaggleスタートブック』（講談社） Kaggleに登録したら次にやること〜これだけやれば⼗分闘える！Titanicの先へ⾏く⼊⾨ 10 Kernel 〜 Weekly Kaggle News 『Kaggleで勝つデータ分析の技術』（技術評論社）『Approaching (Almost) Any Machine Learning Problem』 kagger-ja slack 55

Slide 56

Slide 56 text

In this section, you learned ... ☑ Summary of this tutorial ☑ Future resources 56