Quick Start Guide of Kaggle: Machine Learning Competitions with Python

Quick Start Guide of Kaggle: Machine Learning Competitions with Python
SciPy Japan 2020 0900-1230, October 30 Speaker: Shotaro Ishihara 1

Abstract As you may know, Kaggle is a high-profile machine
learning competition platform. In Kaggle, data scientists from all over the world are using Python to build machine learning models. In this hands-on tutorial, you'll learn the basics of machine learning and Kaggle by running the Notebook-style source code. The objective is to help participants learn how to compete and learn with Kaggle using Python. 2

Table of Contents 0900-0935 Introduction: What is machine learning &
Kaggle? 0940-1045 Practice 1: From participation to submission 1100-1145 Practice 2: How to boost your score 1150-1230 Conclusion: Wrap up & future resources 3

Announcements Resources: https://github.com/upura/scipy-japan-2020- kaggle-tutorial Questions are welcome by both English
and Japanese Zoom Chat in the tutorial GitHub Issues after the tutorial 4

Profile: Shotaro Ishihara Data scientist at Nikkei Kaggle "PetFinder.my Adoption
Prediction" 1st place Tier: Kaggle Master Host of "Kaggle Days Tokyo" offline competition 『PythonではじめるKaggleスタートブック』（講談社） INMA "30 Under 30 Awards", Grand Prize in Asia/Pacific 5

Introduction What is machine learning & Kaggle? 6

In this section, you will learn ... ☑ What is
machine learning? ☑ What is Kaggle? 7

What is machine learning? 8

AI, Machine learning, and Deep learning This is a relationship
among them. Artificial intelligence (AI) Machine learning Supervised learning Logistic regression Random Forest Deep learning Unsupervised learning Reinforcement learning 9

What is Artificial intelligence? There are two main approaches to
AI research: . Create a machine with human intelligence itself . Let a machine do a specific task that humans can do with their intelligence Most current researches are in the latter position. https://www.ai-gakkai.or.jp/whatsai/AIwhats.html 10

What is machine learning? General term for technology that enables
computers to acquire human-like learning ability Artificial intelligence ∋ Machine Learning The recent "artificial intelligence" boom has been driven by the rapid growth of machine learning research 杉⼭将, 『イラストで学ぶ機械学習』, 講談社 11

What machine learning can do? Supervised learning Unsupervised learning Reinforcement
learning 12

Supervised learning By giving several (problem, answer) pairs, it can
acquire the ability to answer questions that it has not been taught. ⽯原ら, 『PythonではじめるKaggleスタートブック』, 講談社 13

Concept of supervised learning https://www.hubertwang.me/machinelearning/intro-to-tf-for-ai-ml-and-dl 14

Exercise 1 1.1 Calculate the area of a circle with
a radius of 3. 1.2 When a ball was dropped from a bridge, it reached the surface of the water 2 seconds later. Calculate the distance between the bridge and the surface of the water with gravitational acceleration g. 15

Answer 1 There are well known theories / formulas. 1.1
A = π × r2 1.2 y = × 2 1 g × t2 16

Exercise 2 f(0) = 3 f(1) = 5 f(2) =
7 f(3) = 9 f(4) = 11 f(5) =? 17

Answer 2 f(x) = 2 × x + 3 f(0)
= 3 f(1) = 5 f(2) = 7 f(3) = 9 f(4) = 11 f(5) = 13 18

Exercise 3 f(0.42) = 3.24 f(1.223) = 5.23 f(2.84) =
7.02 f(3.43) = 9.24 f(4.58) = 11.22 f(5) =? 19

Answer 3 f(5) ≈ 12 20

Characteristic of supervised learning To introduce a rule from past
results. If we don't know the exact rules, they'll give us an answer that looks "good". 21

Examples of supervised learning Image Recognition (Google Image Search) Voice
recognition (Siri) Automatic spam classification (Gmail) Machine Translation (DeepL) 22

Unsupervised learning To gain useful knowledge from data. There's a
lot of data in the world without labels. 23

Examples of unsupervised learning Classification Anomaly detection https://www.naftaliharris.com/blog/visualizing-k-means- clustering/ 24

Reinforcement learning The goal is to get a computer to
acquire the same ability as supervised learning. Give a computer not data but "environment" Action / State / Reward Instead of considering each piece of data separately, we give a "reward" of the "state" after a series of "actions" as the correct answer. 久保隆宏, 『Pythonで学ぶ強化学習⼊⾨から実践まで』, 講談社 25

Examples of reinforcement learning Computer games (AlphaGo) Automatic control of
robots Advertising distribution 26

What is Kaggle? 27

Kaggle: Machine Learning Competitions A competition platform for testing the
performance of different machine learning models, mainly for supervised learning. Organized by Google-owned Kaggle. Data scientists from around the world participate. 28

Overview of Kaggle ⽯原ら, 『PythonではじめるKaggleスタートブック』, 講談社 29

DeNA's Kaggle ranking system https://dena.ai/kaggle/ 30

What we can do in Kaggle? Use Python or R
to create machine learning models Stay tuned for the next Practice section! 31

In this section, you learned ... ☑ What is machine
learning? Supervised learning Unsupervised learning Reinforcement learning ☑ What is machine Kaggle? 32

Practice Python tutorial with Kaggle Notebooks 33

In this section, you will learn ... From participation to
submission ☑ Participation in a competition ☑ How to use Python environment in Kaggle ☑ Loading packages ☑ Loading datasets ☑ Feature engineering ☑ Training and prediction of machine learning algorithms ☑ Submission to the leaderboard 34

In this section, you will learn ... How to boost
your score ☑ Exploratory data analysis ☑ Adding hypothesis-based features ☑ Switching machine learning algorithms ☑ Hyper parameters tuning ☑ The importance of validation ☑ Ensembling 35

Titanic: Machine Learning from Disaster Competition page: https://www.kaggle.com/c/titanic 36

Exploratory data analysis Get insight of data and the task
There is no free lunch What I did in the past competition: https://www.kaggle.com/c/petfinder-adoption- prediction/discussion/88773 ⽯原ら, 『PythonではじめるKaggleスタートブック』, 講談社 37

Exercise 4 Let's come up with at least 3 hypotheses
that might contribute to the better prediction. Tool You can use useful packages such as pandas_profiling , matplotlib , and seaborn . 38

Adding hypothesis-based features Be careful for reproducibility Then, add hypothesis-based
features 39

Switching machine learning algorithms In general, LightGBM works better in
tabular competitions. sklearn: https://scikit- learn.org/stable/supervised_learning.html LightGBM: https://lightgbm.readthedocs.io/en/latest/ Neural Network: https://keras.io/, https://pytorch.org/ 40

Decision Trees 41

Grandient Boosting 42

Hyper parameters tuning . Hand tuning . Use Optuna: https://optuna.org/
43

The importance of validation Why do we need validation? There
is a limitation of submission The risk of overfitting ⽯原ら, 『PythonではじめるKaggleスタートブック』, 講談社 44

Hold out ⽯原ら, 『PythonではじめるKaggleスタートブック』, 講談社 45

Cross validation ⽯原ら, 『PythonではじめるKaggleスタートブック』, 講談社 46

Ensembling Diversity boosts the score 47

In this section, you learned ... From participation to submission
☑ Participation in a competition ☑ How to use Python environment in Kaggle ☑ Loading packages ☑ Loading datasets ☑ Feature engineering ☑ Training and prediction of machine learning algorithms ☑ Submission to the leaderboard 48

In this section, you learned ... How to boost your
score ☑ Exploratory data analysis ☑ Adding hypothesis-based features ☑ Switching machine learning algorithms ☑ Hyper parameters tuning ☑ The importance of validation ☑ Ensembling 49

Conclusion Wrap up & future resources 50

In this section, you will learn ... ☑ Summary of
this tutorial ☑ Future resources 51

Workflow in Kaggle competitions . Understand the competition and the
task . Create benchmark Simple feature engineering Training and prediction Validation . Exploratory data analysis and hypothesis . Check validation and submission scores . Back to 3 (Sometimes 1 or 2) 52

Recommended way for beginners to participate . Understand the competition
and the task Read competition page Read EDA Notebooks . Create benchmark Utilize public Notebooks . Improve benchmark Utilize public Notebooks and Discussions . Do ensembling 53

Which competitions? There are several perspectives Medal Type of dataset
Schedule Code competitions Competition platform https://speakerdeck.com/upura/how-to-choose-kaggle- competitions 54

Future resources 『PythonではじめるKaggleスタートブック』（講談社） Kaggleに登録したら次にやること〜これだけやれば⼗分闘える！Titanicの先へ⾏く⼊⾨ 10 Kernel 〜
Weekly Kaggle News 『Kaggleで勝つデータ分析の技術』（技術評論社）『Approaching (Almost) Any Machine Learning Problem』 kagger-ja slack 55

In this section, you learned ... ☑ Summary of this
tutorial ☑ Future resources 56

Quick Start Guide of Kaggle: Machine Learning C...

Quick Start Guide of Kaggle: Machine Learning Competitions with Python

More Decks by Shotaro Ishihara

Other Decks in Technology

Featured

Transcript