Abstract As you may know, Kaggle is a high-profile machine learning competition platform. In Kaggle, data scientists from all over the world are using Python to build machine learning models. In this hands-on tutorial, you'll learn the basics of machine learning and Kaggle by running the Notebook-style source code. The objective is to help participants learn how to compete and learn with Kaggle using Python. 2
Table of Contents 0900-0935 Introduction: What is machine learning & Kaggle? 0940-1045 Practice 1: From participation to submission 1100-1145 Practice 2: How to boost your score 1150-1230 Conclusion: Wrap up & future resources 3
Announcements Resources: https://github.com/upura/scipy-japan-2020- kaggle-tutorial Questions are welcome by both English and Japanese Zoom Chat in the tutorial GitHub Issues after the tutorial 4
Profile: Shotaro Ishihara Data scientist at Nikkei Kaggle "PetFinder.my Adoption Prediction" 1st place Tier: Kaggle Master Host of "Kaggle Days Tokyo" offline competition 『PythonではじめるKaggleスタートブック』(講談社) INMA "30 Under 30 Awards", Grand Prize in Asia/Pacific 5
AI, Machine learning, and Deep learning This is a relationship among them. Artificial intelligence (AI) Machine learning Supervised learning Logistic regression Random Forest Deep learning Unsupervised learning Reinforcement learning 9
What is Artificial intelligence? There are two main approaches to AI research: . Create a machine with human intelligence itself . Let a machine do a specific task that humans can do with their intelligence Most current researches are in the latter position. https://www.ai-gakkai.or.jp/whatsai/AIwhats.html 10
What is machine learning? General term for technology that enables computers to acquire human-like learning ability Artificial intelligence ∋ Machine Learning The recent "artificial intelligence" boom has been driven by the rapid growth of machine learning research 杉⼭将, 『イラストで学ぶ機械学習』, 講談社 11
Supervised learning By giving several (problem, answer) pairs, it can acquire the ability to answer questions that it has not been taught. ⽯原ら, 『PythonではじめるKaggleスタートブック』, 講談社 13
Exercise 1 1.1 Calculate the area of a circle with a radius of 3. 1.2 When a ball was dropped from a bridge, it reached the surface of the water 2 seconds later. Calculate the distance between the bridge and the surface of the water with gravitational acceleration g. 15
Characteristic of supervised learning To introduce a rule from past results. If we don't know the exact rules, they'll give us an answer that looks "good". 21
Reinforcement learning The goal is to get a computer to acquire the same ability as supervised learning. Give a computer not data but "environment" Action / State / Reward Instead of considering each piece of data separately, we give a "reward" of the "state" after a series of "actions" as the correct answer. 久保隆宏, 『Pythonで学ぶ強化学習 ⼊⾨から実践まで』, 講談社 25
Kaggle: Machine Learning Competitions A competition platform for testing the performance of different machine learning models, mainly for supervised learning. Organized by Google-owned Kaggle. Data scientists from around the world participate. 28
In this section, you learned ... ☑ What is machine learning? Supervised learning Unsupervised learning Reinforcement learning ☑ What is machine Kaggle? 32
In this section, you will learn ... From participation to submission ☑ Participation in a competition ☑ How to use Python environment in Kaggle ☑ Loading packages ☑ Loading datasets ☑ Feature engineering ☑ Training and prediction of machine learning algorithms ☑ Submission to the leaderboard 34
In this section, you will learn ... How to boost your score ☑ Exploratory data analysis ☑ Adding hypothesis-based features ☑ Switching machine learning algorithms ☑ Hyper parameters tuning ☑ The importance of validation ☑ Ensembling 35
Exploratory data analysis Get insight of data and the task There is no free lunch What I did in the past competition: https://www.kaggle.com/c/petfinder-adoption- prediction/discussion/88773 ⽯原ら, 『PythonではじめるKaggleスタートブック』, 講談社 37
Exercise 4 Let's come up with at least 3 hypotheses that might contribute to the better prediction. Tool You can use useful packages such as pandas_profiling , matplotlib , and seaborn . 38
The importance of validation Why do we need validation? There is a limitation of submission The risk of overfitting ⽯原ら, 『PythonではじめるKaggleスタートブック』, 講談社 44
In this section, you learned ... From participation to submission ☑ Participation in a competition ☑ How to use Python environment in Kaggle ☑ Loading packages ☑ Loading datasets ☑ Feature engineering ☑ Training and prediction of machine learning algorithms ☑ Submission to the leaderboard 48
In this section, you learned ... How to boost your score ☑ Exploratory data analysis ☑ Adding hypothesis-based features ☑ Switching machine learning algorithms ☑ Hyper parameters tuning ☑ The importance of validation ☑ Ensembling 49
Workflow in Kaggle competitions . Understand the competition and the task . Create benchmark Simple feature engineering Training and prediction Validation . Exploratory data analysis and hypothesis . Check validation and submission scores . Back to 3 (Sometimes 1 or 2) 52
Recommended way for beginners to participate . Understand the competition and the task Read competition page Read EDA Notebooks . Create benchmark Utilize public Notebooks . Improve benchmark Utilize public Notebooks and Discussions . Do ensembling 53
Which competitions? There are several perspectives Medal Type of dataset Schedule Code competitions Competition platform https://speakerdeck.com/upura/how-to-choose-kaggle- competitions 54