Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Quick Start Guide of Kaggle: Machine Learning Competitions with Python

Shotaro Ishihara
September 26, 2020

Quick Start Guide of Kaggle: Machine Learning Competitions with Python

“Quick Start Guide of Kaggle: Machine Learning Competitions with Python” (Pythonで機械学習コンペティション「Kaggle」をはじめよう) in Scipy Japan 2020, held on October 30.
https://github.com/upura/scipy-japan-2020-kaggle-tutorial

Shotaro Ishihara

September 26, 2020
Tweet

More Decks by Shotaro Ishihara

Other Decks in Technology

Transcript

  1. Quick Start Guide of Kaggle: Machine Learning Competitions with Python

    SciPy Japan 2020 0900-1230, October 30 Speaker: Shotaro Ishihara 1
  2. Abstract As you may know, Kaggle is a high-profile machine

    learning competition platform. In Kaggle, data scientists from all over the world are using Python to build machine learning models. In this hands-on tutorial, you'll learn the basics of machine learning and Kaggle by running the Notebook-style source code. The objective is to help participants learn how to compete and learn with Kaggle using Python. 2
  3. Table of Contents 0900-0935 Introduction: What is machine learning &

    Kaggle? 0940-1045 Practice 1: From participation to submission 1100-1145 Practice 2: How to boost your score 1150-1230 Conclusion: Wrap up & future resources 3
  4. Profile: Shotaro Ishihara Data scientist at Nikkei Kaggle "PetFinder.my Adoption

    Prediction" 1st place Tier: Kaggle Master Host of "Kaggle Days Tokyo" offline competition 『PythonではじめるKaggleスタートブック』(講談社) INMA "30 Under 30 Awards", Grand Prize in Asia/Pacific 5
  5. In this section, you will learn ... ☑ What is

    machine learning? ☑ What is Kaggle? 7
  6. AI, Machine learning, and Deep learning This is a relationship

    among them. Artificial intelligence (AI) Machine learning Supervised learning Logistic regression Random Forest Deep learning Unsupervised learning Reinforcement learning 9
  7. What is Artificial intelligence? There are two main approaches to

    AI research: . Create a machine with human intelligence itself . Let a machine do a specific task that humans can do with their intelligence Most current researches are in the latter position. https://www.ai-gakkai.or.jp/whatsai/AIwhats.html 10
  8. What is machine learning? General term for technology that enables

    computers to acquire human-like learning ability Artificial intelligence ∋ Machine Learning The recent "artificial intelligence" boom has been driven by the rapid growth of machine learning research 杉⼭将, 『イラストで学ぶ機械学習』, 講談社 11
  9. Supervised learning By giving several (problem, answer) pairs, it can

    acquire the ability to answer questions that it has not been taught. ⽯原ら, 『PythonではじめるKaggleスタートブック』, 講談社 13
  10. Exercise 1 1.1 Calculate the area of a circle with

    a radius of 3. 1.2 When a ball was dropped from a bridge, it reached the surface of the water 2 seconds later. Calculate the distance between the bridge and the surface of the water with gravitational acceleration g. 15
  11. Answer 1 There are well known theories / formulas. 1.1

    A = π × r2 1.2 y = × 2 1 g × t2 16
  12. Exercise 2 f(0) = 3 f(1) = 5 f(2) =

    7 f(3) = 9 f(4) = 11 f(5) =? 17
  13. Answer 2 f(x) = 2 × x + 3 f(0)

    = 3 f(1) = 5 f(2) = 7 f(3) = 9 f(4) = 11 f(5) = 13 18
  14. Exercise 3 f(0.42) = 3.24 f(1.223) = 5.23 f(2.84) =

    7.02 f(3.43) = 9.24 f(4.58) = 11.22 f(5) =? 19
  15. Characteristic of supervised learning To introduce a rule from past

    results. If we don't know the exact rules, they'll give us an answer that looks "good". 21
  16. Examples of supervised learning Image Recognition (Google Image Search) Voice

    recognition (Siri) Automatic spam classification (Gmail) Machine Translation (DeepL) 22
  17. Unsupervised learning To gain useful knowledge from data. There's a

    lot of data in the world without labels. 23
  18. Reinforcement learning The goal is to get a computer to

    acquire the same ability as supervised learning. Give a computer not data but "environment" Action / State / Reward Instead of considering each piece of data separately, we give a "reward" of the "state" after a series of "actions" as the correct answer. 久保隆宏, 『Pythonで学ぶ強化学習 ⼊⾨から実践まで』, 講談社 25
  19. Kaggle: Machine Learning Competitions A competition platform for testing the

    performance of different machine learning models, mainly for supervised learning. Organized by Google-owned Kaggle. Data scientists from around the world participate. 28
  20. What we can do in Kaggle? Use Python or R

    to create machine learning models Stay tuned for the next Practice section! 31
  21. In this section, you learned ... ☑ What is machine

    learning? Supervised learning Unsupervised learning Reinforcement learning ☑ What is machine Kaggle? 32
  22. In this section, you will learn ... From participation to

    submission ☑ Participation in a competition ☑ How to use Python environment in Kaggle ☑ Loading packages ☑ Loading datasets ☑ Feature engineering ☑ Training and prediction of machine learning algorithms ☑ Submission to the leaderboard 34
  23. In this section, you will learn ... How to boost

    your score ☑ Exploratory data analysis ☑ Adding hypothesis-based features ☑ Switching machine learning algorithms ☑ Hyper parameters tuning ☑ The importance of validation ☑ Ensembling 35
  24. Exploratory data analysis Get insight of data and the task

    There is no free lunch What I did in the past competition: https://www.kaggle.com/c/petfinder-adoption- prediction/discussion/88773 ⽯原ら, 『PythonではじめるKaggleスタートブック』, 講談社 37
  25. Exercise 4 Let's come up with at least 3 hypotheses

    that might contribute to the better prediction. Tool You can use useful packages such as pandas_profiling , matplotlib , and seaborn . 38
  26. Switching machine learning algorithms In general, LightGBM works better in

    tabular competitions. sklearn: https://scikit- learn.org/stable/supervised_learning.html LightGBM: https://lightgbm.readthedocs.io/en/latest/ Neural Network: https://keras.io/, https://pytorch.org/ 40
  27. The importance of validation Why do we need validation? There

    is a limitation of submission The risk of overfitting ⽯原ら, 『PythonではじめるKaggleスタートブック』, 講談社 44
  28. In this section, you learned ... From participation to submission

    ☑ Participation in a competition ☑ How to use Python environment in Kaggle ☑ Loading packages ☑ Loading datasets ☑ Feature engineering ☑ Training and prediction of machine learning algorithms ☑ Submission to the leaderboard 48
  29. In this section, you learned ... How to boost your

    score ☑ Exploratory data analysis ☑ Adding hypothesis-based features ☑ Switching machine learning algorithms ☑ Hyper parameters tuning ☑ The importance of validation ☑ Ensembling 49
  30. In this section, you will learn ... ☑ Summary of

    this tutorial ☑ Future resources 51
  31. Workflow in Kaggle competitions . Understand the competition and the

    task . Create benchmark Simple feature engineering Training and prediction Validation . Exploratory data analysis and hypothesis . Check validation and submission scores . Back to 3 (Sometimes 1 or 2) 52
  32. Recommended way for beginners to participate . Understand the competition

    and the task Read competition page Read EDA Notebooks . Create benchmark Utilize public Notebooks . Improve benchmark Utilize public Notebooks and Discussions . Do ensembling 53
  33. Which competitions? There are several perspectives Medal Type of dataset

    Schedule Code competitions Competition platform https://speakerdeck.com/upura/how-to-choose-kaggle- competitions 54
  34. Future resources 『PythonではじめるKaggleスタートブック』(講談社) Kaggleに登録したら次にやること 〜 これだけやれば⼗分闘え る!Titanicの先へ⾏く⼊⾨ 10 Kernel 〜

    Weekly Kaggle News 『Kaggleで勝つデータ分析の技術』(技術評論社) 『Approaching (Almost) Any Machine Learning Problem』 kagger-ja slack 55
  35. In this section, you learned ... ☑ Summary of this

    tutorial ☑ Future resources 56