$30 off During Our Annual Pro Sale. View Details »

Quick Start Guide of Kaggle: Machine Learning Competitions with Python

Shotaro Ishihara
September 26, 2020

Quick Start Guide of Kaggle: Machine Learning Competitions with Python

“Quick Start Guide of Kaggle: Machine Learning Competitions with Python” (Pythonで機械学習コンペティション「Kaggle」をはじめよう) in Scipy Japan 2020, held on October 30.
https://github.com/upura/scipy-japan-2020-kaggle-tutorial

Shotaro Ishihara

September 26, 2020
Tweet

More Decks by Shotaro Ishihara

Other Decks in Technology

Transcript

  1. Quick Start Guide of Kaggle:
    Machine Learning Competitions
    with Python
    SciPy Japan 2020
    0900-1230, October 30
    Speaker: Shotaro Ishihara
    1

    View Slide

  2. Abstract
    As you may know, Kaggle is a high-profile machine learning
    competition platform. In Kaggle, data scientists from all over
    the world are using Python to build machine learning
    models.
    In this hands-on tutorial, you'll learn the basics of machine
    learning and Kaggle by running the Notebook-style source
    code.
    The objective is to help participants learn how to compete
    and learn with Kaggle using Python.
    2

    View Slide

  3. Table of Contents
    0900-0935 Introduction: What is machine learning &
    Kaggle?
    0940-1045 Practice 1: From participation to submission
    1100-1145 Practice 2: How to boost your score
    1150-1230 Conclusion: Wrap up & future resources
    3

    View Slide

  4. Announcements
    Resources: https://github.com/upura/scipy-japan-2020-
    kaggle-tutorial
    Questions are welcome by both English and Japanese
    Zoom Chat in the tutorial
    GitHub Issues after the tutorial
    4

    View Slide

  5. Profile: Shotaro Ishihara
    Data scientist at Nikkei
    Kaggle "PetFinder.my Adoption Prediction" 1st place
    Tier: Kaggle Master
    Host of "Kaggle Days Tokyo" offline competition
    『PythonではじめるKaggleスタートブック』(講談社)
    INMA "30 Under 30 Awards", Grand Prize in Asia/Pacific
    5

    View Slide

  6. Introduction
    What is machine learning &
    Kaggle?
    6

    View Slide

  7. In this section, you will learn ...
    ☑ What is machine learning?
    ☑ What is Kaggle?
    7

    View Slide

  8. What is machine learning?
    8

    View Slide

  9. AI, Machine learning, and Deep
    learning
    This is a relationship among them.
    Artificial intelligence (AI)
    Machine learning
    Supervised learning
    Logistic regression
    Random Forest
    Deep learning
    Unsupervised learning
    Reinforcement learning
    9

    View Slide

  10. What is Artificial intelligence?
    There are two main approaches to AI research:
    . Create a machine with human intelligence itself
    . Let a machine do a specific task that humans can do with
    their intelligence
    Most current researches are in the latter position.
    https://www.ai-gakkai.or.jp/whatsai/AIwhats.html 10

    View Slide

  11. What is machine learning?
    General term for technology that enables computers to
    acquire human-like learning ability
    Artificial intelligence ∋
    Machine Learning
    The recent "artificial intelligence" boom has been driven by
    the rapid growth of machine learning research
    杉⼭将, 『イラストで学ぶ機械学習』, 講談社 11

    View Slide

  12. What machine learning can do?
    Supervised learning
    Unsupervised learning
    Reinforcement learning
    12

    View Slide

  13. Supervised learning
    By giving several (problem, answer) pairs, it can acquire the
    ability to answer questions that it has not been taught.
    ⽯原ら, 『PythonではじめるKaggleスタートブック』, 講談社 13

    View Slide

  14. Concept of supervised learning
    https://www.hubertwang.me/machinelearning/intro-to-tf-for-ai-ml-and-dl 14

    View Slide

  15. Exercise 1
    1.1 Calculate the area of a circle with a radius of 3.
    1.2 When a ball was dropped from a bridge, it reached the
    surface of the water 2 seconds later. Calculate the distance
    between the bridge and the surface of the water with
    gravitational acceleration g.
    15

    View Slide

  16. Answer 1
    There are well known theories / formulas.
    1.1
    A = π × r2
    1.2
    y = ×
    2
    1
    g × t2
    16

    View Slide

  17. Exercise 2
    f(0) = 3
    f(1) = 5
    f(2) = 7
    f(3) = 9
    f(4) = 11
    f(5) =?
    17

    View Slide

  18. Answer 2
    f(x) = 2 × x + 3
    f(0) = 3
    f(1) = 5
    f(2) = 7
    f(3) = 9
    f(4) = 11
    f(5) = 13
    18

    View Slide

  19. Exercise 3
    f(0.42) = 3.24
    f(1.223) = 5.23
    f(2.84) = 7.02
    f(3.43) = 9.24
    f(4.58) = 11.22
    f(5) =?
    19

    View Slide

  20. Answer 3
    f(5) ≈ 12
    20

    View Slide

  21. Characteristic of supervised
    learning
    To introduce a rule from past results.
    If we don't know the exact rules, they'll give us an answer
    that looks "good".
    21

    View Slide

  22. Examples of supervised learning
    Image Recognition (Google Image Search)
    Voice recognition (Siri)
    Automatic spam classification (Gmail)
    Machine Translation (DeepL)
    22

    View Slide

  23. Unsupervised learning
    To gain useful knowledge from data.
    There's a lot of data in the world without labels.
    23

    View Slide

  24. Examples of unsupervised
    learning
    Classification
    Anomaly detection
    https://www.naftaliharris.com/blog/visualizing-k-means-
    clustering/ 24

    View Slide

  25. Reinforcement learning
    The goal is to get a computer to acquire the same ability as
    supervised learning.
    Give a computer not data but "environment"
    Action / State / Reward
    Instead of considering each piece of data separately, we
    give a "reward" of the "state" after a series of "actions" as
    the correct answer.
    久保隆宏, 『Pythonで学ぶ強化学習 ⼊⾨から実践まで』, 講談社 25

    View Slide

  26. Examples of reinforcement
    learning
    Computer games (AlphaGo)
    Automatic control of robots
    Advertising distribution
    26

    View Slide

  27. What is Kaggle?
    27

    View Slide

  28. Kaggle: Machine Learning
    Competitions
    A competition platform for testing the performance of
    different machine learning models, mainly for supervised
    learning.
    Organized by Google-owned Kaggle.
    Data scientists from around the world participate.
    28

    View Slide

  29. Overview of Kaggle
    ⽯原ら, 『PythonではじめるKaggleスタートブック』, 講談社 29

    View Slide

  30. DeNA's Kaggle ranking system
    https://dena.ai/kaggle/ 30

    View Slide

  31. What we can do in Kaggle?
    Use Python or R to create machine learning models
    Stay tuned for the next Practice section!
    31

    View Slide

  32. In this section, you learned ...
    ☑ What is machine learning?
    Supervised learning
    Unsupervised learning
    Reinforcement learning
    ☑ What is machine Kaggle?
    32

    View Slide

  33. Practice
    Python tutorial with Kaggle
    Notebooks
    33

    View Slide

  34. In this section, you will learn ...
    From participation to submission
    ☑ Participation in a competition
    ☑ How to use Python environment in Kaggle
    ☑ Loading packages
    ☑ Loading datasets
    ☑ Feature engineering
    ☑ Training and prediction of machine learning algorithms
    ☑ Submission to the leaderboard
    34

    View Slide

  35. In this section, you will learn ...
    How to boost your score
    ☑ Exploratory data analysis
    ☑ Adding hypothesis-based features
    ☑ Switching machine learning algorithms
    ☑ Hyper parameters tuning
    ☑ The importance of validation
    ☑ Ensembling
    35

    View Slide

  36. Titanic: Machine Learning from
    Disaster
    Competition page: https://www.kaggle.com/c/titanic
    36

    View Slide

  37. Exploratory data analysis
    Get insight of data and the task
    There is no free lunch
    What I did in the past competition:
    https://www.kaggle.com/c/petfinder-adoption-
    prediction/discussion/88773
    ⽯原ら, 『PythonではじめるKaggleスタートブック』, 講談社 37

    View Slide

  38. Exercise 4
    Let's come up with at least 3 hypotheses that might contribute to
    the better prediction.
    Tool
    You can use useful packages such as pandas_profiling
    ,
    matplotlib
    , and seaborn
    .
    38

    View Slide

  39. Adding hypothesis-based
    features
    Be careful for reproducibility
    Then, add hypothesis-based features
    39

    View Slide

  40. Switching machine learning
    algorithms
    In general, LightGBM works better in tabular competitions.
    sklearn: https://scikit-
    learn.org/stable/supervised_learning.html
    LightGBM: https://lightgbm.readthedocs.io/en/latest/
    Neural Network: https://keras.io/, https://pytorch.org/
    40

    View Slide

  41. Decision Trees
    41

    View Slide

  42. Grandient Boosting
    42

    View Slide

  43. Hyper parameters tuning
    . Hand tuning
    . Use Optuna: https://optuna.org/
    43

    View Slide

  44. The importance of validation
    Why do we need validation?
    There is a limitation of submission
    The risk of overfitting
    ⽯原ら, 『PythonではじめるKaggleスタートブック』, 講談社 44

    View Slide

  45. Hold out
    ⽯原ら, 『PythonではじめるKaggleスタートブック』, 講談社 45

    View Slide

  46. Cross validation
    ⽯原ら, 『PythonではじめるKaggleスタートブック』, 講談社 46

    View Slide

  47. Ensembling
    Diversity boosts the score
    47

    View Slide

  48. In this section, you learned ...
    From participation to submission
    ☑ Participation in a competition
    ☑ How to use Python environment in Kaggle
    ☑ Loading packages
    ☑ Loading datasets
    ☑ Feature engineering
    ☑ Training and prediction of machine learning algorithms
    ☑ Submission to the leaderboard
    48

    View Slide

  49. In this section, you learned ...
    How to boost your score
    ☑ Exploratory data analysis
    ☑ Adding hypothesis-based features
    ☑ Switching machine learning algorithms
    ☑ Hyper parameters tuning
    ☑ The importance of validation
    ☑ Ensembling
    49

    View Slide

  50. Conclusion
    Wrap up & future resources
    50

    View Slide

  51. In this section, you will learn ...
    ☑ Summary of this tutorial
    ☑ Future resources
    51

    View Slide

  52. Workflow in Kaggle
    competitions
    . Understand the competition and the task
    . Create benchmark
    Simple feature engineering
    Training and prediction
    Validation
    . Exploratory data analysis and hypothesis
    . Check validation and submission scores
    . Back to 3 (Sometimes 1 or 2)
    52

    View Slide

  53. Recommended way for
    beginners to participate
    . Understand the competition and the task
    Read competition page
    Read EDA Notebooks
    . Create benchmark
    Utilize public Notebooks
    . Improve benchmark
    Utilize public Notebooks and Discussions
    . Do ensembling
    53

    View Slide

  54. Which competitions?
    There are several perspectives
    Medal
    Type of dataset
    Schedule
    Code competitions
    Competition platform
    https://speakerdeck.com/upura/how-to-choose-kaggle-
    competitions
    54

    View Slide

  55. Future resources
    『PythonではじめるKaggleスタートブック』(講談社)
    Kaggleに登録したら次にやること 〜 これだけやれば⼗分闘え
    る!Titanicの先へ⾏く⼊⾨ 10 Kernel 〜
    Weekly Kaggle News
    『Kaggleで勝つデータ分析の技術』(技術評論社)
    『Approaching (Almost) Any Machine Learning Problem』
    kagger-ja slack
    55

    View Slide

  56. In this section, you learned ...
    ☑ Summary of this tutorial
    ☑ Future resources
    56

    View Slide