Pro Yearly is on sale from $80 to $50! »

Quick Start Guide of Kaggle: Machine Learning Competitions with Python

B1cc148711c6a37a5c922b6e72a4ad52?s=47 u++
September 26, 2020

Quick Start Guide of Kaggle: Machine Learning Competitions with Python

“Quick Start Guide of Kaggle: Machine Learning Competitions with Python” (Pythonで機械学習コンペティション「Kaggle」をはじめよう) in Scipy Japan 2020, held on October 30.
https://github.com/upura/scipy-japan-2020-kaggle-tutorial

B1cc148711c6a37a5c922b6e72a4ad52?s=128

u++

September 26, 2020
Tweet

Transcript

  1. Quick Start Guide of Kaggle: Machine Learning Competitions with Python

    SciPy Japan 2020 0900-1230, October 30 Speaker: Shotaro Ishihara 1
  2. Abstract As you may know, Kaggle is a high-profile machine

    learning competition platform. In Kaggle, data scientists from all over the world are using Python to build machine learning models. In this hands-on tutorial, you'll learn the basics of machine learning and Kaggle by running the Notebook-style source code. The objective is to help participants learn how to compete and learn with Kaggle using Python. 2
  3. Table of Contents 0900-0935 Introduction: What is machine learning &

    Kaggle? 0940-1045 Practice 1: From participation to submission 1100-1145 Practice 2: How to boost your score 1150-1230 Conclusion: Wrap up & future resources 3
  4. Announcements Resources: https://github.com/upura/scipy-japan-2020- kaggle-tutorial Questions are welcome by both English

    and Japanese Zoom Chat in the tutorial GitHub Issues after the tutorial 4
  5. Profile: Shotaro Ishihara Data scientist at Nikkei Kaggle "PetFinder.my Adoption

    Prediction" 1st place Tier: Kaggle Master Host of "Kaggle Days Tokyo" offline competition 『PythonではじめるKaggleスタートブック』(講談社) INMA "30 Under 30 Awards", Grand Prize in Asia/Pacific 5
  6. Introduction What is machine learning & Kaggle? 6

  7. In this section, you will learn ... ☑ What is

    machine learning? ☑ What is Kaggle? 7
  8. What is machine learning? 8

  9. AI, Machine learning, and Deep learning This is a relationship

    among them. Artificial intelligence (AI) Machine learning Supervised learning Logistic regression Random Forest Deep learning Unsupervised learning Reinforcement learning 9
  10. What is Artificial intelligence? There are two main approaches to

    AI research: . Create a machine with human intelligence itself . Let a machine do a specific task that humans can do with their intelligence Most current researches are in the latter position. https://www.ai-gakkai.or.jp/whatsai/AIwhats.html 10
  11. What is machine learning? General term for technology that enables

    computers to acquire human-like learning ability Artificial intelligence ∋ Machine Learning The recent "artificial intelligence" boom has been driven by the rapid growth of machine learning research 杉⼭将, 『イラストで学ぶ機械学習』, 講談社 11
  12. What machine learning can do? Supervised learning Unsupervised learning Reinforcement

    learning 12
  13. Supervised learning By giving several (problem, answer) pairs, it can

    acquire the ability to answer questions that it has not been taught. ⽯原ら, 『PythonではじめるKaggleスタートブック』, 講談社 13
  14. Concept of supervised learning https://www.hubertwang.me/machinelearning/intro-to-tf-for-ai-ml-and-dl 14

  15. Exercise 1 1.1 Calculate the area of a circle with

    a radius of 3. 1.2 When a ball was dropped from a bridge, it reached the surface of the water 2 seconds later. Calculate the distance between the bridge and the surface of the water with gravitational acceleration g. 15
  16. Answer 1 There are well known theories / formulas. 1.1

    A = π × r2 1.2 y = × 2 1 g × t2 16
  17. Exercise 2 f(0) = 3 f(1) = 5 f(2) =

    7 f(3) = 9 f(4) = 11 f(5) =? 17
  18. Answer 2 f(x) = 2 × x + 3 f(0)

    = 3 f(1) = 5 f(2) = 7 f(3) = 9 f(4) = 11 f(5) = 13 18
  19. Exercise 3 f(0.42) = 3.24 f(1.223) = 5.23 f(2.84) =

    7.02 f(3.43) = 9.24 f(4.58) = 11.22 f(5) =? 19
  20. Answer 3 f(5) ≈ 12 20

  21. Characteristic of supervised learning To introduce a rule from past

    results. If we don't know the exact rules, they'll give us an answer that looks "good". 21
  22. Examples of supervised learning Image Recognition (Google Image Search) Voice

    recognition (Siri) Automatic spam classification (Gmail) Machine Translation (DeepL) 22
  23. Unsupervised learning To gain useful knowledge from data. There's a

    lot of data in the world without labels. 23
  24. Examples of unsupervised learning Classification Anomaly detection https://www.naftaliharris.com/blog/visualizing-k-means- clustering/ 24

  25. Reinforcement learning The goal is to get a computer to

    acquire the same ability as supervised learning. Give a computer not data but "environment" Action / State / Reward Instead of considering each piece of data separately, we give a "reward" of the "state" after a series of "actions" as the correct answer. 久保隆宏, 『Pythonで学ぶ強化学習 ⼊⾨から実践まで』, 講談社 25
  26. Examples of reinforcement learning Computer games (AlphaGo) Automatic control of

    robots Advertising distribution 26
  27. What is Kaggle? 27

  28. Kaggle: Machine Learning Competitions A competition platform for testing the

    performance of different machine learning models, mainly for supervised learning. Organized by Google-owned Kaggle. Data scientists from around the world participate. 28
  29. Overview of Kaggle ⽯原ら, 『PythonではじめるKaggleスタートブック』, 講談社 29

  30. DeNA's Kaggle ranking system https://dena.ai/kaggle/ 30

  31. What we can do in Kaggle? Use Python or R

    to create machine learning models Stay tuned for the next Practice section! 31
  32. In this section, you learned ... ☑ What is machine

    learning? Supervised learning Unsupervised learning Reinforcement learning ☑ What is machine Kaggle? 32
  33. Practice Python tutorial with Kaggle Notebooks 33

  34. In this section, you will learn ... From participation to

    submission ☑ Participation in a competition ☑ How to use Python environment in Kaggle ☑ Loading packages ☑ Loading datasets ☑ Feature engineering ☑ Training and prediction of machine learning algorithms ☑ Submission to the leaderboard 34
  35. In this section, you will learn ... How to boost

    your score ☑ Exploratory data analysis ☑ Adding hypothesis-based features ☑ Switching machine learning algorithms ☑ Hyper parameters tuning ☑ The importance of validation ☑ Ensembling 35
  36. Titanic: Machine Learning from Disaster Competition page: https://www.kaggle.com/c/titanic 36

  37. Exploratory data analysis Get insight of data and the task

    There is no free lunch What I did in the past competition: https://www.kaggle.com/c/petfinder-adoption- prediction/discussion/88773 ⽯原ら, 『PythonではじめるKaggleスタートブック』, 講談社 37
  38. Exercise 4 Let's come up with at least 3 hypotheses

    that might contribute to the better prediction. Tool You can use useful packages such as pandas_profiling , matplotlib , and seaborn . 38
  39. Adding hypothesis-based features Be careful for reproducibility Then, add hypothesis-based

    features 39
  40. Switching machine learning algorithms In general, LightGBM works better in

    tabular competitions. sklearn: https://scikit- learn.org/stable/supervised_learning.html LightGBM: https://lightgbm.readthedocs.io/en/latest/ Neural Network: https://keras.io/, https://pytorch.org/ 40
  41. Decision Trees 41

  42. Grandient Boosting 42

  43. Hyper parameters tuning . Hand tuning . Use Optuna: https://optuna.org/

    43
  44. The importance of validation Why do we need validation? There

    is a limitation of submission The risk of overfitting ⽯原ら, 『PythonではじめるKaggleスタートブック』, 講談社 44
  45. Hold out ⽯原ら, 『PythonではじめるKaggleスタートブック』, 講談社 45

  46. Cross validation ⽯原ら, 『PythonではじめるKaggleスタートブック』, 講談社 46

  47. Ensembling Diversity boosts the score 47

  48. In this section, you learned ... From participation to submission

    ☑ Participation in a competition ☑ How to use Python environment in Kaggle ☑ Loading packages ☑ Loading datasets ☑ Feature engineering ☑ Training and prediction of machine learning algorithms ☑ Submission to the leaderboard 48
  49. In this section, you learned ... How to boost your

    score ☑ Exploratory data analysis ☑ Adding hypothesis-based features ☑ Switching machine learning algorithms ☑ Hyper parameters tuning ☑ The importance of validation ☑ Ensembling 49
  50. Conclusion Wrap up & future resources 50

  51. In this section, you will learn ... ☑ Summary of

    this tutorial ☑ Future resources 51
  52. Workflow in Kaggle competitions . Understand the competition and the

    task . Create benchmark Simple feature engineering Training and prediction Validation . Exploratory data analysis and hypothesis . Check validation and submission scores . Back to 3 (Sometimes 1 or 2) 52
  53. Recommended way for beginners to participate . Understand the competition

    and the task Read competition page Read EDA Notebooks . Create benchmark Utilize public Notebooks . Improve benchmark Utilize public Notebooks and Discussions . Do ensembling 53
  54. Which competitions? There are several perspectives Medal Type of dataset

    Schedule Code competitions Competition platform https://speakerdeck.com/upura/how-to-choose-kaggle- competitions 54
  55. Future resources 『PythonではじめるKaggleスタートブック』(講談社) Kaggleに登録したら次にやること 〜 これだけやれば⼗分闘え る!Titanicの先へ⾏く⼊⾨ 10 Kernel 〜

    Weekly Kaggle News 『Kaggleで勝つデータ分析の技術』(技術評論社) 『Approaching (Almost) Any Machine Learning Problem』 kagger-ja slack 55
  56. In this section, you learned ... ☑ Summary of this

    tutorial ☑ Future resources 56