Upgrade to Pro — share decks privately, control downloads, hide ads and more …

How to start your career in Data Science with Kaggle

How to start your career in Data Science with Kaggle

Recorded session: https://youtu.be/hV8x9uCQdS0

Kaggle is the world's largest data science community with competitions tackling issues in various industries, from drug discovery to autonomous vehicles. Many aspiring data scientists focus on doing Kaggle competitions as a way to build their portfolios and kickstart their career in Data Science. There are, however, several preconceptions around Kaggle, including lack of relevance to the real world and overfitting.

For this talk, we would love to welcome Hiroshi Yoshihara, Machine Learning Engineer at Aillis Inc. and Kaggle Competitions Master, to share his experience with Kaggle and tips for beginners making their first steps in Data Science.

Key points 👉

- what is Kaggle and some common misunderstandings about it
- my experience in Kaggle: what did I learn from it and how did it benefit my career?
- how to use Kaggle for your business?
- tips & insights for Kaggle beginners

If you are interested in a career in Data Science, join us on Dec 17th for an 1h talk and have all your questions answered!
____________

🚀 About Hiroshi Yoshihara 🚀

Hiroshi Yoshihara, Machine Learning Engineer at Aillis Inc., a Japan-based startup developing an AI-powered medical device for an early and accurate influenza detection. He is also a public health researcher in academia. His passion lies at the intersection of technology and healthcare. He has the title of Kaggle Competitions Master.

🚀About Le Wagon Tokyo 🚀

Le Wagon Tokyo (https://www.lewagon.com/tokyo) is the world's most acclaimed coding bootcamp for startups, creative people and tech entrepreneurs.
Our Web Development and Data Science bootcamps are designed for individuals who want to change their career, become freelancer, or launch their own venture!

Hiroshechka Y

January 07, 2023
Tweet

More Decks by Hiroshechka Y

Other Decks in Programming

Transcript

  1. Hiroshi Yoshihara Kaggle Master Machine Learning Engineer How to start

    your career in Data Science with 17th Dec 2020 @ Le Wagon Tokyo
  2. Hiroshi Yoshihara a.k.a. RabotniKuma(熊), Hiroshechka Y. Competition Freak iGEM (biology),

    ABU Robocon (robotics) ML engineer at Aillis Inc. Public health researcher at UTokyo CTO at LiDAT Inc. Kaggle Competitions Master @analokmaus
  3. Aillis is a Japan-based startup developing an AI-powered medical device

    for an early and accurate influenza detection.
  4. 1-1. Kaggle Kaggle is the world’s largest data science competition

    platform. More than 150k competitors. More than 40 competitions this year. A perfect place to learn machine learning and build portfolio as a data scientist. So fun!
  5. 1-2. How does a competition work? Training set (with labels)

    Public test set (no labels) Private test set (no labels) 1. Build models using training set. 2. Make predictions on public and private test sets. 3. Submit predictions and get score on public test set. 4. Repeat 1. - 3. 5. Final standing is based on score on private test set.
  6. 1-3. What makes Kaggle so fun? Competitive spirit! Tiers based

    on your performance in past competitions. Prizes for winners. Cooperation with data scientists from all over the world and learn together. Free computational resources (including GPU/TPU) .
  7. 2-1. Myths Lack of relevance to real world. Problem setting

    is overly ideal. 0.01% improvement does not matter. Solutions are too complex to be used in real world. Lack of academic value.
  8. 2-2. Is problem setting too ideal? Competition problem is more

    or less simplified. Competition focuses on data processing and modeling. Every year, tens of companies and organizations pay Kaggle to host competitions. Extract and translate a real world problem into a competition problem. Problem statement Data collection Data processing Modeling Deployment
  9. 2-3. Does 0.01% really matters? In most cases, slight differences

    in score are not important. BUT, the insights you learned during the process to improve score are usually of great importance. Kaggle can be compared to F1. How to run, turn, and stop the car at more than 300 kph safely. How to process and model the data to achieve 99.99% accuracy. https://www.raconteur.net/how-much-does-an-f1-car-cost/
  10. 2-4. Are solutions too complex? For tabular data competitions, top

    solutions are likely to be a mixture of tons of lightweight models (GBDTs). If you pursue only model accuracy, it is completely fine. Many code competitions, in which solutions are run on Kaggle server with limited resources and time. Some innovative ideas were born in competitions. e.g. BERT in chemistry (https://www.kaggle.com/c/champs- scalar-coupling/discussion/106572)
  11. 2-5. Lack of academic value? It is true that most

    innovative ideas such as ResNet or AlphaFold2 are not likely to be invented in Kaggle. State-of-the-art (SOTA) models proposed in academic societies do not always perform good on external datasets e.g. Kaggle. Kaggle is a good place to test SOTA models. Kaggle is of great practical value. Research competitions, which aim to advance SOTA in specific domains.
  12. 3-1. The very beginning I got to know Kaggle in

    a boring data science lecture at university. My skills at that time: Programming language : C and javascript Data science : very basic level / no hands-on experience Statistics: I can do Student’s t-test Math : I got 53 out of 100 in linear algebra exam “Kaggle is like a MMO game. Cool!”
  13. 3-2. The first competition VSB Power Line Fault Detection. Time

    series anomaly detection. One of the most difficult kind of task :( Most competitors used LSTM, Attention, and many other complex neural networks which I didn’t understand well :( I modified a notebook which extracts massive amount of hand- engineered features. Complex neural networks are mostly overfitted to public test data, and I jumped up to Bronze medal 🥉 :)
  14. 3-3. Learnings from the first competition There are many Kaggle-specific

    keywords: CV, LB, Shakedown, Adversarial validation, etc Don’t be lazy and google them up! Kaggle discussions and notebooks are the best textbook. Before you make models, you must build a reliable validation scheme. Complex models do not always win. (“Oh, this MMO game is quite a bit of fun ”)
  15. 3-4. As a Kaggler I asked my friends to join

    Kaggle and tackled many competitions with a fixed team. Finding teammate is always a good idea. Kaggle performance / tier matters wrt job offers.
  16. 3-5. Kaggler in industry I joined an AI startup -

    Aillis Inc. by a referral from my friend. CTO and the whole company were interested in the potential of Kaggle, and supported my challenge in Kaggle. “Kaggle is to Aillis is What F1 is to Honda.” - CTO Insights from PANDA challenge helped a lot. Aillis Inc. CTO Atsushi Fukuda @fukumimi014
  17. 3-6. What I learned from Kaggle Theories Mathematics Statistics Machine

    learning Programming and coding Data wrangling Data intuition Data visualization https://blog.udacity.com/2014/11/data-science-job-skills.html
  18. Appx. My competition routine 1. Careful Exploratory Data Analysis (EDA,

    i.e. Torture data) 2. Make baseline model and validation scheme. 3. Search for useful materials such as previous competitions with similar settings, papers related to the competition. 4. Make (update) a TODO list before doing experiments. 5. Experiment and experiment. 6. Pray :)
  19. Appx. Resources “Machine Learning” by Andrew Ng(https://www.coursera.org/learn/machine-learning) “How to Win

    a Data Science Competition: Learn from Top Kagglers” by National Research University Higher School of Economics (https://www.coursera.org/learn/ competitive-data-science?specialization=aml) “Approaching (Almost) Any Machine Learning Problem” by Abhishek Thakur (https:// www.amazon.com/Approaching-Almost-Machine-Learning-Problem/dp/8269211508) “Kaggleで勝つデータ分析の技術” by Daisuke Kadowaki et al. (In Japanese, https:// www.amazon.co.jp/dp/4297108437) “PythonではじめるKaggleスタートブック” by Shotaro Ishihara and Hideki Murata (In Japanese, https://www.amazon.co.jp/dp/B088R992TJ )
  20. 4-1. Kagglers are good at Torture data. Design good validation

    schemes. Choose proper algorithms. Build accurate models. Read and implement new algorithms on published papers.
  21. 4-2. Kagglers MAY NOT be good at Design the whole

    problem: what kind of data to collect, which metric is suitable, how the output should be like, etc. Data, metric, and output format are fixed in competitions. Data engineering: RDB, SQL, Cloud, etc. Data visualization (but Kaggle is a good place to practice it). Communication with non-engineer people. Large scale software development.
  22. 4-3. How to utilize Kagglers ? Correctly understand what they

    are (not) good at. Kaggler + liaison = not good :( A good team = fleet Make up for each other’s weak points. Transform original task into a competition-like task (if possible). Encourage them to Kaggle ! Approve Kaggle-related activity as a part of work, etc. Winning a competition is winning reputation of a company.
  23. 4-4. Kaggling companies Many companies have so-called “Kaggler team”, as

    a specialist team in data analysis. Many companies encourage their employees to participate in Kaggle competitions.
  24. Aillis allows engineers to participate in Kaggle competitions during work.

    No limitation based on performance in Kaggle. Ratio of Kaggle-related activities is up tp 40%. Participants are asked to share the learnings after the competition. Participants can use company’s cloud GPU instances when they are not occupied.
  25. Aillis hosted an in-house competition as a part of development.

    We developed an online machine learning competition platform and published it as an open-source project. (https://github.com/ AillisInc/ml_competition_platform)