Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Kaggle Days Championship予選全12回まとめ + TIPS

shimacos
November 09, 2022

Kaggle Days Championship予選全12回まとめ + TIPS

10/28, 29にバルセロナで開催されたKaggle Days Championship Finalにて登壇した内容です。

我々のチームはKaggle Days Championshipの予選 12回を通して総合1位となりました。この資料では、予選12回のまとめと我々のチームが4時間という短期間コンペで何を試すか、試したかという一般的なアプローチについての解説を行いました。

This is what we said at the Kaggle Days Championship Final held in Barcelona on October 28, 29.

Our team came in first overall in the Kaggle Days Championship through 12 meetups. In this document, I summarize the 12 competitions of meetups and explain the general approach of what our team tried and tested in the short 4 hour competition.

shimacos

November 09, 2022
Tweet

More Decks by shimacos

Other Decks in Science

Transcript

  1. Kohei
    sakami
    shimacos
    sugawarya
    Team SSSS
    Kaggle days championship competitions
    summary of our approaches

    View Slide

  2. Self Introduction
    sakami
    NLP/tabular
    Kohei
    CV/tabular
    shimacos
    CV/NLP/tabular
    sugawarya
    CV/tabular

    View Slide

  3. Agenda
    1 Overview of each championship competitions
    2 Our general approach detail
    3 Summary

    View Slide

  4. Overview of each championship
    competitions

    View Slide

  5. Don’t stop until you drop!
    1
    Competition outline
    • 6-classes image classification
    • Predict types of yoga poses
    • micro F1
    Top solutions
    • Swin-Large
    • We used NFNet, EfficientNet

    View Slide

  6. Water, water everywhere, not a drop to drink!
    2
    Competition outline
    • Tabular competition (anonymous data)
    • Predict water quality by region
    • RMSLE
    Top solutions
    • Features that treat categories as
    numerical values (link)
    • MLP model trained with small batch
    size (32~64) was strong

    View Slide

  7. Now you’re playing with power!
    3
    Competition outline
    • Tabular competition
    • Predict amount of electricity
    • RMSE
    Top solutions
    • Leakage features that interpolate from
    before and after (This competition was
    time series, but random train/test split)
    • Imputation of missing values

    View Slide

  8. Nature never goes out of style!
    4
    Competition outline
    • Tabular competition with NLP features
    • Predict view counts of images in an
    image retrieval system
    • Each images are converted into features
    • MAE
    Top solutions
    • Ensemble of GBDT models
    • The key was how to create text features

    View Slide

  9. We are all alike, on the inside.
    5
    Competition outline
    • Multilingual NLP competition
    • Predict categories from pairs of texts
    • micro F1
    Top solutions
    • Rembert, xlm-roberta-large, etc.
    • Larger models are stronger

    View Slide

  10. Nowcasting
    6
    Competition outline
    • Time-series tabular competition
    • Predict precipitation from past weather
    • QWK
    Top solutions
    • Sequential NN e.g. Conv1d, Transformer
    • Lag features from next sequence

    View Slide

  11. Gaps in Gaps
    7
    Competition outline
    • Tabular competition (anonymous data)
    • Predict values that will go into missing
    values in 10 tables
    • MSE (numerical) +
    error rate (categorical)
    Top solutions
    • Repeat imputation
    • Only two tables have enough data to predict
    with good accuracy. The remaining tables
    are filled with mean/mode.

    View Slide

  12. Which Book Should I Read?
    8
    Competition outline
    • Tabular competition (recommendation)
    • Predict rating values users give to books
    • User info: age, region, etc.
    • Book info: title, author, etc.
    • micro F1
    Top solutions
    • Catboost
    • Work on feature engineering

    View Slide

  13. Where there’s code, there’s bug
    9
    Competition outline
    • NLP competition
    • Predict whether given code has a bug
    • Two lines of code on top and bottom are
    additionally provided for context
    • ROC-AUC
    Top solutions
    • CodeBERT, UniXcoder, etc.
    • Input only 3rd line
    • Logistic regression with tf-idf features has
    higher score than our language models

    View Slide

  14. Open your eyes for the beauty around you
    10
    Competition outline
    • CV competition (code submit)
    ○ No execution time limit
    ○ Unknown test data size, but can
    be run in 2-3 minutes per model
    • Predict beauty scores (?) of images
    • RMSE
    Top solutions
    • Swin-Large
    • Swin-Large needed to be trained for longer
    epochs than other models

    View Slide

  15. KaggleDays’s Flying Circus
    11
    Competition outline
    • CV competition
    • Predict people’s ages and genders
    • Sunglasses and beards are graffitied
    • micro F1 (ages are categorized)
    Top solutions
    • Ensemble of EfficientNet, NFNet, ConvNeXt
    • Predict categories integrating age & gender

    View Slide

  16. Was your stay worth its price?
    12
    Competition outline
    • Tabular competition with NLP features
    • Predict accommodation ratings
    • Multiple review texts are also given for
    each accommodation
    • micro F1
    Top solutions
    • Predict targets from review texts and
    aggregate them
    • Ensemble with MLP

    View Slide

  17. Our general approach detail

    View Slide

  18. Movement during short term competition
    0
    • EDA to determine how to split the fold
    • Create features and share them with GCS (Google Cloud Storage)
    • Experiment different backbones and architectures with team member
    ○ Share OOF with GCS
    • Ensemble
    ○ Weighted Average, Nelder-Mead, Optuna

    View Slide

  19. Tabular competitions | EDA
    1
    • Distribution of target
    ○ Normal distribution or not
    • Scatter plots of numrtical features and targets
    • Existence of Category Columns
    ○ How much is covered by the test data?
    • Time series data or not
    • (Leakage is occurring or not)

    View Slide

  20. Tabular competitions | Feature Engineering
    1
    • Target Encoding (Basically strong)
    • Aggregate numerical columns by category columns
    • Difference and percentage features
    • Count Encoding
    • Dimensionally reduction with SVD for co-occurrence matrices
    between categories
    • NLP Feature
    ○ Tf-idf, Universal Sentence Encoder, BERT
    • (Leak Feature)

    View Slide

  21. Tabular competitions | Model
    1
    • Boosting Model
    ○ LightGBM, XGBoost, CatBoost
    • MLP using the same features as Boosting Model
    ○ Concat important features in the later layers.
    ○ Use GRU (t=1) instead of Linear (Talking Data 3rd)
    • RNN or 1DCNN or Transformer (time series data)
    • 1DCNN (MoA 2nd and Optiver 1st)

    View Slide

  22. Image competitions | EDA
    2
    • Look at the images and think about what Augmentation would be most effective.
    ○ Vertical Flip would result in a different yoga pose?
    • Error Analysis by baseline models
    ○ Noise Reduction

    View Slide

  23. Image competitions | Model
    2
    • EfficientNet with large image size
    ○ Lightweight and fast training
    • Swin Transformer (BMS 3rd, Google Landmark Recognition 3rd)
    • NFNet ( Shopee 1st, Shopee 2nd)

    View Slide

  24. Image competitions | Experiments
    2
    • Large image size
    • Test time augmentation
    • SAM Optimizer
    • Mixup
    • Pseudo Labeling

    View Slide

  25. NLP competitions | EDA
    3
    • Text length
    • Token length (transformed by tokenizer)
    ○ If many of them exceed the max_length of the model, we need
    to consider how to cut them out. (head, tail, head + tail)
    ○ LongFormer, DeBERTa
    • Error Analysis

    View Slide

  26. NLP competitions | Model
    3
    • DeBERTa (Recently strong)
    ○ Feedback prize, NBME 1st
    • Models pretrained in the same domain
    ○ Cross Language model (XLM)
    ○ CodeBERT in AI4Code
    ○ Indonesian-BERT in Shopee 2nd
    • Larger model is strong
    • Ensemble is important

    View Slide

  27. NLP competitions | TIPS
    3
    • Learning rate needs to be set smaller than for images.
    • Linear Warmup Scheduling makes training stable.
    • Sequence Bucketing for fast training and inference.
    • Multi Sample Dropout (Google QUEST 1st)
    • Adversarial Weight Perturbation (Feedback 1st, NBME 1st)
    • Regression tasks set Dropout to 0. (Blog)

    View Slide

  28. Multi modal competitions | TIPS
    4
    • Models to train using only images
    • Models to train using only texts
    • Stacking using the above two predictions
    ○ From Experience, better accuracy than simultaneous training by images and texts.
    ○ TReNDS 3rd, Shopee 10th

    View Slide

  29. Summary

    View Slide

  30. Summary
    1
    • We won the first place in the meetup LB!
    • Tabular: Target Encoding, Text Feature, Neural Network, (Find Leak)
    • Image: Swin Large
    • NLP: Bigger model and Domain specific model
    • ALL: Ensemble is all you need!
    https://kaggledays.com/championship/leaderboard/

    View Slide

  31. THANK YOU

    View Slide