Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Kaggle Days Championship予選全12回まとめ + TIPS

shimacos
November 09, 2022

Kaggle Days Championship予選全12回まとめ + TIPS

10/28, 29にバルセロナで開催されたKaggle Days Championship Finalにて登壇した内容です。

我々のチームはKaggle Days Championshipの予選 12回を通して総合1位となりました。この資料では、予選12回のまとめと我々のチームが4時間という短期間コンペで何を試すか、試したかという一般的なアプローチについての解説を行いました。

This is what we said at the Kaggle Days Championship Final held in Barcelona on October 28, 29.

Our team came in first overall in the Kaggle Days Championship through 12 meetups. In this document, I summarize the 12 competitions of meetups and explain the general approach of what our team tried and tested in the short 4 hour competition.

shimacos

November 09, 2022
Tweet

More Decks by shimacos

Other Decks in Science

Transcript

  1. Don’t stop until you drop! 1 Competition outline • 6-classes

    image classification • Predict types of yoga poses • micro F1 Top solutions • Swin-Large • We used NFNet, EfficientNet
  2. Water, water everywhere, not a drop to drink! 2 Competition

    outline • Tabular competition (anonymous data) • Predict water quality by region • RMSLE Top solutions • Features that treat categories as numerical values (link) • MLP model trained with small batch size (32~64) was strong
  3. Now you’re playing with power! 3 Competition outline • Tabular

    competition • Predict amount of electricity • RMSE Top solutions • Leakage features that interpolate from before and after (This competition was time series, but random train/test split) • Imputation of missing values
  4. Nature never goes out of style! 4 Competition outline •

    Tabular competition with NLP features • Predict view counts of images in an image retrieval system • Each images are converted into features • MAE Top solutions • Ensemble of GBDT models • The key was how to create text features
  5. We are all alike, on the inside. 5 Competition outline

    • Multilingual NLP competition • Predict categories from pairs of texts • micro F1 Top solutions • Rembert, xlm-roberta-large, etc. • Larger models are stronger
  6. Nowcasting 6 Competition outline • Time-series tabular competition • Predict

    precipitation from past weather • QWK Top solutions • Sequential NN e.g. Conv1d, Transformer • Lag features from next sequence
  7. Gaps in Gaps 7 Competition outline • Tabular competition (anonymous

    data) • Predict values that will go into missing values in 10 tables • MSE (numerical) + error rate (categorical) Top solutions • Repeat imputation • Only two tables have enough data to predict with good accuracy. The remaining tables are filled with mean/mode.
  8. Which Book Should I Read? 8 Competition outline • Tabular

    competition (recommendation) • Predict rating values users give to books • User info: age, region, etc. • Book info: title, author, etc. • micro F1 Top solutions • Catboost • Work on feature engineering
  9. Where there’s code, there’s bug 9 Competition outline • NLP

    competition • Predict whether given code has a bug • Two lines of code on top and bottom are additionally provided for context • ROC-AUC Top solutions • CodeBERT, UniXcoder, etc. • Input only 3rd line • Logistic regression with tf-idf features has higher score than our language models
  10. Open your eyes for the beauty around you 10 Competition

    outline • CV competition (code submit) ◦ No execution time limit ◦ Unknown test data size, but can be run in 2-3 minutes per model • Predict beauty scores (?) of images • RMSE Top solutions • Swin-Large • Swin-Large needed to be trained for longer epochs than other models
  11. KaggleDays’s Flying Circus 11 Competition outline • CV competition •

    Predict people’s ages and genders • Sunglasses and beards are graffitied • micro F1 (ages are categorized) Top solutions • Ensemble of EfficientNet, NFNet, ConvNeXt • Predict categories integrating age & gender
  12. Was your stay worth its price? 12 Competition outline •

    Tabular competition with NLP features • Predict accommodation ratings • Multiple review texts are also given for each accommodation • micro F1 Top solutions • Predict targets from review texts and aggregate them • Ensemble with MLP
  13. Movement during short term competition 0 • EDA to determine

    how to split the fold • Create features and share them with GCS (Google Cloud Storage) • Experiment different backbones and architectures with team member ◦ Share OOF with GCS • Ensemble ◦ Weighted Average, Nelder-Mead, Optuna
  14. Tabular competitions | EDA 1 • Distribution of target ◦

    Normal distribution or not • Scatter plots of numrtical features and targets • Existence of Category Columns ◦ How much is covered by the test data? • Time series data or not • (Leakage is occurring or not)
  15. Tabular competitions | Feature Engineering 1 • Target Encoding (Basically

    strong) • Aggregate numerical columns by category columns • Difference and percentage features • Count Encoding • Dimensionally reduction with SVD for co-occurrence matrices between categories • NLP Feature ◦ Tf-idf, Universal Sentence Encoder, BERT • (Leak Feature)
  16. Tabular competitions | Model 1 • Boosting Model ◦ LightGBM,

    XGBoost, CatBoost • MLP using the same features as Boosting Model ◦ Concat important features in the later layers. ◦ Use GRU (t=1) instead of Linear (Talking Data 3rd) • RNN or 1DCNN or Transformer (time series data) • 1DCNN (MoA 2nd and Optiver 1st)
  17. Image competitions | EDA 2 • Look at the images

    and think about what Augmentation would be most effective. ◦ Vertical Flip would result in a different yoga pose? • Error Analysis by baseline models ◦ Noise Reduction
  18. Image competitions | Model 2 • EfficientNet with large image

    size ◦ Lightweight and fast training • Swin Transformer (BMS 3rd, Google Landmark Recognition 3rd) • NFNet ( Shopee 1st, Shopee 2nd)
  19. Image competitions | Experiments 2 • Large image size •

    Test time augmentation • SAM Optimizer • Mixup • Pseudo Labeling
  20. NLP competitions | EDA 3 • Text length • Token

    length (transformed by tokenizer) ◦ If many of them exceed the max_length of the model, we need to consider how to cut them out. (head, tail, head + tail) ◦ LongFormer, DeBERTa • Error Analysis
  21. NLP competitions | Model 3 • DeBERTa (Recently strong) ◦

    Feedback prize, NBME 1st • Models pretrained in the same domain ◦ Cross Language model (XLM) ◦ CodeBERT in AI4Code ◦ Indonesian-BERT in Shopee 2nd • Larger model is strong • Ensemble is important
  22. NLP competitions | TIPS 3 • Learning rate needs to

    be set smaller than for images. • Linear Warmup Scheduling makes training stable. • Sequence Bucketing for fast training and inference. • Multi Sample Dropout (Google QUEST 1st) • Adversarial Weight Perturbation (Feedback 1st, NBME 1st) • Regression tasks set Dropout to 0. (Blog)
  23. Multi modal competitions | TIPS 4 • Models to train

    using only images • Models to train using only texts • Stacking using the above two predictions ◦ From Experience, better accuracy than simultaneous training by images and texts. ◦ TReNDS 3rd, Shopee 10th
  24. Summary 1 • We won the first place in the

    meetup LB! • Tabular: Target Encoding, Text Feature, Neural Network, (Find Leak) • Image: Swin Large • NLP: Bigger model and Domain specific model • ALL: Ensemble is all you need! https://kaggledays.com/championship/leaderboard/