Kaggle Days Championship予選全12回まとめ + TIPS

by shimacos

Slide 1

Slide 1 text

Kohei sakami shimacos sugawarya Team SSSS Kaggle days championship competitions summary of our approaches

Slide 2

Slide 2 text

Self Introduction sakami NLP/tabular Kohei CV/tabular shimacos CV/NLP/tabular sugawarya CV/tabular

Slide 3

Slide 3 text

Agenda 1 Overview of each championship competitions 2 Our general approach detail 3 Summary

Slide 4

Slide 4 text

Overview of each championship competitions

Slide 5

Slide 5 text

Don’t stop until you drop! 1 Competition outline • 6-classes image classiﬁcation • Predict types of yoga poses • micro F1 Top solutions • Swin-Large • We used NFNet, EﬃcientNet

Slide 6

Slide 6 text

Water, water everywhere, not a drop to drink! 2 Competition outline • Tabular competition (anonymous data) • Predict water quality by region • RMSLE Top solutions • Features that treat categories as numerical values (link) • MLP model trained with small batch size (32~64) was strong

Slide 7

Slide 7 text

Now you’re playing with power! 3 Competition outline • Tabular competition • Predict amount of electricity • RMSE Top solutions • Leakage features that interpolate from before and after (This competition was time series, but random train/test split) • Imputation of missing values

Slide 8

Slide 8 text

Nature never goes out of style! 4 Competition outline • Tabular competition with NLP features • Predict view counts of images in an image retrieval system • Each images are converted into features • MAE Top solutions • Ensemble of GBDT models • The key was how to create text features

Slide 9

Slide 9 text

We are all alike, on the inside. 5 Competition outline • Multilingual NLP competition • Predict categories from pairs of texts • micro F1 Top solutions • Rembert, xlm-roberta-large, etc. • Larger models are stronger

Slide 10

Slide 10 text

Nowcasting 6 Competition outline • Time-series tabular competition • Predict precipitation from past weather • QWK Top solutions • Sequential NN e.g. Conv1d, Transformer • Lag features from next sequence

Slide 11

Slide 11 text

Gaps in Gaps 7 Competition outline • Tabular competition (anonymous data) • Predict values that will go into missing values in 10 tables • MSE (numerical) + error rate (categorical) Top solutions • Repeat imputation • Only two tables have enough data to predict with good accuracy. The remaining tables are ﬁlled with mean/mode.

Slide 12

Slide 12 text

Which Book Should I Read? 8 Competition outline • Tabular competition (recommendation) • Predict rating values users give to books • User info: age, region, etc. • Book info: title, author, etc. • micro F1 Top solutions • Catboost • Work on feature engineering

Slide 13

Slide 13 text

Where there’s code, there’s bug 9 Competition outline • NLP competition • Predict whether given code has a bug • Two lines of code on top and bottom are additionally provided for context • ROC-AUC Top solutions • CodeBERT, UniXcoder, etc. • Input only 3rd line • Logistic regression with tf-idf features has higher score than our language models

Slide 14

Slide 14 text

Open your eyes for the beauty around you 10 Competition outline • CV competition (code submit) ○ No execution time limit ○ Unknown test data size, but can be run in 2-3 minutes per model • Predict beauty scores (?) of images • RMSE Top solutions • Swin-Large • Swin-Large needed to be trained for longer epochs than other models

Slide 15

Slide 15 text

KaggleDays’s Flying Circus 11 Competition outline • CV competition • Predict people’s ages and genders • Sunglasses and beards are graﬃtied • micro F1 (ages are categorized) Top solutions • Ensemble of EﬃcientNet, NFNet, ConvNeXt • Predict categories integrating age & gender

Slide 16

Slide 16 text

Was your stay worth its price? 12 Competition outline • Tabular competition with NLP features • Predict accommodation ratings • Multiple review texts are also given for each accommodation • micro F1 Top solutions • Predict targets from review texts and aggregate them • Ensemble with MLP

Slide 17

Slide 17 text

Our general approach detail

Slide 18

Slide 18 text

Movement during short term competition 0 • EDA to determine how to split the fold • Create features and share them with GCS (Google Cloud Storage) • Experiment diﬀerent backbones and architectures with team member ○ Share OOF with GCS • Ensemble ○ Weighted Average, Nelder-Mead, Optuna

Slide 19

Slide 19 text

Tabular competitions | EDA 1 • Distribution of target ○ Normal distribution or not • Scatter plots of numrtical features and targets • Existence of Category Columns ○ How much is covered by the test data? • Time series data or not • (Leakage is occurring or not)

Slide 20

Slide 20 text

Tabular competitions | Feature Engineering 1 • Target Encoding (Basically strong) • Aggregate numerical columns by category columns • Diﬀerence and percentage features • Count Encoding • Dimensionally reduction with SVD for co-occurrence matrices between categories • NLP Feature ○ Tf-idf, Universal Sentence Encoder, BERT • (Leak Feature)

Slide 21

Slide 21 text

Tabular competitions | Model 1 • Boosting Model ○ LightGBM, XGBoost, CatBoost • MLP using the same features as Boosting Model ○ Concat important features in the later layers. ○ Use GRU (t=1) instead of Linear (Talking Data 3rd) • RNN or 1DCNN or Transformer (time series data) • 1DCNN (MoA 2nd and Optiver 1st)

Slide 22

Slide 22 text

Image competitions | EDA 2 • Look at the images and think about what Augmentation would be most eﬀective. ○ Vertical Flip would result in a diﬀerent yoga pose? • Error Analysis by baseline models ○ Noise Reduction

Slide 23

Slide 23 text

Image competitions | Model 2 • EﬃcientNet with large image size ○ Lightweight and fast training • Swin Transformer (BMS 3rd, Google Landmark Recognition 3rd) • NFNet ( Shopee 1st, Shopee 2nd)

Slide 24

Slide 24 text

Image competitions | Experiments 2 • Large image size • Test time augmentation • SAM Optimizer • Mixup • Pseudo Labeling

Slide 25

Slide 25 text

NLP competitions | EDA 3 • Text length • Token length (transformed by tokenizer) ○ If many of them exceed the max_length of the model, we need to consider how to cut them out. (head, tail, head + tail) ○ LongFormer, DeBERTa • Error Analysis

Slide 26

Slide 26 text

NLP competitions | Model 3 • DeBERTa (Recently strong) ○ Feedback prize, NBME 1st • Models pretrained in the same domain ○ Cross Language model (XLM) ○ CodeBERT in AI4Code ○ Indonesian-BERT in Shopee 2nd • Larger model is strong • Ensemble is important

Slide 27

Slide 27 text

NLP competitions | TIPS 3 • Learning rate needs to be set smaller than for images. • Linear Warmup Scheduling makes training stable. • Sequence Bucketing for fast training and inference. • Multi Sample Dropout (Google QUEST 1st) • Adversarial Weight Perturbation (Feedback 1st, NBME 1st) • Regression tasks set Dropout to 0. (Blog)

Slide 28

Slide 28 text

Multi modal competitions | TIPS 4 • Models to train using only images • Models to train using only texts • Stacking using the above two predictions ○ From Experience, better accuracy than simultaneous training by images and texts. ○ TReNDS 3rd, Shopee 10th

Slide 29

Slide 29 text

Summary

Slide 30

Slide 30 text

Summary 1 • We won the ﬁrst place in the meetup LB! • Tabular: Target Encoding, Text Feature, Neural Network, (Find Leak) • Image: Swin Large • NLP: Bigger model and Domain speciﬁc model • ALL: Ensemble is all you need! https://kaggledays.com/championship/leaderboard/

Slide 31

Slide 31 text

THANK YOU