Kaggle Days Championship予選全12回まとめ + TIPS

Kohei sakami shimacos sugawarya Team SSSS Kaggle days championship competitions
summary of our approaches

Self Introduction sakami NLP/tabular Kohei CV/tabular shimacos CV/NLP/tabular sugawarya CV/tabular

Agenda 1 Overview of each championship competitions 2 Our general
approach detail 3 Summary

Overview of each championship competitions

Don’t stop until you drop! 1 Competition outline • 6-classes
image classiﬁcation • Predict types of yoga poses • micro F1 Top solutions • Swin-Large • We used NFNet, EﬃcientNet

Water, water everywhere, not a drop to drink! 2 Competition
outline • Tabular competition (anonymous data) • Predict water quality by region • RMSLE Top solutions • Features that treat categories as numerical values (link) • MLP model trained with small batch size (32~64) was strong

Now you’re playing with power! 3 Competition outline • Tabular
competition • Predict amount of electricity • RMSE Top solutions • Leakage features that interpolate from before and after (This competition was time series, but random train/test split) • Imputation of missing values

Nature never goes out of style! 4 Competition outline •
Tabular competition with NLP features • Predict view counts of images in an image retrieval system • Each images are converted into features • MAE Top solutions • Ensemble of GBDT models • The key was how to create text features

We are all alike, on the inside. 5 Competition outline
• Multilingual NLP competition • Predict categories from pairs of texts • micro F1 Top solutions • Rembert, xlm-roberta-large, etc. • Larger models are stronger

Nowcasting 6 Competition outline • Time-series tabular competition • Predict
precipitation from past weather • QWK Top solutions • Sequential NN e.g. Conv1d, Transformer • Lag features from next sequence

Gaps in Gaps 7 Competition outline • Tabular competition (anonymous
data) • Predict values that will go into missing values in 10 tables • MSE (numerical) + error rate (categorical) Top solutions • Repeat imputation • Only two tables have enough data to predict with good accuracy. The remaining tables are ﬁlled with mean/mode.

Which Book Should I Read? 8 Competition outline • Tabular
competition (recommendation) • Predict rating values users give to books • User info: age, region, etc. • Book info: title, author, etc. • micro F1 Top solutions • Catboost • Work on feature engineering

Where there’s code, there’s bug 9 Competition outline • NLP
competition • Predict whether given code has a bug • Two lines of code on top and bottom are additionally provided for context • ROC-AUC Top solutions • CodeBERT, UniXcoder, etc. • Input only 3rd line • Logistic regression with tf-idf features has higher score than our language models

Open your eyes for the beauty around you 10 Competition
outline • CV competition (code submit) ◦ No execution time limit ◦ Unknown test data size, but can be run in 2-3 minutes per model • Predict beauty scores (?) of images • RMSE Top solutions • Swin-Large • Swin-Large needed to be trained for longer epochs than other models

KaggleDays’s Flying Circus 11 Competition outline • CV competition •
Predict people’s ages and genders • Sunglasses and beards are graﬃtied • micro F1 (ages are categorized) Top solutions • Ensemble of EﬃcientNet, NFNet, ConvNeXt • Predict categories integrating age & gender

Was your stay worth its price? 12 Competition outline •
Tabular competition with NLP features • Predict accommodation ratings • Multiple review texts are also given for each accommodation • micro F1 Top solutions • Predict targets from review texts and aggregate them • Ensemble with MLP

Our general approach detail

Movement during short term competition 0 • EDA to determine
how to split the fold • Create features and share them with GCS (Google Cloud Storage) • Experiment diﬀerent backbones and architectures with team member ◦ Share OOF with GCS • Ensemble ◦ Weighted Average, Nelder-Mead, Optuna

Tabular competitions | EDA 1 • Distribution of target ◦
Normal distribution or not • Scatter plots of numrtical features and targets • Existence of Category Columns ◦ How much is covered by the test data? • Time series data or not • (Leakage is occurring or not)

Tabular competitions | Feature Engineering 1 • Target Encoding (Basically
strong) • Aggregate numerical columns by category columns • Diﬀerence and percentage features • Count Encoding • Dimensionally reduction with SVD for co-occurrence matrices between categories • NLP Feature ◦ Tf-idf, Universal Sentence Encoder, BERT • (Leak Feature)

Tabular competitions | Model 1 • Boosting Model ◦ LightGBM,
XGBoost, CatBoost • MLP using the same features as Boosting Model ◦ Concat important features in the later layers. ◦ Use GRU (t=1) instead of Linear (Talking Data 3rd) • RNN or 1DCNN or Transformer (time series data) • 1DCNN (MoA 2nd and Optiver 1st)

Image competitions | EDA 2 • Look at the images
and think about what Augmentation would be most eﬀective. ◦ Vertical Flip would result in a diﬀerent yoga pose? • Error Analysis by baseline models ◦ Noise Reduction

Image competitions | Model 2 • EﬃcientNet with large image
size ◦ Lightweight and fast training • Swin Transformer (BMS 3rd, Google Landmark Recognition 3rd) • NFNet ( Shopee 1st, Shopee 2nd)

Image competitions | Experiments 2 • Large image size •
Test time augmentation • SAM Optimizer • Mixup • Pseudo Labeling

NLP competitions | EDA 3 • Text length • Token
length (transformed by tokenizer) ◦ If many of them exceed the max_length of the model, we need to consider how to cut them out. (head, tail, head + tail) ◦ LongFormer, DeBERTa • Error Analysis

NLP competitions | Model 3 • DeBERTa (Recently strong) ◦
Feedback prize, NBME 1st • Models pretrained in the same domain ◦ Cross Language model (XLM) ◦ CodeBERT in AI4Code ◦ Indonesian-BERT in Shopee 2nd • Larger model is strong • Ensemble is important

NLP competitions | TIPS 3 • Learning rate needs to
be set smaller than for images. • Linear Warmup Scheduling makes training stable. • Sequence Bucketing for fast training and inference. • Multi Sample Dropout (Google QUEST 1st) • Adversarial Weight Perturbation (Feedback 1st, NBME 1st) • Regression tasks set Dropout to 0. (Blog)

Multi modal competitions | TIPS 4 • Models to train
using only images • Models to train using only texts • Stacking using the above two predictions ◦ From Experience, better accuracy than simultaneous training by images and texts. ◦ TReNDS 3rd, Shopee 10th

Summary

Summary 1 • We won the ﬁrst place in the
meetup LB! • Tabular: Target Encoding, Text Feature, Neural Network, (Find Leak) • Image: Swin Large • NLP: Bigger model and Domain speciﬁc model • ALL: Ensemble is all you need! https://kaggledays.com/championship/leaderboard/

THANK YOU

Kaggle Days Championship予選全12回まとめ + TIPS

Kaggle Days Championship予選全12回まとめ + TIPS

shimacos

More Decks by shimacos

Other Decks in Science

Featured

Transcript

Kohei sakami shimacos sugawarya Team SSSS Kaggle days championship competitions

Self Introduction sakami NLP/tabular Kohei CV/tabular shimacos CV/NLP/tabular sugawarya CV/tabular

Agenda 1 Overview of each championship competitions 2 Our general

Overview of each championship competitions

Don’t stop until you drop! 1 Competition outline • 6-classes

Water, water everywhere, not a drop to drink! 2 Competition

Now you’re playing with power! 3 Competition outline • Tabular

Nature never goes out of style! 4 Competition outline •

We are all alike, on the inside. 5 Competition outline

Nowcasting 6 Competition outline • Time-series tabular competition • Predict

Gaps in Gaps 7 Competition outline • Tabular competition (anonymous

Which Book Should I Read? 8 Competition outline • Tabular

Where there’s code, there’s bug 9 Competition outline • NLP

Open your eyes for the beauty around you 10 Competition

KaggleDays’s Flying Circus 11 Competition outline • CV competition •

Was your stay worth its price? 12 Competition outline •

Our general approach detail

Movement during short term competition 0 • EDA to determine

Tabular competitions | EDA 1 • Distribution of target ◦

Tabular competitions | Feature Engineering 1 • Target Encoding (Basically

Tabular competitions | Model 1 • Boosting Model ◦ LightGBM,

Image competitions | EDA 2 • Look at the images

Image competitions | Model 2 • EﬃcientNet with large image

Image competitions | Experiments 2 • Large image size •

NLP competitions | EDA 3 • Text length • Token

NLP competitions | Model 3 • DeBERTa (Recently strong) ◦

NLP competitions | TIPS 3 • Learning rate needs to

Multi modal competitions | TIPS 4 • Models to train

Summary

Summary 1 • We won the ﬁrst place in the

THANK YOU