APTOS 2019 28th place solution

Slide 1

Slide 1 text

APTOS 2019 Blindness Detection Detect diabetic retinopathy to stop blindness before it's too late Team: 2 AI Ophthalmologists Maxwell and tereka

Slide 2

Slide 2 text

1. Competition Overview 2. Result 3. Model Pipeline 4. What we did (worked/Not worked) 5. Take away

Slide 3

Slide 3 text

1. Competition Overview 2. Result 3. Model Pipeline 4. What we did (worked/Not worked) 5. Take away

Slide 4

Slide 4 text

Aravind Eye Hospital Madurai, Tamil Nadu Asia Pacific Tele-Ophthalmology Society 2019 Detect Diabetic Retinopathy to stop blindness before it’s too late Build a model to help identify diabetic retinopathy automatically  Aravind Eye Hospital technicians travel to rural areas to capture images  Shortage of high trained doctors to review the images and provide diagnosis in rural areas of India Rural areas

Slide 5

Slide 5 text

Localized Various Findings of Diabetic Retinopathy

Slide 6

Slide 6 text

Data  Num of images train : 3,662 public : 1,928 private : ~11,000  Target label 0 : No DR 1 : Mild 2 : Moderate 3 : Severe 4 : Proliferative DR 0 : No DR 1 : Mild 2 : Moderate 3 : Severe 4 : Proliferative DR

Slide 7

Slide 7 text

1. Less Images  External data (DRD, IDRiD) allowed 2. Target Imbalance 3. Target Distribution  Difference of train and public  How is private? 4. Duplicated Images 5. Various Image Sizes  Difference of train and test 6. Label Inconsistency  Different labels by each ophthalmologist Data Issue Duplicated Images Target Distributions Credit to: https://www.kaggle.com/currypurin/image-shape-distribution-previous-and-present Image Size Distributions

Slide 8

Slide 8 text

https://youtu.be/oOeZ7IgEN4o?t=145 Label Inconsistency The diagnosis of ophthalmoligists are inconsistent...

Slide 9

Slide 9 text

Evaluation Quadratic Weighted Kappa κ = 1 − i, j ωi, j Oi, j i, j ωi, j Ei, j ωi, j = i − j 2  Confusion Matrix-based Metric Label Prediction (Threshold Optimization) => Unstable  Target Distribution Dependence Depend on label distribution => Unstable Unstable Metric!

Slide 10

Slide 10 text

1. Competition Overview 2. Result 3. Model Pipeline 4. What we did (worked/Not worked) 5. Take away

Slide 11

Slide 11 text

Submission 2 (3 models blended) Public LB : 0.825 / 65th Private LB : 0.927 / 28th CV 0.9363 on APTOS all 0.7542 on Higher Grade APTOS Submission 1 (2 models blended) Public LB : 0.832 / 38 th Private LB : 0.925 / 47 - 60 th ? CV 0.9359 on APTOS all 0.7557 on Higher Grade APTOS Private LB Public LB Our Final Selected Submissions

Slide 12

Slide 12 text

Hard to get a consistent correlation Our Submission History Everwhere inconsistent... - Local CV vs Public - Public vs Private

Slide 13

Slide 13 text

No content

Slide 14

Slide 14 text

1. Competition Overview 2. Result 3. Model Pipeline 4. What we did (worked/Not worked) 5. Take away

Slide 15

Slide 15 text

APTOS 2019 Blindness Detection 320 320 Remove black background Resize - HorizontalFlip - Brightness - Contrast - RGBshift - Scale - Rotate - RandomErasing - Stratified 5 fold - Pretrained on ImageNet - BCE loss - Early Stopping with BCE - Adam 1e-4 RGB brightness normalization (ImageNet base) + Preprocessing SE-ResNext 50 ( Pre-trained ) Fine tuning APTOS IDRiD DRD 2015 Test TTA ( 3 times ) - HorizontalFlip - Brightness - Contrast - RGBShift - Scale - Rotate Prediction Encoding + + Remove black background Preprocessing CLAHE ( adaptive histogram equalization ) 300 260 Resize Grade Balanced Sampling + + Augmentation - HorizontalFlip - Brightness - Contrast - RGBshift - Scale - Rotate - Shear - Stratified 5 fold - Pretrained on ImageNet - Input normalized with BN - Clipped MSE (CMSE) - Early Stopping with CMSE - Adam 1e-3 / 5e-4 Changing Sampling Rate with Disease Grade ( 1 : 2 : 2 : 2 : 2 ) TTA ( 3 / 3 times ) - Brightness - Contrast - RGBShift - Scale - Rotate Blending  Blending Coefficients SE-ResNext 50: 0.469 EfficinetNet B2: 0.273 EfficientNet B3: 0.258  QWK optimization with Nelder-Mead solver 320 Copyright 2019 @ Maxwell_110 Public: 38 th Local 0.936 on APTOS train Public LB 0.832 Private: 28 th Private LB 0.927 Pre-training Augmentation Preprocessing Augmentation EfficientNet B2 ( Pre-trained ) EfficientNet B3 ( Pre-trained ) EfficientNet B2 ( Regression ) EfficientNet B3 ( Regression ) SE-ResNext 50 ( Ordinal Regression ) Prediction and Ensemble Preprocessing Preprocessing Using all data

Slide 16

Slide 16 text

Our strategy in fail...  QWK depends on label distribution  Distribution of train and public test are different Our strategy was:  Built 2 models trained with different label distributions (train-like and public-like)  Blended those to get a robust prediction Private was Train-like... OK...Then Private will also be different from train.

Slide 17

Slide 17 text

1. Competition Overview 2. Result 3. Model Pipeline 4. What we did (worked/Not worked) 5. Take away

Slide 18

Slide 18 text

4–1. Ordinal Regression ? [ 1, 0, 0 ] [ 0, 0, 1 ] [ 0, 1, 0 ] 0 : No DR = [ 1, 0, 0 ] 1 : Mild = [ 0, 1, 0 ] 2 : Moderate = [ 0, 0, 1 ] Cao et al. 2019. Rank-consistent-Ordinal-Regression-Neural-Networks. Bear classification Severity estimation

Slide 19

Slide 19 text

Labels are independent with others. When diagnosis = 2, the eye will contain diagnosis 1 disease or equivalent. [ 1, 0, 0 ] [ 0, 0, 1 ] [ 0, 1, 0 ] 0 : No DR = [ 1, 0, 0 ] 1 : Mild = [ 0, 1, 0 ] 2 : Moderate = [ 0, 0, 1 ] Bear classification Severity estimation Labels are dependent on others.

Slide 20

Slide 20 text

[ 1, 0, 0 ] [ 0, 0, 1 ] [ 0, 1, 0 ] FC Layer Naive Classification Rank-Consistent Ordinal Regression Soft-Max Loss Multi - Class BCE Loss Multi - Label [ 0, 0 ] [ 1, 1] [ 1, 0 ] FC Layer = 0 + 0 = 0 = 1 + 0 = 1 = 1 + 1 = 2

Slide 21

Slide 21 text

Theory 𝐬 𝐳 = 𝟏 𝟏 + 𝐞−𝐳 The Number of Classes: K (= 5) Retina Image Features (Input of CNN): xi Rank Label: yi ∈ 0, 1, 2, 3, 4 yi ⟹ y i (1), … , y i k , … , y i (K−1) y i (k) ∈ I yi > rk e.g. yi = 2 ⟹ 1, 1, 0, 0 , y i (1 or 2) = 1, y i (3 or 4) = 0 (K - 1) binary tasks share the same weight parameter ωj , but have independent bias units bk h xi = k=1 K−1 fk (xi ) fk xi = I P y i (k) = 1 > 0.5 P y i (k) = 1 = s j m ωj aj + bk = s(g xi , W + bk ) Predicted Rank Condition. f1 xi ≥ f2 xi ≥ … ≥ fK−1 xi Required for the Ordinal Information and Rank-Monotonicity => Let's prove in the next slide! Output function of CNN e.g. 𝐏 𝐲 𝟏 𝐤 = 𝟏 , 𝐏 𝐲 𝟐 𝐤 = 𝟏 , 𝐏 𝐲 𝟑 𝐤 = 𝟏 , 𝐏 𝐲 𝟒 𝐤 = 𝟏 = [ 𝟎. 𝟖, 𝟎. 𝟕, 𝟎. 𝟒, 𝟎. 𝟏 ] 𝐡 𝐱𝐢 = 𝟏 + 𝟏 + 𝟎 + 𝟎 = 𝟐

Slide 22

Slide 22 text

Proof by contradiction. (Using the loss function and its optimal solution (W*, b*)) Theorem: f1 xi ≥ f2 xi ≥ … ≥ fK−1 xi satisfied with the optimal solution (W*, b*) w*: Optimal CNN Weights with train data b*: Optimal Biases of the final layer with train data  Binary Cross Entropy as Loss Function L W, b = − i=1 N k=1 K−1 [ y i (k) log s(g xi , W + bk ) + 1 − y i k log 1 − s(g xi , W + bk ) ]  Sufficient Condition P y i (1) = 1 ≥ P y i (2) = 1 ≥ … ≥ P y i (K−1) = 1 ⇒ f1 xi ≥ f2 xi ≥ … ≥ fK−1 xi P y i (1) = 1 ≥ P y i (2) = 1 ≥ … ≥ P y i (K−1) = 1 ⇔ b1 ∗ ≥ b2 ∗ ≥ … ≥ bK−1 ∗ Suppose, (W∗, b∗) is an optimal solution and bk < bk+1 for some k . Let, A1 = n ∶ y n (k) = y n (k+1) = 1 , A2 = n ∶ y n (k) = y n (k+1) = 0 , A3 = n ∶ y n (k) = 1, y n (k+1) = 0 A1 ∪ A2 ∪ A3 = 1, 2, … , N Denote, p n = s(g xi , W + bk ) δn = log(p n (bk+1 )) − log p n bk > 0 δn ′ = log(1 − p n (bk )) − log 1 − p n bk+1 > 0 If replacing bk with bk+1 , the change of Loss is given as Δ1 L = − n∈A1 δn + n∈A2 δn ′ − n∈A3 δn If replacing bk+1 with bk , Δ2 L = n∈A1 δn − n∈A2 δn ′ − n∈A3 δn ′ Then we have, Δ1 L + Δ2 L = − n∈A3 δn + δn ′ < 0 ⇔ Δ1 L < 0 or Δ2 L < 0 show this! contradictory More Optimal solution exists!!! ( Replacing bk to satisfy b1 ∗ ≥ b2 ∗ ≥ … ≥ bK−1 ∗ )

Slide 23

Slide 23 text

This is satisfied only in the optimal solution (W*, b*). w* : Optimal CNN Weights with train data b* : Optimal Biases of the final layer with train data When using Mini-Batch training and OOF prediction, this assumption will be violated a little. P y i (1) = 1 ≥ P y i (2) = 1 ≥ … ≥ P y i (K−1) = 1 ~ 14,000 examples e.g. [p1, p2, p3, p4] = [0.90, 0.48, 0.52, 0.30] Hard Encoding with threshold 0.5 => diagnosis 3!?

Slide 24

Slide 24 text

Prediction Encoding (Soft Encoding) To blend with naive regression models, we need to encode 4 probabilities to one scalar value. Pencoded = P y i 1 = 1 + P y i (2) = 1 + P y i (3) = 1 + P y i (4) = 1 (0 ≤ Pencoded ≤ 4) e.g. [p1, p2, p3, p4] = [0.90, 0.48, 0.52, 0.30] Hard Encoding with threshold 0.5 => diagnosis 3 Soft Encoding => 2.20

Slide 25

Slide 25 text

4–2. Others a. EfficientNet https://arxiv.org/abs/1905.11946 b. MixUp and CutMix https://arxiv.org/abs/1710.09412 https://arxiv.org/abs/1905.04899 c. Ben's preprocessing d. Statistics Value Mean, Std, Quantile, ... Implemented as `Lambda Layer` e. Psuedo Labeling f. RAdam https://arxiv.org/abs/1908.03265 Diagnosis 1 Diagnosis 2 MixUp Diag1 * 0.5 + Diag2 * 0.5 diagnosis = 1.5? maybe not... I think it's more than 2, due to non-linear transformation for findings (label). CutMix Diag1 * Mask1 + Diag2 * Mask2 Diag 2 <--- ---> Diag 1 diagnosis = 1.7? My guess is 2.0 or so... worked only on private Findings Findings Why MixUp and CutMix did not work?

Slide 26

Slide 26 text

1. Competition Overview 2. Result 3. Model Pipeline 4. What we did (worked/Not worked) 5. Takeaway

Slide 27

Slide 27 text

 24GB GDDR6  1770 MHz  576 Tensor Core  4608 CUDA core