APTOS 2019 28th place solution

APTOS 2019 Blindness Detection Detect diabetic retinopathy to stop blindness
before it's too late Team: 2 AI Ophthalmologists Maxwell and tereka

1. Competition Overview 2. Result 3. Model Pipeline 4. What
we did (worked/Not worked) 5. Take away

Aravind Eye Hospital Madurai, Tamil Nadu Asia Pacific Tele-Ophthalmology Society
2019 Detect Diabetic Retinopathy to stop blindness before it’s too late Build a model to help identify diabetic retinopathy automatically  Aravind Eye Hospital technicians travel to rural areas to capture images  Shortage of high trained doctors to review the images and provide diagnosis in rural areas of India Rural areas

Localized Various Findings of Diabetic Retinopathy

Data  Num of images train : 3,662 public :
1,928 private : ~11,000  Target label 0 : No DR 1 : Mild 2 : Moderate 3 : Severe 4 : Proliferative DR 0 : No DR 1 : Mild 2 : Moderate 3 : Severe 4 : Proliferative DR

1. Less Images  External data (DRD, IDRiD) allowed 2.
Target Imbalance 3. Target Distribution  Difference of train and public  How is private? 4. Duplicated Images 5. Various Image Sizes  Difference of train and test 6. Label Inconsistency  Different labels by each ophthalmologist Data Issue Duplicated Images Target Distributions Credit to: https://www.kaggle.com/currypurin/image-shape-distribution-previous-and-present Image Size Distributions

https://youtu.be/oOeZ7IgEN4o?t=145 Label Inconsistency The diagnosis of ophthalmoligists are inconsistent...

Evaluation Quadratic Weighted Kappa κ = 1 − i, j
ωi, j Oi, j i, j ωi, j Ei, j ωi, j = i − j 2  Confusion Matrix-based Metric Label Prediction (Threshold Optimization) => Unstable  Target Distribution Dependence Depend on label distribution => Unstable Unstable Metric!

Submission 2 (3 models blended) Public LB : 0.825 /
65th Private LB : 0.927 / 28th CV 0.9363 on APTOS all 0.7542 on Higher Grade APTOS Submission 1 (2 models blended) Public LB : 0.832 / 38 th Private LB : 0.925 / 47 - 60 th ? CV 0.9359 on APTOS all 0.7557 on Higher Grade APTOS Private LB Public LB Our Final Selected Submissions

Hard to get a consistent correlation Our Submission History Everwhere
inconsistent... - Local CV vs Public - Public vs Private

APTOS 2019 Blindness Detection 320 320 Remove black background Resize
- HorizontalFlip - Brightness - Contrast - RGBshift - Scale - Rotate - RandomErasing - Stratified 5 fold - Pretrained on ImageNet - BCE loss - Early Stopping with BCE - Adam 1e-4 RGB brightness normalization (ImageNet base) + Preprocessing SE-ResNext 50 ( Pre-trained ) Fine tuning APTOS IDRiD DRD 2015 Test TTA ( 3 times ) - HorizontalFlip - Brightness - Contrast - RGBShift - Scale - Rotate Prediction Encoding + + Remove black background Preprocessing CLAHE ( adaptive histogram equalization ) 300 260 Resize Grade Balanced Sampling + + Augmentation - HorizontalFlip - Brightness - Contrast - RGBshift - Scale - Rotate - Shear - Stratified 5 fold - Pretrained on ImageNet - Input normalized with BN - Clipped MSE (CMSE) - Early Stopping with CMSE - Adam 1e-3 / 5e-4 Changing Sampling Rate with Disease Grade ( 1 : 2 : 2 : 2 : 2 ) TTA ( 3 / 3 times ) - Brightness - Contrast - RGBShift - Scale - Rotate Blending  Blending Coefficients SE-ResNext 50: 0.469 EfficinetNet B2: 0.273 EfficientNet B3: 0.258  QWK optimization with Nelder-Mead solver 320 Copyright 2019 @ Maxwell_110 Public: 38 th Local 0.936 on APTOS train Public LB 0.832 Private: 28 th Private LB 0.927 Pre-training Augmentation Preprocessing Augmentation EfficientNet B2 ( Pre-trained ) EfficientNet B3 ( Pre-trained ) EfficientNet B2 ( Regression ) EfficientNet B3 ( Regression ) SE-ResNext 50 ( Ordinal Regression ) Prediction and Ensemble Preprocessing Preprocessing Using all data

Our strategy in fail...  QWK depends on label distribution
 Distribution of train and public test are different Our strategy was:  Built 2 models trained with different label distributions (train-like and public-like)  Blended those to get a robust prediction Private was Train-like... OK...Then Private will also be different from train.

4–1. Ordinal Regression ? [ 1, 0, 0 ] [
0, 0, 1 ] [ 0, 1, 0 ] 0 : No DR = [ 1, 0, 0 ] 1 : Mild = [ 0, 1, 0 ] 2 : Moderate = [ 0, 0, 1 ] Cao et al. 2019. Rank-consistent-Ordinal-Regression-Neural-Networks. Bear classification Severity estimation

Labels are independent with others. When diagnosis = 2, the
eye will contain diagnosis 1 disease or equivalent. [ 1, 0, 0 ] [ 0, 0, 1 ] [ 0, 1, 0 ] 0 : No DR = [ 1, 0, 0 ] 1 : Mild = [ 0, 1, 0 ] 2 : Moderate = [ 0, 0, 1 ] Bear classification Severity estimation Labels are dependent on others.

[ 1, 0, 0 ] [ 0, 0, 1 ]
[ 0, 1, 0 ] FC Layer Naive Classification Rank-Consistent Ordinal Regression Soft-Max Loss Multi - Class BCE Loss Multi - Label [ 0, 0 ] [ 1, 1] [ 1, 0 ] FC Layer = 0 + 0 = 0 = 1 + 0 = 1 = 1 + 1 = 2

Theory 𝐬 𝐳 = 𝟏 𝟏 + 𝐞−𝐳 The Number
of Classes: K (= 5) Retina Image Features (Input of CNN): xi Rank Label: yi ∈ 0, 1, 2, 3, 4 yi ⟹ y i (1), … , y i k , … , y i (K−1) y i (k) ∈ I yi > rk e.g. yi = 2 ⟹ 1, 1, 0, 0 , y i (1 or 2) = 1, y i (3 or 4) = 0 (K - 1) binary tasks share the same weight parameter ωj , but have independent bias units bk h xi = k=1 K−1 fk (xi ) fk xi = I P y i (k) = 1 > 0.5 P y i (k) = 1 = s j m ωj aj + bk = s(g xi , W + bk ) Predicted Rank Condition. f1 xi ≥ f2 xi ≥ … ≥ fK−1 xi Required for the Ordinal Information and Rank-Monotonicity => Let's prove in the next slide! Output function of CNN e.g. 𝐏 𝐲 𝟏 𝐤 = 𝟏 , 𝐏 𝐲 𝟐 𝐤 = 𝟏 , 𝐏 𝐲 𝟑 𝐤 = 𝟏 , 𝐏 𝐲 𝟒 𝐤 = 𝟏 = [ 𝟎. 𝟖, 𝟎. 𝟕, 𝟎. 𝟒, 𝟎. 𝟏 ] 𝐡 𝐱𝐢 = 𝟏 + 𝟏 + 𝟎 + 𝟎 = 𝟐

Proof by contradiction. (Using the loss function and its optimal
solution (W*, b*)) Theorem: f1 xi ≥ f2 xi ≥ … ≥ fK−1 xi satisfied with the optimal solution (W*, b*) w*: Optimal CNN Weights with train data b*: Optimal Biases of the final layer with train data  Binary Cross Entropy as Loss Function L W, b = − i=1 N k=1 K−1 [ y i (k) log s(g xi , W + bk ) + 1 − y i k log 1 − s(g xi , W + bk ) ]  Sufficient Condition P y i (1) = 1 ≥ P y i (2) = 1 ≥ … ≥ P y i (K−1) = 1 ⇒ f1 xi ≥ f2 xi ≥ … ≥ fK−1 xi P y i (1) = 1 ≥ P y i (2) = 1 ≥ … ≥ P y i (K−1) = 1 ⇔ b1 ∗ ≥ b2 ∗ ≥ … ≥ bK−1 ∗ Suppose, (W∗, b∗) is an optimal solution and bk < bk+1 for some k . Let, A1 = n ∶ y n (k) = y n (k+1) = 1 , A2 = n ∶ y n (k) = y n (k+1) = 0 , A3 = n ∶ y n (k) = 1, y n (k+1) = 0 A1 ∪ A2 ∪ A3 = 1, 2, … , N Denote, p n = s(g xi , W + bk ) δn = log(p n (bk+1 )) − log p n bk > 0 δn ′ = log(1 − p n (bk )) − log 1 − p n bk+1 > 0 If replacing bk with bk+1 , the change of Loss is given as Δ1 L = − n∈A1 δn + n∈A2 δn ′ − n∈A3 δn If replacing bk+1 with bk , Δ2 L = n∈A1 δn − n∈A2 δn ′ − n∈A3 δn ′ Then we have, Δ1 L + Δ2 L = − n∈A3 δn + δn ′ < 0 ⇔ Δ1 L < 0 or Δ2 L < 0 show this! contradictory More Optimal solution exists!!! ( Replacing bk to satisfy b1 ∗ ≥ b2 ∗ ≥ … ≥ bK−1 ∗ )

This is satisfied only in the optimal solution (W*, b*).
w* : Optimal CNN Weights with train data b* : Optimal Biases of the final layer with train data When using Mini-Batch training and OOF prediction, this assumption will be violated a little. P y i (1) = 1 ≥ P y i (2) = 1 ≥ … ≥ P y i (K−1) = 1 ~ 14,000 examples e.g. [p1, p2, p3, p4] = [0.90, 0.48, 0.52, 0.30] Hard Encoding with threshold 0.5 => diagnosis 3!?

Prediction Encoding (Soft Encoding) To blend with naive regression models,
we need to encode 4 probabilities to one scalar value. Pencoded = P y i 1 = 1 + P y i (2) = 1 + P y i (3) = 1 + P y i (4) = 1 (0 ≤ Pencoded ≤ 4) e.g. [p1, p2, p3, p4] = [0.90, 0.48, 0.52, 0.30] Hard Encoding with threshold 0.5 => diagnosis 3 Soft Encoding => 2.20

4–2. Others a. EfficientNet https://arxiv.org/abs/1905.11946 b. MixUp and CutMix https://arxiv.org/abs/1710.09412
https://arxiv.org/abs/1905.04899 c. Ben's preprocessing d. Statistics Value Mean, Std, Quantile, ... Implemented as `Lambda Layer` e. Psuedo Labeling f. RAdam https://arxiv.org/abs/1908.03265 Diagnosis 1 Diagnosis 2 MixUp Diag1 * 0.5 + Diag2 * 0.5 diagnosis = 1.5? maybe not... I think it's more than 2, due to non-linear transformation for findings (label). CutMix Diag1 * Mask1 + Diag2 * Mask2 Diag 2 <--- ---> Diag 1 diagnosis = 1.7? My guess is 2.0 or so... worked only on private Findings Findings Why MixUp and CutMix did not work?

we did (worked/Not worked) 5. Takeaway

 24GB GDDR6  1770 MHz  576 Tensor Core
 4608 CUDA core

APTOS 2019 28th place solution

APTOS 2019 28th place solution

Maxwell

More Decks by Maxwell

Other Decks in Science

Featured

Transcript

APTOS 2019 Blindness Detection Detect diabetic retinopathy to stop blindness

1. Competition Overview 2. Result 3. Model Pipeline 4. What

1. Competition Overview 2. Result 3. Model Pipeline 4. What

Aravind Eye Hospital Madurai, Tamil Nadu Asia Pacific Tele-Ophthalmology Society

Localized Various Findings of Diabetic Retinopathy

Data  Num of images train : 3,662 public :

1. Less Images  External data (DRD, IDRiD) allowed 2.

https://youtu.be/oOeZ7IgEN4o?t=145 Label Inconsistency The diagnosis of ophthalmoligists are inconsistent...

Evaluation Quadratic Weighted Kappa κ = 1 − i, j

1. Competition Overview 2. Result 3. Model Pipeline 4. What

Submission 2 (3 models blended) Public LB : 0.825 /

Hard to get a consistent correlation Our Submission History Everwhere

1. Competition Overview 2. Result 3. Model Pipeline 4. What

APTOS 2019 Blindness Detection 320 320 Remove black background Resize

Our strategy in fail...  QWK depends on label distribution

1. Competition Overview 2. Result 3. Model Pipeline 4. What

4–1. Ordinal Regression ? [ 1, 0, 0 ] [

Labels are independent with others. When diagnosis = 2, the

[ 1, 0, 0 ] [ 0, 0, 1 ]

Theory 𝐬 𝐳 = 𝟏 𝟏 + 𝐞−𝐳 The Number

Proof by contradiction. (Using the loss function and its optimal

This is satisfied only in the optimal solution (W, b).

Prediction Encoding (Soft Encoding) To blend with naive regression models,

4–2. Others a. EfficientNet https://arxiv.org/abs/1905.11946 b. MixUp and CutMix https://arxiv.org/abs/1710.09412

1. Competition Overview 2. Result 3. Model Pipeline 4. What

 24GB GDDR6  1770 MHz  576 Tensor Core