Upgrade to Pro — share decks privately, control downloads, hide ads and more …

APTOS 2019 28th place solution

Maxwell
September 09, 2019

APTOS 2019 28th place solution

APTOS 2019 Blindness Detection
https://www.kaggle.com/c/aptos2019-blindness-detection/overview

Team 2 AI Ophthalmologists 28th place solution

twitter: https://twitter.com/Maxwell_110

Maxwell

September 09, 2019
Tweet

More Decks by Maxwell

Other Decks in Science

Transcript

  1. APTOS 2019 Blindness Detection Detect diabetic retinopathy to stop blindness

    before it's too late Team: 2 AI Ophthalmologists Maxwell and tereka
  2. 1. Competition Overview 2. Result 3. Model Pipeline 4. What

    we did (worked/Not worked) 5. Take away
  3. 1. Competition Overview 2. Result 3. Model Pipeline 4. What

    we did (worked/Not worked) 5. Take away
  4. Aravind Eye Hospital Madurai, Tamil Nadu Asia Pacific Tele-Ophthalmology Society

    2019 Detect Diabetic Retinopathy to stop blindness before it’s too late Build a model to help identify diabetic retinopathy automatically  Aravind Eye Hospital technicians travel to rural areas to capture images  Shortage of high trained doctors to review the images and provide diagnosis in rural areas of India Rural areas
  5. Data  Num of images train : 3,662 public :

    1,928 private : ~11,000  Target label 0 : No DR 1 : Mild 2 : Moderate 3 : Severe 4 : Proliferative DR 0 : No DR 1 : Mild 2 : Moderate 3 : Severe 4 : Proliferative DR
  6. 1. Less Images  External data (DRD, IDRiD) allowed 2.

    Target Imbalance 3. Target Distribution  Difference of train and public  How is private? 4. Duplicated Images 5. Various Image Sizes  Difference of train and test 6. Label Inconsistency  Different labels by each ophthalmologist Data Issue Duplicated Images Target Distributions Credit to: https://www.kaggle.com/currypurin/image-shape-distribution-previous-and-present Image Size Distributions
  7. Evaluation Quadratic Weighted Kappa κ = 1 − i, j

    ωi, j Oi, j i, j ωi, j Ei, j ωi, j = i − j 2  Confusion Matrix-based Metric Label Prediction (Threshold Optimization) => Unstable  Target Distribution Dependence Depend on label distribution => Unstable Unstable Metric!
  8. 1. Competition Overview 2. Result 3. Model Pipeline 4. What

    we did (worked/Not worked) 5. Take away
  9. Submission 2 (3 models blended) Public LB : 0.825 /

    65th Private LB : 0.927 / 28th CV 0.9363 on APTOS all 0.7542 on Higher Grade APTOS Submission 1 (2 models blended) Public LB : 0.832 / 38 th Private LB : 0.925 / 47 - 60 th ? CV 0.9359 on APTOS all 0.7557 on Higher Grade APTOS Private LB Public LB Our Final Selected Submissions
  10. Hard to get a consistent correlation Our Submission History Everwhere

    inconsistent... - Local CV vs Public - Public vs Private
  11. 1. Competition Overview 2. Result 3. Model Pipeline 4. What

    we did (worked/Not worked) 5. Take away
  12. APTOS 2019 Blindness Detection 320 320 Remove black background Resize

    - HorizontalFlip - Brightness - Contrast - RGBshift - Scale - Rotate - RandomErasing - Stratified 5 fold - Pretrained on ImageNet - BCE loss - Early Stopping with BCE - Adam 1e-4 RGB brightness normalization (ImageNet base) + Preprocessing SE-ResNext 50 ( Pre-trained ) Fine tuning APTOS IDRiD DRD 2015 Test TTA ( 3 times ) - HorizontalFlip - Brightness - Contrast - RGBShift - Scale - Rotate Prediction Encoding + + Remove black background Preprocessing CLAHE ( adaptive histogram equalization ) 300 260 Resize Grade Balanced Sampling + + Augmentation - HorizontalFlip - Brightness - Contrast - RGBshift - Scale - Rotate - Shear - Stratified 5 fold - Pretrained on ImageNet - Input normalized with BN - Clipped MSE (CMSE) - Early Stopping with CMSE - Adam 1e-3 / 5e-4 Changing Sampling Rate with Disease Grade ( 1 : 2 : 2 : 2 : 2 ) TTA ( 3 / 3 times ) - Brightness - Contrast - RGBShift - Scale - Rotate Blending  Blending Coefficients SE-ResNext 50: 0.469 EfficinetNet B2: 0.273 EfficientNet B3: 0.258  QWK optimization with Nelder-Mead solver 320 Copyright 2019 @ Maxwell_110 Public: 38 th Local 0.936 on APTOS train Public LB 0.832 Private: 28 th Private LB 0.927 Pre-training Augmentation Preprocessing Augmentation EfficientNet B2 ( Pre-trained ) EfficientNet B3 ( Pre-trained ) EfficientNet B2 ( Regression ) EfficientNet B3 ( Regression ) SE-ResNext 50 ( Ordinal Regression ) Prediction and Ensemble Preprocessing Preprocessing Using all data
  13. Our strategy in fail...  QWK depends on label distribution

     Distribution of train and public test are different Our strategy was:  Built 2 models trained with different label distributions (train-like and public-like)  Blended those to get a robust prediction Private was Train-like... OK...Then Private will also be different from train.
  14. 1. Competition Overview 2. Result 3. Model Pipeline 4. What

    we did (worked/Not worked) 5. Take away
  15. 4–1. Ordinal Regression ? [ 1, 0, 0 ] [

    0, 0, 1 ] [ 0, 1, 0 ] 0 : No DR = [ 1, 0, 0 ] 1 : Mild = [ 0, 1, 0 ] 2 : Moderate = [ 0, 0, 1 ] Cao et al. 2019. Rank-consistent-Ordinal-Regression-Neural-Networks. Bear classification Severity estimation
  16. Labels are independent with others. When diagnosis = 2, the

    eye will contain diagnosis 1 disease or equivalent. [ 1, 0, 0 ] [ 0, 0, 1 ] [ 0, 1, 0 ] 0 : No DR = [ 1, 0, 0 ] 1 : Mild = [ 0, 1, 0 ] 2 : Moderate = [ 0, 0, 1 ] Bear classification Severity estimation Labels are dependent on others.
  17. [ 1, 0, 0 ] [ 0, 0, 1 ]

    [ 0, 1, 0 ] FC Layer Naive Classification Rank-Consistent Ordinal Regression Soft-Max Loss Multi - Class BCE Loss Multi - Label [ 0, 0 ] [ 1, 1] [ 1, 0 ] FC Layer = 0 + 0 = 0 = 1 + 0 = 1 = 1 + 1 = 2
  18. Theory 𝐬 𝐳 = 𝟏 𝟏 + 𝐞−𝐳 The Number

    of Classes: K (= 5) Retina Image Features (Input of CNN): xi Rank Label: yi ∈ 0, 1, 2, 3, 4 yi ⟹ y i (1), … , y i k , … , y i (K−1) y i (k) ∈ I yi > rk e.g. yi = 2 ⟹ 1, 1, 0, 0 , y i (1 or 2) = 1, y i (3 or 4) = 0 (K - 1) binary tasks share the same weight parameter ωj , but have independent bias units bk h xi = k=1 K−1 fk (xi ) fk xi = I P y i (k) = 1 > 0.5 P y i (k) = 1 = s j m ωj aj + bk = s(g xi , W + bk ) Predicted Rank Condition. f1 xi ≥ f2 xi ≥ … ≥ fK−1 xi Required for the Ordinal Information and Rank-Monotonicity => Let's prove in the next slide! Output function of CNN e.g. 𝐏 𝐲 𝟏 𝐤 = 𝟏 , 𝐏 𝐲 𝟐 𝐤 = 𝟏 , 𝐏 𝐲 𝟑 𝐤 = 𝟏 , 𝐏 𝐲 𝟒 𝐤 = 𝟏 = [ 𝟎. 𝟖, 𝟎. 𝟕, 𝟎. 𝟒, 𝟎. 𝟏 ] 𝐡 𝐱𝐢 = 𝟏 + 𝟏 + 𝟎 + 𝟎 = 𝟐
  19. Proof by contradiction. (Using the loss function and its optimal

    solution (W*, b*)) Theorem: f1 xi ≥ f2 xi ≥ … ≥ fK−1 xi satisfied with the optimal solution (W*, b*) w*: Optimal CNN Weights with train data b*: Optimal Biases of the final layer with train data  Binary Cross Entropy as Loss Function L W, b = − i=1 N k=1 K−1 [ y i (k) log s(g xi , W + bk ) + 1 − y i k log 1 − s(g xi , W + bk ) ]  Sufficient Condition P y i (1) = 1 ≥ P y i (2) = 1 ≥ … ≥ P y i (K−1) = 1 ⇒ f1 xi ≥ f2 xi ≥ … ≥ fK−1 xi P y i (1) = 1 ≥ P y i (2) = 1 ≥ … ≥ P y i (K−1) = 1 ⇔ b1 ∗ ≥ b2 ∗ ≥ … ≥ bK−1 ∗ Suppose, (W∗, b∗) is an optimal solution and bk < bk+1 for some k . Let, A1 = n ∶ y n (k) = y n (k+1) = 1 , A2 = n ∶ y n (k) = y n (k+1) = 0 , A3 = n ∶ y n (k) = 1, y n (k+1) = 0 A1 ∪ A2 ∪ A3 = 1, 2, … , N Denote, p n = s(g xi , W + bk ) δn = log(p n (bk+1 )) − log p n bk > 0 δn ′ = log(1 − p n (bk )) − log 1 − p n bk+1 > 0 If replacing bk with bk+1 , the change of Loss is given as Δ1 L = − n∈A1 δn + n∈A2 δn ′ − n∈A3 δn If replacing bk+1 with bk , Δ2 L = n∈A1 δn − n∈A2 δn ′ − n∈A3 δn ′ Then we have, Δ1 L + Δ2 L = − n∈A3 δn + δn ′ < 0 ⇔ Δ1 L < 0 or Δ2 L < 0 show this! contradictory More Optimal solution exists!!! ( Replacing bk to satisfy b1 ∗ ≥ b2 ∗ ≥ … ≥ bK−1 ∗ )
  20. This is satisfied only in the optimal solution (W*, b*).

    w* : Optimal CNN Weights with train data b* : Optimal Biases of the final layer with train data When using Mini-Batch training and OOF prediction, this assumption will be violated a little. P y i (1) = 1 ≥ P y i (2) = 1 ≥ … ≥ P y i (K−1) = 1 ~ 14,000 examples e.g. [p1, p2, p3, p4] = [0.90, 0.48, 0.52, 0.30] Hard Encoding with threshold 0.5 => diagnosis 3!?
  21. Prediction Encoding (Soft Encoding) To blend with naive regression models,

    we need to encode 4 probabilities to one scalar value. Pencoded = P y i 1 = 1 + P y i (2) = 1 + P y i (3) = 1 + P y i (4) = 1 (0 ≤ Pencoded ≤ 4) e.g. [p1, p2, p3, p4] = [0.90, 0.48, 0.52, 0.30] Hard Encoding with threshold 0.5 => diagnosis 3 Soft Encoding => 2.20
  22. 4–2. Others a. EfficientNet https://arxiv.org/abs/1905.11946 b. MixUp and CutMix https://arxiv.org/abs/1710.09412

    https://arxiv.org/abs/1905.04899 c. Ben's preprocessing d. Statistics Value Mean, Std, Quantile, ... Implemented as `Lambda Layer` e. Psuedo Labeling f. RAdam https://arxiv.org/abs/1908.03265 Diagnosis 1 Diagnosis 2 MixUp Diag1 * 0.5 + Diag2 * 0.5 diagnosis = 1.5? maybe not... I think it's more than 2, due to non-linear transformation for findings (label). CutMix Diag1 * Mask1 + Diag2 * Mask2 Diag 2 <--- ---> Diag 1 diagnosis = 1.7? My guess is 2.0 or so... worked only on private Findings Findings Why MixUp and CutMix did not work?
  23. 1. Competition Overview 2. Result 3. Model Pipeline 4. What

    we did (worked/Not worked) 5. Takeaway