2019 Detect Diabetic Retinopathy to stop blindness before it’s too late Build a model to help identify diabetic retinopathy automatically Aravind Eye Hospital technicians travel to rural areas to capture images Shortage of high trained doctors to review the images and provide diagnosis in rural areas of India Rural areas
Target Imbalance 3. Target Distribution Difference of train and public How is private? 4. Duplicated Images 5. Various Image Sizes Difference of train and test 6. Label Inconsistency Different labels by each ophthalmologist Data Issue Duplicated Images Target Distributions Credit to: https://www.kaggle.com/currypurin/image-shape-distribution-previous-and-present Image Size Distributions
Distribution of train and public test are different Our strategy was: Built 2 models trained with different label distributions (train-like and public-like) Blended those to get a robust prediction Private was Train-like... OK...Then Private will also be different from train.
of Classes: K (= 5) Retina Image Features (Input of CNN): xi Rank Label: yi ∈ 0, 1, 2, 3, 4 yi ⟹ y i (1), … , y i k , … , y i (K−1) y i (k) ∈ I yi > rk e.g. yi = 2 ⟹ 1, 1, 0, 0 , y i (1 or 2) = 1, y i (3 or 4) = 0 (K - 1) binary tasks share the same weight parameter ωj , but have independent bias units bk h xi = k=1 K−1 fk (xi ) fk xi = I P y i (k) = 1 > 0.5 P y i (k) = 1 = s j m ωj aj + bk = s(g xi , W + bk ) Predicted Rank Condition. f1 xi ≥ f2 xi ≥ … ≥ fK−1 xi Required for the Ordinal Information and Rank-Monotonicity => Let's prove in the next slide! Output function of CNN e.g. 𝐏 𝐲 𝟏 𝐤 = 𝟏 , 𝐏 𝐲 𝟐 𝐤 = 𝟏 , 𝐏 𝐲 𝟑 𝐤 = 𝟏 , 𝐏 𝐲 𝟒 𝐤 = 𝟏 = [ 𝟎. 𝟖, 𝟎. 𝟕, 𝟎. 𝟒, 𝟎. 𝟏 ] 𝐡 𝐱𝐢 = 𝟏 + 𝟏 + 𝟎 + 𝟎 = 𝟐
solution (W*, b*)) Theorem: f1 xi ≥ f2 xi ≥ … ≥ fK−1 xi satisfied with the optimal solution (W*, b*) w*: Optimal CNN Weights with train data b*: Optimal Biases of the final layer with train data Binary Cross Entropy as Loss Function L W, b = − i=1 N k=1 K−1 [ y i (k) log s(g xi , W + bk ) + 1 − y i k log 1 − s(g xi , W + bk ) ] Sufficient Condition P y i (1) = 1 ≥ P y i (2) = 1 ≥ … ≥ P y i (K−1) = 1 ⇒ f1 xi ≥ f2 xi ≥ … ≥ fK−1 xi P y i (1) = 1 ≥ P y i (2) = 1 ≥ … ≥ P y i (K−1) = 1 ⇔ b1 ∗ ≥ b2 ∗ ≥ … ≥ bK−1 ∗ Suppose, (W∗, b∗) is an optimal solution and bk < bk+1 for some k . Let, A1 = n ∶ y n (k) = y n (k+1) = 1 , A2 = n ∶ y n (k) = y n (k+1) = 0 , A3 = n ∶ y n (k) = 1, y n (k+1) = 0 A1 ∪ A2 ∪ A3 = 1, 2, … , N Denote, p n = s(g xi , W + bk ) δn = log(p n (bk+1 )) − log p n bk > 0 δn ′ = log(1 − p n (bk )) − log 1 − p n bk+1 > 0 If replacing bk with bk+1 , the change of Loss is given as Δ1 L = − n∈A1 δn + n∈A2 δn ′ − n∈A3 δn If replacing bk+1 with bk , Δ2 L = n∈A1 δn − n∈A2 δn ′ − n∈A3 δn ′ Then we have, Δ1 L + Δ2 L = − n∈A3 δn + δn ′ < 0 ⇔ Δ1 L < 0 or Δ2 L < 0 show this! contradictory More Optimal solution exists!!! ( Replacing bk to satisfy b1 ∗ ≥ b2 ∗ ≥ … ≥ bK−1 ∗ )
w* : Optimal CNN Weights with train data b* : Optimal Biases of the final layer with train data When using Mini-Batch training and OOF prediction, this assumption will be violated a little. P y i (1) = 1 ≥ P y i (2) = 1 ≥ … ≥ P y i (K−1) = 1 ~ 14,000 examples e.g. [p1, p2, p3, p4] = [0.90, 0.48, 0.52, 0.30] Hard Encoding with threshold 0.5 => diagnosis 3!?
we need to encode 4 probabilities to one scalar value. Pencoded = P y i 1 = 1 + P y i (2) = 1 + P y i (3) = 1 + P y i (4) = 1 (0 ≤ Pencoded ≤ 4) e.g. [p1, p2, p3, p4] = [0.90, 0.48, 0.52, 0.30] Hard Encoding with threshold 0.5 => diagnosis 3 Soft Encoding => 2.20
https://arxiv.org/abs/1905.04899 c. Ben's preprocessing d. Statistics Value Mean, Std, Quantile, ... Implemented as `Lambda Layer` e. Psuedo Labeling f. RAdam https://arxiv.org/abs/1908.03265 Diagnosis 1 Diagnosis 2 MixUp Diag1 * 0.5 + Diag2 * 0.5 diagnosis = 1.5? maybe not... I think it's more than 2, due to non-linear transformation for findings (label). CutMix Diag1 * Mask1 + Diag2 * Mask2 Diag 2 <--- ---> Diag 1 diagnosis = 1.7? My guess is 2.0 or so... worked only on private Findings Findings Why MixUp and CutMix did not work?