Human Protein Image Classification using PyTorch and fastai

William Horton Human Protein Image Classification using PyTorch and fastai

Or, Beyond Dogs vs. Cats

https://github.com/fastai/fastai/blob/master/courses/dl1/lesson1.ipynb

Basic CNN architecture https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the- eli5-way-3bd2b1164a53

What happens when you start working with real(er) data?

What happens when Dogs vs. Cats doesn’t cut it anymore?

Who am I? • Backend Engineer on the Data Team
at Compass (We’re hiring!) • Python, Spark, Kafka, Airflow

Deep Learning • fast.ai International Fellow (Spring 2018) • Open-source
contributor (PyTorch and fastai) • Speaker, PyOhio 2018 -- “You Can Do Deep Learning” • Tweeting at @hortonhearsafoo

Structure of a Machine Learning Problem Inputs Models Loss Function
Threshold- ing Outputs ? ?

What was different from Dogs vs. Cats?

Inputs Dogs vs. Cats - Regular (3-channel) JPGs - Even
class distribution Human Protein Classification - 4 separate PNGs per sample - Wildly uneven class distribution

Outputs Dogs vs. Cats - Single (binary) class - Single
label per image Human Protein Classification - Multi-class - Multi-label

Inputs

How do I even read this data? • 4 channels
saved as separate PNGs • blue, green red, and yellow

Example https://www.kaggle.com/jschnab/exploring-the-human-protein-atlas-images

Two approaches • Custom open function • Preprocess into 4-channel
PNG

Custom open function adapted from https://www.kaggle.com/iafoss/pretrained-resnet34-with-rgby-0-460-public-lb (fastai 0.7 version)

Using the custom open function then...

Trade-offs • Disk space • Precomputing vs. during training •
Code complexity

"Extreme" augmentations Based on the way the data was gathered
Rotation all the way up to 90 degrees Large random scaling (up to 1.4-1.5x)

Problem Common pretrained models are designed for 3-channel inputs

How do we adapt the architecture while retaining the benefits
of pretrained weights?

Possible solutions Why not create a new layer of the
right shape? Why not create a new layer of the right shape, and copy over the weights for the first three layers? Why not create a new layer of the right shape, and copy over the weights for the first three layers, and re-use the first layer for the fourth layer as well?

Custom 4-Channel Resnet adapted from https://www.kaggle.com/iafoss/pretrained-resnet34-with-rgby-0-460-public-lb

What is the best architecture to use? Answer: try them
all! My best submission was a combination of resnet18 + resnet50.

Loss function

What makes this data unique? Multi-class and multi-label Highly imbalanced
classes

Multi-class and multi-label Multi-class: i.e. not binary Multi-label: each example
could have multiple correct classes

Sigmoid and Softmax Sigmoid: "compresses" output to between 0 and
1 Softmax: "pushes" largest value closer to 1, other values closer to 0 Example: Output: [ 0.6590094 , -1.1441954 , 0.19337232, 1.57017275, -0.13396834] Sigmoid: [ 0.65903783, 0.24155091, 0.548193 , 0.82780823, 0.46655792] Softmax: [0.21131754, 0.03481879, 0.13265143, 0.52559203, 0.09562021]

Loss functions For Binary Classification (i.e. Dogs vs. Cats):

Loss functions For Multiclass, Single-label Classification (i.e. ImageNet):

What do we use for multi-class, multi-label data?

Loss functions It’s like we’re seeing each class as a
binary decision

Class imbalance https://www.kaggle.com/jschnab/exploring-the-human-protein-atlas-images

Evaluation metric Macro F1 score Each class has the same
impact on the final metric

Weighted Loss

Calculating weights Direct weights may be too extreme Log-dampened weights

Log-dampened weights https://www.kaggle.com/c/human-protein-atlas-image-classification/discussion/74065

From my training notebook

Focal loss From this object-detection paper (origin of RetinaNet): “The
Focal Loss is designed to address the one-stage object detection scenario in which there is an extreme imbalance between foreground and background classes during training (e.g., 1:1000).”

Weighted Samplers

Standard Sampling: Weighted Random Sampling (maybe):

Thresholds

Calculating thresholds Model output for one example (after sigmoid): [0.9,
0.1, 0.2, 0.7, … C] How to decide which are 1s and which are 0s? Most naive: 0.5 Better: optimize threshold based on validation set Best: optimize separate threshold per class on validation set

Calculating thresholds all classes 0.5 optimize on validation set optimize
per class on validation set

Calculating thresholds Optimization problem -- solve for C variables that
minimize metric (in this case, F1) Combination of optimization methods L-BFGS-B and Basinhopping using sklearn (https://github.com/mratsim/Amazon-Forest-Computer-Vision)

(Bonus: Ensembling)

Ensembling with averaging Let’s say we have the outputs of
3 models for 1 example, with 5 possible classes: [0.8, 0.1, 0.3, 0.9, 0.5] [0.9, 0.3, 0.6, 0.1, 0.7] [0.9, 0.2, 0.4, 0.9, 0.7] Our output would be the average (then we’d apply thresholding post-averaging): [0.86666667, 0.2 , 0.43333333, 0.63333333, 0.63333333]

Ensembling with max [0.8, 0.1, 0.3, 0.9, 0.5] [0.9, 0.3,
0.6, 0.1, 0.7] [0.9, 0.2, 0.4, 0.9, 0.7] Our output would be the max: [0.9, 0.3, 0.6, 0.9, 0.7]

Ensembling with OR Let’s say we’ve already thresholded to 0s
and 1s: [1, 0, 0, 1, 0] [1, 0, 1, 0, 1] [1, 0, 0, 1, 1] Our output would be the logical OR (basically if any model predicted that class): [1, 0, 1, 1, 1]

Conclusion

Structure of a Machine Learning Problem Inputs Models Loss Function
Threshold- ing Outputs

Work with datasets that force you to explore the capabilities
of your tools.

Further information My starter code: https://github.com/wdhorton/protein-atlas-fastai Kernel version: https://www.kaggle.com/hortonhearsafoo/fastai-v1-starter-pack-kernel-edition-lb-0-323 Kaggle
discussion: https://www.kaggle.com/c/human-protein-atlas-image-classification/discussion/71039

References Kaggle is an amazing community! I tried to collect
Kernels and Discussion threads that helped me in this competition (some already referenced in slides): https://www.kaggle.com/iafoss/pretrained-resnet34-with-rgby-0-460-public-lb (fastai v0.7) https://www.kaggle.com/c/human-protein-atlas-image-classification/discussion/75691 https://www.kaggle.com/c/human-protein-atlas-image-classification/discussion/73938 https://www.kaggle.com/c/human-protein-atlas-image-classification/discussion/74065 https://www.kaggle.com/c/human-protein-atlas-image-classification/discussion/69984#4 36319

Q&A (@hortonhearsafoo for anything that doesn’t get answered!)

Human Protein Image Classification using PyTorc...

Human Protein Image Classification using PyTorch and fastai

More Decks by William Horton

Other Decks in Technology

Featured

Transcript