OpenTalks.AI - Борис Лесцов, Детектирование людей в толпе

Person detection in crowds Boris Lestsov Mail.Ru Group

Computer Vision Team We solve computer vision problems at Mail.Ru
Projects: 1) Vision (b2b) 2) Cloud 3) Mail 4) ...

Business case

Business case 1) Queue optimisation a) Open the elevator (ski
resort) b) Call the cashier 2) Await time estimation

Business requirements 1) Works in various setups 2) Real time
3) Robust on video

Challenges

Challenges Heavy occlusion

Challenges Pose, illumination, clothing variability.

Metrics and datasets

Intersection Over Union (IoU) Measures detection quality for single bounding
box.

Benchmarks CrowdHuman - our main dataset Custom dataset similar to
target domain

CrowdHuman examples

Why create own solution? Existing solutions: • Not adapted to
crowds • Slow inference

Approaches

Approaches 1) Classical CV (HOG, Deformable Part Models, ViolaJones) 2)
Motion-based detection (background subtraction) 3) CNN: a) Two stage - Faster RCNN b) Single stage - SSD, YOLO, RetinaNet.

Faster RCNN

Faster RCNN Person?

Faster RCNN + Accurate + Bigger resolution => better result
- Slow - More objects => more proposals => slower detection

Single Shot Detector

SSD (Single Shot Detector) Image

SSD (Single Shot Detector) Extract features

Single Shot Detector Reducing height & width => to detect
at different scales

Different Scales Smaller scale Bigger scale

Single Shot Detector Predict bounding boxes

Single Shot Detector Merge similar detections with NMS

Non Maximum Suppression

Problems with SSD • Backbone - VGG-16 • 512x512 =>
breaking aspect ratio

Architecture: RetinaNet 1) Backbone - ResNet 2) Feature Pyramid Network
(FPN) 3) FocalLoss against class disbalance

Feature Pyramid Network Higher level features for smaller scales

FocalLoss Problem: class disbalance 99 : 1 Cross Entropy (CE):
Focal Loss (FL): p t - predicted probability of g.t. class:

Focal Loss 1) Well classified examples => smaller contribution 2)
Analogue of Online Hard Example Mining

Other problems

Problems How many people on this image?

Small pedestrians Bigger resolution => better result, but slower. 800x600
: 30 fps, ~73.5% AP 1200x800: 15 fps, ~78.0% AP

Crop augmentations Good Crop Bad Crop Random Crop Random Crop

Crop augmentations Good Crop Bad Crop Random Crop Random Crop
IoU is roughly the same!

Tuned prior boxes Removed horizontal boxes

Tuned prior boxes Smaller boxes

Detection examples

Prediction Examples

Failure Case Extra box between people

Future work 1) Repulsion Loss 2) RetinaMask 3) Replace ResNet-50
with a better backbone model.

Tracking

Blinking problem

Tracking use cases 1) Tracking itself 2) Less False Positives
on a video stream. 3) Deal with “blinking” detections.

SORT (Simple Online and Realtime Tracking) • Association by IoU
• Kalman Filters • Fast We fine-tuned SORT

With tracking

Tracking example

Queues in canteen 1) Multiple cameras 2) Zoning

Conclusion 1) Two stage detectors are more accurate, but slower
2) Bigger resolution => better accuracy, slower 3) ResNet, FPN, Focal Loss => better result

Thanks! Questions?

Appendix

Metrics: AP - Average Precision (single class) • False Positive
(FP) - predicted bbox without IoU>0.5 with some g.t. • False Negative (FN) - g.t. bbox without IoU>0.5 with some predicted box. 1) Compute predictions. 2) Plot Precision-Recall Curve, make

Intuition about Kalman Filter in SORT Box is represented with
vector: • u,v - coordinates of the center • s - box scale • r - box aspect ratio • dotted u, v, s - corresponding derivatives Notes: 1. Linear prediction from frame to frame with correction from detector output. 2. Generally can model broad range of dynamic systems (fluid in a tank, the temperature of a car engine).

Appendix Repulsion Loss Three components: 1) Attraction to matched g.t.
box. 2) Repulsion from other g.t. boxes. 3) Repulsion from other predicted boxes. Technically, IoU is

RetinaMask 1) RetinaNet adapted to instance segmentation 2) Mask prediction
gives good improvement in detection quality (~2.3% mAP on COCO). 3) Masks are predicted in Faster-RCNN manner. Mask prediction can be discarded during inference to speed up the detector. 4) Tune masks on COCO “Person” category, detection on CrowdHuman. 5) Code and models available!

Resolution 1) SGD training instead of Adam. 2) Replacing SSD
with RetinaNet arch. 3) FocalLoss 4) Bigger resolution (current models: 800x600 and 1200x800) 5) scale_by_aspect instead of simple resize. 6) Anchor box tuning. 7) Crop augmentations 8) Joint training with head detection. 9) Removing strides from convolutions in last

Things that did NOT work out 1) MTCNN for detecting
small people. 2) Prediction of full bounding box instead of the visible one.

OpenTalks.AI - Борис Лесцов, Детектирование люд...

OpenTalks.AI - Борис Лесцов, Детектирование людей в толпе

More Decks by OpenTalks.AI

Other Decks in Science

Featured

Transcript