Upgrade to Pro — share decks privately, control downloads, hide ads and more …

CodeFest 2019. Борис Лесцов (Mail.Ru Group) — Д...

CodeFest
April 06, 2019

CodeFest 2019. Борис Лесцов (Mail.Ru Group) — Детектирование людей в толпе

Задача детектирования людей на изображении или видеопотоке — это сложная задача компьютерного зрения, основными сложностями в которой являются разнообразие возможных сценариев детектирования, большая внутриклассовая вариативность самих людей (одежда, поза), а также частое перекрытие людей (ообенно сложный случай — толпы). Для её решения исторически было придумано множество способов, но на данный момент наилучшее качество демонстрируют свёрточные нейронные сети. Доклад посвящен построению собственной production-ready системы детектирования людей, работающей на свёрточных нейронных сетях в реальном времении. Рассматриваются специфические приемы (архитектуры, функции потерь, особенности обучения), позволяющие существенно поднять качество детектирования.

CodeFest

April 06, 2019
Tweet

More Decks by CodeFest

Other Decks in Technology

Transcript

  1. Computer Vision Team We solve computer vision problems at Mail.Ru

    Projects: 1) Vision (b2b) 2) Cloud 3) Mail 4) ...
  2. Business case 1) Queue optimisation a) Open the elevator (ski

    resort) b) Call the cashier 2) Await time estimation
  3. Metrics: AP - Average Precision (single class) 1. Compute predictions.

    2. Plot Precision-Recall Curve, make non-increasing 3. Compute Area Under Curve Multiclass: mean AP (mAP) Different IoU thresholds
  4. mAP - mean Average Precision (VOC) Compute mean of AP

    for all classes. Problem: These detections give the same contribution to mAP.
  5. Metrics: mAP@[.5:.95] (COCO) 1) For each IoU threshold in [.5:.95]

    = [0.5, 0.55, 0.6, …, 0.9, 0.95] compute mAP. 2) Average these values to get mAP@[.5:.95]: Also: log average miss-rate (mMR) is used sometimes
  6. Approaches 1) Classical CV (HOG, Deformable Part Models, ViolaJones) 2)

    Motion-based detection (background subtraction) 3) CNN: a) Two stage - Faster RCNN b) Single stage - SSD, YOLO, RetinaNet. c) Cascaded - MTCNN
  7. Faster RCNN + Accurate + Bigger resolution => better result

    - Slow - More objects => more proposals => slower detection
  8. FocalLoss Problem: class disbalance 99 : 1 Cross Entropy (CE):

    Focal Loss (FL): pt - predicted probability of g.t. class:
  9. Small pedestrians Bigger resolution => better result, but slower. 800x600

    : 30 fps, ~73.5% AP 1200x800: 15 fps, ~78.0% AP
  10. Appendix Repulsion Loss Three components: 1) Attraction to matched g.t.

    box. 2) Repulsion from other g.t. boxes. 3) Repulsion from other predicted boxes. Technically, IoU is maximized/minimized.
  11. RetinaMask 1) RetinaNet adapted to instance segmentation 2) Mask prediction

    gives improves detection quality (~2.3% mAP on COCO). 3) Masks are predicted in Mask-RCNN manner.
  12. RetinaMask 4) Mask prediction can be discarded during inference to

    speed up the detector. 5) Code and models available!
  13. Tracking use cases 1) Tracking itself 2) Less False Positives

    on a video stream. 3) Deal with “blinking” detections.
  14. SORT (Simple Online and Realtime Tracking) • Association by IoU

    • Kalman Filters • Fast We fine-tuned SORT
  15. Intuition about Kalman Filter in SORT Box is represented with

    vector: • u,v - coordinates of the center • s - box scale • r - box aspect ratio • dotted u, v, s - corresponding derivatives
  16. Intuition about Kalman Filter in SORT Notes: 1. Linear prediction

    with correction from detector output. 2. Speed, aspect ratio are constant. 3. Can model many dynamic systems (fluid amount in a tank, the temperature of a car engine).
  17. Conclusion 1) Two stage detectors are more accurate, but slower

    2) Bigger resolution => better accuracy, slower 3) ResNet, FPN, Focal Loss => better result
  18. Resolution 1) SGD training instead of Adam. 2) Replacing SSD

    with RetinaNet arch. 3) FocalLoss 4) Bigger resolution (current models: 800x600 and 1200x800) 5) scale_by_aspect instead of simple resize. 6) Anchor box tuning. 7) Crop augmentations 8) Joint training with head detection. 9) Removing strides from convolutions in last stages of RetinaNet. 10)Synchronized Batchnorm (big resolution => small batch size)
  19. Things that did NOT work out 1) MTCNN for detecting

    small people. 2) Prediction of full bounding box instead of the visible one.