Upgrade to Pro — share decks privately, control downloads, hide ads and more …

CodeFest 2019. Борис Лесцов (Mail.Ru Group) — Детектирование людей в толпе

16b6c87229eaf58768d25ed7b2bbbf52?s=47 CodeFest
April 06, 2019

CodeFest 2019. Борис Лесцов (Mail.Ru Group) — Детектирование людей в толпе

Задача детектирования людей на изображении или видеопотоке — это сложная задача компьютерного зрения, основными сложностями в которой являются разнообразие возможных сценариев детектирования, большая внутриклассовая вариативность самих людей (одежда, поза), а также частое перекрытие людей (ообенно сложный случай — толпы). Для её решения исторически было придумано множество способов, но на данный момент наилучшее качество демонстрируют свёрточные нейронные сети. Доклад посвящен построению собственной production-ready системы детектирования людей, работающей на свёрточных нейронных сетях в реальном времении. Рассматриваются специфические приемы (архитектуры, функции потерь, особенности обучения), позволяющие существенно поднять качество детектирования.



April 06, 2019


  1. Person detection in crowds Boris Lestsov Mail.Ru Group

  2. Computer Vision Team We solve computer vision problems at Mail.Ru

    Projects: 1) Vision (b2b) 2) Cloud 3) Mail 4) ...
  3. Business case

  4. Business case 1) Queue optimisation a) Open the elevator (ski

    resort) b) Call the cashier 2) Await time estimation
  5. None
  6. None
  7. Business requirements 1) Works in various setups 2) Real time

    3) Robust on video
  8. Challenges

  9. Challenges Heavy occlusion

  10. Challenges Pose, illumination, clothing variability.

  11. Metrics and datasets

  12. Intersection Over Union (IoU) Measures detection quality for single bounding

    box. Gives FP, FN
  13. Metrics: AP - Average Precision (single class) 1. Compute predictions.

    2. Plot Precision-Recall Curve, make non-increasing 3. Compute Area Under Curve Multiclass: mean AP (mAP) Different IoU thresholds
  14. mAP - mean Average Precision (VOC) Compute mean of AP

    for all classes. Problem: These detections give the same contribution to mAP.
  15. Metrics: mAP@[.5:.95] (COCO) 1) For each IoU threshold in [.5:.95]

    = [0.5, 0.55, 0.6, …, 0.9, 0.95] compute mAP. 2) Average these values to get mAP@[.5:.95]: Also: log average miss-rate (mMR) is used sometimes
  16. Benchmarks CrowdHuman - our main dataset Custom dataset similar to

    target domain
  17. CrowdHuman examples

  18. CrowdHuman examples

  19. CrowdHuman examples

  20. Why create own solution? Existing solutions: • Not adapted to

    crowds • Slow inference
  21. Approaches

  22. Approaches 1) Classical CV (HOG, Deformable Part Models, ViolaJones) 2)

    Motion-based detection (background subtraction) 3) CNN: a) Two stage - Faster RCNN b) Single stage - SSD, YOLO, RetinaNet. c) Cascaded - MTCNN
  23. Faster RCNN

  24. Faster RCNN

  25. Faster RCNN

  26. Faster RCNN

  27. Faster RCNN

  28. Faster RCNN

  29. Faster RCNN Person?

  30. Faster RCNN + Accurate + Bigger resolution => better result

    - Slow - More objects => more proposals => slower detection
  31. Single Shot Detector

  32. None
  33. None
  34. None
  35. None
  36. None
  37. SSD (Single Shot Detector) Image

  38. SSD (Single Shot Detector) Extract features

  39. Single Shot Detector Reducing height & width => to detect

    at different scales
  40. Different Scales Smaller scale Bigger scale

  41. Predict displacement Predict ∆x, ∆y

  42. Refine bbox shape Predict Sx , Sy :

  43. Single Shot Detector Predict bounding boxes

  44. Single Shot Detector Merge similar detections with NMS

  45. Non Maximum Suppression

  46. Problems with SSD • Backbone - VGG-16 • 512x512 =>

    breaking aspect ratio
  47. Architecture: RetinaNet 1) Backbone - ResNet 2) Feature Pyramid Network

    (FPN) 3) FocalLoss against class disbalance
  48. Feature Pyramid Network Higher level features for smaller scales

  49. FocalLoss Problem: class disbalance 99 : 1 Cross Entropy (CE):

    Focal Loss (FL): pt - predicted probability of g.t. class:
  50. Focal Loss 1) Well classified examples => smaller contribution 2)

    Analogue of Online Hard Example Mining
  51. Other problems

  52. Problems How many people on this image?

  53. None
  54. Small pedestrians Bigger resolution => better result, but slower. 800x600

    : 30 fps, ~73.5% AP 1200x800: 15 fps, ~78.0% AP
  55. Crop augmentations Good Crop Bad Crop Random Crop Random Crop

  56. Crop augmentations Good Crop Bad Crop Random Crop Random Crop

    IoU is roughly the same!
  57. Tuned prior boxes Removed horizontal boxes

  58. Tuned prior boxes Smaller boxes

  59. Detection examples

  60. Prediction Examples

  61. Prediction Examples

  62. Failure Case Extra box between people

  63. Future work 1) Repulsion Loss 2) RetinaMask 3) Replace ResNet-50

    with a better backbone model.
  64. Appendix Repulsion Loss Three components: 1) Attraction to matched g.t.

    box. 2) Repulsion from other g.t. boxes. 3) Repulsion from other predicted boxes. Technically, IoU is maximized/minimized.
  65. RetinaMask 1) RetinaNet adapted to instance segmentation 2) Mask prediction

    gives improves detection quality (~2.3% mAP on COCO). 3) Masks are predicted in Mask-RCNN manner.
  66. RetinaMask 4) Mask prediction can be discarded during inference to

    speed up the detector. 5) Code and models available!
  67. Tracking

  68. Blinking problem

  69. Tracking use cases 1) Tracking itself 2) Less False Positives

    on a video stream. 3) Deal with “blinking” detections.
  70. SORT (Simple Online and Realtime Tracking) • Association by IoU

    • Kalman Filters • Fast We fine-tuned SORT
  71. Intuition about Kalman Filter in SORT Box is represented with

    vector: • u,v - coordinates of the center • s - box scale • r - box aspect ratio • dotted u, v, s - corresponding derivatives
  72. Intuition about Kalman Filter in SORT Notes: 1. Linear prediction

    with correction from detector output. 2. Speed, aspect ratio are constant. 3. Can model many dynamic systems (fluid amount in a tank, the temperature of a car engine).
  73. With tracking

  74. Tracking example

  75. Queues in canteen 1) Multiple cameras 2) Zoning

  76. Conclusion 1) Two stage detectors are more accurate, but slower

    2) Bigger resolution => better accuracy, slower 3) ResNet, FPN, Focal Loss => better result
  77. Thanks!

  78. Appendix

  79. None
  80. Resolution 1) SGD training instead of Adam. 2) Replacing SSD

    with RetinaNet arch. 3) FocalLoss 4) Bigger resolution (current models: 800x600 and 1200x800) 5) scale_by_aspect instead of simple resize. 6) Anchor box tuning. 7) Crop augmentations 8) Joint training with head detection. 9) Removing strides from convolutions in last stages of RetinaNet. 10)Synchronized Batchnorm (big resolution => small batch size)
  81. Things that did NOT work out 1) MTCNN for detecting

    small people. 2) Prediction of full bounding box instead of the visible one.