Upgrade to Pro — share decks privately, control downloads, hide ads and more …

kaggle NFL 1st and Future - Impact Detection

kaggle NFL 1st and Future - Impact Detection

My 14th place solution and summary of top 10 solutions.

Kazuyuki Miyazawa

January 22, 2021
Tweet

More Decks by Kazuyuki Miyazawa

Other Decks in Technology

Transcript

  1. Nov 16, 2020 - Jan 4, 2021 https://www.kaggle.com/c/nfl-impact-detection/overview

    View full-size slide

  2. Kazuyuki Miyazawa
    Group Leader
    AI R&D Group 2
    AI System Dept.
    Mobility Technologies Co., Ltd.
    Past Work Experience
    April 2019 - March 2020
    AI Research Engineer@DeNA Co., Ltd.
    April 2010 - March 2019
    Research Scientist@Mitsubishi Electric Corp.
    Education
    PhD in Information Science@Tohoku University
    @kzykmyzw

    View full-size slide

  3. ■ Detect helmet impacts that happen in NFL games using videos from sideline and endzone, and player
    tracking data
    ■ For training, 9947 still images are provided for helmet detection, and 60 video pairs (sideline and
    endzone) and player tracking data are provided for helmet impact detection
    ■ Video frame rate is 59.94, and duration is around 10 seconds
    +
    helmet bboxes
    +
    helmet bboxes w/ player ID
    impact information
    +
    helmet bboxes w/ player ID
    impact information
    +
    player positions w/ player ID
    players’ speed, acceleration, etc
    images videos from endzone videos from sideline player tracking data
    time-synced
    available in training
    available in training and test

    View full-size slide

  4. ■ Labels for images
    ■ image: the image file name.
    ■ label: the label type (Helmet, Helmet-Blurred, Helmet-Difficult, Helmet-Sideline, Helmet-Partial).
    ■ [left/width/top/height]: the specification of the bounding box of the label, with left=0 and top=0 being the top left corner.
    ■ Labels for videos
    ■ gameKey: the ID code for the game.
    ■ playID: the ID code for the play.
    ■ view: the camera orientation.
    ■ video: the filename of the associated video.
    ■ frame: the frame number for this play.
    ■ label: the associate player's number.
    ■ [left/width/top/height]: the specification of the bounding box of the prediction.
    ■ impact: an indicator (1 = helmet impact) for bounding boxes associated with helmet impacts
    ■ confidence: 1 = Possible, 2 = Definitive, 3 = Definitive and Obvious
    ■ visibility: 0 = Not Visible from View, 1 = Minimum, 2 = Visible, 3 = Clearly Visible
    ■ impactType: a description of the type of helmet impact: helmet, shoulder, body, ground, etc.
    For the purposes of
    evaluation, definitive helmet
    impacts are defined as
    meeting three criteria:
    ● impact = 1
    ● confidence > 1
    ● visibility > 0

    View full-size slide

  5. Helmet w/o impact
    Helmet w/ impact

    View full-size slide

  6. Helmet w/o impact
    Helmet w/ impact

    View full-size slide

  7. ■ gameKey: the ID code for the game.
    ■ playID: the ID code for the play.
    ■ player: the player's ID code.
    ■ time: timestamp at 10 Hz.
    ■ x: player position along the long axis of the field.
    ■ y: player position along the short axis of the field.
    ■ s: speed in yards/second.
    ■ a: acceleration in yards/second^2.
    ■ dis: distance traveled from prior time point, in yards.
    ■ o: orientation of player (deg).
    ■ dir: angle of player motion (deg).
    ■ event: game events like a snap, whistle, etc.
    In test, we cannot directly map players’ positions in given tracking data to detected players in videos.

    View full-size slide

  8. Visualized by @hidehisaarai1213’s notebook

    View full-size slide

  9. ■ gameKey: the ID code for the game.
    ■ playID: the ID code for the play.
    ■ view: the camera orientation.
    ■ video: the filename of the associated video.
    ■ frame: the frame number for this play.
    ■ [left/width/top/height]: the specification of the bounding box of the prediction.

    View full-size slide

  10. ■ F1 score at an IoU threshold of 0.35
    ■ For a given ground truth impact, a prediction within +/- 4 frames (9 frames total) within the same play can
    be accepted as valid without necessarily degrading the score.
    ■ If one or more predictions are assigned to more than one ground truth boxes, the metric will optimize for
    the assignments between the prediction(s) and the ground truth boxes that lead to the highest total
    number of True Positives (thereby maximizing the F1 score). At most one prediction will be assigned to
    any ground truth box and vice versa.
    https://www.kaggle.com/c/nfl-impact-detection/discussion/197672

    View full-size slide

  11. ■ Only use videos and images
    ■ 2 stage pipeline (detection + classification)
    ■ NMS using tracking results
    helmet detection
    on every frame
    classification on
    every bbox
    post
    processing
    results
    ± n frames

    View full-size slide

  12. ■ Use DetectoRS (no strong reason other than it’s implemented in MMDetection and is easy to use)
    ■ Train using images + 80% of training video and validate using the other 20%
    ■ Detection performance seems to be quite high, so I didn’t pursue the accuracy by ensembling or TTA

    View full-size slide

  13. ■ Crop every bbox with expantion
    ■ Also crop bboxes at the same position from ±N frames (2N + 1 bboxes total)
    ■ 2N + 1 bboxes are concatenated in channel direction and fed into ResNet-50
    ■ ResNet outputs probability of the input bboxes include impact
    h
    w
    4 x max(w, h)
    4 x max(w, h)
    +N frames
    -N frames
    ResNet-50
    resize
    224 x 224 x 3 x (2N + 1)
    impact or not

    View full-size slide

  14. ■ Split endzone-sideline video pairs into 80% (48 pairs) train and 20% (12 pairs) val
    ■ Since positive samples (bboxes which have impact labels) are only 0.18% of total samples, employed
    over-sampling to balance positive and negative samples
    ■ Augmentation: LR flip, color jitter, bbox position jitter, bbox size jitter
    ■ No cross validation because of time and computational resource limiation
    ■ Score calculation by @nvnn’s notebook

    View full-size slide

  15. ■ Train two ResNets for different types of labels
    ■ Type-I: Assign TRUE labels only for the bboxes of impact timing
    ■ Type-II: Assign TRUE labels for the bboxes of impact timing and ±4 consecutive bboxes
    ■ Type-I uses ±2 bboxes as inputs, and Type-II uses ±4 bboxes as inputs (N = 2 and N = 4 in P.18)
    ■ ResNet-Type-I and -Type-II achieve high precision and high recall, respectively, so their ensembling leads
    to performance gain (Type-I: 0.40 + Type-II: 0.38 → 0.46)
    ■ Tried adding other models such as EfficientNet, but finally employed the two ResNets based on local val
    impact
    impact
    TRUE
    FALSE
    Recall Precision F1

    View full-size slide

  16. ■ NMS in temporal domain is necessary since classifying all the detected bboxes produces a lot of false
    positives
    ■ Employ IoU-based tracking and pick up the bbox which has the highest confidence value in a track
    ■ Remove bbox which couldn’t be tracked
    ■ Remove bbox whose confidence value is less than threshold
    max conf > threshold
    max conf < threshold
    t t

    View full-size slide

  17. Score
    single, w/o post processing
    single, w/ post processing
    ensemble (2~4 models)
    finetune w/ val data (2~3 models)

    View full-size slide

  18. detection classification pp results
    16 frames
    tracking
    1-class YOLOv5
    Track helmet and estimate the
    average helmet velocity over a few
    surrounding frames by optical flow.
    Normalize size of helmet to 128 x 128 x 3 x 16.
    Correct helmet movement by optical flow to
    differentiate (i) helmet at constant velocity and
    (ii) helmet during acceleration.
    Ensemble of EfficientNet B0-B3, ResNet-18, and ResNet-34
    with TSM (Temporal Shift Module).
    Mark 3 frames around the impact as positive and use 5 or 10%
    positive samples. Add the false positive prediction from a few
    undertrained detection models.
    Average predictions of multiple models from 4 folds.
    NMS in temporal direction
    using tracking results
    https://www.kaggle.com/c/nfl-impact-detection/discussion/209403

    View full-size slide

  19. ■ Can be inserted into 2D CNN backbone to enable joint spatial-temporal modeling at no additional cost.
    ■ Shift part of the channels along the temporal dimension; thus facilitate information exchanged among
    neighboring frames.
    ■ Support both offline (bi-direction) and online (uni-direction) video recognition.
    https://arxiv.org/abs/1811.08383
    After the competition, I evaluated TSM in my own pipeline, and it showed
    better performance compared to 2D CNN (0.380 → 0.436)

    View full-size slide

  20. detection classification pp results
    9 frames
    1-class YOLOv5
    Crop 2x width and height of the original bbox.
    Ensemble of 6 different EfficientNets (B3 and B5) + horizontal flip TTA.
    Replace the first 2D conv layers in the inverted residual blocks of EfficientNet with 3D conv layers.
    Predict the different impact types for the center frame as output variable (no impact, helmet, shoulder, body,
    ground impact) and optimize a softmax loss with class weights split 0.8:0.2 (non-impact : impact).
    Select all of the positive impact samples and a random sample of negative impact samples according to a
    specified ratio (0.99:0.01 non-impact:impact) at each epoch.
    https://www.kaggle.com/c/nfl-impact-detection/discussion/208979
    Thresholding using stage 1 score.
    Filter out any frame earlier than 25.
    NMS based on IoU to filter out duplicate boxes in
    subsequent frames.
    Consider the top 19 predicted boxes based on their stage
    2 score and remove boxes below a threshold of 0.15.

    View full-size slide

  21. detection classification pp results
    9 frames
    Ensemble of 7 EfficientDet models (WBF)
    Detect helmet w/ impact and w/o impact
    separately
    Recall is around 0.97
    Detected helmets with impact can be candidates.
    Crop 3x width and height of the original bbox.
    Convert to grayscale.
    Ensemble of 18 EfficientNets and ReXNets.
    Regard the impact bbox and its ±1 frame corresponding
    bbox as positive.
    https://www.kaggle.com/c/nfl-impact-detection/discussion/208787
    Tune separated thresholds depending on the predictions in the other view. For
    instance, the threshold for a certain Endzone frame depends on whether there
    is a predicted bbox in the Sideline view within +-1 frame. If yes, the threshold
    is lower (say 0.25); if not, the threshold is higher (say 0.45).
    NMS based on IoU to filter out duplicate boxes in subsequent frames.

    View full-size slide

  22. detection
    classification pp
    results
    20 frames
    https://www.kaggle.com/c/nfl-impact-detection/discussion/208947
    1-class Faster R-CNN
    classification pp
    20 frames
    A sequence of full frames is fed to the 3D CNN
    input, a feature map is calculated, and using
    the ROIAlign operation, features for the ROIs
    are extracted and classified.
    A sequence of frames cropped around the target box,
    then the sequence of crops is fed to the 3D CNN input,
    and then the impact probability is calculated directly.
    Use 5 input channels instead of 3 RGB channels. The
    first additional channel is the heatmap of the center of
    the helmet of interest. The second additional channel is
    the heatmap of the centers of all helmets.
    NMS based on IoU to
    filter out duplicate boxes
    in subsequent frames.

    View full-size slide

  23. https://www.kaggle.com/c/nfl-impact-detection/discussion/209235
    I3D FPN
    8 frames
    Make patches of 224 x 224 using
    a grid with a constant step.
    3D feature maps from
    second to fifth blocks of the
    3D CNN are passed to FPN.
    FPN produces a 6 x 56 x 56 grid.
    1 : presence of helmet
    2-5 : position and size of helmet
    6 : presence of impact
    1st stage
    training
    2nd stage
    training
    20 frames
    I3D FPN
    Freeze
    pp results
    NMS based on IoU to filter out duplicate boxes in
    subsequent frames.
    Remove predictions with low confidence if there is no
    predictions in the other view.

    View full-size slide

  24. https://www.kaggle.com/c/nfl-impact-detection/discussion/208833
    detection classification pp results
    9 frames
    1-class EfficientDet-D5
    Crop 3x width and height of the original bbox.
    Resize to 112 x 112
    Ensemble of 2D ResNet-18 and 3D ResNet-18
    Augmentation: HorizontalFlip, RandomBrightness, RandomContrast,
    one of(MotionBlur,MedianBlur,GaussianBlur,GaussNoise),
    HueSaturationValue, ShiftScaleRotate, Cutout, Bbox jitter
    If helmets are detected in the same position within 4
    frames, only the middle frame is kept.
    Ignored the first and last 10 frames of the video because
    it is expected to be a low collision.

    View full-size slide

  25. https://www.kaggle.com/c/nfl-impact-detection/discussion/208851
    detection classification pp
    Combine 4 folds with TTA for CenterNet.
    8 consecutive frames are passed through the encoder individually,
    then intermediate concatenated and fed through UNet-like decoder to
    produce output heatmap & impacts map for 8 frames.
    results
    3D ResNet-50
    8 frames
    NMS based on IoU to
    filter out duplicate boxes
    in subsequent frames.

    View full-size slide

  26. https://www.kaggle.com/c/nfl-impact-detection/discussion/209012
    detection classification pp results
    9 frames
    (every 2 frames)
    Detect helmet w/ impact and w/o impact
    separately using DetectoRS (train 1-class
    detector as warm-up)
    Crop 1.22x width and height of the original bbox.
    3D CNN (I3D, SlowFast)
    Different thresholds for Endzone and Sideline views.
    Different thresholds over time.
    Use an IoU threshold and frame-difference threshold to
    cluster detections (through multiple frames) which belong
    to the same player and remove FP
    Impacts that are detected with a confidence lower than T
    are removed if no impact is found in the other view.

    View full-size slide

  27. https://www.kaggle.com/c/nfl-impact-detection/discussion/208773
    detection pp
    Stack multiple CenterNet's heads on the top of the feature extraction
    block (EfficientNet-B5). Each head is responsible for predicting helmet for
    each frame.
    Calculate loss independently between 2 classes (helmet w/ impact and
    helmet w/o impact) and then using weighted sum of the 2 losses.
    results
    15 frames
    NMS based on IoU to filter out duplicate boxes in
    subsequent frames.
    Dynamic confidence
    threshold: Low threshold for
    frame 30-80th (impact most
    likely to happen in this
    period), then slowly increase
    the threshold.

    View full-size slide













  28. View full-size slide





  29. View full-size slide

  30. https://www.kaggle.com/c/nfl-impact-detection/discussion/208767
    MAYBE YES

    View full-size slide

  31. https://hrmos.co/pages/mo-t/jobs

    View full-size slide

  32. 文章 画像等の内容の無断転載及び複製等の行為はご遠慮ください。

    View full-size slide