Dept. Mobility Technologies Co., Ltd. Past Work Experience April 2019 - March 2020 AI Research Engineer@DeNA Co., Ltd. April 2010 - March 2019 Research Scientist@Mitsubishi Electric Corp. Education PhD in Information Science@Tohoku University @kzykmyzw
videos from sideline and endzone, and player tracking data ▪ For training, 9947 still images are provided for helmet detection, and 60 video pairs (sideline and endzone) and player tracking data are provided for helmet impact detection ▪ Video frame rate is 59.94, and duration is around 10 seconds + helmet bboxes + helmet bboxes w/ player ID impact information + helmet bboxes w/ player ID impact information + player positions w/ player ID players’ speed, acceleration, etc images videos from endzone videos from sideline player tracking data time-synced available in training available in training and test
▪ label: the label type (Helmet, Helmet-Blurred, Helmet-Difficult, Helmet-Sideline, Helmet-Partial). ▪ [left/width/top/height]: the specification of the bounding box of the label, with left=0 and top=0 being the top left corner. ▪ Labels for videos ▪ gameKey: the ID code for the game. ▪ playID: the ID code for the play. ▪ view: the camera orientation. ▪ video: the filename of the associated video. ▪ frame: the frame number for this play. ▪ label: the associate player's number. ▪ [left/width/top/height]: the specification of the bounding box of the prediction. ▪ impact: an indicator (1 = helmet impact) for bounding boxes associated with helmet impacts ▪ confidence: 1 = Possible, 2 = Definitive, 3 = Definitive and Obvious ▪ visibility: 0 = Not Visible from View, 1 = Minimum, 2 = Visible, 3 = Clearly Visible ▪ impactType: a description of the type of helmet impact: helmet, shoulder, body, ground, etc. For the purposes of evaluation, definitive helmet impacts are defined as meeting three criteria: • impact = 1 • confidence > 1 • visibility > 0
the ID code for the play. ▪ player: the player's ID code. ▪ time: timestamp at 10 Hz. ▪ x: player position along the long axis of the field. ▪ y: player position along the short axis of the field. ▪ s: speed in yards/second. ▪ a: acceleration in yards/second^2. ▪ dis: distance traveled from prior time point, in yards. ▪ o: orientation of player (deg). ▪ dir: angle of player motion (deg). ▪ event: game events like a snap, whistle, etc. In test, we cannot directly map players’ positions in given tracking data to detected players in videos.
the ID code for the play. ▪ view: the camera orientation. ▪ video: the filename of the associated video. ▪ frame: the frame number for this play. ▪ [left/width/top/height]: the specification of the bounding box of the prediction.
For a given ground truth impact, a prediction within +/- 4 frames (9 frames total) within the same play can be accepted as valid without necessarily degrading the score. ▪ If one or more predictions are assigned to more than one ground truth boxes, the metric will optimize for the assignments between the prediction(s) and the ground truth boxes that lead to the highest total number of True Positives (thereby maximizing the F1 score). At most one prediction will be assigned to any ground truth box and vice versa. https://www.kaggle.com/c/nfl-impact-detection/discussion/197672
(detection + classification) ▪ NMS using tracking results helmet detection on every frame classification on every bbox post processing results ± n frames
in MMDetection and is easy to use) ▪ Train using images + 80% of training video and validate using the other 20% ▪ Detection performance seems to be quite high, so I didn’t pursue the accuracy by ensembling or TTA
at the same position from ±N frames (2N + 1 bboxes total) ▪ 2N + 1 bboxes are concatenated in channel direction and fed into ResNet-50 ▪ ResNet outputs probability of the input bboxes include impact h w 4 x max(w, h) 4 x max(w, h) +N frames -N frames ResNet-50 resize 224 x 224 x 3 x (2N + 1) impact or not
and 20% (12 pairs) val ▪ Since positive samples (bboxes which have impact labels) are only 0.18% of total samples, employed over-sampling to balance positive and negative samples ▪ Augmentation: LR flip, color jitter, bbox position jitter, bbox size jitter ▪ No cross validation because of time and computational resource limiation ▪ Score calculation by @nvnn’s notebook
Type-I: Assign TRUE labels only for the bboxes of impact timing ▪ Type-II: Assign TRUE labels for the bboxes of impact timing and ±4 consecutive bboxes ▪ Type-I uses ±2 bboxes as inputs, and Type-II uses ±4 bboxes as inputs (N = 2 and N = 4 in P.18) ▪ ResNet-Type-I and -Type-II achieve high precision and high recall, respectively, so their ensembling leads to performance gain (Type-I: 0.40 + Type-II: 0.38 → 0.46) ▪ Tried adding other models such as EfficientNet, but finally employed the two ResNets based on local val impact impact TRUE FALSE Recall Precision F1
the detected bboxes produces a lot of false positives ▪ Employ IoU-based tracking and pick up the bbox which has the highest confidence value in a track ▪ Remove bbox which couldn’t be tracked ▪ Remove bbox whose confidence value is less than threshold max conf > threshold max conf < threshold t t
helmet and estimate the average helmet velocity over a few surrounding frames by optical flow. Normalize size of helmet to 128 x 128 x 3 x 16. Correct helmet movement by optical flow to differentiate (i) helmet at constant velocity and (ii) helmet during acceleration. Ensemble of EfficientNet B0-B3, ResNet-18, and ResNet-34 with TSM (Temporal Shift Module). Mark 3 frames around the impact as positive and use 5 or 10% positive samples. Add the false positive prediction from a few undertrained detection models. Average predictions of multiple models from 4 folds. NMS in temporal direction using tracking results https://www.kaggle.com/c/nfl-impact-detection/discussion/209403
joint spatial-temporal modeling at no additional cost. ▪ Shift part of the channels along the temporal dimension; thus facilitate information exchanged among neighboring frames. ▪ Support both offline (bi-direction) and online (uni-direction) video recognition. https://arxiv.org/abs/1811.08383 After the competition, I evaluated TSM in my own pipeline, and it showed better performance compared to 2D CNN (0.380 → 0.436)
width and height of the original bbox. Ensemble of 6 different EfficientNets (B3 and B5) + horizontal flip TTA. Replace the first 2D conv layers in the inverted residual blocks of EfficientNet with 3D conv layers. Predict the different impact types for the center frame as output variable (no impact, helmet, shoulder, body, ground impact) and optimize a softmax loss with class weights split 0.8:0.2 (non-impact : impact). Select all of the positive impact samples and a random sample of negative impact samples according to a specified ratio (0.99:0.01 non-impact:impact) at each epoch. https://www.kaggle.com/c/nfl-impact-detection/discussion/208979 Thresholding using stage 1 score. Filter out any frame earlier than 25. NMS based on IoU to filter out duplicate boxes in subsequent frames. Consider the top 19 predicted boxes based on their stage 2 score and remove boxes below a threshold of 0.15.
models (WBF) Detect helmet w/ impact and w/o impact separately Recall is around 0.97 Detected helmets with impact can be candidates. Crop 3x width and height of the original bbox. Convert to grayscale. Ensemble of 18 EfficientNets and ReXNets. Regard the impact bbox and its ±1 frame corresponding bbox as positive. https://www.kaggle.com/c/nfl-impact-detection/discussion/208787 Tune separated thresholds depending on the predictions in the other view. For instance, the threshold for a certain Endzone frame depends on whether there is a predicted bbox in the Sideline view within +-1 frame. If yes, the threshold is lower (say 0.25); if not, the threshold is higher (say 0.45). NMS based on IoU to filter out duplicate boxes in subsequent frames.
classification pp 20 frames A sequence of full frames is fed to the 3D CNN input, a feature map is calculated, and using the ROIAlign operation, features for the ROIs are extracted and classified. A sequence of frames cropped around the target box, then the sequence of crops is fed to the 3D CNN input, and then the impact probability is calculated directly. Use 5 input channels instead of 3 RGB channels. The first additional channel is the heatmap of the center of the helmet of interest. The second additional channel is the heatmap of the centers of all helmets. NMS based on IoU to filter out duplicate boxes in subsequent frames.
224 using a grid with a constant step. 3D feature maps from second to fifth blocks of the 3D CNN are passed to FPN. FPN produces a 6 x 56 x 56 grid. 1 : presence of helmet 2-5 : position and size of helmet 6 : presence of impact 1st stage training 2nd stage training 20 frames I3D FPN Freeze pp results NMS based on IoU to filter out duplicate boxes in subsequent frames. Remove predictions with low confidence if there is no predictions in the other view.
3x width and height of the original bbox. Resize to 112 x 112 Ensemble of 2D ResNet-18 and 3D ResNet-18 Augmentation: HorizontalFlip, RandomBrightness, RandomContrast, one of(MotionBlur,MedianBlur,GaussianBlur,GaussNoise), HueSaturationValue, ShiftScaleRotate, Cutout, Bbox jitter If helmets are detected in the same position within 4 frames, only the middle frame is kept. Ignored the first and last 10 frames of the video because it is expected to be a low collision.
CenterNet. 8 consecutive frames are passed through the encoder individually, then intermediate concatenated and fed through UNet-like decoder to produce output heatmap & impacts map for 8 frames. results 3D ResNet-50 8 frames NMS based on IoU to filter out duplicate boxes in subsequent frames.
Detect helmet w/ impact and w/o impact separately using DetectoRS (train 1-class detector as warm-up) Crop 1.22x width and height of the original bbox. 3D CNN (I3D, SlowFast) Different thresholds for Endzone and Sideline views. Different thresholds over time. Use an IoU threshold and frame-difference threshold to cluster detections (through multiple frames) which belong to the same player and remove FP Impacts that are detected with a confidence lower than T are removed if no impact is found in the other view.
of the feature extraction block (EfficientNet-B5). Each head is responsible for predicting helmet for each frame. Calculate loss independently between 2 classes (helmet w/ impact and helmet w/o impact) and then using weighted sum of the 2 losses. results 15 frames NMS based on IoU to filter out duplicate boxes in subsequent frames. Dynamic confidence threshold: Low threshold for frame 30-80th (impact most likely to happen in this period), then slowly increase the threshold.