Slide 1

Slide 1 text

Automatic Dance Video Segmentation for Understanding Choreography Koki Endo* †1, Shuhei Tsuchida* †2, Tsukasa Fukusato†3, Takeo Igarashi†1 †1 The University of Tokyo †2 Ochanomizu University †3 Waseda University (* = authors contributed equally) 1 MOCO’24

Slide 2

Slide 2 text

Automatic Dance Video Segmentation for Understanding Choreography: Overview 2 1. Background 2. Related Work 3. Proposed Method 4. Dataset 5. Evaluation Experiment 6. Application 7. Future Work

Slide 3

Slide 3 text

• Learning dance becomes easier if the choreography is divided into short movements. Background / Learning Dance from Videos 3 gLH_sFM_c01_d16_mLH0_ch01.mp4

Slide 4

Slide 4 text

• Typical dance videos lack segmentation • Learners need to find the appropriate segmentation points themselves • Difficult for beginners • Tedious for experienced dancers ØWe propose a method to automatically segment dance videos Background / Learning Dance from Videos 4

Slide 5

Slide 5 text

Automatic Dance Video Segmentation for Understanding Choreography : Overview 5 1. Background 2. Related Work 3. Proposed Method 4. Dataset 5. Evaluation Experiment 6. Application 7. Future Work

Slide 6

Slide 6 text

Input Musical information Method What is estimated Detecting Dance Motion Structure Using Motion Capture and Musical Information [Shiratori et al. 2004] Japanese dance motion data (Nihon-buyo) Used Rule-based Motion segmentation points Dance Motion Segmentation Method based on Choreographic Primitives [Okada et al. 2015] Dance motion data Only beat positions used (prior knowledge) Rule-based Motion segmentation points Proposed Method Dance video Used Neural network Video segmentation points Related Work / Dance Motion Segmentation 6 Takaaki Shiratori, Atsushi Nakazawa and Katsushi Ikeuchi. Detecting Dance Motion Structure through Music Analysis. 2004 Narumi Okada, Naoya Iwamoto, Tsukasa Fukusato and Shigeo Morishima. Dance Motion Segmentation Method based on Choreographic Primitives. 2015

Slide 7

Slide 7 text

Automatic Dance Video Segmentation for Understanding Choreography : Overview 7 1. Background 2. Related Work 3. Proposed Method 4. Dataset 5. Evaluation Experiment 6. Application 7. Future Work

Slide 8

Slide 8 text

Dance videos Proposed Method 8 Visual features 𝒗 𝑡 Fully Connected NN Bone vectors Auditory features 𝒂 𝑡 2D CNN Mel spectrogram

Slide 9

Slide 9 text

Proposed Method / Visual Features • Using AlphaPose [Fang et al. 2017] to detect keypoint positions • Keypoints: 26 body points + 21 points for each hand = 68 keypoints • → Convert to bone vectors (67) and normalize length to 0.5 • → Pass through a fully connected neural network to obtain visual features 9 Hao-Shu Fang, Shuqin Xie, Yu-Wing Tai and Cewu Lu. RMPE: Regional Multi-person Pose Estimation. 2017 𝒗 𝑡 ∈ ℝ!"# Fully connected NN Keypoint positions Bone vector

Slide 10

Slide 10 text

Proposed Method / Auditory Features • Convert the music in the video to a Mel spectrogram using a short-time Fourier transform (STFT) • The Mel spectrogram is a 2D array representing the magnitude of each frequency component at a given time 10 Audio data Mel spectrogram 𝑆 STFT

Slide 11

Slide 11 text

Proposed Method / Auditory Features • Compress the Mel spectrogram to match the number of samples with the number of video frames • For each frame 𝑡 , find the nearest Mel spectrogram index 𝑖 • Extract a 5-sample segment centered at 𝑖 , and apply a 2D CNN to obtain auditory features 𝒂 𝑡 11 Mel spectrogram 𝑆 𝑡 ↓ 𝑖 2D CNN Auditory features 𝒂 𝑡 ∈ ℝ!" 𝑆[𝑖 − 2, 𝑖 − 1, 𝑖, 𝑖 + 1, 𝑖 + 2] Extract

Slide 12

Slide 12 text

Dance video Proposed Method 12 Visual feature 𝒗 𝑡 Fully connected NN Bone vectors Auditory features 𝒂 𝑡 2D CNN Mel spectrogram Segmentation possibility 𝑝 𝑡 Temporal Convolutional Network (TCN)

Slide 13

Slide 13 text

Input Output Time Proposed Method / TCN • Temporal Convolutional Network (TCN) [Bai et al. 2018] • 1D convolutions on time series data • Increase convolutional stride with deeper layers 13 Shaojie Bai, J. Zico Kolter and Vladlen Koltun. An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling. 2018

Slide 14

Slide 14 text

Proposed Method / TCN 14 𝑇 150 𝑇 Input: 𝒗 0 … 𝒗 𝑇 − 1 𝒂 0 … 𝒂 𝑇 − 1 ∈ ℝ!#$×& TCN

Slide 15

Slide 15 text

Proposed Method / TCN 15 𝑇 150 𝑇 Input: 𝒗 0 … 𝒗 𝑇 − 1 𝒂 0 … 𝒂 𝑇 − 1 ∈ ℝ!#$×& 𝑝 0 𝑝 1 𝑝 𝑇 − 1 Fully connected layer TCN

Slide 16

Slide 16 text

Dance video Proposed Method 16 Visual feature 𝒗 𝑡 Fully connected NN Bone vectors Auditory feature 𝒂 𝑡 2D CNN Mel spectrogram Segmentation possibility 𝑝 𝑡 Temporal Convolutional Network (TCN) Peak Detection Segmentation points 𝑡! , 𝑡" , …

Slide 17

Slide 17 text

Proposed Method / Peak Detection • Conditions for peak detection: 1. Segmentation possibility exceeds a certain threshold 2. Segmentation possibility is a local maximum 17 Segmentation possibility 𝑝 𝑡

Slide 18

Slide 18 text

Automatic Dance Video Segmentation for Understanding Choreography 18 1. Background 2. Related Work 3. Proposed Method 4. Dataset 5. Evaluation Experiment 6. Application 7. Future Work

Slide 19

Slide 19 text

• Manual segmentation of videos from the AIST Dance Video Database [Tsuchida et al. 2019] • 1200 basic dance videos + 210 freestyle dance videos = 1410 videos, approximately 10.7 hours An example of basic dance An example of advanced dance (Audio off for explanation) Dataset Shuhei Tsuchida, Satoru Fukayama, Masahiro Hamasaki and Masataka Goto. AIST Dance Video Database: Multi-genre, Multi-dancer, and Multi-camera Database for Dance Information Processing. 2019. 19 gLH_sFM_c01_d16_mLH0_ch01.mp4 gLH_sBM_c01_d16_mLH0_ch01.mp4

Slide 20

Slide 20 text

Annotation using the training data creation tool (audio off) Dataset / Annotation Tool 20 segToolDemo.mov

Slide 21

Slide 21 text

• Annotation workers • For each video, three segmentation annotations were collected. • The first author + two other experienced dancers Dataset / Annotation Tool 21 Worker Years of dance experience Number of videos worked on Author 11 1410 20 experienced dancers 5-19 141 each

Slide 22

Slide 22 text

• The intended segmentation points of the workers might be a few frames off from the actual annotated positions. ØRepresent individual segmentation results as a sum of Gaussian distributions centered on the annotated positions. Dataset / Creating Ground Truth Labels 22 Segmentation position 𝑡$, 𝑡!, … 𝑡$ 𝑡! 𝑡'

Slide 23

Slide 23 text

• The segmentation points of the three workers are averaged to create the ground truth labels. Dataset / Creating Ground Truth Labels 23

Slide 24

Slide 24 text

Automatic Dance Video Segmentation for Understanding Choreography: Overview 24 1. Background 2. Related Work 3. Proposed Method 4. Dataset 5. Evaluation Experiment 6. Application 7. Future Work

Slide 25

Slide 25 text

• Experiment overview: 1. Split the dataset into training (3), validation (1), and test (1) sets 2. Train on the training data 3. Stop training if validation loss does not improve for 10 epochs 4. Evaluate performance on the test data • Predicted segmentation points had an F-score of 0.797 ± 0.013 Evaluation Experiment 25

Slide 26

Slide 26 text

Evaluation Results / Correct Predictions 26 (0.75x speed, audio off) Lock Stop and Go House Farmer LA-style hip-hop Slide Middle hip-hop Brooklyn LO_stopAndGo.mov HO_farmer.mov LH_slide.mov MH_brooklyn.mov

Slide 27

Slide 27 text

Evaluation Results / Incorrect Predictions 28 Correct split position Wrongly predicted split position (played at 0.75x speed, audio off) Wrongly predicted split position Ballet Jazz、Chaines Ballet Jazz、Paddbre JB_chaines.mov JB_paddbre.mov

Slide 28

Slide 28 text

• Comparison of the proposed method (V+A) against models using only one feature type (V for visual, A for auditory) • t-test at a significance level of 5%: V < V+A • No significant difference between A and V+A • Possible data set bias or too high dimensionality of visual features. Evaluation Results/Feature Comparison 30 𝑝 = 5.08×10!" < 0.05 <> 𝑝 = 7.40×10!# > 0.05 <>

Slide 29

Slide 29 text

• Segmentation results of two videos with different choreography but the same music. 31 (a) (b) Evaluation Results/Feature Comparison

Slide 30

Slide 30 text

Automatic Dance Video Segmentation for Understanding Choreography 32 1. Background 2. Related Work 3. Proposed Method 4. Dataset 5. Evaluation Experiment 6. Application 7. Future Work

Slide 31

Slide 31 text

Application/Dance Learning Support 33 (Audio off for explanation) appDemo.mov

Slide 32

Slide 32 text

Application / User Test 34 • Users learned choreography using the application and provided feedback. • Participants: 4 (2-4 years of dance experience). • Usability and usefulness rated on a 5-point Likert scale: ØAll participants rated 4 (good) or 5 (very good). Participants App Experiment Setup (Recreated by the Author)

Slide 33

Slide 33 text

Application / User Testing Feedback 35 • Positive feedback: • Loop playback made repeated practice easy. • Convenient as manual segmentation was not required. • Automated segmentation positions matched my sense. • Improvement suggestions: • Adjustable break times between loop playbacks. • Ability to manually specify segmentation points.

Slide 34

Slide 34 text

Automatic Dance Video Segmentation for Understanding Choreography : Overview 36 1. Background 2. Related Work 3. Proposed Method 4. Dataset 5. Evaluation Experiment 6. Application 7. Future Work

Slide 35

Slide 35 text

• Enhancing the dataset: • Videos from various genres, especially jazz and ballet. • Increasing the number of annotators. • Adapting to non-static camera videos. • Handling camera movements and switches. • Improving the application: • Detecting repetitions. • Semi-automatic adjustment of segmentation points based on users' dance experience and preferences. Future Work 37

Slide 36

Slide 36 text

• Proposed Automatic Segmentation Method for Dance Videos • Uses Temporal Convolutional Network (TCN) • General-purpose method that does not require genre-specific knowledge • Created a dataset of 1410 dance videos in the AIST Dance Video Database by manually annotating the segmentation positions • Author + 20 experienced dancers • Evaluate the proposed method using the dataset in experiments Confirmation of the effectiveness of the proposed method on many street dances • Confirmed the validity of visual and auditory features • Proposed an application to support dance learning by applying automatic segmentation • Segmented sections can be played back in a loop for repeated practice • Confirm validity by user test Summary 38

Slide 37

Slide 37 text

This work was supported by: • JST, CREST Grant Number JPMJCR17A1 JSPS Grant-in-Aid 23K17022, Japan • JSPS Grant-in-Aid 23K17022, Japan We would also like to thank all the participants who took part in our experiments. Acknowledgements 39

Slide 38

Slide 38 text

40 AIST Dance Video Database (AIST Dance DB) is a shared database containing original street dance videos (1,410 dances) with copyright-cleared dance music (60). Contact: Shuhei Tsuchida [email protected] Thank you