Automatic Dance Video Segmentation for Understanding Choreography

Automatic Dance Video Segmentation for Understanding Choreography Koki Endo* †1,
Shuhei Tsuchida* †2, Tsukasa Fukusato†3, Takeo Igarashi†1 †1 The University of Tokyo †2 Ochanomizu University †3 Waseda University （* = authors contributed equally） 1 MOCO’24

Automatic Dance Video Segmentation for Understanding Choreography: Overview 2 1.
Background 2. Related Work 3. Proposed Method 4. Dataset 5. Evaluation Experiment 6. Application 7. Future Work

• Learning dance becomes easier if the choreography is divided
into short movements. Background / Learning Dance from Videos 3 gLH_sFM_c01_d16_mLH0_ch01.mp4

• Typical dance videos lack segmentation • Learners need to
ﬁnd the appropriate segmentation points themselves • Diﬃcult for beginners • Tedious for experienced dancers ØWe propose a method to automatically segment dance videos Background / Learning Dance from Videos 4

Automatic Dance Video Segmentation for Understanding Choreography : Overview 5
1. Background 2. Related Work 3. Proposed Method 4. Dataset 5. Evaluation Experiment 6. Application 7. Future Work

Input Musical information Method What is estimated Detecting Dance Motion
Structure Using Motion Capture and Musical Information [Shiratori et al. 2004] Japanese dance motion data (Nihon-buyo) Used Rule-based Motion segmentation points Dance Motion Segmentation Method based on Choreographic Primitives [Okada et al. 2015] Dance motion data Only beat positions used (prior knowledge) Rule-based Motion segmentation points Proposed Method Dance video Used Neural network Video segmentation points Related Work / Dance Motion Segmentation 6 Takaaki Shiratori, Atsushi Nakazawa and Katsushi Ikeuchi. Detecting Dance Motion Structure through Music Analysis. 2004 Narumi Okada, Naoya Iwamoto, Tsukasa Fukusato and Shigeo Morishima. Dance Motion Segmentation Method based on Choreographic Primitives. 2015

Dance videos Proposed Method 8 Visual features 𝒗 𝑡 Fully
Connected NN Bone vectors Auditory features 𝒂 𝑡 2D CNN Mel spectrogram

Proposed Method / Visual Features • Using AlphaPose [Fang et
al. 2017] to detect keypoint positions • Keypoints: 26 body points + 21 points for each hand = 68 keypoints • → Convert to bone vectors (67) and normalize length to 0.5 • → Pass through a fully connected neural network to obtain visual features 9 Hao-Shu Fang, Shuqin Xie, Yu-Wing Tai and Cewu Lu. RMPE: Regional Multi-person Pose Estimation. 2017 𝒗 𝑡 ∈ ℝ!"# Fully connected NN Keypoint positions Bone vector

Proposed Method / Auditory Features • Convert the music in
the video to a Mel spectrogram using a short-time Fourier transform (STFT) • The Mel spectrogram is a 2D array representing the magnitude of each frequency component at a given time 10 Audio data Mel spectrogram 𝑆 STFT

Proposed Method / Auditory Features • Compress the Mel spectrogram
to match the number of samples with the number of video frames • For each frame 𝑡 , ﬁnd the nearest Mel spectrogram index 𝑖 • Extract a 5-sample segment centered at 𝑖 ， and apply a 2D CNN to obtain auditory features 𝒂 𝑡 11 Mel spectrogram 𝑆 𝑡 ↓ 𝑖 2D CNN Auditory features 𝒂 𝑡 ∈ ℝ!" 𝑆[𝑖 − 2, 𝑖 − 1, 𝑖, 𝑖 + 1, 𝑖 + 2] Extract

Dance video Proposed Method 12 Visual feature 𝒗 𝑡 Fully
connected NN Bone vectors Auditory features 𝒂 𝑡 2D CNN Mel spectrogram Segmentation possibility 𝑝 𝑡 Temporal Convolutional Network (TCN)

Input Output Time Proposed Method / TCN • Temporal Convolutional
Network (TCN) [Bai et al. 2018] • 1D convolutions on time series data • Increase convolutional stride with deeper layers 13 Shaojie Bai, J. Zico Kolter and Vladlen Koltun. An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling. 2018

Proposed Method / TCN 14 𝑇 150 𝑇 Input: 𝒗
0 … 𝒗 𝑇 − 1 𝒂 0 … 𝒂 𝑇 − 1 ∈ ℝ!#$×& TCN

Proposed Method / TCN 15 𝑇 150 𝑇 Input: 𝒗
0 … 𝒗 𝑇 − 1 𝒂 0 … 𝒂 𝑇 − 1 ∈ ℝ!#$×& 𝑝 0 𝑝 1 𝑝 𝑇 − 1 Fully connected layer TCN

Dance video Proposed Method 16 Visual feature 𝒗 𝑡 Fully
connected NN Bone vectors Auditory feature 𝒂 𝑡 2D CNN Mel spectrogram Segmentation possibility 𝑝 𝑡 Temporal Convolutional Network (TCN) Peak Detection Segmentation points 𝑡! , 𝑡" , …

Proposed Method / Peak Detection • Conditions for peak detection:
1. Segmentation possibility exceeds a certain threshold 2. Segmentation possibility is a local maximum 17 Segmentation possibility 𝑝 𝑡

Automatic Dance Video Segmentation for Understanding Choreography 18 1. Background
2. Related Work 3. Proposed Method 4. Dataset 5. Evaluation Experiment 6. Application 7. Future Work

• Manual segmentation of videos from the AIST Dance Video
Database [Tsuchida et al. 2019] • 1200 basic dance videos + 210 freestyle dance videos = 1410 videos, approximately 10.7 hours An example of basic dance An example of advanced dance (Audio oﬀ for explanation) Dataset Shuhei Tsuchida, Satoru Fukayama, Masahiro Hamasaki and Masataka Goto. AIST Dance Video Database: Multi-genre, Multi-dancer, and Multi-camera Database for Dance Information Processing. 2019. 19 gLH_sFM_c01_d16_mLH0_ch01.mp4 gLH_sBM_c01_d16_mLH0_ch01.mp4

Annotation using the training data creation tool (audio oﬀ) Dataset
/ Annotation Tool 20 segToolDemo.mov

• Annotation workers • For each video, three segmentation annotations
were collected. • The first author + two other experienced dancers Dataset / Annotation Tool 21 Worker Years of dance experience Number of videos worked on Author 11 1410 20 experienced dancers 5-19 141 each

• The intended segmentation points of the workers might be
a few frames off from the actual annotated positions. ØRepresent individual segmentation results as a sum of Gaussian distributions centered on the annotated positions. Dataset / Creating Ground Truth Labels 22 Segmentation position 𝑡$, 𝑡!, … 𝑡$ 𝑡! 𝑡'

• The segmentation points of the three workers are averaged
to create the ground truth labels. Dataset / Creating Ground Truth Labels 23

Automatic Dance Video Segmentation for Understanding Choreography: Overview 24 1.
Background 2. Related Work 3. Proposed Method 4. Dataset 5. Evaluation Experiment 6. Application 7. Future Work

• Experiment overview: 1. Split the dataset into training (3),
validation (1), and test (1) sets 2. Train on the training data 3. Stop training if validation loss does not improve for 10 epochs 4. Evaluate performance on the test data • Predicted segmentation points had an F-score of 0.797 ± 0.013 Evaluation Experiment 25

Evaluation Results / Correct Predictions 26 (0.75x speed, audio oﬀ)
Lock Stop and Go House Farmer LA-style hip-hop Slide Middle hip-hop Brooklyn LO_stopAndGo.mov HO_farmer.mov LH_slide.mov MH_brooklyn.mov

Evaluation Results / Incorrect Predictions 28 Correct split position Wrongly
predicted split position (played at 0.75x speed, audio off) Wrongly predicted split position Ballet Jazz、Chaines Ballet Jazz、Paddbre JB_chaines.mov JB_paddbre.mov

• Comparison of the proposed method (V+A) against models using
only one feature type (V for visual, A for auditory) • t-test at a significance level of 5%: V < V+A • No significant difference between A and V+A • Possible data set bias or too high dimensionality of visual features. Evaluation Results/Feature Comparison 30 𝑝 = 5.08×10!" < 0.05 <> 𝑝 = 7.40×10!# > 0.05 <>

• Segmentation results of two videos with diﬀerent choreography but
the same music. 31 (a) (b) Evaluation Results/Feature Comparison

Automatic Dance Video Segmentation for Understanding Choreography 32 1. Background
2. Related Work 3. Proposed Method 4. Dataset 5. Evaluation Experiment 6. Application 7. Future Work

Application/Dance Learning Support 33 (Audio oﬀ for explanation) appDemo.mov

Application / User Test 34 • Users learned choreography using
the application and provided feedback. • Participants: 4 (2-4 years of dance experience). • Usability and usefulness rated on a 5-point Likert scale: ØAll participants rated 4 (good) or 5 (very good). Participants App Experiment Setup (Recreated by the Author)

Application / User Testing Feedback 35 • Positive feedback: •
Loop playback made repeated practice easy. • Convenient as manual segmentation was not required. • Automated segmentation positions matched my sense. • Improvement suggestions: • Adjustable break times between loop playbacks. • Ability to manually specify segmentation points.

• Enhancing the dataset: • Videos from various genres, especially
jazz and ballet. • Increasing the number of annotators. • Adapting to non-static camera videos. • Handling camera movements and switches. • Improving the application: • Detecting repetitions. • Semi-automatic adjustment of segmentation points based on users' dance experience and preferences. Future Work 37

• Proposed Automatic Segmentation Method for Dance Videos • Uses
Temporal Convolutional Network (TCN) • General-purpose method that does not require genre-specific knowledge • Created a dataset of 1410 dance videos in the AIST Dance Video Database by manually annotating the segmentation positions • Author + 20 experienced dancers • Evaluate the proposed method using the dataset in experiments Confirmation of the effectiveness of the proposed method on many street dances • Confirmed the validity of visual and auditory features • Proposed an application to support dance learning by applying automatic segmentation • Segmented sections can be played back in a loop for repeated practice • Confirm validity by user test Summary 38

This work was supported by: • JST, CREST Grant Number
JPMJCR17A1 JSPS Grant-in-Aid 23K17022, Japan • JSPS Grant-in-Aid 23K17022, Japan We would also like to thank all the participants who took part in our experiments. Acknowledgements 39

40 AIST Dance Video Database (AIST Dance DB) is a
shared database containing original street dance videos (1,410 dances) with copyright-cleared dance music (60). Contact: Shuhei Tsuchida [email protected] Thank you

Automatic Dance Video Segmentation for Understa...

Automatic Dance Video Segmentation for Understanding Choreography

More Decks by Shuhei Tsuchida

Other Decks in Technology

Featured

Transcript