Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Automatic Dance Video Segmentation for Understa...

Automatic Dance Video Segmentation for Understanding Choreography

Automatic Dance Video Segmentation for Understanding Choreography
Koki Endo*, Shuhei Tsuchida*, Tsukasa Fukusato, Takeo Igarashi (* = authors contributed equally)

Segmenting dance video into short movements is a popular way to easily understand dance choreography. However, it is currently done manually and requires a significant amount of effort by experts. In this paper, we propose a method to automatically segment a dance video into each movement. Given a dance video as input, we first extract visual and audio features: the former is computed from the keypoints of the dancer in the video, and the latter is computed from the Mel spectrogram of the music in the video. Next, these features are passed to a Temporal Convolutional Network (TCN), and segmentation points are estimated by picking peaks of the network output.

Shuhei Tsuchida

June 01, 2024
Tweet

More Decks by Shuhei Tsuchida

Other Decks in Technology

Transcript

  1. Automatic Dance Video Segmentation for Understanding Choreography Koki Endo* †1,

    Shuhei Tsuchida* †2, Tsukasa Fukusato†3, Takeo Igarashi†1 †1 The University of Tokyo †2 Ochanomizu University †3 Waseda University (* = authors contributed equally) 1 MOCO’24
  2. Automatic Dance Video Segmentation for Understanding Choreography: Overview 2 1.

    Background 2. Related Work 3. Proposed Method 4. Dataset 5. Evaluation Experiment 6. Application 7. Future Work
  3. • Learning dance becomes easier if the choreography is divided

    into short movements. Background / Learning Dance from Videos 3 gLH_sFM_c01_d16_mLH0_ch01.mp4
  4. • Typical dance videos lack segmentation • Learners need to

    find the appropriate segmentation points themselves • Difficult for beginners • Tedious for experienced dancers ØWe propose a method to automatically segment dance videos Background / Learning Dance from Videos 4
  5. Automatic Dance Video Segmentation for Understanding Choreography : Overview 5

    1. Background 2. Related Work 3. Proposed Method 4. Dataset 5. Evaluation Experiment 6. Application 7. Future Work
  6. Input Musical information Method What is estimated Detecting Dance Motion

    Structure Using Motion Capture and Musical Information [Shiratori et al. 2004] Japanese dance motion data (Nihon-buyo) Used Rule-based Motion segmentation points Dance Motion Segmentation Method based on Choreographic Primitives [Okada et al. 2015] Dance motion data Only beat positions used (prior knowledge) Rule-based Motion segmentation points Proposed Method Dance video Used Neural network Video segmentation points Related Work / Dance Motion Segmentation 6 Takaaki Shiratori, Atsushi Nakazawa and Katsushi Ikeuchi. Detecting Dance Motion Structure through Music Analysis. 2004 Narumi Okada, Naoya Iwamoto, Tsukasa Fukusato and Shigeo Morishima. Dance Motion Segmentation Method based on Choreographic Primitives. 2015
  7. Automatic Dance Video Segmentation for Understanding Choreography : Overview 7

    1. Background 2. Related Work 3. Proposed Method 4. Dataset 5. Evaluation Experiment 6. Application 7. Future Work
  8. Dance videos Proposed Method 8 Visual features 𝒗 𝑡 Fully

    Connected NN Bone vectors Auditory features 𝒂 𝑡 2D CNN Mel spectrogram
  9. Proposed Method / Visual Features • Using AlphaPose [Fang et

    al. 2017] to detect keypoint positions • Keypoints: 26 body points + 21 points for each hand = 68 keypoints • → Convert to bone vectors (67) and normalize length to 0.5 • → Pass through a fully connected neural network to obtain visual features 9 Hao-Shu Fang, Shuqin Xie, Yu-Wing Tai and Cewu Lu. RMPE: Regional Multi-person Pose Estimation. 2017 𝒗 𝑡 ∈ ℝ!"# Fully connected NN Keypoint positions Bone vector
  10. Proposed Method / Auditory Features • Convert the music in

    the video to a Mel spectrogram using a short-time Fourier transform (STFT) • The Mel spectrogram is a 2D array representing the magnitude of each frequency component at a given time 10 Audio data Mel spectrogram 𝑆 STFT
  11. Proposed Method / Auditory Features • Compress the Mel spectrogram

    to match the number of samples with the number of video frames • For each frame 𝑡 , find the nearest Mel spectrogram index 𝑖 • Extract a 5-sample segment centered at 𝑖 , and apply a 2D CNN to obtain auditory features 𝒂 𝑡 11 Mel spectrogram 𝑆 𝑡 ↓ 𝑖 2D CNN Auditory features 𝒂 𝑡 ∈ ℝ!" 𝑆[𝑖 − 2, 𝑖 − 1, 𝑖, 𝑖 + 1, 𝑖 + 2] Extract
  12. Dance video Proposed Method 12 Visual feature 𝒗 𝑡 Fully

    connected NN Bone vectors Auditory features 𝒂 𝑡 2D CNN Mel spectrogram Segmentation possibility 𝑝 𝑡 Temporal Convolutional Network (TCN)
  13. Input Output Time Proposed Method / TCN • Temporal Convolutional

    Network (TCN) [Bai et al. 2018] • 1D convolutions on time series data • Increase convolutional stride with deeper layers 13 Shaojie Bai, J. Zico Kolter and Vladlen Koltun. An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling. 2018
  14. Proposed Method / TCN 14 𝑇 150 𝑇 Input: 𝒗

    0 … 𝒗 𝑇 − 1 𝒂 0 … 𝒂 𝑇 − 1 ∈ ℝ!#$×& TCN
  15. Proposed Method / TCN 15 𝑇 150 𝑇 Input: 𝒗

    0 … 𝒗 𝑇 − 1 𝒂 0 … 𝒂 𝑇 − 1 ∈ ℝ!#$×& 𝑝 0 𝑝 1 𝑝 𝑇 − 1 Fully connected layer TCN
  16. Dance video Proposed Method 16 Visual feature 𝒗 𝑡 Fully

    connected NN Bone vectors Auditory feature 𝒂 𝑡 2D CNN Mel spectrogram Segmentation possibility 𝑝 𝑡 Temporal Convolutional Network (TCN) Peak Detection Segmentation points 𝑡! , 𝑡" , …
  17. Proposed Method / Peak Detection • Conditions for peak detection:

    1. Segmentation possibility exceeds a certain threshold 2. Segmentation possibility is a local maximum 17 Segmentation possibility 𝑝 𝑡
  18. Automatic Dance Video Segmentation for Understanding Choreography 18 1. Background

    2. Related Work 3. Proposed Method 4. Dataset 5. Evaluation Experiment 6. Application 7. Future Work
  19. • Manual segmentation of videos from the AIST Dance Video

    Database [Tsuchida et al. 2019] • 1200 basic dance videos + 210 freestyle dance videos = 1410 videos, approximately 10.7 hours An example of basic dance An example of advanced dance (Audio off for explanation) Dataset Shuhei Tsuchida, Satoru Fukayama, Masahiro Hamasaki and Masataka Goto. AIST Dance Video Database: Multi-genre, Multi-dancer, and Multi-camera Database for Dance Information Processing. 2019. 19 gLH_sFM_c01_d16_mLH0_ch01.mp4 gLH_sBM_c01_d16_mLH0_ch01.mp4
  20. • Annotation workers • For each video, three segmentation annotations

    were collected. • The first author + two other experienced dancers Dataset / Annotation Tool 21 Worker Years of dance experience Number of videos worked on Author 11 1410 20 experienced dancers 5-19 141 each
  21. • The intended segmentation points of the workers might be

    a few frames off from the actual annotated positions. ØRepresent individual segmentation results as a sum of Gaussian distributions centered on the annotated positions. Dataset / Creating Ground Truth Labels 22 Segmentation position 𝑡$, 𝑡!, … 𝑡$ 𝑡! 𝑡'
  22. • The segmentation points of the three workers are averaged

    to create the ground truth labels. Dataset / Creating Ground Truth Labels 23
  23. Automatic Dance Video Segmentation for Understanding Choreography: Overview 24 1.

    Background 2. Related Work 3. Proposed Method 4. Dataset 5. Evaluation Experiment 6. Application 7. Future Work
  24. • Experiment overview: 1. Split the dataset into training (3),

    validation (1), and test (1) sets 2. Train on the training data 3. Stop training if validation loss does not improve for 10 epochs 4. Evaluate performance on the test data • Predicted segmentation points had an F-score of 0.797 ± 0.013 Evaluation Experiment 25
  25. Evaluation Results / Correct Predictions 26 (0.75x speed, audio off)

    Lock Stop and Go House Farmer LA-style hip-hop Slide Middle hip-hop Brooklyn LO_stopAndGo.mov HO_farmer.mov LH_slide.mov MH_brooklyn.mov
  26. Evaluation Results / Incorrect Predictions 28 Correct split position Wrongly

    predicted split position (played at 0.75x speed, audio off) Wrongly predicted split position Ballet Jazz、Chaines Ballet Jazz、Paddbre JB_chaines.mov JB_paddbre.mov
  27. • Comparison of the proposed method (V+A) against models using

    only one feature type (V for visual, A for auditory) • t-test at a significance level of 5%: V < V+A • No significant difference between A and V+A • Possible data set bias or too high dimensionality of visual features. Evaluation Results/Feature Comparison 30 𝑝 = 5.08×10!" < 0.05 <> 𝑝 = 7.40×10!# > 0.05 <>
  28. • Segmentation results of two videos with different choreography but

    the same music. 31 (a) (b) Evaluation Results/Feature Comparison
  29. Automatic Dance Video Segmentation for Understanding Choreography 32 1. Background

    2. Related Work 3. Proposed Method 4. Dataset 5. Evaluation Experiment 6. Application 7. Future Work
  30. Application / User Test 34 • Users learned choreography using

    the application and provided feedback. • Participants: 4 (2-4 years of dance experience). • Usability and usefulness rated on a 5-point Likert scale: ØAll participants rated 4 (good) or 5 (very good). Participants App Experiment Setup (Recreated by the Author)
  31. Application / User Testing Feedback 35 • Positive feedback: •

    Loop playback made repeated practice easy. • Convenient as manual segmentation was not required. • Automated segmentation positions matched my sense. • Improvement suggestions: • Adjustable break times between loop playbacks. • Ability to manually specify segmentation points.
  32. Automatic Dance Video Segmentation for Understanding Choreography : Overview 36

    1. Background 2. Related Work 3. Proposed Method 4. Dataset 5. Evaluation Experiment 6. Application 7. Future Work
  33. • Enhancing the dataset: • Videos from various genres, especially

    jazz and ballet. • Increasing the number of annotators. • Adapting to non-static camera videos. • Handling camera movements and switches. • Improving the application: • Detecting repetitions. • Semi-automatic adjustment of segmentation points based on users' dance experience and preferences. Future Work 37
  34. • Proposed Automatic Segmentation Method for Dance Videos • Uses

    Temporal Convolutional Network (TCN) • General-purpose method that does not require genre-specific knowledge • Created a dataset of 1410 dance videos in the AIST Dance Video Database by manually annotating the segmentation positions • Author + 20 experienced dancers • Evaluate the proposed method using the dataset in experiments Confirmation of the effectiveness of the proposed method on many street dances • Confirmed the validity of visual and auditory features • Proposed an application to support dance learning by applying automatic segmentation • Segmented sections can be played back in a loop for repeated practice • Confirm validity by user test Summary 38
  35. This work was supported by: • JST, CREST Grant Number

    JPMJCR17A1 JSPS Grant-in-Aid 23K17022, Japan • JSPS Grant-in-Aid 23K17022, Japan We would also like to thank all the participants who took part in our experiments. Acknowledgements 39
  36. 40 AIST Dance Video Database (AIST Dance DB) is a

    shared database containing original street dance videos (1,410 dances) with copyright-cleared dance music (60). Contact: Shuhei Tsuchida [email protected] Thank you