About Google Magenta

Music AI with Google Magenta DSP Lab, Inha University Aug
19, 2021 Taein Kim ([email protected]) Department of Electronic Engineering Inha University, South Korea

Outline • Music and Audio Data • Google Magenta –
Wave2Midi2Wave – MAESTRO dataset – Music Transformer • Conclusion

Music and Audio Data 5 Time Amplitude • Rawaudio Time
Domain

Music and Audio Data 6 Symbolic Domain • Symbolic representation
• (Human readable) • Sequence of: • Notes • Rhythm • Duration • Intensity Time Pitch • Musical Score

Music and Audio Data 7 Symbolic Domain • Symbolic representation
• (Computer readable) • List of events • Pitch • Start time (Onsets) • End time (Offsets) • Volume (Velocity) Time Events • Piano Roll / Midi

Music and Audio Data 8 Time-Frequency Domain Time Frequency Magnitude
Image: https://stackoverflow.com/questions/41457036/make-matplotlib-pyplot-color-bar-span-two-rows- alongside-waveform-and-specgram • Audio as an image • Lossy transformation • Multiple parameters need to be tuned

Music and Audio Data 9 Data Summary (Raw) Audio Score
Piano Roll (Midi) Spectrogram 1D N/A 2D 2D Time Symbolic Symbolic Time- Frequency Example Dimensionality Domain

Music and Audio Data 10 Challenges Image: https://deepmind.com/blog/wavenet-generative-model-raw-audio/ • High
dimensionality • 1 sec = 44 k data points • 1 min = 2.5 M data points • 1 song= 10 M data points • (Average novel = 100 k words) * • Multilevel dependencies • Short time – Timbre and pitch • Medium term – Rhythm • Long term – Songstructure • Non-linear perception of sound • Similar waveforms can sound very different • Dissimilar waveforms can sound the same * Source: https://self-publishingschool.com/how-many-words-in-a-novel/

Music and Audio Data 11 Challenges • High dimensionality •
1 sec = 44 k data points • 1 min = 2.5 M data points • 1 song= 10 M data points • (Average novel = 100 k words) * • Multilevel dependencies • Short time – Timbre and pitch • Medium term – Rhythm • Long term – Songstructure • Non-linear perception of sound • Similar waveforms can sound very different • Dissimilar waveforms can sound the same Image adapted from Hawthrone, 2019 :https://youtube.videoken.com/embed/1ohtSlux9EQ?tocitem=38

Music and Audio Data 12 Challenges • High dimensionality •
1 sec = 44 k data points • 1 min = 2.5 M data points • 1 song= 10 M data points • (Average novel = 100 k words) * • Multilevel dependencies • Short time – Timbre and pitch • Medium term – Rhythm • Long term – Songstructure • Non-linear perception of sound • Similar waveforms can sound very different • Dissimilar waveforms can sound the same Which one of these waveforms sound different? Adapted from Jordi Pons, Jesse Engel slides, 2019

Google Magenta 13 "An open-source research project exploring the role
of machine learning as a tool in the creative process." • Focus on generative models • Tools for artists and developers • Open-source code • Standalone demos

Google Magenta 14

Wave2Midi2Wave 15 Music audio with structured prior (notes) Score Performance
(PianoRoll / Midi) Audio Slide adapted from https://youtube.videoken.com/embed/1ohtSlux9EQ?tocitem=38 Image:jp.Fotolia.com

Wave2Midi2Wave 16 Overview

Wave2Midi2Wave 17 Transcription model (Encoder)

Wave2Midi2Wave 18 Performance matters • MIDI data of scores is
vastly available • The score is quantized • Human performance adds: • Micro timings (variations in time) • Expression (variations in note velocity) • Very few datasets available Score (Quantized) Performance (Unquantized)

Wave2Midi2Wave 19 MAESTRO Dataset Midi and Audio Edited for Synchronous
TRacks and Organization Data: • Recorded performances of virtuoso piano competitions • Audio andMidi recordings • Midi data collected using Yamaha Disklavier* pianos • Audio and midi aligned with ~3 ms accuracy *Disklavier is a piano with a high-res MIDI capture system https://en.wikipedia.org/wiki/Disklavier http://piano-e-competition.com/ 1,814 Performances 430 Compositions 172.3 Hours of Audio and MIDI 102.8 GB 6.18 Million Notes

Wave2Midi2Wave 20 MAESTRO Dataset 1,814 Performances 430 Compositions 172.3 Hours
of Audio and MIDI 102.8 GB 6.18 Million Notes Hawthorne, C., Stasyuk, A., Roberts, A., Simon, I., Huang, C. Z. A., Dieleman, S., ... & Eck, D. (2018). Enabling factorized piano music modeling and generation with the MAESTRO dataset. arXiv preprint arXiv:1810.12247.

Wave2Midi2Wave 21 Onsets and Frames • Encoder, basedon an improved
Onsets and Framesmodel. • Translates a mel-spectrogram into a piano roll. Loffsets = Similar to onsets (Mel-Spectrogram)

Wave2Midi2Wave 22 Purple – Correct Estimation Red – Missed Estimations
(False Negative) Blue – Incorrect Estimation (False Positive) Magenta - Overlapping Onset and Frames Black - Onset Predictions Cyan - Frames w/o Onset

Wave2Midi2Wave 23 Magenta - Overlapping Onset and Frames Black -
Onset Predictions Cyan - Frames without Onset Purple – Correct Estimation Red – Missed Estimations (False Negative) Blue – Incorrect Estimation (False Positive)

Wave2Midi2Wave 24 • Wave2Midi – Translates raw audio into midi
• Captures performance nuances • Problem: Very few datasets with matched audio and midi

Wave2Midi2Wave 25 Language Model

Music Transformer 26 as Language Model • Transformer with relative
attention • Modified to work with very long sequences • Memory consumption from O(L2D) to O(LD). L = sequence length, D = hidden-state size Vanilla Attention Relative Attention

Music Transformer 27 Visualizing Self-Reference

Music Transformer 28 Sequence generation with an initial prime PerformanceRNN
(LSTM) Vanilla Transformer Music Transformer (Relative Attention)

Music Transformer 29 Piano Synthesizer

Music Transformer 30 Piano Synthesis - WaveNet based model Other
methods for synthesis (not based on machine-learning): • Concatenative synthesis / sampling • Physical Modeling (Animated) https://storage.googleapis.com/deepmind-live-cms/documents/BlogPost-Fig2-Anim-160908-r01.gif Image adapted from Aaron van den Oord, et al, 2016

Music Transformer 31 WaveNet Demos Frédéric Chopin - Mazurka in
D Major, Op. 33, No. 2 Original Audio WaveNet Other Synthesis https://storage.googleapis.com/magentadata/papers/maestro/index.html

Music Transformer 32 Performance RNN (LSTM) Music Transformer Motif phrase

Conclusion 33 • Google Magenta project provides researches and codes
to study music AI • DDSP makes you easily manipulate audio and music data • Wave2Midi can read music and convert to MIDI transcript • MAESTRO dataset could be a starting point to learn piano performances • Music Transformer can generate expert-level improvise with maintaining initial motive

Questions? 34/31

References • Curtis Hawthorne, Andriy Stasyuk, Adam Roberts, Ian Simon,
Cheng-Zhi Anna Huang, Sander Dieleman, Erich Elsen, Jesse H. Engel, Douglas Eck:, ”Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset”, ICLR 2019 • Curtis Hawthorne, Erich Elsen, Jialin Song, Adam Roberts, Ian Simon, Colin Raffel, Jesse H. Engel, Sageev Oore, Douglas Eck, ”Onsets and Frames: Dual-Objective Piano Transcription”, ISMIR 2018 • Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Ian Simon, Curtis Hawthorne, Noam Shazeer, Andrew M. Dai, Matthew D. Hoffman, Monica Dinculescu, Douglas Eck: ”Music Transformer: Generating Music with Long-Term Structure”. ICLR (Poster) 2019 • J. W. Kim and J. P. Bello, “ADVERSARIAL LEARNING FOR IMPROVED ONSETS AND FRAMES MUSIC TRANSCRIPTION,” p. 8, 2019. http://archives.ismir.net/ismir2019/paper/000081.pdf • Oord, Aaron van den, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. “WaveNet: A Generative Model for Raw Audio.” ArXiv:1609.03499 [Cs], September 12, 2016. http://arxiv.org/abs/1609.03499 • Curtis Hawthorne, Talk at ICLR 2019, https://youtube.videoken.com/embed/1ohtSlux9EQ?tocitem=38 35

About Google Magenta

About Google Magenta

More Decks by Taein Kim

Other Decks in Research

Featured

Transcript