Convolute all the things

Slide 1

Slide 1 text

No content

Slide 2

Slide 2 text

LONG ET AL., INTRO • Fully convolutional networks

Slide 3

Slide 3 text

LONG ET AL., INTRO

Slide 4

Slide 4 text

LONG ET AL., INTRO • Q: How do you make a network fully convolutional? • A: by making it fully convolutional

Slide 5

Slide 5 text

LONG ET AL., INTRO • Okay, but how do we obtain “dense” predictions, i.e., predictions for every pixel in the output? 1. Shift and stitch, or equivalently ‘a trous’ / dilated convolution 2. Upsampling, AKA backwards convolution or deconvolution

Slide 6

Slide 6 text

LONG ET AL., INTRO

Slide 7

Slide 7 text

LONG ET AL., RESULTS • Converting classification nets to segmentation nets yielded state-of-the-art results

Slide 8

Slide 8 text

LONG ET AL., RESULTS • Adding the “deep jet” with skip layers improved the segmentation detail

Slide 9

Slide 9 text

VAN DEN OORD ET AL.: INTRO • A generative model for raw audio – “What if we used PixelCNN on audio data?”

Slide 10

Slide 10 text

VAN DEN OORD ET AL.: INTRO • Even more secret ingredient: dilated causal convolution

Slide 11

Slide 11 text

VAN DEN OORD ET AL.: INTRO • Even more secret ingredient: dilated causal convolution

Slide 12

Slide 12 text

VAN DEN OORD ET AL.: INTRO • Yet more secret ingredients: – Output is a softmax layer trained on transformed data • non-linear transformation that can be mapped back to full range of 16-bit audio output – Gated activation units – Residual and skip connections

Slide 13

Slide 13 text

VAN DEN OORD ET AL.: INTRO • Your model needs a conditioner – global conditioning – local conditioning

Slide 14

Slide 14 text

VAN DEN OORD ET AL.: RESULTS • 3.1: We got it to make up speech. • 3.2: It did better than other models on text to speech (TTS) – Other models are concatenative (LSTM-RNN) and parameterized (HMM)

Slide 15

Slide 15 text

VAN DEN OORD ET AL.: RESULTS • 3.2: It did better than other models on text to speech (TTS) (cont.) • 3.3: We got it to make music.