convolutional network trained by the VGG group at Oxford University Deep learning tricks of the trade Tips to save you some time When things go wrong Detecting problems and debugging
(ϵ) Boiling point of He -452.1 4.22 -896.126276 810623.417462 Boiling point of N -320.4 77.36 -635.078210 507568.203773 Melting point of H2O 32.0 273.20 63.428535 44004.067369 Body temperature 98.6 310.50 195.439175 13238.993532 Boiling point of H2O 212.0 373.20 420.214047 2210.320605
(ϵ) Boiling point of He -452.1 4.22 4.202747 0.000298 Boiling point of N -320.4 77.36 77.402958 0.001845 Melting point of H2O 32.0 273.20 273.270495 0.004969 Body temperature 98.6 310.50 310.287458 0.045174 Boiling point of H2O 212.0 373.20 373.316342 0.013535 True values: a=0.556 b=255.372
is fast • Most likely easer to get going • Bindings for MATLAB, Python, command line access • Less flexible; harder to extend (need to learn architecture, manual differentiation) Expression compiler (e.g. Theano) • Extensible; new layer type or cost function: no problem • See what goes on under the hood • Being adventurous is easier! • Slower (Theano) • Debugging can be tricky (compiled expressions are a step away from your code) • Typically only work with one language (e.g. Python for Theano)
Evaluate (run/execute) the network Measure the average error/cost across mini- batch Use gradient descent to modify parameters to reduce cost REPEAT ABOVE UNTIL DONE
mini-batch results in regularization (due to noise), reaching lower error rates in the end [Goodfellow16]. When using very small mini- batches, need to compensate with lower learning rate and more epochs. Slow due to low parallelism Does not use all cores of GPU Low memory usage Less neuron activations kept in RAM
rate as with smaller batches and may not learn at all. Can be fast due to high parallelism Uses GPU parallelism (there are limits; gains only achievable if there are unused CUDA cores) High memory usage Lots of neuron activations kept around; can run out of RAM on large networks
lots of experiments use ~100 Effective training Learns reasonably quickly – in terms of improvement per epoch – and reaches acceptable error rate or loss Medium performance Acceptable in many cases Medium memory usage Fine for modest sized networks
networks particularly A model over-fits when it is very good at correctly predicting samples in training set but fails to generalise to samples outside it
15] Train two networks; one given random parameters to generate an image, another to discriminate between a generated image and one from the training set
model [Simonyan14] and extract texture features from one of the convolutional layers, given a target style / painting as input Use gradient descent to iterate photo – not weights – so that its texture features match those of the target image.