Image Recognition with DL4J II Dr. Javier Gonzalez-Sanchez [email protected] javiergs.engineering.asu.edu | javiergs.com PERALTA 230U Office Hours: By appointment
jgs Career Fair § Master’s and PhD Online Career Fair Tuesday, Feb 15, 2022 9 a.m.–4 p.m. § No lecture that day. § Faculty picnic with students Thursday, Feb 24, 2022 (SER Faculty, SCAI Director, Dean of Students). (there will be food) 11:30 am I will start the lecture 12:15 pm that day.
jgs Weight Initialization | Xavier § A too-large initialization leads to exploding (partial derivatives) § A too-small initialization leads to vanishing (partial derivatives) Advice: § The mean of the activations should be zero. § The variance of the activations should stay the same across every layer. / / statistical measurement of / / the spread between numbers in a data set
jgs Activation Functions | SoftMax § Sigmoid is independent § Most popular activation function for output layers handling multiple classes. § Probabilities.
jgs Error Function | Negative Log-Likelihood § the SoftMax function is used in tandem with the negative log-likelihood. § Likelihood of observed data y would be produced by parameter values w L(y, w) Likelihood can be in range 0 to 1. § Log facilitates the derivatives § The Log likelihood values are then in range -Infinite to 0. § Negative make it Infinite to 0 https://hea-www.harvard.edu/AstroStat/aas227_2016/lecture1_Robinson.pdf
jgs Updater § Training mechanisms. § There are methods that can result in much faster network training compared to 'vanilla' gradient descent. You can set the updater using the .updater(Updater) configuration option. § E.g., momentum, RMSProp, adagrad, ADAM, NADAM, and others
jgs Updater § A limitation of gradient descent is that the progress of the search can slow down if the gradient becomes flat or large curvature. § Momentum can be added to gradient descent that incorporates some inertia to updates. / / quantity of motion of a moving body / / (product of its mass and velocity)
jgs Updater § A limitation of gradient descent is that the progress of the search can slow down if the gradient becomes flat or large curvature. § Momentum can be added to gradient descent that incorporates some inertia to updates. § Adaptive Movement Estimation (ADAM): Calculate learning rate for each input objective function and further smooths the search process by using an exponentially decreasing moving average of the gradient § Nesterov Momentum + ADAM Nesterov-accelerated Adaptive Moment Estimation
Ph.D. [email protected] Spring 2022 Copyright. These slides can only be used as study material for the class CSE205 at Arizona State University. They cannot be distributed or used for another purpose.