rs. he d, ut . e). en h- Figure 3. Top: Top 48 stimuli of the best neuron from the test set. Bottom: The optimal stimulus according to nu- merical constraint optimization. 4.5. Invariance properties We would like to assess the robustness of the face de- tector against common object transformations, e.g., translation, scaling and out-of-plane rotation. First, we chose a set of 10 face images and perform distor- Building high-level features using large-scale unsupervised learning t) and out-of-plane (3D) rotation (right) ies of the best feature. Figure 6. Visualization of the cat fac human body neuron (right). scribed in (Zhang et al., 2008). In are 10,000 positive images and 18,4 (so that the positive-to-negative ra
A front-end of a single convolution-pooling-convolution ﬁltering on the rectiﬁed input, followed by three locally-connected layers and two fully-connected layers. Colors illustrate feature maps produced at each layer. The net includes more than 120 million parameters, where more than 95% come from the local and fully connected layers. very few parameters. These layers merely expand the input into a set of simple local features. The subsequent layers (L4, L5 and L6) are instead lo- cally connected [13, 16], like a convolutional layer they ap- ply a ﬁlter bank, but every location in the feature map learns a different set of ﬁlters. Since different regions of an aligned image have different local statistics, the spatial stationarity assumption of convolution cannot hold. For example, ar- eas between the eyes and the eyebrows exhibit very differ- ent appearance and have much higher discrimination ability compared to areas between the nose and the mouth. In other words, we customize the architecture of the DNN by lever- The goal of training is to maximize the probability of the correct class (face id). We achieve this by minimiz- ing the cross-entropy loss for each training sample. If k is the index of the true label for a given input, the loss is: L = log pk . The loss is minimized over the parameters by computing the gradient of L w.r.t. the parameters and by updating the parameters using stochastic gradient de- scent (SGD). The gradients are computed by standard back- propagation of the error [25, 21]. One interesting property of the features produced by this network is that they are very sparse. On average, 75% of the feature components in the topmost layers are exactly zero. This is mainly due to the ases in the DNN, perfect equivariance would have been achieved. 4. Veriﬁcation Metric Verifying whether two input instances belong to the same class (identity) or not has been extensively researched in the domain of unconstrained face-recognition, with supervised methods showing a clear performance advantage over unsu- pervised ones. By training on the target-domain’s training set, one is able to ﬁne-tune a feature vector (or classiﬁer) to perform better within the particular distribution of the dataset. For instance, LFW has about 75% males, celebri- ties that were photographed by mostly professional photog- raphers. As demonstrated in , training and testing within different domain distributions hurt performance consider- ably and requires further tuning to the representation (or classiﬁer) in order to improve their generalization and per- Figure 3. The ROC curves on the LFW dataset. Best viewed in color.
arguably the easiest large vocabulary for a continuous speech recognition task. We benchmark our system on two test sets from the Wall Street Journal (WSJ) corpus of read news articles. These are available in the LDC catalog as LDC94S13B and LDC93S6B. We also take advantage of the recently developed LibriSpeech corpus constructed using audio books from the LibriVox project . Table 13 shows that the DS2 system outperforms humans in 3 out of the 4 test sets and is competitive on the fourth. Given this result, we suspect that there is little room for a generic speech system to further improve on clean read speech without further domain adaptation. Read Speech Test set DS1 DS2 Human WSJ eval’92 4.94 3.60 5.03 WSJ eval’93 6.94 4.98 8.08 LibriSpeech test-clean 7.89 5.33 5.83 LibriSpeech test-other 21.74 13.25 12.69 Table 13: Comparison of WER for two speech systems and human level performance on read speech. 18
of games against Pachi23 and 12% against a slightly weaker program, Fuego24. Reinforcement learning of value networks The final stage of the training pipeline focuses on position evaluation, (s, a) of the search tree stores an action value Q(s, a), visit count N(s, a), and prior probability P(s, a). The tree is traversed by simulation (that is, descending the tree in complete games without backup), starting from the root state. At each time step t of each simulation, an action at is selected from state st Figure 3 | Monte Carlo tree search in AlphaGo. a, Each simulation traverses the tree by selecting the edge with maximum action value Q, plus a bonus u(P) that depends on a stored prior probability P for that edge. b, The leaf node may be expanded; the new node is processed once by the policy network pσ and the output probabilities are stored as prior probabilities P for each action. c, At the end of a simulation, the leaf node is evaluated in two ways: using the value network vθ ; and by running a rollout to the end of the game with the fast rollout policy pπ , then computing the winner with function r. d, Action values Q are updated to track the mean value of all evaluations r(·) and vθ (·) in the subtree below that action. Selection a b c d Expansion Evaluation Backup p S p V Q + u(P) Q + u(P) Q + u(P) Q + u(P) P P P P Q Q Q Q Q r r r r P max max P Q T Q T Q T Q T Q T Q T ग़ॴɿhttp://www.nature.com/nature/journal/v529/n7587/abs/nature16961.html?lang=en