Matrix capsules with em routing

1 DEEP LEARNING JP [DL Papers] http://deeplearning.jp/ Matrix Capsules with
EM Routing (ICLR2018) Kazuki Fujikawa, DeNA

• y – 0 – E G L ADIED
G E G 2A EB H ,GEHHI • – LD CA E IAD I D FH B H 8 E G 2 S y – iao s – EH foPT O Puw – foSho PT O tSmgcfPuw – oM p]R r[ leoS+1 o n tN – HC BB23 eMbacfP 37 Published as a conference paper at ICLR 2018 Figure 1: A network with one ReLU convolutional layer followed by a primary convolutional capsule layer and two more convolutional capsule layers.

• • • •

• – i 5 g N P C -
– M C a ਤҾ༻: https://hackernoon.com/uncovering-the-intuition-behind-capsule-networks-and-inverse-graphics-part-i-7412d121798d

• 3 p n – o r w i
c p e C N • p D C 6 aI • ND D N – C vn C 6 • Dt N V N pC w • p c 6 To set the activation probability for a particular higher-level capsule, j, we compare the description lengths of two different ways of coding the poses of the activated lower-level capsules assigned to j by the routing, as described in section 3. “Description length” is just another term for energy. The difference in the two description lengths (in nats) is put through a logistic function to determine the activation probability of capsule j. The logistic function computes the distribution (p, 1 p) that minimizes free energy when the difference in the energies of the two alternatives is the argument of the logistic function. The energies we use for determining the activation probabilities are the same energies as we use for ﬁtting the Gaussians and computing the assignment probabilities. So all three steps minimize the same free energy but with respect to different parameters for each step. In some of the explanations above we have implicitly assumed that the lower-level capsules have activities of 1 or 0 and the assignment probabilities computed during the dynamic routing are also 1 or 0. In fact, these numbers are both probabilities and we use the product of these two probabilities as a multiplier on both the baseline description length of each lower-level mean and its alternative description length obtained by making use of the Gaussian ﬁtted by a higher-level capsule. B SUPPLEMENTARY FIGURES ਤҾ༻: Hinton+, ICLR2018

• • • •

• 18 : 1 7 12: + , 0
– C ]bi • D C o – l [ ]b gcab uN • D ]bi o n PB S ys t u w – l [ ]b l e i u IR rpm w – e i gcab ਤҾ༻: https://medium.com/ai%C2%B3-theory-practice-business/understanding-hintons-capsule-networks-part-ii-how-capsules-work-153b6ade9f66

• 07 2 ,9 0 019 + – B
C D ਤҾ༻: https://medium.com/@mike_ross/a-visual-representation-of-capsule-network-computations-83767d79e737

• C 1 7 7 6 1 12 +
, 0 – N ਤҾ༻: https://medium.com/@mike_ross/a-visual-representation-of-capsule-network-computations-83767d79e737 I ++ ] D B N SPR 5 1 1 [

• + 3 5 CB 8 6BD66 3 C
6 103 C , 0 2 – S [ ਤҾ༻: https://medium.com/@mike_ross/a-visual-representation-of-capsule-network-computations-83767d79e737 [] 763BC 6 3 R [] PN 6 3 6 + [] I

• , 7 0 + D D 21 C
.-/1 – ] ¥b ਤҾ༻: https://medium.com/@mike_ross/a-visual-representation-of-capsule-network-computations-83767d79e737 1B D:N 1B D: i j S a I P R P[ ce 3 hg : D : : , 7 0 + D D

• 1 3 1 7 12 + , 0
– BFI ਤҾ༻: https://medium.com/@mike_ross/a-visual-representation-of-capsule-network-computations-83767d79e737 71 CD

• 07 2 , 4 0 01 + –
NBP S ਤҾ༻: https://medium.com/@mike_ross/a-visual-representation-of-capsule-network-computations-83767d79e737 a [C R[D ] [ c b ] I

• B 2 7 7 (5 55 )2 5
0 2 ,+ 1 – SD [ ਤҾ༻: https://medium.com/@mike_ross/a-visual-representation-of-capsule-network-computations-83767d79e737 ] NCb g !" Rc P C o m e #"$ S ail I p 5 R n % &$|" = #"$ &"

• B 0 62 , 6 0 7 01
+ – SD [ ਤҾ༻: https://medium.com/@mike_ross/a-visual-representation-of-capsule-network-computations-83767d79e737 i e I ]e ! "#|% RC &%# b h I aRNPC 0 I '# R &%# R B 0 62 , 6 cg '# = )*+,)ℎ(/ % &%# 0 +#|% )

• 0 2 ,: 0 7 01: + –
B R ਤҾ༻: https://medium.com/@mike_ross/a-visual-representation-of-capsule-network-computations-83767d79e737 In convolutional capsule layers, each capsule outputs a local grid of vectors to each type of capsule in the layer above using different transformation matrices for each member of the grid as well as for each type of capsule. Procedure 1 Routing algorithm. 1: procedure ROUTING(ˆ uj|i , r, l) 2: for all capsule i in layer l and capsule j in layer (l + 1): bij 0. 3: for r iterations do 4: for all capsule i in layer l: ci softmax(bi) . softmax computes Eq. 3 5: for all capsule j in layer (l + 1): sj P i cij ˆ uj|i 6: for all capsule j in layer (l + 1): vj squash(sj) . squash computes Eq. 1 7: for all capsule i in layer l and capsule j in layer (l + 1): bij bij + ˆ uj|i .vj return vj 3 Margin loss for digit existence We are using the length of the instantiation vector to represent the probability that a capsule’s entity exists. We would like the top-level capsule for digit class k to have a long instantiation vector if and only if that digit is present in the image. To allow for multiple digits, we use a separate margin loss, Lk for each digit capsule, k: Lk = Tk max(0, m+ ||vk ||)2 + (1 Tk) max(0, ||vk || m )2 (4) where Tk = 1 iff a digit of class k is present3 and m+ = 0.9 and m = 0.1. The down-weighting of the loss for absent digit classes stops the initial learning from shrinking the lengths of the activity vectors of all the digit capsules. We use = 0.5. The total loss is simply the sum of the losses of all digit capsules. 4 CapsNet architecture A simple CapsNet architecture is shown in Fig. 1. The architecture is shallow with only two convolutional layers and one fully connected layer. Conv1 has 256, 9 ⇥ 9 convolution kernels with a stride of 1 and ReLU activation. This layer converts pixel intensities to the activities of local feature detectors that are then used as inputs to the primary capsules. The primary capsules are the lowest level of multi-dimensional entities and, from an inverse graphics perspective, activating the primary capsules corresponds to inverting the rendering process. This is a very different type of computation than piecing instantiated parts together to make familiar wholes, which is what capsules are designed to be good at. The second layer (PrimaryCapsules) is a convolutional capsule layer with 32 channels of convolutional 8D capsules (i.e. each primary capsule contains 8 convolutional units with a 9 ⇥ 9 kernel and a stride of 2). Each primary capsule output sees the outputs of all 256 ⇥ 81 Conv1 units whose receptive 2For MNIST we found that it was sufﬁcient to set all of these priors to be equal. 3We do not allow an image to contain two instances of the same digit class. We address this weakness of capsules in the discussion section. 3 !"# ]bP S I[bP S C D a cN

• C 0 2 , 0 8 01 +
– RD ਤҾ༻: https://medium.com/@mike_ross/a-visual-representation-of-capsule-network-computations-83767d79e737 In convolutional capsule layers, each capsule outputs a local grid of vectors to each type of capsule in the layer above using different transformation matrices for each member of the grid as well as for each type of capsule. Procedure 1 Routing algorithm. 1: procedure ROUTING(ˆ uj|i , r, l) 2: for all capsule i in layer l and capsule j in layer (l + 1): bij 0. 3: for r iterations do 4: for all capsule i in layer l: ci softmax(bi) . softmax computes Eq. 3 5: for all capsule j in layer (l + 1): sj P i cij ˆ uj|i 6: for all capsule j in layer (l + 1): vj squash(sj) . squash computes Eq. 1 7: for all capsule i in layer l and capsule j in layer (l + 1): bij bij + ˆ uj|i .vj return vj 3 Margin loss for digit existence We are using the length of the instantiation vector to represent the probability that a capsule’s entity exists. We would like the top-level capsule for digit class k to have a long instantiation vector if and only if that digit is present in the image. To allow for multiple digits, we use a separate margin loss, Lk for each digit capsule, k: Lk = Tk max(0, m+ ||vk ||)2 + (1 Tk) max(0, ||vk || m )2 (4) where Tk = 1 iff a digit of class k is present3 and m+ = 0.9 and m = 0.1. The down-weighting of the loss for absent digit classes stops the initial learning from shrinking the lengths of the activity vectors of all the digit capsules. We use = 0.5. The total loss is simply the sum of the losses of all digit capsules. 4 CapsNet architecture A simple CapsNet architecture is shown in Fig. 1. The architecture is shallow with only two convolutional layers and one fully connected layer. Conv1 has 256, 9 ⇥ 9 convolution kernels with a stride of 1 and ReLU activation. This layer converts pixel intensities to the activities of local feature detectors that are then used as inputs to the primary capsules. The primary capsules are the lowest level of multi-dimensional entities and, from an inverse graphics perspective, activating the primary capsules corresponds to inverting the rendering process. This is a very different type of computation than piecing instantiated parts together to make familiar wholes, which is what capsules are designed to be good at. The second layer (PrimaryCapsules) is a convolutional capsule layer with 32 channels of convolutional 8D capsules (i.e. each primary capsule contains 8 convolutional units with a 9 ⇥ 9 kernel and a stride of 2). Each primary capsule output sees the outputs of all 256 ⇥ 81 Conv1 units whose receptive 2For MNIST we found that it was sufﬁcient to set all of these priors to be equal. 3We do not allow an image to contain two instances of the same digit class. We address this weakness of capsules in the discussion section. 3 !" 7 ] 0B cS [ b P I g aNfe

• 07 2 ,9 0 019 + – B
SPR ਤҾ༻: https://medium.com/@mike_ross/a-visual-representation-of-capsule-network-computations-83767d79e737 In convolutional capsule layers, each capsule outputs a local grid of vectors to each type of capsule in the layer above using different transformation matrices for each member of the grid as well as for each type of capsule. Procedure 1 Routing algorithm. 1: procedure ROUTING(ˆ uj|i , r, l) 2: for all capsule i in layer l and capsule j in layer (l + 1): bij 0. 3: for r iterations do 4: for all capsule i in layer l: ci softmax(bi) . softmax computes Eq. 3 5: for all capsule j in layer (l + 1): sj P i cij ˆ uj|i 6: for all capsule j in layer (l + 1): vj squash(sj) . squash computes Eq. 1 7: for all capsule i in layer l and capsule j in layer (l + 1): bij bij + ˆ uj|i .vj return vj 3 Margin loss for digit existence We are using the length of the instantiation vector to represent the probability that a capsule’s entity exists. We would like the top-level capsule for digit class k to have a long instantiation vector if and only if that digit is present in the image. To allow for multiple digits, we use a separate margin loss, Lk for each digit capsule, k: Lk = Tk max(0, m+ ||vk ||)2 + (1 Tk) max(0, ||vk || m )2 (4) where Tk = 1 iff a digit of class k is present3 and m+ = 0.9 and m = 0.1. The down-weighting of the loss for absent digit classes stops the initial learning from shrinking the lengths of the activity vectors of all the digit capsules. We use = 0.5. The total loss is simply the sum of the losses of all digit capsules. 4 CapsNet architecture A simple CapsNet architecture is shown in Fig. 1. The architecture is shallow with only two convolutional layers and one fully connected layer. Conv1 has 256, 9 ⇥ 9 convolution kernels with a stride of 1 and ReLU activation. This layer converts pixel intensities to the activities of local feature detectors that are then used as inputs to the primary capsules. The primary capsules are the lowest level of multi-dimensional entities and, from an inverse graphics perspective, activating the primary capsules corresponds to inverting the rendering process. This is a very different type of computation than piecing instantiated parts together to make familiar wholes, which is what capsules are designed to be good at. The second layer (PrimaryCapsules) is a convolutional capsule layer with 32 channels of convolutional 8D capsules (i.e. each primary capsule contains 8 convolutional units with a 9 ⇥ 9 kernel and a stride of 2). Each primary capsule output sees the outputs of all 256 ⇥ 81 Conv1 units whose receptive 2For MNIST we found that it was sufﬁcient to set all of these priors to be equal. 3We do not allow an image to contain two instances of the same digit class. We address this weakness of capsules in the discussion section. 3 N ! "#|% &%# [I]CD D D

• 7 1 + 2 22 2 , 0
, – PDR S ] ਤҾ༻: https://medium.com/@mike_ross/a-visual-representation-of-capsule-network-computations-83767d79e737 In convolutional capsule layers, each capsule outputs a local grid of vectors to each type of capsule in the layer above using different transformation matrices for each member of the grid as well as for each type of capsule. Procedure 1 Routing algorithm. 1: procedure ROUTING(ˆ uj|i , r, l) 2: for all capsule i in layer l and capsule j in layer (l + 1): bij 0. 3: for r iterations do 4: for all capsule i in layer l: ci softmax(bi) . softmax computes Eq. 3 5: for all capsule j in layer (l + 1): sj P i cij ˆ uj|i 6: for all capsule j in layer (l + 1): vj squash(sj) . squash computes Eq. 1 7: for all capsule i in layer l and capsule j in layer (l + 1): bij bij + ˆ uj|i .vj return vj 3 Margin loss for digit existence We are using the length of the instantiation vector to represent the probability that a capsule’s entity exists. We would like the top-level capsule for digit class k to have a long instantiation vector if and only if that digit is present in the image. To allow for multiple digits, we use a separate margin loss, Lk for each digit capsule, k: Lk = Tk max(0, m+ ||vk ||)2 + (1 Tk) max(0, ||vk || m )2 (4) where Tk = 1 iff a digit of class k is present3 and m+ = 0.9 and m = 0.1. The down-weighting of the loss for absent digit classes stops the initial learning from shrinking the lengths of the activity vectors of all the digit capsules. We use = 0.5. The total loss is simply the sum of the losses of all digit capsules. 4 CapsNet architecture A simple CapsNet architecture is shown in Fig. 1. The architecture is shallow with only two convolutional layers and one fully connected layer. Conv1 has 256, 9 ⇥ 9 convolution kernels with a stride of 1 and ReLU activation. This layer converts pixel intensities to the activities of local feature detectors that are then used as inputs to the primary capsules. The primary capsules are the lowest level of multi-dimensional entities and, from an inverse graphics perspective, activating the primary capsules corresponds to inverting the rendering process. This is a very different type of computation than piecing instantiated parts together to make familiar wholes, which is what capsules are designed to be good at. The second layer (PrimaryCapsules) is a convolutional capsule layer with 32 channels of convolutional 8D capsules (i.e. each primary capsule contains 8 convolutional units with a 9 ⇥ 9 kernel and a stride of 2). Each primary capsule output sees the outputs of all 256 ⇥ 81 Conv1 units whose receptive 2For MNIST we found that it was sufﬁcient to set all of these priors to be equal. 3We do not allow an image to contain two instances of the same digit class. We address this weakness of capsules in the discussion section. 3 , I B N D a C cS b [

• 7 1 + 7 2 227 2 ,
0 , – N PR ਤҾ༻: https://medium.com/@mike_ross/a-visual-representation-of-capsule-network-computations-83767d79e737 In convolutional capsule layers, each capsule outputs a local grid of vectors to each type of capsule in the layer above using different transformation matrices for each member of the grid as well as for each type of capsule. Procedure 1 Routing algorithm. 1: procedure ROUTING(ˆ uj|i , r, l) 2: for all capsule i in layer l and capsule j in layer (l + 1): bij 0. 3: for r iterations do 4: for all capsule i in layer l: ci softmax(bi) . softmax computes Eq. 3 5: for all capsule j in layer (l + 1): sj P i cij ˆ uj|i 6: for all capsule j in layer (l + 1): vj squash(sj) . squash computes Eq. 1 7: for all capsule i in layer l and capsule j in layer (l + 1): bij bij + ˆ uj|i .vj return vj 3 Margin loss for digit existence We are using the length of the instantiation vector to represent the probability that a capsule’s entity exists. We would like the top-level capsule for digit class k to have a long instantiation vector if and only if that digit is present in the image. To allow for multiple digits, we use a separate margin loss, Lk for each digit capsule, k: Lk = Tk max(0, m+ ||vk ||)2 + (1 Tk) max(0, ||vk || m )2 (4) where Tk = 1 iff a digit of class k is present3 and m+ = 0.9 and m = 0.1. The down-weighting of the loss for absent digit classes stops the initial learning from shrinking the lengths of the activity vectors of all the digit capsules. We use = 0.5. The total loss is simply the sum of the losses of all digit capsules. 4 CapsNet architecture A simple CapsNet architecture is shown in Fig. 1. The architecture is shallow with only two convolutional layers and one fully connected layer. Conv1 has 256, 9 ⇥ 9 convolution kernels with a stride of 1 and ReLU activation. This layer converts pixel intensities to the activities of local feature detectors that are then used as inputs to the primary capsules. The primary capsules are the lowest level of multi-dimensional entities and, from an inverse graphics perspective, activating the primary capsules corresponds to inverting the rendering process. This is a very different type of computation than piecing instantiated parts together to make familiar wholes, which is what capsules are designed to be good at. The second layer (PrimaryCapsules) is a convolutional capsule layer with 32 channels of convolutional 8D capsules (i.e. each primary capsule contains 8 convolutional units with a 9 ⇥ 9 kernel and a stride of 2). Each primary capsule output sees the outputs of all 256 ⇥ 81 Conv1 units whose receptive 2For MNIST we found that it was sufﬁcient to set all of these priors to be equal. 3We do not allow an image to contain two instances of the same digit class. We address this weakness of capsules in the discussion section. 3 !"# $ %#|" '# [ $ %#|" '# B ]BSCDI

• 07 2 , 0 . 01 + –
B [ ਤҾ༻: https://medium.com/@mike_ross/a-visual-representation-of-capsule-network-computations-83767d79e737 bw C70 Dlg p 20 n NS rti I ] a o rti y mc rti s Cu Ba 2 D a e R P Figure 2: Decoder structure to reconstruct a digit from the DigitCaps layer representation. The euclidean distance between the image and the output of the Sigmoid layer is minimized during training. We use the true label as reconstruction target during training. ﬁelds overlap with the location of the center of the capsule. In total PrimaryCapsules has [32 ⇥ 6 ⇥ 6] capsule outputs (each output is an 8D vector) and each capsule in the [6 ⇥ 6] grid is sharing their weights with each other. One can see PrimaryCapsules as a Convolution layer with Eq. 1 as its block non-linearity. The ﬁnal Layer (DigitCaps) has one 16D capsule per digit class and each of these capsules receives input from all the capsules in the layer below. We have routing only between two consecutive capsule layers (e.g. PrimaryCapsules and DigitCaps). Since Conv1 output is 1D, there is no orientation in its space to agree on. Therefore, no routing is used between Conv1 and PrimaryCapsules. All the routing logits (bij ) are initialized to zero. Therefore, initially a capsule output (ui ) is sent to all parent capsules (v0...v9 ) with equal probability (cij ). Our implementation is in TensorFlow (Abadi et al. [2016]) and we use the Adam optimizer (Kingma and Ba [2014]) with its TensorFlow default parameters, including the exponentially decaying learning rate, to minimize the sum of the margin losses in Eq. 4. 4.1 Reconstruction as a regularization method We use an additional reconstruction loss to encourage the digit capsules to encode the instantiation parameters of the input digit. During training, we mask out all but the activity vector of the correct

• • • •

• 42 3 3 4 4 – G iC
i – 3 C R a – ge C R h R a – PE D M C Published as a conference paper at ICLR 2018 Figure 1: A network with one ReLU convolutional layer followed by a primary convolutional capsule layer and two more convolutional capsule layers. that location. The activations of the primary capsules are produced by applying the sigmoid function to the weighted sums of the same set of lower-layer ReLUs. The primary capsules are followed by two 3x3 convolutional capsule layers (K=3), each with 32 capsule types (C=D=32) with strides of 2 and one, respectively. The last layer of convolutional

• C:1 2 :5 A + :5 – C:1
2 :5D 1 E • MSP 3 !" D A32 =E #$%&#ℎ(∑ * +*" , %"|* ) • MSP (2 A1 : 21 1= !" R • MSP +*" / 0"|* !" G – + :5D E • MSP • MSP 3 1 = e 1 1 = =3 1 3 • MSP (2 A1 : 21 1= 2345#65+ 7 89 − 8; ∑ * <*" − ∑ = +3#6" = • MSP >*" c a ) :A)1 f ) :A)1 f

• where a is the same for all capsules
and is an inverse temperature parameter. We learn a and u discriminatively and set a fixed schedule for as a hyper-parameter. For finalizing the pose parameters and activations of the capsules in layer L + 1 we run the EM algorithm for few iterations (normally 3) after the pose parameters and activations have already been finalized in layer L. The non-linearity implemented by a whole capsule layer is a form of cluster finding using the EM algorithm, so we call it EM Routing. Procedure 1 Routing algorithm returns activation and pose of the capsules in layer L + 1 given the activations and votes of capsules in layer L. V h ij is the hth dimension of the vote from capsule i with activation a i in layer L to capsule j in layer L + 1. a , u are learned discriminatively and the inverse temperature increases at each iteration with a fixed schedule. 1: procedure EM ROUTING(a, V ) 2: 8i 2 ⌦ L , j 2 ⌦ L+1 : R ij 1/|⌦ L+1 | 3: for t iterations do 4: 8j 2 ⌦ L+1 : M-STEP(a, R, V , j) 5: 8i 2 ⌦ L : E-STEP(µ, , a, V , i) return a, M 1: procedure M-STEP(a, R, V , j) . for one higher-level capsule, j 2: 8i 2 ⌦ L : R ij R ij ⇤ a i 3: 8h: µh j P i Rij V h ij P i Rij 4: 8h: ( h j )2 P i Rij (V h ij µh j )2 P i Rij 5: costh u + log( h j ) P i R ij 6: a j logistic( ( a P h costh)) 1: procedure E-STEP(µ, , a, V , i) . for one lower-level capsule, i 2: 8j 2 ⌦ L+1 : p j 1 qQ H h 2⇡( h j )2 exp ✓ P H h (V h ij µh j )2 2( h j )2 ◆ 3: 8j 2 ⌦ L+1 : R ij a j pj P k2⌦ L+1 a kpk 4 THE CAPSULES ARCHITECTURE The general architecture of our model is shown in Fig. 1. The model starts with a 5x5 convolutional layer with 32 channels (A=32) and a stride of 2 with a ReLU non-linearity. All the other layers are capsule layers starting with the primary capsule layer. The 4x4 pose of each of the B=32 primary

8 • 1 where a is the same for all
capsules and is an inverse temperature parameter. We learn a and u discriminatively and set a fixed schedule for as a hyper-parameter. For finalizing the pose parameters and activations of the capsules in layer L + 1 we run the EM algorithm for few iterations (normally 3) after the pose parameters and activations have already been finalized in layer L. The non-linearity implemented by a whole capsule layer is a form of cluster finding using the EM algorithm, so we call it EM Routing. Procedure 1 Routing algorithm returns activation and pose of the capsules in layer L + 1 given the activations and votes of capsules in layer L. V h ij is the hth dimension of the vote from capsule i with activation a i in layer L to capsule j in layer L + 1. a , u are learned discriminatively and the inverse temperature increases at each iteration with a fixed schedule. 1: procedure EM ROUTING(a, V ) 2: 8i 2 ⌦ L , j 2 ⌦ L+1 : R ij 1/|⌦ L+1 | 3: for t iterations do 4: 8j 2 ⌦ L+1 : M-STEP(a, R, V , j) 5: 8i 2 ⌦ L : E-STEP(µ, , a, V , i) return a, M 1: procedure M-STEP(a, R, V , j) . for one higher-level capsule, j 2: 8i 2 ⌦ L : R ij R ij ⇤ a i 3: 8h: µh j P i Rij V h ij P i Rij 4: 8h: ( h j )2 P i Rij (V h ij µh j )2 P i Rij 5: costh u + log( h j ) P i R ij 6: a j logistic( ( a P h costh)) 1: procedure E-STEP(µ, , a, V , i) . for one lower-level capsule, i 2: 8j 2 ⌦ L+1 : p j 1 qQ H h 2⇡( h j )2 exp ✓ P H h (V h ij µh j )2 2( h j )2 ◆ 3: 8j 2 ⌦ L+1 : R ij a j pj P k2⌦ L+1 a kpk 4 THE CAPSULES ARCHITECTURE The general architecture of our model is shown in Fig. 1. The model starts with a 5x5 convolutional layer with 32 channels (A=32) and a stride of 2 with a ReLU non-linearity. All the other layers are capsule layers starting with the primary capsule layer. The 4x4 pose of each of the B=32 primary coefficients are then iteratively refined by measuring the agreement between the current output vj of each capsule, j, in the layer above and the prediction ˆ uj|i made by capsule i. The agreement is simply the scalar product aij = vj.ˆ uj|i . This agreement is treated as if it was a log likelihood and is added to the initial logit, bij before computing the new values for all the coupling coefficients linking capsule i to higher level capsules. In convolutional capsule layers, each capsule outputs a local grid of vectors to each type of capsule in the layer above using different transformation matrices for each member of the grid as well as for each type of capsule. Procedure 1 Routing algorithm. 1: procedure ROUTING(ˆ uj|i , r, l) 2: for all capsule i in layer l and capsule j in layer (l + 1): bij 0. 3: for r iterations do 4: for all capsule i in layer l: ci softmax(bi) . softmax computes Eq. 3 5: for all capsule j in layer (l + 1): sj P i cij ˆ uj|i 6: for all capsule j in layer (l + 1): vj squash(sj) . squash computes Eq. 1 7: for all capsule i in layer l and capsule j in layer (l + 1): bij bij + ˆ uj|i .vj return vj 3 Margin loss for digit existence We are using the length of the instantiation vector to represent the probability that a capsule’s entity exists. We would like the top-level capsule for digit class k to have a long instantiation vector if and only if that digit is present in the image. To allow for multiple digits, we use a separate margin loss, Lk for each digit capsule, k: Lk = Tk max(0, m+ ||vk ||)2 + (1 Tk) max(0, ||vk || m )2 (4) where Tk = 1 iff a digit of class k is present3 and m+ = 0.9 and m = 0.1. The down-weighting of the loss for absent digit classes stops the initial learning from shrinking the lengths of the activity vectors of all the digit capsules. We use = 0.5. The total loss is simply the sum of the losses of all digit capsules. 4 CapsNet architecture A simple CapsNet architecture is shown in Fig. 1. The architecture is shallow with only two convolutional layers and one fully connected layer. Conv1 has 256, 9 ⇥ 9 convolution kernels with a stride of 1 and ReLU activation. This layer converts pixel intensities to the activities of local feature detectors that are then used as inputs to the primary capsules. The primary capsules are the lowest level of multi-dimensional entities and, from an inverse graphics perspective, activating the primary capsules corresponds to inverting the rendering process. This is a very different type of computation than piecing instantiated parts together to make familiar wholes, which is what capsules are designed to be good at. 01 7 + • 2 • ! "#|% I : 21 7 + 1 • 2 • &# I E 21 7 + 1 2 , • 2 • ' I : 21 • ( I + 7 1 • 2 • ) I E 21 • ( I + 7 1

• + where a is the same for all
capsules and is an inverse temperature parameter. We learn a and u discriminatively and set a fixed schedule for as a hyper-parameter. For finalizing the pose parameters and activations of the capsules in layer L + 1 we run the EM algorithm for few iterations (normally 3) after the pose parameters and activations have already been finalized in layer L. The non-linearity implemented by a whole capsule layer is a form of cluster finding using the EM algorithm, so we call it EM Routing. Procedure 1 Routing algorithm returns activation and pose of the capsules in layer L + 1 given the activations and votes of capsules in layer L. V h ij is the hth dimension of the vote from capsule i with activation a i in layer L to capsule j in layer L + 1. a , u are learned discriminatively and the inverse temperature increases at each iteration with a fixed schedule. 1: procedure EM ROUTING(a, V ) 2: 8i 2 ⌦ L , j 2 ⌦ L+1 : R ij 1/|⌦ L+1 | 3: for t iterations do 4: 8j 2 ⌦ L+1 : M-STEP(a, R, V , j) 5: 8i 2 ⌦ L : E-STEP(µ, , a, V , i) return a, M 1: procedure M-STEP(a, R, V , j) . for one higher-level capsule, j 2: 8i 2 ⌦ L : R ij R ij ⇤ a i 3: 8h: µh j P i Rij V h ij P i Rij 4: 8h: ( h j )2 P i Rij (V h ij µh j )2 P i Rij 5: costh u + log( h j ) P i R ij 6: a j logistic( ( a P h costh)) 1: procedure E-STEP(µ, , a, V , i) . for one lower-level capsule, i 2: 8j 2 ⌦ L+1 : p j 1 qQ H h 2⇡( h j )2 exp ✓ P H h (V h ij µh j )2 2( h j )2 ◆ 3: 8j 2 ⌦ L+1 : R ij a j pj P k2⌦ L+1 a kpk 4 THE CAPSULES ARCHITECTURE The general architecture of our model is shown in Fig. 1. The model starts with a 5x5 convolutional layer with 32 channels (A=32) and a stride of 2 with a ReLU non-linearity. All the other layers are capsule layers starting with the primary capsule layer. The 4x4 pose of each of the B=32 primary !"# 8 1 : / 2

and is an inverse temperature parameter. We learn a and u discriminatively and set a fixed schedule for as a hyper-parameter. For finalizing the pose parameters and activations of the capsules in layer L + 1 we run the EM algorithm for few iterations (normally 3) after the pose parameters and activations have already been finalized in layer L. The non-linearity implemented by a whole capsule layer is a form of cluster finding using the EM algorithm, so we call it EM Routing. Procedure 1 Routing algorithm returns activation and pose of the capsules in layer L + 1 given the activations and votes of capsules in layer L. V h ij is the hth dimension of the vote from capsule i with activation a i in layer L to capsule j in layer L + 1. a , u are learned discriminatively and the inverse temperature increases at each iteration with a fixed schedule. 1: procedure EM ROUTING(a, V ) 2: 8i 2 ⌦ L , j 2 ⌦ L+1 : R ij 1/|⌦ L+1 | 3: for t iterations do 4: 8j 2 ⌦ L+1 : M-STEP(a, R, V , j) 5: 8i 2 ⌦ L : E-STEP(µ, , a, V , i) return a, M 1: procedure M-STEP(a, R, V , j) . for one higher-level capsule, j 2: 8i 2 ⌦ L : R ij R ij ⇤ a i 3: 8h: µh j P i Rij V h ij P i Rij 4: 8h: ( h j )2 P i Rij (V h ij µh j )2 P i Rij 5: costh u + log( h j ) P i R ij 6: a j logistic( ( a P h costh)) 1: procedure E-STEP(µ, , a, V , i) . for one lower-level capsule, i 2: 8j 2 ⌦ L+1 : p j 1 qQ H h 2⇡( h j )2 exp ✓ P H h (V h ij µh j )2 2( h j )2 ◆ 3: 8j 2 ⌦ L+1 : R ij a j pj P k2⌦ L+1 a kpk 4 THE CAPSULES ARCHITECTURE The general architecture of our model is shown in Fig. 1. The model starts with a 5x5 convolutional layer with 32 channels (A=32) and a stride of 2 with a ReLU non-linearity. All the other layers are capsule layers starting with the primary capsule layer. The 4x4 pose of each of the B=32 primary !"# 9 2:

• 3 + where a is the same for
all capsules and is an inverse temperature parameter. We learn a and u discriminatively and set a fixed schedule for as a hyper-parameter. For finalizing the pose parameters and activations of the capsules in layer L + 1 we run the EM algorithm for few iterations (normally 3) after the pose parameters and activations have already been finalized in layer L. The non-linearity implemented by a whole capsule layer is a form of cluster finding using the EM algorithm, so we call it EM Routing. Procedure 1 Routing algorithm returns activation and pose of the capsules in layer L + 1 given the activations and votes of capsules in layer L. V h ij is the hth dimension of the vote from capsule i with activation a i in layer L to capsule j in layer L + 1. a , u are learned discriminatively and the inverse temperature increases at each iteration with a fixed schedule. 1: procedure EM ROUTING(a, V ) 2: 8i 2 ⌦ L , j 2 ⌦ L+1 : R ij 1/|⌦ L+1 | 3: for t iterations do 4: 8j 2 ⌦ L+1 : M-STEP(a, R, V , j) 5: 8i 2 ⌦ L : E-STEP(µ, , a, V , i) return a, M 1: procedure M-STEP(a, R, V , j) . for one higher-level capsule, j 2: 8i 2 ⌦ L : R ij R ij ⇤ a i 3: 8h: µh j P i Rij V h ij P i Rij 4: 8h: ( h j )2 P i Rij (V h ij µh j )2 P i Rij 5: costh u + log( h j ) P i R ij 6: a j logistic( ( a P h costh)) 1: procedure E-STEP(µ, , a, V , i) . for one lower-level capsule, i 2: 8j 2 ⌦ L+1 : p j 1 qQ H h 2⇡( h j )2 exp ✓ P H h (V h ij µh j )2 2( h j )2 ◆ 3: 8j 2 ⌦ L+1 : R ij a j pj P k2⌦ L+1 a kpk 4 THE CAPSULES ARCHITECTURE The general architecture of our model is shown in Fig. 1. The model starts with a 5x5 convolutional layer with 32 channels (A=32) and a stride of 2 with a ReLU non-linearity. All the other layers are capsule layers starting with the primary capsule layer. The 4x4 pose of each of the B=32 primary EM 1 0+ : ! R L "# $, &# $ ' G "# $ EM 1 0+

and is an inverse temperature parameter. We learn a and u discriminatively and set a fixed schedule for as a hyper-parameter. For finalizing the pose parameters and activations of the capsules in layer L + 1 we run the EM algorithm for few iterations (normally 3) after the pose parameters and activations have already been finalized in layer L. The non-linearity implemented by a whole capsule layer is a form of cluster finding using the EM algorithm, so we call it EM Routing. Procedure 1 Routing algorithm returns activation and pose of the capsules in layer L + 1 given the activations and votes of capsules in layer L. V h ij is the hth dimension of the vote from capsule i with activation a i in layer L to capsule j in layer L + 1. a , u are learned discriminatively and the inverse temperature increases at each iteration with a fixed schedule. 1: procedure EM ROUTING(a, V ) 2: 8i 2 ⌦ L , j 2 ⌦ L+1 : R ij 1/|⌦ L+1 | 3: for t iterations do 4: 8j 2 ⌦ L+1 : M-STEP(a, R, V , j) 5: 8i 2 ⌦ L : E-STEP(µ, , a, V , i) return a, M 1: procedure M-STEP(a, R, V , j) . for one higher-level capsule, j 2: 8i 2 ⌦ L : R ij R ij ⇤ a i 3: 8h: µh j P i Rij V h ij P i Rij 4: 8h: ( h j )2 P i Rij (V h ij µh j )2 P i Rij 5: costh u + log( h j ) P i R ij 6: a j logistic( ( a P h costh)) 1: procedure E-STEP(µ, , a, V , i) . for one lower-level capsule, i 2: 8j 2 ⌦ L+1 : p j 1 qQ H h 2⇡( h j )2 exp ✓ P H h (V h ij µh j )2 2( h j )2 ◆ 3: 8j 2 ⌦ L+1 : R ij a j pj P k2⌦ L+1 a kpk 4 THE CAPSULES ARCHITECTURE The general architecture of our model is shown in Fig. 1. The model starts with a 5x5 convolutional layer with 32 channels (A=32) and a stride of 2 with a ReLU non-linearity. All the other layers are capsule layers starting with the primary capsule layer. The 4x4 pose of each of the B=32 primary 1 ac !" = $%&'()'* + ,- − ,/ 0 1 21" − 0 3 0 1 −21" ln(71|" 3 ) + R 3 M 3 R 3 L 3 : ,- , ,/ i R 3 71|" 3 E n o l E n g + g !" fitting a mixture of Gaussians. 3 USING EM FOR ROUTING-BY-AGREEMENT Let us suppose that we have already decided on the poses and activation probabilities of all the capsules in a layer and we now want to decide which capsules to activate in the layer above and how to assign each active lower-level capsule to one active higher-level capsule. Each capsule in the higher-layer corresponds to a Gaussian and the pose of each active capsule in the lower-layer (converted to a vector) corresponds to a data-point (or a fraction of a data-point if the capsule is partially active). Using the minimum description length principle we have a choice when deciding whether or not to activate a higher-level capsule. Choice 0: if we do not activate it, we must pay a fixed cost of u per data-point for describing the poses of all the lower-level capsules that are assigned to the higher-level capsule. This cost is the negative log probability density of the data-point under an improper uniform prior. For fractional assignments we pay that fraction of the fixed cost. Choice 1: if we do activate the higher-level capsule we must pay a fixed cost of a for coding its mean and variance and the fact that it is active and then pay additional costs, pro-rated by the assignment probabilities, for describing the discrepancies between the lower-level means and the values predicted for them when the mean of the higher-level capsule is used to predict them via the inverse of the transformation matrix. A much simpler way to compute the cost of describing a datapoint is to use the negative log probability density of that datapoint’s vote under the Gaussian distribution fitted by whatever higher-level capsule it gets assigned to. This is incorrect for reasons explained in appendix 1, but we use it because it requires much less computation (also explained in the appendix). The difference in cost between choice 0 and choice 1, is then put through the logistic function on each iteration to determine the higher-level capsule’s activation probability. Appendix 1 explains why the logistic is the correct function to use. Using our efficient approximation for choice 1 above, the incremental cost of explaining a whole data-point i by using an active capsule j that has an axis-aligned covariance matrix is simply the sum over all dimensions of the cost of explaining each dimension, h, of the vote V ij . This is simply ln(Ph i|j ) where Ph i|j is the probability density of the hth component of the vectorized vote V ij under j’s Gaussian model for dimension h which has variance ( h j )2 and mean µh j where µ j is the vectorized version of j’s pose matrix M j . Ph i|j = 1 q 2⇡( h j )2 exp (V h ij µh j )2 2( h j )2 ! , ln(Ph i|j ) = (V h ij µh j )2 2( h j )2 ln( h j ) ln(2⇡)/2 2

and is an inverse temperature parameter. We learn a and u discriminatively and set a fixed schedule for as a hyper-parameter. For finalizing the pose parameters and activations of the capsules in layer L + 1 we run the EM algorithm for few iterations (normally 3) after the pose parameters and activations have already been finalized in layer L. The non-linearity implemented by a whole capsule layer is a form of cluster finding using the EM algorithm, so we call it EM Routing. Procedure 1 Routing algorithm returns activation and pose of the capsules in layer L + 1 given the activations and votes of capsules in layer L. V h ij is the hth dimension of the vote from capsule i with activation a i in layer L to capsule j in layer L + 1. a , u are learned discriminatively and the inverse temperature increases at each iteration with a fixed schedule. 1: procedure EM ROUTING(a, V ) 2: 8i 2 ⌦ L , j 2 ⌦ L+1 : R ij 1/|⌦ L+1 | 3: for t iterations do 4: 8j 2 ⌦ L+1 : M-STEP(a, R, V , j) 5: 8i 2 ⌦ L : E-STEP(µ, , a, V , i) return a, M 1: procedure M-STEP(a, R, V , j) . for one higher-level capsule, j 2: 8i 2 ⌦ L : R ij R ij ⇤ a i 3: 8h: µh j P i Rij V h ij P i Rij 4: 8h: ( h j )2 P i Rij (V h ij µh j )2 P i Rij 5: costh u + log( h j ) P i R ij 6: a j logistic( ( a P h costh)) 1: procedure E-STEP(µ, , a, V , i) . for one lower-level capsule, i 2: 8j 2 ⌦ L+1 : p j 1 qQ H h 2⇡( h j )2 exp ✓ P H h (V h ij µh j )2 2( h j )2 ◆ 3: 8j 2 ⌦ L+1 : R ij a j pj P k2⌦ L+1 a kpk 4 THE CAPSULES ARCHITECTURE The general architecture of our model is shown in Fig. 1. The model starts with a 5x5 convolutional layer with 32 channels (A=32) and a stride of 2 with a ReLU non-linearity. All the other layers are capsule layers starting with the primary capsule layer. The 4x4 pose of each of the B=32 primary 3 2!" #, %" # &

and is an inverse temperature parameter. We learn a and u discriminatively and set a fixed schedule for as a hyper-parameter. For finalizing the pose parameters and activations of the capsules in layer L + 1 we run the EM algorithm for few iterations (normally 3) after the pose parameters and activations have already been finalized in layer L. The non-linearity implemented by a whole capsule layer is a form of cluster finding using the EM algorithm, so we call it EM Routing. Procedure 1 Routing algorithm returns activation and pose of the capsules in layer L + 1 given the activations and votes of capsules in layer L. V h ij is the hth dimension of the vote from capsule i with activation a i in layer L to capsule j in layer L + 1. a , u are learned discriminatively and the inverse temperature increases at each iteration with a fixed schedule. 1: procedure EM ROUTING(a, V ) 2: 8i 2 ⌦ L , j 2 ⌦ L+1 : R ij 1/|⌦ L+1 | 3: for t iterations do 4: 8j 2 ⌦ L+1 : M-STEP(a, R, V , j) 5: 8i 2 ⌦ L : E-STEP(µ, , a, V , i) return a, M 1: procedure M-STEP(a, R, V , j) . for one higher-level capsule, j 2: 8i 2 ⌦ L : R ij R ij ⇤ a i 3: 8h: µh j P i Rij V h ij P i Rij 4: 8h: ( h j )2 P i Rij (V h ij µh j )2 P i Rij 5: costh u + log( h j ) P i R ij 6: a j logistic( ( a P h costh)) 1: procedure E-STEP(µ, , a, V , i) . for one lower-level capsule, i 2: 8j 2 ⌦ L+1 : p j 1 qQ H h 2⇡( h j )2 exp ✓ P H h (V h ij µh j )2 2( h j )2 ◆ 3: 8j 2 ⌦ L+1 : R ij a j pj P k2⌦ L+1 a kpk 4 THE CAPSULES ARCHITECTURE The general architecture of our model is shown in Fig. 1. The model starts with a 5x5 convolutional layer with 32 channels (A=32) and a stride of 2 with a ReLU non-linearity. All the other layers are capsule layers starting with the primary capsule layer. The 4x4 pose of each of the B=32 primary

• ,3 78 1034 0 2 • 0 3
1 8 + 2 – g ,3 78 – P S b C P S b lHI R M P P[t rR u • 8 3 3 8 8 – x C d d C S S [ – e] S p oL s R RC [ a i N S[C d Sx S yn SS e training less sensitive to the initialization and hyper-parameters of the model, s” to directly maximize the gap between the activation of the target class (a t ) and e other classes. If the activation of a wrong class, a i , is closer than the margin, enalized by the squared distance to the margin: L i = (max(0, m (a t a i ))2, L = X i6=t L i (3) small margin of 0.2 and linearly increasing it during training to 0.9, we avoid he earlier layers. Spread loss is equivalent to squared Hinge loss with m = 1. rini (2011) studies a variant of this loss in the context of multi class SVMs. NTS ataset (LeCun et al. (2004)) has gray-level stereo images of 5 classes of toys: ks, humans and animals. There are 10 physical instances of each class which are n. 5 physical instances of a class are selected for the training data and the other 5 very individual toy is pictured at 18 different azimuths (0-340), 9 elevations and ns, so the training and test sets each contain 24,300 stereo pairs of 96x96 images. NORB as a benchmark for developing our capsules system because it is carefully using the length of the instantiation vector to represent the probability that a capsule’s entity We would like the top-level capsule for digit class k to have a long instantiation vector if and that digit is present in the image. To allow for multiple digits, we use a separate margin loss, each digit capsule, k: Lk = Tk max(0, m+ ||vk ||)2 + (1 Tk) max(0, ||vk || m )2 (4) Tk = 1 iff a digit of class k is present3 and m+ = 0.9 and m = 0.1. The down-weighting oss for absent digit classes stops the initial learning from shrinking the lengths of the activity of all the digit capsules. We use = 0.5. The total loss is simply the sum of the losses of all apsules. apsNet architecture ple CapsNet architecture is shown in Fig. 1. The architecture is shallow with only two utional layers and one fully connected layer. Conv1 has 256, 9 ⇥ 9 convolution kernels with a of 1 and ReLU activation. This layer converts pixel intensities to the activities of local feature ors that are then used as inputs to the primary capsules. mary capsules are the lowest level of multi-dimensional entities and, from an inverse graphics ctive, activating the primary capsules corresponds to inverting the rendering process. This is a fferent type of computation than piecing instantiated parts together to make familiar wholes, is what capsules are designed to be good at. cond layer (PrimaryCapsules) is a convolutional capsule layer with 32 channels of convolutional sules (i.e. each primary capsule contains 8 convolutional units with a 9 ⇥ 9 kernel and a stride

• • • •

• :6 55 , R – N – 3985
: 9: 9 4: 2 6 : 0 36 5: – ia a l e h ckB d – 9 3 : – = 83= 5 O = 83= 5 = the small subset of them that have appropriate transformation matrices for explaining the data at hand. Fitting to a dataset will then involve deciding which of the transforming Gaussians should be “switched on”. We therefore give each transforming Gaussian an additional activation parameter which is its probability of being switched on for the current dataset. The activation parameters are not mixing proportions because they do not sum to 1. To set the activation probability for a particular higher-level capsule, j, we compare the description lengths of two different ways of coding the poses of the activated lower-level capsules assigned to j by the routing, as described in section 3. “Description length” is just another term for energy. The difference in the two description lengths (in nats) is put through a logistic function to determine the activation probability of capsule j. The logistic function computes the distribution (p, 1 p) that minimizes free energy when the difference in the energies of the two alternatives is the argument of the logistic function. The energies we use for determining the activation probabilities are the same energies as we use for fitting the Gaussians and computing the assignment probabilities. So all three steps minimize the same free energy but with respect to different parameters for each step. In some of the explanations above we have implicitly assumed that the lower-level capsules have activities of 1 or 0 and the assignment probabilities computed during the dynamic routing are also 1 or 0. In fact, these numbers are both probabilities and we use the product of these two probabilities as a multiplier on both the baseline description length of each lower-level mean and its alternative description length obtained by making use of the Gaussian fitted by a higher-level capsule. B SUPPLEMENTARY FIGURES Figure B.1: Sample smallNORB images at different viewpoints. All images in first row are at azimuth 0 and elevation 0. The second row shows a set of images at a higher-elevation and different azimuth.

• c – 0 ,3 OP T m +
la – 7 2, 0 , 1 77 03, , 0 0 3 n eC – S io d , 0 , 0 3 A e Published as a conference paper at ICLR 2018 Table 1: The effect of varying different components of our capsules architecture on smallNORB. Routing iterations Pose structure Loss Coordinate Addition Test error rate 1 Matrix Spread Yes 9.7% 2 Matrix Spread Yes 2.2% 3 Matrix Spread Yes 1.8% 5 Matrix Spread Yes 3.9% 3 Vector Spread Yes 2.9% 3 Matrix Spread No 2.6% 3 Vector Spread No 3.2% 3 Matrix Margin1 Yes 3.2% 3 Matrix CrossEnt Yes 5.8% Baseline CNN with 4.2M parameters 5.2% CNN of Cires ¸an et al. (2011) with extra input images & deformations 2.56% Our Best model (third row), with multiple crops during testing 1.4% We downsample smallNORB to 48 ⇥ 48 pixels and normalize each image to have zero mean and unit variance. During training, we randomly crop 32 ⇥ 32 patches and add random brightness and contrast to the cropped images. During test, we crop a 32 ⇥ 32 patch from the center of the image and achieve 1.8% test error on smallNORB. If we average the class activations over multiple crops at test time we achieve 1.4%. The best reported result on smallNORB without using meta data is 2.56% (Cires ¸an et al. (2011)). To achieve this, they added two additional stereo pairs of input images that are created by using an on-center off-surround filter and an off-center on-surround filter. They also applied affine distortions to the images. Our work also beats the Sabour et al. (2017) capsule work which achieves 2.7% on smallNORB. We also tested our model on NORB which is a jittered version of smallNORB with added background and we achieved a 2.6% error rate which is on par with the state-of-the-art of 2.7% (Ciresan et al. (2012)). Figure 2: Histogram of distances of votes to the mean of each of the 5 final capsules after each routing iteration. Each distance point is weighted by its assignment probability. All three images are selected from the smallNORB test set. The routing procedure correctly routes the votes in the truck and the human example. The plane example shows a rare failure case of the model where the plane is confused with a car in the third routing iteration. The histograms are zoomed-in to visualize only votes with distances less than 0.05. Fig. B.2 shows the complete histograms for the ”human” capsule without clipping the x-axis or fixing the scale of the y-axis.

• 5DIN I 3L MMN 9 L (& -"
8 NLDR M G M DNC 28 L NDI 5DIN I 3L MMN 9 L (& -" 8 NLDR M G M DNC 28 L NDI 607:(& - • L 3L MMN 9 5DIN I 2 (& ," 1SI HD L NDI N I M G M 6I )- . )- ." • 0DL I 1 I 0 N G 5D C L LH I I L G I N LFM L DM G E N G MMD D ND I L D L LDIN L D &( & -) (& " )-

Matrix capsules with em routing

Matrix capsules with em routing

More Decks by Kazuki Fujikawa

Other Decks in Science

Featured

Transcript