1/τ Softmax Dot-Product ( Ni , c ) ( Ni+1 , Ni ) ( Ni+1 , Ni ) ( Ni+1 , c ) × + (b) Gumbel Subset Sampling Annealing τ → 0+ τ = 1 ntion. The core representation Instead, we use a hard and discrete selection w to-end trainable gumbel softmax (Eq. 3): y gumbel = gumbel softmax(wXT i ) · X i , w 2 in training phase, it provides smooth gradients crete reparameterization trick. With annealing, ates to a hard selection in test phase. A Gumbel Subset Sampling (GSS) is simply point version of Eq. 13, which means a distribut sets, GSS(X i ) = gumbel softmax(WXT i )·X i , W The following proposition theoretically gua permutation-invariance of GSS. !, ∈ ℝ), × j !, * = klm !, ∈ ℝ), × ),n$ ÖÜÜ !, = Máàâäã(å!, ç) !, å ∈ ℝ),n$×j S fv p Um