Slide 6
Slide 6 text
coefficients. All dimensions are mean and vari-
per conversation side. Additionally, we employ
ata perturbation/augmentation techniques: speed
rbation in the range 0.9-1.1 [18], sequence noise
ere we add with probability 0.4 and weight 0.4 the
one or two random training utterances to the current
e, and spectral augmentation as described in [1].
hieve the best possible word error rates, we trained
M parameters RNN-T with the following architec-
2], the encoder network has 8 bidirectional LSTM
cells per layer per direction with pyramidal sub-
nput [5, 6] by a factor of 4 after the first and second
2 each). The prediction network has an embedding
and 2 unidirectional LSTM layers with 1024 cells
utputs of the encoder and prediction networks are
jected to a common dimension of size 512. The
esponds to 182 word piece units (plus BLANK) ex-
-step byte-pair encoding [20].
was trained in PyTorch to minimize the RNN-T
or 50 epochs on 8 V100 GPUs using Nesterov-
hronous SGD with a batchsize of 256 utterances.
hase both batchsize and learning rate are linearly
and 0.02 over the first 2 epochs. The training
• Batching of prediction network evaluations. First, we find
y’s in the beam that are not in the prediction cache. Seco
we make a single call to the Prediction function with
entire batch. Lastly, we add the (y, gu) pairs to the cache
• Batching of joint network evaluations. We group the (ht,
pairs for all hypotheses within the beam into a single ba
and make a single joint network and softmax function cal
• Word prefix trees. Instead of iterating over all possible ou
symbols, we restrict the hypothesis expansion only to succ
sor BPE units from a given node in the prefix tree.
The effect of these optimization techniques on the overall r
time factor (RTF) is shown in Table 2 for both time and alignm
length synchronous decoding for the same CallHome WER
10.9%. The timing runs are measured on a 28-core Intel X
2.3GHz processor.
TSD RTF ALSD RTF
No optimization 1.21 0.89
+ prediction net caching 0.72 0.61
+ prediction net batching 0.61 0.48
+ joint net batching 0.52 0.41
+ prefix tree 0.41 0.30
時間フレーム t
wordpiece数 u
各格⼦点で、複数の仮説(wordpiece系列)とそのスコアが得られる。
スコアの⼤きいN個の仮説を保持しながら、右上に向かって進む。(beam幅=Nのビームサーチ)
T
0.97: This is a pen
0.72: This is a pencil
…
0.83: This is
0.42: There
…
p Encoderの計算は、T回必ず必要(さぼれない)
p Prediction network は、wordpiece 仮説のみに依存する。異なる格⼦点で、同じ仮説が展開されうる
→ キャッシュが効く
p Prediction network と Joint network は、異なる仮説に対して独⽴に計算できる
→ 並列化(batching)が効く
p Wordpiece 仮説の展開にあたり、
存在しない単語の仮説は枝刈りできる
→ prefix tree による⾼速化