• Batching of prediction network evaluations. First, we ﬁnd y's in the beam that are not in the prediction cache. Second, we make a single call to the Prediction function with the entire batch. Lastly, we add the (y, gu) pairs to the cache.
• Batching of joint network evaluations. We group the (ht, gu) pairs for all hypotheses within the beam into a single batch and make a single joint network and softmax function call.
• Word preﬁx trees. Instead of iterating over all possible output symbols, we restrict the hypothesis expansion only to successor BPE units from a given node in the preﬁx tree.

The effect of these optimization techniques on the overall real-time factor (RTF) is shown in Table 2 for both time and alignment-length synchronous decoding for the same CallHome WER of 10.9%. The timing runs are measured on a 28-core Intel Xeon 2.3GHz processor.

TSD RTF ALSD RTF
No optimization 1.21 0.89
+ prediction net caching 0.72 0.61
+ prediction net batching 0.61 0.48
+ joint net batching 0.52 0.41
+ preﬁx tree 0.41 0.30

Additionally, we employ data perturbation/augmentation techniques: speed perturbation in the range 0.9-1.1 [18], sequence noise where we add with probability 0.4 and weight 0.4 one or two random training utterances to the current utterance, and spectral augmentation as described in [1].

To achieve the best possible word error rates, we trained an RNN-T with the following architecture [2]: the encoder network has 8 bidirectional LSTM layers with cells per layer per direction with pyramidal subsampling by a factor of 4 after the ﬁrst and second layers. The prediction network has an embedding layer and 2 unidirectional LSTM layers with 1024 cells. The outputs of the encoder and prediction networks are projected to a common dimension of size 512. The output corresponds to 182 word piece units (plus BLANK) extracted using byte-pair encoding [20].

The model was trained in PyTorch to minimize the RNN-T loss for 50 epochs on 8 V100 GPUs using Nesterov-accelerated asynchronous SGD with a batchsize of 256 utterances. In the warmup phase both batchsize and learning rate are linearly ramped up over the ﬁrst 2 epochs. The training utterances are randomly grouped into buckets such that the input lengths in each bucket differ by at most 10 frames. The buckets are processed in ascending length order for the ﬁrst 8 epochs and randomly shufﬂed after that. Both encoder and prediction networks apply dropout and drop-connect to the hidden-to-hidden matrices [21].
Fig. 1. Comparison between time synchronous (left) and alignment-length synchronous (right) search spaces.

Comparing the two algorithms whose search spaces are illustrated in Figure 1, it is clear that TSD has a complexity of T × max_sym_exp whereas ALSD runs in T + Umax steps. Since, for our experimental setting, Umax < T and max_sym_exp ≈ 4, we expect ALSD to perform fewer operations and consequently be faster than TSD for the same accuracy (even though ALSD requires a larger beam).
RNN-T 8.5 16.4 Hadian et al. (2018) [22] TDNN-LSTM LF-MMI 7.3 14.2 Xiong et al. (2017) [23] BLSTM (hybrid) 6.3 12.0 Han et al. (2017) [24] CNN-BLSTM (hybrid) 5.6 10.7 This system RNN-T 6.2 10.9 Table 1. Word error rate comparison with other single-model sys- tems for the Switchboard 2000 hours task. [23] and [24] use RNN LMs trained on external data sources whereas our model does not. networks apply dropout and drop-connect to the hidden-to-hidden
