things about the input they received, which allows them to be very precise in predicting what’s coming next. This is why they’re the preferred algorithm for sequential data like time series, speech, text, fi nancial data, audio, video, weather and much more Recurrent Neural Network
transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models are connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. …”
Abstract “… Unlike recent language representation models, BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. …”