ML Paper Challenge Day 15 — Speech Recognition with Deep Recurrent Neural Networks

Original article can be found here (source): Deep Learning on Medium

ML Paper Challenge Day 15 — Speech Recognition with Deep Recurrent Neural Networks

Day 15: 2020.04.26
Paper: Speech Recognition with Deep Recurrent Neural Networks
Category: Model/Deep Learning/Speech Recognition



Network Training:

Connectionist Temporal Classification (CTC): A type of neural network output and associated scoring function which enable training RNN for sequence labelling problems where input-output alignment is unknown

Input: A sequence of observations
Output: A sequence of labels (can be null)

RNN Transducer: combines a CTC-like network with a separate RNN that predicts each phoneme given the previous ones, thereby yielding a jointly trained acoustic and language model.

CTC determines an output distribution at every input time-step, an RNN transducer determines a separate distribution Pr(k|t, u) for every combination of input time-step t and output time-step u.

RNN transducers can be trained from random initial weights. However they appear to work better when initialised with the weights of a pre-trained CTC network and a pre-trained next-step prediction network

Decoding: beam search to yield an n-best list of candidate transcriptions


  1. Early stopping
  2. Weight noise: added once per train- ing sequence, rather than at every time-step