Original article can be found here (source): Deep Learning on Medium
ML Paper Challenge Day 15 — Speech Recognition with Deep Recurrent Neural Networks
Day 15: 2020.04.26
Paper: Speech Recognition with Deep Recurrent Neural Networks
Category: Model/Deep Learning/Speech Recognition
Connectionist Temporal Classification (CTC): A type of neural network output and associated scoring function which enable training RNN for sequence labelling problems where input-output alignment is unknown
Input: A sequence of observations
Output: A sequence of labels (can be null)
RNN Transducer: combines a CTC-like network with a separate RNN that predicts each phoneme given the previous ones, thereby yielding a jointly trained acoustic and language model.
CTC determines an output distribution at every input time-step, an RNN transducer determines a separate distribution Pr(k|t, u) for every combination of input time-step t and output time-step u.
RNN transducers can be trained from random initial weights. However they appear to work better when initialised with the weights of a pre-trained CTC network and a pre-trained next-step prediction network
Decoding: beam search to yield an n-best list of candidate transcriptions
- Early stopping
- Weight noise: added once per train- ing sequence, rather than at every time-step