understanding Whisper

Last edited time
Last updated October 19, 2022
Robust Speech Recognition via Large-Scale Weak Supervision
  • there is still only a moderate amount of this data easily available. SpeechStew (Chan et al., 2021) mixes together 7 pre-existing datasets totaling 5,140 hours of supervision. While not insignificant, this is still tiny compared to the previously mentioned 1,000,000 hours of unlabeled speech data utilized in Zhang et al. (2021).
Data processing
  • audio that is paired with transcripts on the Internet
  • audio is re-sampled to 16,000 Hz, and an 80-channel logmagnitude Mel spectrogram representation is computed on 25-millisecond windows with a stride of 10 milliseconds. For feature normalization, we globally scale the input to be between -1 and 1 with approximately zero mean across the pre-training dataset.
  • encoder:
    • The encoder processes this input representation with a small stem consisting of two convolution layers with a filter width of 3 and the GELU activation function
    • Sinusoidal position embeddings are then added to the output of the stem after which the encoder Transformer blocks are applied
  • decoder
    • we predict the language being spoken which is represented by a unique token for each language in our training set (99 total). These language targets are sourced from the aforementioned VoxLingua107 model. In the case where there is no speech in an audio segment, the model is trained to predict a <|nospeech|> token indicating this
    • a simple format to specify all tasks and conditioning information as a sequence of input tokens to the decoder
    • For timestamp prediction, we predict time relative to the current audio segment, quantizing all times to the nearest 20 milliseconds which matches the native time resolution of Whisper models, and add additional tokens to our vocabulary for each of these.
notion image