Skip to content

0x453 Neural Model

1. Vocoder

Vocoder takes the acoustic features (mel spectrogram or discrete ids) into a time-domain audio waveform.

This Youtube is a very good summary of neural vocoder

1.1. Autoregressive Vocoder

Model (WaveNet)

Use stack of dilated causal convolution layer to increase receptive field

\(\mu\)-law is applied to quantize 16bit to 8bit representation to enable softmax

\[f(x) = sign(x)\frac{\ln (1+\mu|x|)}{1+\mu}\]

It is using gated activation unit as the nonlinearity

\[z = \tanh(W_f * x) \odot \sigma(W_g * x)\]

When it is conditional on some inputs \(h\), it becomes

\[z = \tanh(W_f * x + V_f^Th) \odot \sigma(W_g * x + V_g^h)\]

The condition on \(h\) can be either global or local.

  • when it is a glocal condition, \(h\) is broadcast
  • When it is a local condition, \(h\) is upsampled to match the resolution

Model (FFTNet)

Achieves dilation effect by splitting the input into two parts

Model (WaveRNN)

1.2. Flow Vocoder

Model (WaveGlow)

WaveGlow is a flow-based model, not an autoregressive model.


2. Language Model

2.1. Codec Language Model

Model (AudioGen) use soundstream token and conditioned on textual description embeded with T5


Model (AudioLM) use soundstream's token as discrete units


Model (VALL-E) use EnCodec as discrete tokenizers


2.2. Alignment

We consider mapping input sequence \(X=[x_1, ..., x_U]\) to output sequence \(Y=[y_1, ..., y_T]\) where \(X,Y\) can vary in length and no alignment is provided. We are interested in the following two problems:

  • loss function: compute conditional probability \(-\log p(Y|X)\) efficiently and its gradient efficiently
  • inference: find the most likely sequence \(\hat{Y} = \text{argmax}_Y p(Y|X)\)

The following three criterions can be applied to solve this problem

2.2.1. ASG

ASG (Auto-Segmentation Criterion) aims at minimizing:

\[ASG(\theta, T) = - LSE_{\pi \in \text{asg}} (f_{\pi_t}(x) + g_{\pi_{t-1}, \pi(t)}(x)) + LSE_{\pi \in \text{Z}} (f_{\pi_t}(x) + g_{\pi_{t-1}, \pi(t)}(x))\]

where \(f,g\) are emission/transition scores

2.2.2. CTC

CTC can handle some potential problems of ASG in some cases

  • repeat token creates ambiguity (aabbba -> aba or abba?). An emission can map to multiple outputs
  • not every input frame has an label (e.g: silence) Training

CTC is a discrimative model, it has a conditional independence assumption for a valid alignment \(A=\{ a_1, ..., a_T\}\).

\[P(Y,A|X) = \prod_{t=1}^T p_t(a_t | X)\]

then the objective of CTC is to marginalize all valid alignments:

\[p(Y|X) = \sum_\mathcal{A \in A_{X,Y}} p(Y,A|X)\]

where the sum can be done efficiently by forward computing. Note there are two different transition cases when aligned character is blank or not.

We want to minimze the negative-log-likelihood over the dataset.

\[\sum_{(X,Y) \in \mathcal{D}} -\log p(Y|X)\]

For gradient derivation, see Alex Grave's thesis page 61 rather than the original CTC paper. The definition of \(\beta\) in thesis is consistent with the traditional definition.

Suppose the training set is \((x,z)\) and network outputs of probability is \(y^t_k\). The objective is

\[- \frac{\partial \log p(z|x)}{\partial y^t_k} = -\frac{1}{p(z|x)}\frac{\partial p(z|x)}{\partial y_k^t}\]

Noting the \(\alpha, \beta\) has the property

\[\alpha_t(s)\beta_t(s) = \sum_{\pi \in \mathcal{B}^{-1}(z), \pi_t = z'(s)} \prod_{t=1}^T y^t_{\pi_t}\]

from which we obtain,

\[\frac{\partial \alpha_t(s)\beta_t(s)}{\partial y_k^t} = \begin{cases} \frac{\alpha_t(s)\beta_t(s)}{y_k^t} & \text{if k occurs in z'} \\ 0 \text{otherwise} \end{cases}\]

For any \(t\), we have

\[p(z|x) = \sum_{s=1}^{|z'|} \alpha_t(s)\beta_t(s)\]

We know

\[\frac{p(z|x)}{\partial y_k^t} = \frac{1}{y_k^t} \sum_{s \in lab(z, k)} \alpha_t(s)\beta_t(s)\] Inference

One inference heuristic is to take the most likely output at each time stamp.

\[Y^* = \text{argmax}_A \prod_t p_t(a_t|X)\]

Beam search is also available by carefully collapsing hypothesis into equivalent set.

We can also consider language model

\[Y^* = \text{argmax}_Y P(Y|X) p(Y)^\alpha L(Y)^\beta\]

where the second term is the language model, and the last term is word insertion bonus. \(\alpha, \beta\) are hyperparameters to be tuned

Model (CTC-CRF)

Instead of the conditional independence, there are some works use CRF with CTC topology: use CRF to compute posterior, let \(\pi = (Y,A)\) be a valid alignment

\[p(Y,A | X) = p_\theta(\pi | X) = \frac{\exp{\phi_\theta (\pi, x)}}{\sum_{\pi'} \exp{\phi_\theta (\pi', x)}}\]

2.2.3. Transducer

Model (RNA, recurrent neural aligner)

  • removing the conditional independence assumption from CTC by using label of \(t-1\) to predict label \(t\)

Check this lecture video

Model (RNN-T) also extends the CTC model by removing the conditional independence assumption by adding a prediction network

\[p(Y,A|X) = \prod_{t=1}^T p_t(a_t|X, y_1, ..., y_{u-1})\]

Unlike CTC or RNA, each frame can emit multiple labels

where the current label depend on the non-blank label history \(y_1, ..., y_{u-1}\). The prediction network is believed to be classic LM, but might not be, removing the recurrecy (only depending on the last label \(y_{u-1}\)) would also yield similar result, which suggests it might only predict either an actual label or blank.

Pytorchaudio now has an implementation. Look at it here, the logits into the loss function is (batch, T_in, T_out, class), where (batch, T_in, class) is from audio encoder, and (batch, T_out, class) is from label encoder (prediction network).

Model (Transformer Transducer) propose to replace RNN with transformer with proper masking

2.2.4. GTC

This paper

2.2.5. Attention

Unlike CTC, attention model does not preserve order of inputs, but the ASR task requires the monotonic alignment, which makes the model hard to learn from scratch. LAS Hybrid-CTC

3. Task

3.1. Voice Conversion

Conversion does not need to be speaker conversion, speaking style can also be converted (e.g. emotion, whisper/normal). This topic is similar to image style transfer in CV

3.1.1. Feature Disentangle

3.1.2. Direct Transformation



Model (Parrotron)

3.2. Speech Separation

3.3. Speech Enhancement

3.3.1. Time Domain Models

Model (Demucs)

  • encoder-decoder with U-net
  • loss is defined over clean signal and enhanced signal


3.3.2. Time-Frequency Domain Models

Model (FullSubNet) minimize complex Ratio Mask and Idea Ratio Mask

3.4. Speech Synthesis

3.4.1. Autoregressive TTS

Model (Tacotron) attention-based s2s model

  • input: character
  • output: linear spectrogram
  • vocoder: Griffin Lim



encoder: roughly corresponding to grapheme-to-phoneme model

  • prenet: FFN, dropout
  • CBHG: conv1d + max-pool along time + highway network + GRU

attention: roughly corresponding to modeling duration

decoder: audio synthesis,

  • RNN: each step can generate multiple frames (3,5 frames in v1, only 1 frame in v2)
  • prenet: training was done using teacher forcing. but dropout works like the schedule sampling
  • postprocessing: non-causal CBHG normalizing autoregressive outputs. loss is applied to both before and after the postprocessing steps

Model (Tacotron 2)

  • input: character
  • output: melspectrogram
  • vocoder: a modified WaveNet


Model (Non-Attentive Tacotron)

it replaces the attention mechanism with the duration predictor

Gaussian Upsampling

  • Given \([h_1, ..., h_n]\) and duration values \([d_1, ..., d_n]\) and range parameter \([\sigma_1, ..., \sigma_n]\), the upsampled vector \([u_1, ..., u_t]\) is computed by placing Gaussian distribution to each segment

Unsupervised duration modeling

  • extract alignment between input and output using fine-grained VAE similar to this work

Model (VALL-E)


3.4.2. Non-Autoregressive TTS

Model (Parallel Wavenet) A trained WaveNet model is used as a teacher for a feedforward IAF (inverse autoregressive flow) student

The probability density distillation loss is the KL divergence

\[D_{KL}(P_S || P_T) = H(P_S, P_T) - H(P_S)\]

Note that the entropy term \(H(P_S)\) is necessary. Otherwise it will collapse to the teacher's mode (mostly silence, see Appendix A.1 in the paper)

parallel wavenet

Model (FastSpeech) upsample the input sequence by a duration prediction model


Duration Modeling

  • Duration Predictor:

    • 2-layer 1d-conv
    • ground truth is extracted from a encoder-decoder transformer TTS's attention alignment. head is choosed using the most diagonal-like attention.
  • Length Regulator:

    • expand the hidden state of phoneme sequences by the predicted duration.
    • For example hidden states of 4 phonemes are \([h_1, h_2, h_3, h_4]\), its duration are \([2,2,3,1]\), then it expands to \([h_1, h_1, h_2, h_2, h_3, h_3, h_3, h_1]\). A hyperparameter \(\alpha\) can be used to control the voice speed by modifying duration.

Model (FastSpeech 2)


Duration Predictor

  • use Montreal force aligner, accurate than the teacher alignment in the original model

3.4.3. Multispeaker TTS

Model (Deep Voice 2) using speaker embedding, a single neural TTS system can learn hundreds of unique voices from less than half an hour of data per speaker

3.4.4. Multilingual TTS

Model (remapping input symbol) learn a mapping between source and target linguistic symbols.

Model (byte2speech) map byte to spectrogram, it can adapt to new languages with merely 40s transcribed recording

3.5. Speech Recognition

3.5.1. DNN

An alternative generative model with HMM is the DNN-HMM

DNN is a discriminative model, so it can not be directly connected with the HMM model, In the DNN-HMM hybrid model, DNN first estimates the posterior \(P(p|X)\) where \(p\) is usually a CD state, then converting this output to likelihood with prior by using

\[P(X|p) \propto \frac{P(p|X)}{P(p)}\]

Then the \(P(X|p)\) can be plugged into the generative HMM framework.

3.5.2. RNN

Model (location aware attention) applying 1d convolution to the previous attention when computing new attention for current timestamp.

It can be implemented with additive attention, for example,

\[a_j(q, k) = w_2^T \tanh(W[q;k] + F * a_{j-1})\]


Convolution helps to move attention forward to enforce monotonic attention.

If the previous attention is \([0, 1, 2, 1, 0]\) and learned conv kernel is \([1, 0, 0]\) with pad 1 stride 1,

then output will shift by one step \([0, 0, 1, 2, 1]\)

3.5.3. Transformer Positional Encoding

This paper compares 4 different positional encoding for transformer-based AM, convolution works best here

  • None
  • Sinusoid
  • Frame stacking
  • Convolution: VGG Conformer

Conformer combines self-attention and convolution

  • self-attention captures content-based global information
  • convolution captures local features effectively


Reference: from conformer paper

This paper has a good comparison between conformer and transformer

3.5.4. E2E

Instead of the traditional pipelines, we can train a deep network that directly maps speech signal to target word/word sequence

  • simplify the complicated the model building process
  • easy to build ASR systems for new tasks without expert knowledge
  • potential to overperform the conventional pipelines by optimizing a single objective

3.6. Speech Translation

There are roughly three types for speech translation systems

  • cascaded system: ASR + MT + TTS (F speech -> F text -> E text -> E speech)
  • speech-text system: F speech -> E text -> E speech
  • speech-speech system: F speech -> E speech

3.6.1. Speech-to-Text Model

3.6.2. Speech-to-Speech Model

Speech-to-Speech models do not rely on text generation as a intermediate step, it is a natural approach for languages without a writing system. Speech to Spectrogram Model

Model (Translatotron 1)


Model (Translatotron 2)

Translatotron 2 improves 1 wrt the following three weakness:

  • phoneme alignment is not used by the main task
  • long sequence to long spectrogram sequence with attention is difficult to train

The new version has the following component:

  • speech encoder: mel-spectrogram to hidden representation using Conformer
  • linguistic decoder: encoder output to predict phoneme sequence of translation speech
  • acoustic synthesizer: takes the decoder output (before final) and context from attention, it generates spectrogram by non-autoregressive model. Attention is shared with the linguistic decoder

trans2 Speech to Discrete-Unit Model

Model (vq-vae)

  • train a vq-vae model of the target language
  • learn a s2s to map source lang spectrogram to target lang token
  • synthesize target lang token into spectrogram and apply Griffin-Lim

Model (xl-vae)

  • enhance vq-vae model by adding cross-lingual speech recognition task
  • the quantizer aims at reconstructing the target language as well as some other languages' asr task

Model (speech-to-unit translation with ssl)

  • Apply self-supervised encoder to the target speech
  • Train a speech-to-unit translation model

3.7. Spoken Language Understanding

3.7.1. Cascade NLU

3.7.2. End-to-End SLU

Dataset (Fluent Speech Command)

Each audio clip has three slots: action, object and location. The dataset is trained on the following model

  • the lower layers are pretrained using force-aligned phonemes/words
  • lower-layer target is discarded and the top layer is trained as a classification task by pooling sequence outputs.


Other relevant papers

4. Reference

[0] original papers. All images are taken from the original papers or blogs

[1] CMU 11751: Speech Recognition and Understanding

[2] Lecture Note on Hybrid HMM/DNN

[3] Hung-yi Lee's lecture: Deep Learning for Human Language Processing