0x571 Model
1. Vocoder
Vocoder takes the acoustic features (mel spectrogram or discrete ids) into a time-domain audio waveform.
This Youtube is a very good summary of neural vocoder
1.1. AR Vocoder
Model (WaveNet)
Use stack of dilated causal convolution layer to increase receptive field
\(\mu\)-law is applied to quantize 16bit to 8bit representation to enable softmax
It is using gated activation unit as the nonlinearity
When it is conditional on some inputs \(h\), it becomes
The condition on \(h\) can be either global or local.
- when it is a glocal condition, \(h\) is broadcast
- When it is a local condition, \(h\) is upsampled to match the resolution
Model (FFTNet)
Achieves dilation effect by splitting the input into two parts
Model (WaveRNN)
1.2. NAR Vocoder
Model (WaveGlow)
WaveGlow is a flow-based model, not an autoregressive model.
2. Acoustic Model
2.1. DNN
An alternative generative model with HMM is the DNN-HMM
DNN is a discriminative model, so it can not be directly connected with the HMM model, In the DNN-HMM hybrid model, DNN first estimates the posterior \(P(p|X)\) where \(p\) is usually a CD state, then converting this output to likelihood with prior by using
Then the \(P(X|p)\) can be plugged into the generative HMM framework.
2.2. RNN
Model (location aware attention) applying 1d convolution to the previous attention when computing new attention for current timestamp.
It can be implemented with additive attention, for example,
Note
Convolution helps to move attention forward to enforce monotonic attention.
If the previous attention is \([0, 1, 2, 1, 0]\) and learned conv kernel is \([1, 0, 0]\) with pad 1 stride 1,
then output will shift by one step \([0, 0, 1, 2, 1]\)
2.3. Transformer
2.3.1. Positional Encoding
This paper compares 4 different positional encoding for transformer-based AM, convolution works best here
- None
- Sinusoid
- Frame stacking
- Convolution: VGG
2.4. Conformer
Conformer combines self-attention and convolution
- self-attention captures content-based global information
- convolution captures local features effectively
number of params is roughly hidden^2*layer*23 where the constants can be broken into:
- 4: MLHA
- 8: first FFN
- 8: second FFN
- 3: inside convolution (pointwise conv=1, GLU=1, pointwise conv=1)
This is mostly 2-times larger than the transformer with same hidden size
Reference: from conformer paper
This paper has a good comparison between conformer and transformer
2.5. E2E
Instead of the traditional pipelines, we can train a deep network that directly maps speech signal to target word/word sequence
- simplify the complicated the model building process
- easy to build ASR systems for new tasks without expert knowledge
- potential to overperform the conventional pipelines by optimizing a single objective
2.6. Codec Language Model
Model (AudioGen) use soundstream token and conditioned on textual description embeded with T5
Model (AudioLM) use soundstream's token as discrete units
Model (VALL-E) use EnCodec as discrete tokenizers
3. Alignment
We consider mapping input sequence \(X=[x_1, ..., x_U]\) to output sequence \(Y=[y_1, ..., y_T]\) where \(X,Y\) can vary in length and no alignment is provided. We are interested in the following two problems:
- loss function: compute conditional probability \(-\log p(Y|X)\) efficiently and its gradient efficiently
- inference: find the most likely sequence \(\hat{Y} = \text{argmax}_Y p(Y|X)\)
The following three criterions can be applied to solve this problem
3.1. ASG
ASG (Auto-Segmentation Criterion) aims at minimizing:
where \(f,g\) are emission/transition scores
3.2. CTC
CTC can handle some potential problems of ASG in some cases
- repeat token creates ambiguity (aabbba -> aba or abba?). An emission can map to multiple outputs
- not every input frame has an label (e.g: silence)
Note for reference, see Alex Grave's thesis Chapter 7 rather than the original CTC paper. The definition of \(\beta\) in thesis is consistent with the traditional definition.
3.2.1. Training
CTC is a discrimative model, it has a conditional independence assumption for a valid alignment \(A=\{ a_1, ..., a_T\}\).
then the objective of CTC is to marginalize all valid alignments:
where the sum can be done efficiently by forward computing. Note there are two different transition cases when aligned character is blank or not.
We want to minimze the negative-log-likelihood over the dataset.
Suppose the training set is \((x,z)\) and network outputs of probability is \(y^t_k\). The objective is
Noting the \(\alpha, \beta\) has the property
from which we obtain,
For any \(t\), we have
We know
3.2.2. Inference
One inference heuristic is to take the most likely output at each time stamp.
Beam search is also available by carefully collapsing hypothesis into equivalent set.
We can also consider language model
where the second term is the language model, and the last term is word insertion bonus. \(\alpha, \beta\) are hyperparameters to be tuned
Model (CTC-CRF)
Instead of the conditional independence, there are some works use CRF with CTC topology: use CRF to compute posterior, let \(\pi = (Y,A)\) be a valid alignment
3.3. Transducer
Model (RNA, recurrent neural aligner)
- removing the conditional independence assumption from CTC by using label of \(t-1\) to predict label \(t\)
Check this lecture video
Model (RNN-T) also extends the CTC model by removing the conditional independence assumption by adding a prediction network
Unlike CTC or RNA, each frame can emit multiple labels
where the current label depend on the non-blank label history \(y_1, ..., y_{u-1}\). The prediction network is believed to be classic LM, but might not be, removing the recurrecy (only depending on the last label \(y_{u-1}\)) would also yield similar result, which suggests it might only predict either an actual label or blank.
Pytorchaudio now has an implementation. Look at it here, the logits into the loss function is (batch, T_in, T_out, class), where (batch, T_in, class) is from audio encoder, and (batch, T_out, class) is from label encoder (prediction network).
Model (Transformer Transducer) propose to replace RNN with transformer with proper masking
3.4. GTC
3.5. Attention
Unlike CTC, attention model does not preserve order of inputs, but the ASR task requires the monotonic alignment, which makes the model hard to learn from scratch.