0x453 Neural Model
- 1. Vocoder
- 2. Language Model
- 3. Task
- 4. Reference
1. Vocoder
Vocoder takes the acoustic features (mel spectrogram or discrete ids) into a time-domain audio waveform.
This Youtube is a very good summary of neural vocoder
1.1. Autoregressive Vocoder
Model (WaveNet)
Use stack of dilated causal convolution layer to increase receptive field
\(\mu\)-law is applied to quantize 16bit to 8bit representation to enable softmax
It is using gated activation unit as the nonlinearity
When it is conditional on some inputs \(h\), it becomes
The condition on \(h\) can be either global or local.
- when it is a glocal condition, \(h\) is broadcast
- When it is a local condition, \(h\) is upsampled to match the resolution
Model (FFTNet)
Achieves dilation effect by splitting the input into two parts
Model (WaveRNN)
1.2. Flow Vocoder
Model (WaveGlow)
WaveGlow is a flow-based model, not an autoregressive model.
2. Language Model
2.1. Codec Language Model
Model (AudioGen) use soundstream token and conditioned on textual description embeded with T5
Model (AudioLM) use soundstream's token as discrete units
Model (VALL-E) use EnCodec as discrete tokenizers
2.2. Alignment
We consider mapping input sequence \(X=[x_1, ..., x_U]\) to output sequence \(Y=[y_1, ..., y_T]\) where \(X,Y\) can vary in length and no alignment is provided. We are interested in the following two problems:
- loss function: compute conditional probability \(-\log p(Y|X)\) efficiently and its gradient efficiently
- inference: find the most likely sequence \(\hat{Y} = \text{argmax}_Y p(Y|X)\)
The following three criterions can be applied to solve this problem
2.2.1. ASG
ASG (Auto-Segmentation Criterion) aims at minimizing:
where \(f,g\) are emission/transition scores
2.2.2. CTC
CTC can handle some potential problems of ASG in some cases
- repeat token creates ambiguity (aabbba -> aba or abba?). An emission can map to multiple outputs
- not every input frame has an label (e.g: silence)
2.2.2.1. Training
CTC is a discrimative model, it has a conditional independence assumption for a valid alignment \(A=\{ a_1, ..., a_T\}\).
then the objective of CTC is to marginalize all valid alignments:
where the sum can be done efficiently by forward computing. Note there are two different transition cases when aligned character is blank or not.
We want to minimze the negative-log-likelihood over the dataset.
For gradient derivation, see Alex Grave's thesis page 61 rather than the original CTC paper. The definition of \(\beta\) in thesis is consistent with the traditional definition.
Suppose the training set is \((x,z)\) and network outputs of probability is \(y^t_k\). The objective is
Noting the \(\alpha, \beta\) has the property
from which we obtain,
For any \(t\), we have
We know
2.2.2.2. Inference
One inference heuristic is to take the most likely output at each time stamp.
Beam search is also available by carefully collapsing hypothesis into equivalent set.
We can also consider language model
where the second term is the language model, and the last term is word insertion bonus. \(\alpha, \beta\) are hyperparameters to be tuned
Model (CTC-CRF)
Instead of the conditional independence, there are some works use CRF with CTC topology: use CRF to compute posterior, let \(\pi = (Y,A)\) be a valid alignment
2.2.3. Transducer
Model (RNA, recurrent neural aligner)
- removing the conditional independence assumption from CTC by using label of \(t-1\) to predict label \(t\)
Check this lecture video
Model (RNN-T) also extends the CTC model by removing the conditional independence assumption by adding a prediction network
Unlike CTC or RNA, each frame can emit multiple labels
where the current label depend on the non-blank label history \(y_1, ..., y_{u-1}\). The prediction network is believed to be classic LM, but might not be, removing the recurrecy (only depending on the last label \(y_{u-1}\)) would also yield similar result, which suggests it might only predict either an actual label or blank.
Pytorchaudio now has an implementation. Look at it here, the logits into the loss function is (batch, T_in, T_out, class), where (batch, T_in, class) is from audio encoder, and (batch, T_out, class) is from label encoder (prediction network).
Model (Transformer Transducer) propose to replace RNN with transformer with proper masking
2.2.4. GTC
2.2.5. Attention
Unlike CTC, attention model does not preserve order of inputs, but the ASR task requires the monotonic alignment, which makes the model hard to learn from scratch.
2.2.5.1. LAS
2.2.5.2. Hybrid-CTC
3. Task
3.1. Voice Conversion
Conversion does not need to be speaker conversion, speaking style can also be converted (e.g. emotion, whisper/normal). This topic is similar to image style transfer in CV
3.1.1. Feature Disentangle
3.1.2. Direct Transformation
CycleGAN
StarGAN
Model (Parrotron)
3.2. Speech Separation
3.3. Speech Enhancement
3.3.1. Time Domain Models
Model (Demucs)
- encoder-decoder with U-net
- loss is defined over clean signal and enhanced signal
3.3.2. Time-Frequency Domain Models
Model (FullSubNet) minimize complex Ratio Mask and Idea Ratio Mask
3.4. Speech Synthesis
3.4.1. Autoregressive TTS
Model (Tacotron) attention-based s2s model
- input: character
- output: linear spectrogram
- vocoder: Griffin Lim
Architecture:
encoder: roughly corresponding to grapheme-to-phoneme model
- prenet: FFN, dropout
- CBHG: conv1d + max-pool along time + highway network + GRU
attention: roughly corresponding to modeling duration
decoder: audio synthesis,
- RNN: each step can generate multiple frames (3,5 frames in v1, only 1 frame in v2)
- prenet: training was done using teacher forcing. but dropout works like the schedule sampling
- postprocessing: non-causal CBHG normalizing autoregressive outputs. loss is applied to both before and after the postprocessing steps
Model (Tacotron 2)
- input: character
- output: melspectrogram
- vocoder: a modified WaveNet
Model (Non-Attentive Tacotron)
it replaces the attention mechanism with the duration predictor
Gaussian Upsampling
- Given \([h_1, ..., h_n]\) and duration values \([d_1, ..., d_n]\) and range parameter \([\sigma_1, ..., \sigma_n]\), the upsampled vector \([u_1, ..., u_t]\) is computed by placing Gaussian distribution to each segment
Unsupervised duration modeling
- extract alignment between input and output using fine-grained VAE similar to this work
Model (VALL-E)
3.4.2. Non-Autoregressive TTS
Model (Parallel Wavenet) A trained WaveNet model is used as a teacher for a feedforward IAF (inverse autoregressive flow) student
The probability density distillation loss is the KL divergence
Note that the entropy term \(H(P_S)\) is necessary. Otherwise it will collapse to the teacher's mode (mostly silence, see Appendix A.1 in the paper)
Model (FastSpeech) upsample the input sequence by a duration prediction model
Duration Modeling
-
Duration Predictor:
- 2-layer 1d-conv
- ground truth is extracted from a encoder-decoder transformer TTS's attention alignment. head is choosed using the most diagonal-like attention.
-
Length Regulator:
- expand the hidden state of phoneme sequences by the predicted duration.
- For example hidden states of 4 phonemes are \([h_1, h_2, h_3, h_4]\), its duration are \([2,2,3,1]\), then it expands to \([h_1, h_1, h_2, h_2, h_3, h_3, h_3, h_1]\). A hyperparameter \(\alpha\) can be used to control the voice speed by modifying duration.
Model (FastSpeech 2)
Duration Predictor
- use Montreal force aligner, accurate than the teacher alignment in the original model
3.4.3. Multispeaker TTS
Model (Deep Voice 2) using speaker embedding, a single neural TTS system can learn hundreds of unique voices from less than half an hour of data per speaker
3.4.4. Multilingual TTS
Model (remapping input symbol) learn a mapping between source and target linguistic symbols.
Model (byte2speech) map byte to spectrogram, it can adapt to new languages with merely 40s transcribed recording
3.5. Speech Recognition
3.5.1. DNN
An alternative generative model with HMM is the DNN-HMM
DNN is a discriminative model, so it can not be directly connected with the HMM model, In the DNN-HMM hybrid model, DNN first estimates the posterior \(P(p|X)\) where \(p\) is usually a CD state, then converting this output to likelihood with prior by using
Then the \(P(X|p)\) can be plugged into the generative HMM framework.
3.5.2. RNN
Model (location aware attention) applying 1d convolution to the previous attention when computing new attention for current timestamp.
It can be implemented with additive attention, for example,
Note
Convolution helps to move attention forward to enforce monotonic attention.
If the previous attention is \([0, 1, 2, 1, 0]\) and learned conv kernel is \([1, 0, 0]\) with pad 1 stride 1,
then output will shift by one step \([0, 0, 1, 2, 1]\)
3.5.3. Transformer
3.5.3.1. Positional Encoding
This paper compares 4 different positional encoding for transformer-based AM, convolution works best here
- None
- Sinusoid
- Frame stacking
- Convolution: VGG
3.5.3.2. Conformer
Conformer combines self-attention and convolution
- self-attention captures content-based global information
- convolution captures local features effectively
Reference: from conformer paper
This paper has a good comparison between conformer and transformer
3.5.4. E2E
Instead of the traditional pipelines, we can train a deep network that directly maps speech signal to target word/word sequence
- simplify the complicated the model building process
- easy to build ASR systems for new tasks without expert knowledge
- potential to overperform the conventional pipelines by optimizing a single objective
3.6. Speech Translation
There are roughly three types for speech translation systems
- cascaded system: ASR + MT + TTS (F speech -> F text -> E text -> E speech)
- speech-text system: F speech -> E text -> E speech
- speech-speech system: F speech -> E speech
3.6.1. Speech-to-Text Model
3.6.2. Speech-to-Speech Model
Speech-to-Speech models do not rely on text generation as a intermediate step, it is a natural approach for languages without a writing system.
3.6.2.1. Speech to Spectrogram Model
Model (Translatotron 1)
Model (Translatotron 2)
Translatotron 2 improves 1 wrt the following three weakness:
- phoneme alignment is not used by the main task
- long sequence to long spectrogram sequence with attention is difficult to train
The new version has the following component:
- speech encoder: mel-spectrogram to hidden representation using Conformer
- linguistic decoder: encoder output to predict phoneme sequence of translation speech
- acoustic synthesizer: takes the decoder output (before final) and context from attention, it generates spectrogram by non-autoregressive model. Attention is shared with the linguistic decoder
3.6.2.2. Speech to Discrete-Unit Model
Model (vq-vae)
- train a vq-vae model of the target language
- learn a s2s to map source lang spectrogram to target lang token
- synthesize target lang token into spectrogram and apply Griffin-Lim
Model (xl-vae)
- enhance vq-vae model by adding cross-lingual speech recognition task
- the quantizer aims at reconstructing the target language as well as some other languages' asr task
Model (speech-to-unit translation with ssl)
- Apply self-supervised encoder to the target speech
- Train a speech-to-unit translation model
3.7. Spoken Language Understanding
3.7.1. Cascade NLU
3.7.2. End-to-End SLU
Dataset (Fluent Speech Command)
Each audio clip has three slots: action, object and location. The dataset is trained on the following model
- the lower layers are pretrained using force-aligned phonemes/words
- lower-layer target is discarded and the top layer is trained as a classification task by pooling sequence outputs.
Other relevant papers
4. Reference
[0] original papers. All images are taken from the original papers or blogs
[1] CMU 11751: Speech Recognition and Understanding
[2] Lecture Note on Hybrid HMM/DNN
[3] Hung-yi Lee's lecture: Deep Learning for Human Language Processing