0x572 Task

1. Speech Processing
2. Speech Synthesis
3. Speech Recognition
4. Speech Translation
- 4.1. Speech-to-Text Model
- 4.2. Speech-to-Speech Model
  - 4.2.1. Speech to Spectrogram Model
  - 4.2.2. Speech to Discrete-Unit Model
5. Spoken Language Understanding
- 5.1. Cascade NLU
- 5.2. End-to-End SLU
6. Reference

1. Speech Processing

1.1. Speech Separation

Deep Cluster (Hershey et al. 2016) achieves speaker separation by building spectrogram embeddings that are discriminative for speaker partitioning with k-means.

\[|VV^T-YY^T|_W^2\]

where \(V\) is embedding and \(Y\) is label indicator (can be deterministically decided). Embedding is built with minimizing the distance between affinity matrices (equivalent to push embedding from similar labels closer and different labels away)

1.2. Speech Enhancement

1.2.1. Time Domain Models

Model (Demucs)

encoder-decoder with U-net
loss is defined over clean signal and enhanced signal

demucs

1.2.2. Time-Frequency Domain Models

Model (FullSubNet) minimize complex Ratio Mask and Idea Ratio Mask

1.3. Voice Conversion

Conversion does not need to be speaker conversion, speaking style can also be converted (e.g. emotion, whisper/normal). This topic is similar to image style transfer in CV

1.3.1. Feature Disentangle

1.3.2. Direct Transformation

CycleGAN

StarGAN

Model (Parrotron)

2. Speech Synthesis

2.1. Autoregressive TTS

Model (Tacotron) attention-based s2s model

input: character
output: linear spectrogram
vocoder: Griffin Lim

arch

Architecture:

encoder: roughly corresponding to grapheme-to-phoneme model

prenet: FFN, dropout
CBHG: conv1d + max-pool along time + highway network + GRU

attention: roughly corresponding to modeling duration

decoder: audio synthesis,

RNN: each step can generate multiple frames (3,5 frames in v1, only 1 frame in v2)
prenet: training was done using teacher forcing. but dropout works like the schedule sampling
postprocessing: non-causal CBHG normalizing autoregressive outputs. loss is applied to both before and after the postprocessing steps

Model (Tacotron 2)

input: character
output: melspectrogram
vocoder: a modified WaveNet

tacotron2

Model (Non-Attentive Tacotron)

it replaces the attention mechanism with the duration predictor

Gaussian Upsampling

Given \([h_1, ..., h_n]\) and duration values \([d_1, ..., d_n]\) and range parameter \([\sigma_1, ..., \sigma_n]\), the upsampled vector \([u_1, ..., u_t]\) is computed by placing Gaussian distribution to each segment

Unsupervised duration modeling

extract alignment between input and output using fine-grained VAE similar to this work

Model (VALL-E)

valle

VALL-E uses two models to generate all codecs from EnCodec

AR model generate codecs from the first quantizer conditioned on the phonemes and codecs prompt (from first quantizer)
NAR model generate codecs of remaining quantizers (2 to 8 layers) one by one

Note in VALL-E's setting, EnCodec transforms 24kHz waveforms into 75 Hz codecs where there are 8 quantizers and codebook size is 1024. Therefore, each second is represented by 75*8 discrete tokens.

arch

\[\mathfrak{R}\]

2.2. Non-Autoregressive TTS

Model (Parallel Wavenet) A trained WaveNet model is used as a teacher for a feedforward IAF (inverse autoregressive flow) student

The probability density distillation loss is the KL divergence

\[D_{KL}(P_S || P_T) = H(P_S, P_T) - H(P_S)\]

Note that the entropy term \(H(P_S)\) is necessary. Otherwise it will collapse to the teacher's mode (mostly silence, see Appendix A.1 in the paper)

parallel wavenet

Model (FastSpeech) upsample the input sequence by a duration prediction model

fastspeech

Duration Modeling

Duration Predictor:
- 2-layer 1d-conv
- ground truth is extracted from a encoder-decoder transformer TTS's attention alignment. head is choosed using the most diagonal-like attention.
Length Regulator:
- expand the hidden state of phoneme sequences by the predicted duration.
- For example hidden states of 4 phonemes are \([h_1, h_2, h_3, h_4]\), its duration are \([2,2,3,1]\), then it expands to \([h_1, h_1, h_2, h_2, h_3, h_3, h_3, h_1]\). A hyperparameter \(\alpha\) can be used to control the voice speed by modifying duration.

Model (FastSpeech 2)

fastspeech2

Duration Predictor

use Montreal force aligner, accurate than the teacher alignment in the original model

Issue (oversmoothing)

2.3. Multispeaker TTS

Model (Deep Voice 2) using speaker embedding, a single neural TTS system can learn hundreds of unique voices from less than half an hour of data per speaker

2.4. Multilingual TTS

Model (remapping input symbol) learn a mapping between source and target linguistic symbols.

Model (byte2speech) map byte to spectrogram, it can adapt to new languages with merely 40s transcribed recording

3. Speech Recognition

4. Speech Translation

There are roughly three types for speech translation systems

cascaded system: ASR + MT + TTS (F speech -> F text -> E text -> E speech)
speech-text system: F speech -> E text -> E speech
speech-speech system: F speech -> E speech

4.1. Speech-to-Text Model

4.2. Speech-to-Speech Model

Speech-to-Speech models do not rely on text generation as a intermediate step, it is a natural approach for languages without a writing system.

4.2.1. Speech to Spectrogram Model

Model (Translatotron 1)

trans

Model (Translatotron 2)

Translatotron 2 improves 1 wrt the following three weakness:

phoneme alignment is not used by the main task
long sequence to long spectrogram sequence with attention is difficult to train

The new version has the following component:

speech encoder: mel-spectrogram to hidden representation using Conformer
linguistic decoder: encoder output to predict phoneme sequence of translation speech
acoustic synthesizer: takes the decoder output (before final) and context from attention, it generates spectrogram by non-autoregressive model. Attention is shared with the linguistic decoder

trans2

4.2.2. Speech to Discrete-Unit Model

Model (vq-vae)

train a vq-vae model of the target language
learn a s2s to map source lang spectrogram to target lang token
synthesize target lang token into spectrogram and apply Griffin-Lim

Model (xl-vae)

enhance vq-vae model by adding cross-lingual speech recognition task
the quantizer aims at reconstructing the target language as well as some other languages' asr task

Model (speech-to-unit translation with ssl)

Apply self-supervised encoder to the target speech
Train a speech-to-unit translation model

5. Spoken Language Understanding

5.1. Cascade NLU

5.2. End-to-End SLU

Dataset (Fluent Speech Command)

Each audio clip has three slots: action, object and location. The dataset is trained on the following model

the lower layers are pretrained using force-aligned phonemes/words
lower-layer target is discarded and the top layer is trained as a classification task by pooling sequence outputs.

fluent_speech

Other relevant papers

https://arxiv.org/pdf/2010.13826.pdf

6. Reference

[0] original papers. All images are taken from the original papers or blogs

[1] CMU 11751: Speech Recognition and Understanding

[2] Lecture Note on Hybrid HMM/DNN

[3] Hung-yi Lee's lecture: Deep Learning for Human Language Processing