Skip to content

0x561 Task

1. Speech Processing

1.1. Speech Separation

Deep Cluster (Hershey et al. 2016) achieves speaker separation by building spectrogram embeddings that are discriminative for speaker partitioning with k-means.


where \(V\) is embedding and \(Y\) is label indicator (can be deterministically decided). Embedding is built with minimizing the distance between affinity matrices (equivalent to push embedding from similar labels closer and different labels away)

1.2. Speech Enhancement

1.2.1. Time Domain Models

Model (Demucs)

  • encoder-decoder with U-net
  • loss is defined over clean signal and enhanced signal


1.2.2. Time-Frequency Domain Models

Model (FullSubNet) minimize complex Ratio Mask and Idea Ratio Mask

1.3. Voice Conversion

Conversion does not need to be speaker conversion, speaking style can also be converted (e.g. emotion, whisper/normal). This topic is similar to image style transfer in CV

1.3.1. Feature Disentangle

1.3.2. Direct Transformation



Model (Parrotron)

2. Speech Synthesis

2.1. Autoregressive TTS

Model (Tacotron) attention-based s2s model

  • input: character
  • output: linear spectrogram
  • vocoder: Griffin Lim



encoder: roughly corresponding to grapheme-to-phoneme model

  • prenet: FFN, dropout
  • CBHG: conv1d + max-pool along time + highway network + GRU

attention: roughly corresponding to modeling duration

decoder: audio synthesis,

  • RNN: each step can generate multiple frames (3,5 frames in v1, only 1 frame in v2)
  • prenet: training was done using teacher forcing. but dropout works like the schedule sampling
  • postprocessing: non-causal CBHG normalizing autoregressive outputs. loss is applied to both before and after the postprocessing steps

Model (Tacotron 2)

  • input: character
  • output: melspectrogram
  • vocoder: a modified WaveNet


Model (Non-Attentive Tacotron)

it replaces the attention mechanism with the duration predictor

Gaussian Upsampling

  • Given \([h_1, ..., h_n]\) and duration values \([d_1, ..., d_n]\) and range parameter \([\sigma_1, ..., \sigma_n]\), the upsampled vector \([u_1, ..., u_t]\) is computed by placing Gaussian distribution to each segment

Unsupervised duration modeling

  • extract alignment between input and output using fine-grained VAE similar to this work

Model (VALL-E)


2.2. Non-Autoregressive TTS

Model (Parallel Wavenet) A trained WaveNet model is used as a teacher for a feedforward IAF (inverse autoregressive flow) student

The probability density distillation loss is the KL divergence

\[D_{KL}(P_S || P_T) = H(P_S, P_T) - H(P_S)\]

Note that the entropy term \(H(P_S)\) is necessary. Otherwise it will collapse to the teacher's mode (mostly silence, see Appendix A.1 in the paper)

parallel wavenet

Model (FastSpeech) upsample the input sequence by a duration prediction model


Duration Modeling

  • Duration Predictor:

    • 2-layer 1d-conv
    • ground truth is extracted from a encoder-decoder transformer TTS's attention alignment. head is choosed using the most diagonal-like attention.
  • Length Regulator:

    • expand the hidden state of phoneme sequences by the predicted duration.
    • For example hidden states of 4 phonemes are \([h_1, h_2, h_3, h_4]\), its duration are \([2,2,3,1]\), then it expands to \([h_1, h_1, h_2, h_2, h_3, h_3, h_3, h_1]\). A hyperparameter \(\alpha\) can be used to control the voice speed by modifying duration.

Model (FastSpeech 2)


Duration Predictor

  • use Montreal force aligner, accurate than the teacher alignment in the original model

Issue (oversmoothing)

2.3. Multispeaker TTS

Model (Deep Voice 2) using speaker embedding, a single neural TTS system can learn hundreds of unique voices from less than half an hour of data per speaker

2.4. Multilingual TTS

Model (remapping input symbol) learn a mapping between source and target linguistic symbols.

Model (byte2speech) map byte to spectrogram, it can adapt to new languages with merely 40s transcribed recording

3. Speech Recognition

4. Speech Translation

There are roughly three types for speech translation systems

  • cascaded system: ASR + MT + TTS (F speech -> F text -> E text -> E speech)
  • speech-text system: F speech -> E text -> E speech
  • speech-speech system: F speech -> E speech

4.1. Speech-to-Text Model

4.2. Speech-to-Speech Model

Speech-to-Speech models do not rely on text generation as a intermediate step, it is a natural approach for languages without a writing system.

4.2.1. Speech to Spectrogram Model

Model (Translatotron 1)


Model (Translatotron 2)

Translatotron 2 improves 1 wrt the following three weakness:

  • phoneme alignment is not used by the main task
  • long sequence to long spectrogram sequence with attention is difficult to train

The new version has the following component:

  • speech encoder: mel-spectrogram to hidden representation using Conformer
  • linguistic decoder: encoder output to predict phoneme sequence of translation speech
  • acoustic synthesizer: takes the decoder output (before final) and context from attention, it generates spectrogram by non-autoregressive model. Attention is shared with the linguistic decoder


4.2.2. Speech to Discrete-Unit Model

Model (vq-vae)

  • train a vq-vae model of the target language
  • learn a s2s to map source lang spectrogram to target lang token
  • synthesize target lang token into spectrogram and apply Griffin-Lim

Model (xl-vae)

  • enhance vq-vae model by adding cross-lingual speech recognition task
  • the quantizer aims at reconstructing the target language as well as some other languages' asr task

Model (speech-to-unit translation with ssl)

  • Apply self-supervised encoder to the target speech
  • Train a speech-to-unit translation model

5. Spoken Language Understanding

5.1. Cascade NLU

5.2. End-to-End SLU

Dataset (Fluent Speech Command)

Each audio clip has three slots: action, object and location. The dataset is trained on the following model

  • the lower layers are pretrained using force-aligned phonemes/words
  • lower-layer target is discarded and the top layer is trained as a classification task by pooling sequence outputs.


Other relevant papers

6. Reference

[0] original papers. All images are taken from the original papers or blogs

[1] CMU 11751: Speech Recognition and Understanding

[2] Lecture Note on Hybrid HMM/DNN

[3] Hung-yi Lee's lecture: Deep Learning for Human Language Processing