Skip to content

0x570 Representation

1. Embeddings

Here is a self-supervised learning review for speech

1.1. Nonparametric Bayesian Models

Classical Acoustic Unit Discovery using Dirichlet process mixture model

Model (Gibbs sampling) each mixture is a HMM to model subword unit and to generate observed segments of that unit

Gibbs sampling is used to approximate posterior distribution

Model (Variational Inference) Use VI instead of Gibbs Sampling

1.2. Autoregressive Models

Model (CPC, Contrastive Predictive Coding) see the representation note

Model (CPC + Data Augmentation) Applying augmentation in the past is efficient

  • pitch modification
  • additive noise
  • reverberation

Model (APC, Autoregressive Predictive Coding) use RNN to predict frame feature \(n\)-step ahead

  • \(n=3\) performs best on phone classification task

1.3. Generative Model

Model (convolutional VAE) convolutional VAE, it proposes an interesting approach to modify speech attributes by shifting VAE's posterior (section 4.2)

Model (hierarchical VAE)

Model (VQ-VAE) compare three different autoencoder approaches: VQ-VAE, VAE, dimension reduction

The conclusion is among the three bottlenecks evaluated, VQ-VAE discards the most speaker-related information at the bottleneck, while preserving the most phonetic information

1.4. Masked Model

Model (vq-wav2vec)

  • First train a quantization model using future prediction task.
  • Then use those tokens to pretrain a BERT model

vq-wav2vec

Model (wav2vec2)

Architecture

  • Step 1 (local representation): The model first has a feature encoder which takes raw audio into latent speech representation \(f: X \to Z\), producing \(z_1, ..., z_T\). The feature encoder is multi-convolutional encoder, the features \(z_i\) are local features

  • Step 2 (contextualized representation): Transformer build contextualized representation \(c_1, ..., c_n\), captures broader global information.

  • Step 3 (quantization): local representation \(z_1, ..., z_T\) is quantized to \(q\) using product quantization.

Masking

  • sample a certain proportion (0.065) of all time steps to be starting index, and mask consecutive steps (10 step).

Objective

Loss (contrastive loss) masked context representation \(c_t\) should resemble to the quantized \(q_t\) other than the \(K\) distractors.

\[\mathcal{L} = - \log \frac{\exp(sim(c_t, q_t))}{\sum_{q \sim Q} \exp(sim(c_t, q))}\]

where \(Q\) contains the target \(q_t\) and \(K\) distractors.

Loss (diversity loss) max entropy

Fine-tuning

  • add a randomalized linear layer to project context feature into the vocabulary

wav2vec2

Reference: facebook blog

Model (XLSR) Extending wav2vec2 to multilingual settings.

Model (w2v-BERT)

Similar to wav2vec2, but w2v-BERT has both contrastive loss and MLM loss (cross entropy for masked prediction)

The idea of w2v-BERT is to use

  • first the contrastive task defined in wav2vec 2.0 to obtain an inventory of a finite set of discriminative, discretized speech units
  • then use them as target in a masked prediction task in a way that is similar to masked language modeling (MLM) proposed in BERT for learning contextualized speech representations

w2v-BERT

Model (HuBERT, Hidden-Unit BERT)

hubert

Model (BEST-RQ)

BERT-based Speech pre-Training with Random-projection Quantizer

best-rq

Model (denoising model, WavLM) combine the masked speech prediction and denoising in pretraining

  • inputs are simulated noisy/overlapped speech with masks
  • target is to predict the pseudo-label of the original speech on the masked region like HuBERT

1.5. Analysis

Anaylsis (lingustic information) this work and this work analyzes linguistic information encoded in different layers in wav2vec2

layerwise_info

Analysis (discrete vs continuous) Discretized bottleneck seems to be important to learn a good spoken language modeling

Metric (Minimal-Pair ABX, phonetic level) A (/aba/) and B (/apa/) are token representation of the same speaker, X (/aba/) is the representation from (maybe) another speaker, A and X should be more similar than X and B.

The similarity or distance can be computed, for example, using frame-wise angle along the DTW path

This was used in ZeroResource Speech Challenges (e.g: 2020)

Metric (spot-the-word, lexical) Given a pair of words clip (e.g, brick and blick), the model need to classifies which is a real word

1.6. Downstream Tasks

Model (speaker verification and language identification) using wav2vec2

Model (resynthesis)

resynthesis

2. Tokenizer

Model (soundstream) encoder-decoder codec model

soundstream

Model (encodec)

3. Multimodal Features

3.1. Speech-Text joint features

Model (SLAM)

Pretraining objectives

  • self-supervised objectives: BERT + w2v-BERT
  • alignment loss:
    • translation language modeling: concat speech + transcript to predict masked text or speech
    • speech-text matching: whether text/speech is matched

slam