0x570 Representation
1. Embeddings
Here is a self-supervised learning review for speech
1.1. Nonparametric Bayesian Models
Classical Acoustic Unit Discovery using Dirichlet process mixture model
Model (Gibbs sampling) each mixture is a HMM to model subword unit and to generate observed segments of that unit
Gibbs sampling is used to approximate posterior distribution
Model (Variational Inference) Use VI instead of Gibbs Sampling
1.2. Autoregressive Models
Model (CPC, Contrastive Predictive Coding) see the representation note
Model (CPC + Data Augmentation) Applying augmentation in the past is efficient
- pitch modification
- additive noise
- reverberation
Model (APC, Autoregressive Predictive Coding) use RNN to predict frame feature \(n\)-step ahead
- \(n=3\) performs best on phone classification task
1.3. Generative Model
Model (convolutional VAE) convolutional VAE, it proposes an interesting approach to modify speech attributes by shifting VAE's posterior (section 4.2)
Model (hierarchical VAE)
Model (VQ-VAE) compare three different autoencoder approaches: VQ-VAE, VAE, dimension reduction
The conclusion is among the three bottlenecks evaluated, VQ-VAE discards the most speaker-related information at the bottleneck, while preserving the most phonetic information
1.4. Masked Model
Model (vq-wav2vec)
- First train a quantization model using future prediction task.
- Then use those tokens to pretrain a BERT model
Model (wav2vec2)
Architecture
-
Step 1 (local representation): The model first has a feature encoder which takes raw audio into latent speech representation \(f: X \to Z\), producing \(z_1, ..., z_T\). The feature encoder is multi-convolutional encoder, the features \(z_i\) are local features
-
Step 2 (contextualized representation): Transformer build contextualized representation \(c_1, ..., c_n\), captures broader global information.
-
Step 3 (quantization): local representation \(z_1, ..., z_T\) is quantized to \(q\) using product quantization.
Masking
- sample a certain proportion (0.065) of all time steps to be starting index, and mask consecutive steps (10 step).
Objective
Loss (contrastive loss) masked context representation \(c_t\) should resemble to the quantized \(q_t\) other than the \(K\) distractors.
where \(Q\) contains the target \(q_t\) and \(K\) distractors.
Loss (diversity loss) max entropy
Fine-tuning
- add a randomalized linear layer to project context feature into the vocabulary
Reference: facebook blog
Model (XLSR) Extending wav2vec2 to multilingual settings.
Model (w2v-BERT)
Similar to wav2vec2, but w2v-BERT has both contrastive loss and MLM loss (cross entropy for masked prediction)
The idea of w2v-BERT is to use
- first the contrastive task defined in wav2vec 2.0 to obtain an inventory of a finite set of discriminative, discretized speech units
- then use them as target in a masked prediction task in a way that is similar to masked language modeling (MLM) proposed in BERT for learning contextualized speech representations
Model (HuBERT, Hidden-Unit BERT)
Model (BEST-RQ)
BERT-based Speech pre-Training with Random-projection Quantizer
Model (denoising model, WavLM) combine the masked speech prediction and denoising in pretraining
- inputs are simulated noisy/overlapped speech with masks
- target is to predict the pseudo-label of the original speech on the masked region like HuBERT
1.5. Analysis
Anaylsis (lingustic information) this work and this work analyzes linguistic information encoded in different layers in wav2vec2
Analysis (discrete vs continuous) Discretized bottleneck seems to be important to learn a good spoken language modeling
Metric (Minimal-Pair ABX, phonetic level) A (/aba/) and B (/apa/) are token representation of the same speaker, X (/aba/) is the representation from (maybe) another speaker, A and X should be more similar than X and B.
The similarity or distance can be computed, for example, using frame-wise angle along the DTW path
This was used in ZeroResource Speech Challenges (e.g: 2020)
Metric (spot-the-word, lexical) Given a pair of words clip (e.g, brick and blick), the model need to classifies which is a real word
1.6. Downstream Tasks
Model (speaker verification and language identification) using wav2vec2
Model (resynthesis)
2. Tokenizer
Model (soundstream) encoder-decoder codec model
Model (encodec)
3. Multimodal Features
3.1. Speech-Text joint features
Model (SLAM)
Pretraining objectives
- self-supervised objectives: BERT + w2v-BERT
- alignment loss:
- translation language modeling: concat speech + transcript to predict masked text or speech
- speech-text matching: whether text/speech is matched