0x441 Representations
1. Classifical Features
1.1. Acoustic Features
Depending on the model, we might use different features. In the traditional models, one example feature set is
- 40 and 60 parameters per frame to represent the spectral envelope
- value of F0
- 5 parameters to describe the spectral envelope of the aperiodic excitation
Feature (GeMAPS, Geneva Minimalistic Acoustic Parameter Set)
This works suggests a minimal set of acoustic descriptor as follows (extracted from the paper)
Frequency related parameters:
- Pitch, logarithmic F0 on a semitone frequency scale, starting at 27.5 Hz (semitone 0).
- Jitter, deviations in individual consecutive F0 period lengths.
- Formant 1, 2, and 3 frequency, centre frequency of first, second, and third formant
- Formant 1, bandwidth of first formant
Energy/Amplitude related parameters:
- Shimmer, difference of the peak amplitudes of consecutive F0 periods.
- Loudness, estimate of perceived signal intensity from an auditory spectrum.
- Harmonics-to-Noise Ratio (HNR), relation of energy in harmonic components to energy in noiselike components.
Spectral (balance) parameters
- Alpha Ratio, ratio of the summed energy from 50–1000 Hz and 1–5 kHz
- Hammarberg Index, ratio of the strongest energy peak in the 0–2 kHz region to the strongest peak in the 2–5 kHz region.
- Spectral Slope 0–500 Hz and 500–1500 Hz, linear regression slope of the logarithmic power spectrum within the two given bands.
- Formant 1, 2, and 3 relative energy, as well as the ratio of the energy of the spectral harmonic peak at the first, second, third formant’s centre frequency to the energy of the spectral peak at F0.
- Harmonic difference H1–H2, ratio of energy of the first F0 harmonic (H1) to the energy of the second F0 harmonic (H2).
- Harmonic difference H1–A3, ratio of energy of the first F0 harmonic (H1) to the energy of the highest harmonic in the third formant range (A3).
1.2. Linguistic Features
In neural TTS, the linguistic feature is just grapheme, subword or words, but in traditional TTS, we are using more detailed linguistic features.
The input \(w\) is usually transformed into linguistic features or linguistic specification. This could be as simple as a phoneme sequence, but for better results it will need to include supra-segmental information such as the prosody pattern of the speech to be produced. In other words, the linguistic specification comprises whatever factors might affect the acoustic realisation of the speech sounds making up the utterance.
For example, in HTS, the lab file might contain those features
1.3. Phoneme Feature
Phoneme-based TTS models may suffer from the ambiguity of the representation, such as prosody on homophones.
Consider the sentence: To cancel the payment, press one; or to continue, two.
The last word two can be confused with too, whose phoneme sequence is same but prosody is different. (pronunced with pause or not)
Listen to the first samples for difference here
Model (PnG BERT) This work pretrain a BERT-like model by concating phoneme seq and grapheme seq on a large text corpus. Hidden states of phoneme token can be used as input to TTS model.
This concat helps the phoneme to carry info from grapheme and disambiguiate prosody
Model (G2P, 1-1 approach) 1-to-1 approach:
Model (G2P, N-to-N approach)
Applying Many-to-Many Alignments and Hidden Markov Models to Letter-to-Phoneme Conversion (Github)
- run forward-backword to get the alignment
- segment word into a sequence of letter chunks (by training a different model), then run the aligned models on it.
For a general unicode-based G2P tool, the options are
- unitran: a mapping table can be found here
1.4. Speaker Features
i-vector, x-vector, d-vector
2. Semi-Supervised Learning
2.1. Data Augmentation
2.2. Pseudo Labeling
Model (noisy student)
- use specaugment to add noise
- shallow fusion with a language model