0x431 Representations

1. Character
2. Subword
3. Word
4. Corpus
- 4.1. Large Corpora
- Dataset Filtering
5. Reference

1. Character

1.1. ASCII

1.2. Unicode

Unicode is a character set where each character (called code point) is assigned an id from U+000000 ~ U+10ffff.

The set can be divided into 17 planes. Each planes contains 2 bytes (U+0000 ~ U+ffff). There are 17 planes because (U+00xxxx ~ U+10xxxx)

Plane 0 BMP (Basic Multilingual Plane) The first planes contains character U+0000 - U+FFFF

Plane 1~16 (Supplementary Plane) Notice there are unsigned planes (plane 4~13) and several private planes (plane 15, 16)

Unicode only assigns the code point to each character, however, it does not state how to encode those code points. UTF-8, 16, 32 are just different encoding scheme (implementation) of Unicode.

1.2.1. UTF-8

Variable-length encoding, backwards compatible with ASCII, will not be affected by endianness Good for English, 96.0\% of all web pages as of 2021 according to Wikipedia. It is recommended from the WHATTWG for HTML and DOM spec.

Encoding is done as follows:

U+0000 - U+007F: 1 byte (ASCII) 0xxxxxxx
U+0080 - U+07FF: 2 bytes 110xxxxx 10xxxxxx
U+0800 - U+FFFF: 3 bytes 1110xxxx 10xxxxxx 10xxxxxx
U+10000 - U+10FFFF: 4 bytes 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

1.2.2. UTF-16

Variable-length encoding, either 2 byte or 4 byte. Has different endianness Good for Asian texts, Bad for English

U+000000 - U+00ffff: 2 byte (xxxxxxxx xxxxxxxx)
U+010000 - U10fffff: 4 byte (110110yy yyyyyyyy 110111xx xxxxxxxx)

The 4 byte is to use encode code points in supplementary planes.

The first part (110110yy yyyyyyyy) is called the high surrogate, with hex (0xD800 - 0xDBFF)
The second part (110111xx xxxxxxxx) is called the low surrogate, with hex (0xDC00 - 0xDFFF)

If all xy are 0, then it is U+0x010000. if all 1, it is 0x10ffff

1.2.3. UTF-32

Fixed-length encoding, all code points take 4 bytes. Has different endianness Fast to operate (because of fixed length) but rarely used due to the waste of memory.

unicode equivalence

Multiple code points can represent essentially the same character. We can apply the unicode normalization based on this "equivalence" relationship

There are two main normalization

NCF: composition form
NCD: decomposition form (e.g: \(e'\) can be decomposed into \(e\) and \(\cdot'\), check Wiki for example)

1.3. Byte Representation

There are some byte-level models, which interprets the strings as a sequence of UTF-8 bytes. It helps to build token-free models, which only require 256 embeddings vectors rather than the large fixed vocab.

It has been used to build, for example, LM

2. Subword

The huggingface has a good summary of subward tokenization

2.1. Byte Pair Encoding (BPE)

Model (Unicode BPE) originall introduced for translation task. it greedily merges most frequent subword sequence for fixed number of operations

Huggingface doc has example of this construction

Model (Byte-level BPE, GPT2) Section 2.2 in GPT 2 papers

The base characters of unicode BPE is too large compared with the common token vocab, so GPT2 applied apply bpe on bytes with some modifications (i.e. prevent merging across characters)

GPT2 has a vocabulary size of 50,257, which corresponds to the 256 bytes base tokens, a special end-of-text token and the symbols learned with 50,000 merges.

Model (BPE dropout) This works

claims the drawback of BPE in its deterministic nature: it splits words into unique subword sequences, which means that for each word a model observes only one segmentation.
It stochastically corrupts the segmentation procedure of BPE, which leads to producing multiple segmentations within the same fixed BPE framework

Wordpiece

Model (wordpiece) similar to BPE, but the merging algorithms is decided by the likelihood maximization (bigram LM) instead of the most frequent pair

2.2. Unigram-based segmentation

Model (unigram, sentencepiece) create vocabulary of most frequent character n-grams use EM algorithm to optimize probabilities, remove subwords with low prob

3. Word

3.1. Word Embeddings

Model (based on importance sampling) train the model \(p(w | h)\) based on adaptive importance sampling

Model (based on NCE)

Train a classifier to predict whether the data is from the parametrized distribution \(p_\theta(w | c) = u_\theta(w, c)/Z_\theta(c)\) or the noisy distribution \(q\)

Under the assumption \(Z_c \approx 1\)

\[P(D = 0 | c, w) = \frac{k q(w)}{u_\theta(w, c) + k q(w)}\]

\[P(D = 1 | c, w) = \frac{u_\theta(w, c)}{u_\theta(w, c) + kq(w)}\]

The loss function is

\[L = \sum_{(w,c) \sim \mathcal{D}} \log p(D = 1 | c, w) + k E_{\bar{w} \sim q} \log p(D = 0 | c, \bar{w})\]

where the second term can be approximated by sampling \(k\) negative samples

In the asympotic setting \(k \to \infty\), the parametrized distribution match the empirical distribution.

Model (based on negative sampling, word2vec)

cbow: use embedding from the neighborhood to predict the center word. CBOW here means continuous version of BOW (bag of words) instead of the discrete one (one-hot vector)
skip gram: use the center word embedding to predict neighbor words

As the optimization is involved with taking softmax over large vocab dimension (>10k), negative sampling is used to approximate the actual gradient. In this case, negative words are randomly sample from unigram distribution (usually powered with 0.75 to account more for low prob words)

\[L = \log \sigma({v'}_{w_O}^{T} v_{w_I}) + \sum E_{w_i \sim P_n(w)} [ \log \sigma(-{v'}_{w_i}^T v_{w_I})]\]

negative sampling and NCE

Negative sampling is a variation of NCE based on the following setting

\[p(D = 0 | c, w) = \frac{1}{u_\theta(w,c) + 1}\]

\[p(D = 1 | c, w) = \frac{u_\theta(w,c)}{u_\theta(w,c) + 1}\]

It is NCE under the assumption \(k = |V|\) and \(q\) is uniform, so it does not have the asymptotic consistency guarantee and will not optimize the likelihood.

Check this reference

Model (GloVe) Use word embedding to model the word co-occurrences

let \(X_{ij}\) be the word co-occurrences of word \(i\) and word \(j\) across all dataset, \(w_i, \tilde{w}_j\) denotes two sets of word vectors, GloVe attempts to model the co-occurrences with

\[\log X_{ij} = w_i^T \tilde{w}_j + b_i + \tilde{b}_j\]

The loss function is

\[\mathcal{L} = \sum_{i,j}^V f(X_{ij})(w_i^T \tilde{w}_j + b_i + \tilde{b}_j - \log X_{ij})\]

where higher co-occurrences is given higher weights: \(f(x) =(x/x_{max})^{\alpha}\) if \(x < x_{max}\) otherwise 1

In practice, it will scale with the corpus size rather than \(O(V^2)\) (under some word-co-occurrences distribution assumption)

Model (FastText) use subword instead of word

3.1.1. Crosslingual Embeddings

Mikolov et al., 2013 introduces a linear mapping between source and target word embeddings

Xing et al., 2015 enforces orthogonality on the linear mapping \(W\), which reduces it to the Procrustes problem

3.2. Box Embeddings

Box embeddings are region-based representations of words, instead of using pointwise vectors

3.3. Collocations

3.3.1. Hypothesis Testing

If the two constituent words of a frequent bigram like new companies are frequently words, then we expect the two words to co-occur a lot just by chance, event if they do not form a collocation. so we want to now whether two words occur together more often than change. We formulate a null hypothesis \(H_0\) that there is no association between the words beyond chance occurrences.

3.4. Word Sense Disambiguation

3.5. Lexical Acquisition

3.6. Translation

This work applies adversarial learning to learn word translation without any supervised pairs

4. Corpus

4.1. Large Corpora

Dataset (C4) statistics

Dataset	documents	tokens	size
C4.EN.NOCLEAN	1.1 billion	1.4 trillion	2.3 TB
C4.EN.NOBLOCKLIST	395 million 198	billion	380 GB
C4.EN	365 million	156 billion	305 GB

Analysis of this dataset can be found in this work

Dataset Filtering

High-quality dataset is important

Check, for example, appendix A in GPT3 for filtering approach

5. Reference

[1] Kenneth Heafield' PDF thesis
[2] CMU 11-737 Multilingual Natural Language Processing