0x431 Representations
1. Character
1.1. ASCII
1.2. Unicode
Unicode is a character set where each character (called code point) is assigned an id from U+000000 ~ U+10ffff.
The set can be divided into 17 planes. Each planes contains 2 bytes (U+0000 ~ U+ffff). There are 17 planes because (U+00xxxx ~ U+10xxxx)
Plane 0 BMP (Basic Multilingual Plane) The first planes contains character U+0000 - U+FFFF
Plane 1~16 (Supplementary Plane) Notice there are unsigned planes (plane 4~13) and several private planes (plane 15, 16)
Unicode only assigns the code point to each character, however, it does not state how to encode those code points. UTF-8, 16, 32 are just different encoding scheme (implementation) of Unicode.
1.2.1. UTF-8
Variable-length encoding, backwards compatible with ASCII, will not be affected by endianness Good for English, 96.0\% of all web pages as of 2021 according to Wikipedia. It is recommended from the WHATTWG for HTML and DOM spec.
Encoding is done as follows:
- U+0000 - U+007F: 1 byte (ASCII) 0xxxxxxx
- U+0080 - U+07FF: 2 bytes 110xxxxx 10xxxxxx
- U+0800 - U+FFFF: 3 bytes 1110xxxx 10xxxxxx 10xxxxxx
- U+10000 - U+10FFFF: 4 bytes 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
1.2.2. UTF-16
Variable-length encoding, either 2 byte or 4 byte. Has different endianness Good for Asian texts, Bad for English
- U+000000 - U+00ffff: 2 byte (xxxxxxxx xxxxxxxx)
- U+010000 - U10fffff: 4 byte (110110yy yyyyyyyy 110111xx xxxxxxxx)
The 4 byte is to use encode code points in supplementary planes.
- The first part (110110yy yyyyyyyy) is called the high surrogate, with hex (0xD800 - 0xDBFF)
- The second part (110111xx xxxxxxxx) is called the low surrogate, with hex (0xDC00 - 0xDFFF)
If all xy are 0, then it is U+0x010000. if all 1, it is 0x10ffff
1.2.3. UTF-32
Fixed-length encoding, all code points take 4 bytes. Has different endianness Fast to operate (because of fixed length) but rarely used due to the waste of memory.
unicode equivalence
Multiple code points can represent essentially the same character. We can apply the unicode normalization based on this "equivalence" relationship
There are two main normalization
- NCF: composition form
- NCD: decomposition form (e.g: \(e'\) can be decomposed into \(e\) and \(\cdot'\), check Wiki for example)
1.3. Byte Representation
There are some byte-level models, which interprets the strings as a sequence of UTF-8 bytes. It helps to build token-free models, which only require 256 embeddings vectors rather than the large fixed vocab.
It has been used to build, for example, LM
2. Subword
The huggingface has a good summary of subward tokenization
2.1. Byte Pair Encoding (BPE)
Model (Unicode BPE) originall introduced for translation task. it greedily merges most frequent subword sequence for fixed number of operations
Huggingface doc has example of this construction
Model (Byte-level BPE, GPT2) Section 2.2 in GPT 2 papers
The base characters of unicode BPE is too large compared with the common token vocab, so GPT2 applied apply bpe on bytes with some modifications (i.e. prevent merging across characters)
GPT2 has a vocabulary size of 50,257, which corresponds to the 256 bytes base tokens, a special end-of-text token and the symbols learned with 50,000 merges.
Model (BPE dropout) This works
- claims the drawback of BPE in its deterministic nature: it splits words into unique subword sequences, which means that for each word a model observes only one segmentation.
- It stochastically corrupts the segmentation procedure of BPE, which leads to producing multiple segmentations within the same fixed BPE framework
Wordpiece
Model (wordpiece) similar to BPE, but the merging algorithms is decided by the likelihood maximization (bigram LM) instead of the most frequent pair
2.2. Unigram-based segmentation
Model (unigram, sentencepiece) create vocabulary of most frequent character n-grams use EM algorithm to optimize probabilities, remove subwords with low prob
3. Word
3.1. Word Embeddings
Model (based on importance sampling) train the model \(p(w | h)\) based on adaptive importance sampling
Model (based on NCE)
Train a classifier to predict whether the data is from the parametrized distribution \(p_\theta(w | c) = u_\theta(w, c)/Z_\theta(c)\) or the noisy distribution \(q\)
Under the assumption \(Z_c \approx 1\)
The loss function is
where the second term can be approximated by sampling \(k\) negative samples
In the asympotic setting \(k \to \infty\), the parametrized distribution match the empirical distribution.
Model (based on negative sampling, word2vec)
- cbow: use embedding from the neighborhood to predict the center word. CBOW here means continuous version of BOW (bag of words) instead of the discrete one (one-hot vector)
- skip gram: use the center word embedding to predict neighbor words
As the optimization is involved with taking softmax over large vocab dimension (>10k), negative sampling is used to approximate the actual gradient. In this case, negative words are randomly sample from unigram distribution (usually powered with 0.75 to account more for low prob words)
negative sampling and NCE
Negative sampling is a variation of NCE based on the following setting
It is NCE under the assumption \(k = |V|\) and \(q\) is uniform, so it does not have the asymptotic consistency guarantee and will not optimize the likelihood.
Check this reference
Model (GloVe) Use word embedding to model the word co-occurrences
let \(X_{ij}\) be the word co-occurrences of word \(i\) and word \(j\) across all dataset, \(w_i, \tilde{w}_j\) denotes two sets of word vectors, GloVe attempts to model the co-occurrences with
The loss function is
where higher co-occurrences is given higher weights: \(f(x) =(x/x_{max})^{\alpha}\) if \(x < x_{max}\) otherwise 1
In practice, it will scale with the corpus size rather than \(O(V^2)\) (under some word-co-occurrences distribution assumption)
Model (FastText) use subword instead of word
3.1.1. Crosslingual Embeddings
Mikolov et al., 2013 introduces a linear mapping between source and target word embeddings
Xing et al., 2015 enforces orthogonality on the linear mapping \(W\), which reduces it to the Procrustes problem
3.2. Box Embeddings
Box embeddings are region-based representations of words, instead of using pointwise vectors
3.3. Collocations
3.3.1. Hypothesis Testing
If the two constituent words of a frequent bigram like new companies
are frequently words, then we expect the two words to co-occur a lot just by chance, event if they do not form a collocation. so we want to now whether two words occur together more often than change. We formulate a null hypothesis \(H_0\) that there is no association between the words beyond chance occurrences.
3.4. Word Sense Disambiguation
3.5. Lexical Acquisition
3.6. Translation
This work applies adversarial learning to learn word translation without any supervised pairs
4. Corpus
4.1. Large Corpora
Dataset (C4) statistics
Dataset | documents | tokens | size |
---|---|---|---|
C4.EN.NOCLEAN | 1.1 billion | 1.4 trillion | 2.3 TB |
C4.EN.NOBLOCKLIST | 395 million 198 | billion | 380 GB |
C4.EN | 365 million | 156 billion | 305 GB |
Analysis of this dataset can be found in this work
Dataset Filtering
High-quality dataset is important
Check, for example, appendix A in GPT3 for filtering approach
5. Reference
- [1] Kenneth Heafield' PDF thesis
- [2] CMU 11-737 Multilingual Natural Language Processing