# 0x440 Foundation

Contents Index

## N-gram LM

### Concepts

#### Maximum Likelihood Estimator

$$\hat{P}(w_n | w_1, w_2, … , w_{n-1}) = \frac{c(w_1, …., w_n)}{c(w_1, …, w_{n-1})}$$

• MLE does not work in practice because of sparseness

#### Backoff

Backoff assigns probability with n-1 gram if n-gram is not seen in the training set

$$P(w_n | w_1^{n-1}) = P(w_n | w_1^{n-1}) \begin{cases} P(w_n | w_1^{n-1}) & \text{if the n-gram is seen in the training set } \\ b(w_1^{n-1}) p(w_n | w_2^{n-1}) & \text{otherwise} \end{cases}$$

### Models

without tear

interpolation

#### Kneser-Ney Smoothing

high order is to use count-based probability with absolute discount

$$P(w) \propto \max (c(w)-d , 0)$$

low order is to use the fertility context as the backoff smoothing

$$P(w) \propto | w’: c(w’, w) > 0 |$$

### Tools

[1] provides a very good tutorial for efficient implementation

#### SRILM

Tutorial

Commands

// compute ppl
ngram -lm corpus.arpa -ppl test_corpus.txt

#### MSRLM

• numerator, denominator of normalization are delayed until query. Therefore costs more CPU time for querying. [1]
• memory-mapping implementation

data

#### Implementations

• bit packing: if unigram is less than 1M, trigram can be packed into a 64bit long long by using index of 20 bit for each word
• rank: store the rank of counting instead of the count themselves

### Latent Semantic Analysis (LSA)

Apply SVD to term-document matrix, then approximate with low rank matrix