0x440 Foundation

This page covers the classical statistical NLP approaches.

N-gram LM

Concepts

Maximum Likelihood Estimator

$$\hat{P}(w_n | w_1, w_2, … , w_{n-1}) = \frac{c(w_1, …., w_n)}{c(w_1, …, w_{n-1})} $$

  • MLE does not work in practice because of sparseness

Backoff

Backoff assigns probability with n-1 gram if n-gram is not seen in the training set

$$ P(w_n | w_1^{n-1}) = P(w_n | w_1^{n-1})
\begin{cases}
P(w_n | w_1^{n-1}) & \text{if the n-gram is seen in the training set } \\
b(w_1^{n-1}) p(w_n | w_2^{n-1}) & \text{otherwise} \end{cases} $$

Models

Witten-Bell

Good Turing

without tear

Jelinek-Mercer

interpolation

Kneser-Ney Smoothing

high order is to use count-based probability with absolute discount

$$ P(w) \propto \max (c(w)-d , 0) $$

low order is to use the fertility context as the backoff smoothing

$$ P(w) \propto | w’: c(w’, w) > 0 | $$

Stupid Backoff

Tools

[1] provides a very good tutorial for efficient implementation

SRILM

Tutorial

Commands

// compute ppl
ngram -lm corpus.arpa -ppl test_corpus.txt

MSRLM

  • numerator, denominator of normalization are delayed until query. Therefore costs more CPU time for querying. [1]
  • memory-mapping implementation

IRSLTM

Kenlm

data

Implementations

  • bit packing: if unigram is less than 1M, trigram can be packed into a 64bit long long by using index of 20 bit for each word
  • rank: store the rank of counting instead of the count themselves

Topic Model

Latent Semantic Analysis (LSA)

Apply SVD to term-document matrix, then approximate with low rank matrix

PLSA

Latent Dirichlet Analysis (LDA)

Reference

[1] Kenneth Heafield’ PDF thesis