This page covers the classification models based on neural networks
The easiest way to model vocabulary is to use word embedding.
The issues of word embedding
- huge vocabulary requires more memory and computation in morphological rich languages.
- not all words are not appearing in training data (UNK words)
One of the Common ways to solve the first issue is to use a threshold.
encode characters for the entire sentence, but very slow to train.
Separate rarer words into subwords, then embed. The cons is that it cannot handle non-cancatenative morphology.
Common way to find subword are
Byte pair encoding
- segment into characters
- merge most frequent subword sequence for fixed number of operations
- create vocabulary of most frequent character n-grams
- use EM algorithm to optimize probabilities, remove subwords with low prob
 CMU 11-737 Multilingual Natural Language Processing