0x441 Classification

This page covers the classification models based on neural networks

Vocabulary Modeling

Word Models

The easiest way to model vocabulary is to use word embedding.

The issues of word embedding

  • huge vocabulary requires more memory and computation in morphological rich languages.
  • not all words are not appearing in training data (UNK words)

One of the Common ways to solve the first issue is to use a threshold.

Character-based Models

encode characters for the entire sentence, but very slow to train.

Subword Models

Separate rarer words into subwords, then embed. The cons is that it cannot handle non-cancatenative morphology.

Common way to find subword are

Byte pair encoding

  • segment into characters
  • merge most frequent subword sequence for fixed number of operations

Unigram-based segmentation

  • create vocabulary of most frequent character n-grams
  • use EM algorithm to optimize probabilities, remove subwords with low prob

Reference

[1] CMU 11-737 Multilingual Natural Language Processing