# 0x534 Autoregressive Model

Neural AR models factorize the generation problem with a sequence of conditional probabilities, then use network to model them. The cons of autoregressive model is its slow generation.

$p(x) = p(x_1)\prod_{i=2}^D p(x_i|x_{<i})$

Modeling $$p(x_d | x_{<d})$$ separately requires $$D$$ different model, which is infeasible. Instead we use the shared model (i.e: autoregression model).

To reduce the complexity, one simple idea is to use the finite memory

$p(x) = p(x_1)p(x_2|x_) \prod_d p(x_d | x_{d-1}, x_{d-2})$

where the trigram model $$p(x_d | x_{d-1}, x_{d-2})$$ is modeled using an MLP

### 1.1. Long-Range Memory with RNN

RNN can be used as an autoregressive model.

Model (char-rnn) The character-level language model is model the character sequence $$\mathbf{x}$$ with RNN as follows

$\log p(\mathbf{x}) = \sum_{i=1}^d \log p(x_i | \mathbf{x}_{1:i-1})$

Karpathy's blog shows that this model can be used to generate many different sequences such as Shakespeare, Wikipedia, XML, latex and source code.

This model can also generate non-text objects such as images by representing pixel as character.

Model (Masking-based autoregressive model, MADE) An MLP based autoencoder can be turned into an autoregressive model by removing (masking) some connections.

Model (wavenet) Wavenet is a 1d convolution AR model

Model (PixelCNN) Pixel CNN is the 2d convolution AR model. Unlike normal CNN which will use all neighborhood pixels to convolve, PixelCNN masks out those pixels it has not seen (e.g. with the raster scan ordering)

Model (PixelCNN++) OpenAI's implementation of PixelCNN with several improvement:

1. Use mixture of logistic (e.g: 5 component) to model the discretized distribution instead of 256 softmax because
• saves memory
• allow dense gradient flow to speedup training
2. pixel conditioning is simplified
3. short-cut connection like the U-net

The mixture of logistic is sa follows:

$\nu = \sum_i \pi_i \text{logistic}(\mu_i, s_i)$

PMF is modeled as

$p(x | \pi, \mu, s) = \sum_i \pi_i (\sigma((x+0.5 - \mu_i)/s_i) - \sigma((x-0.5-\mu_i)/s_i))$