# 0x422 Generative Model

The Deep Generative Book has a good comparison across all models

## 1. Autoregressive Model

Neural AR models factorize the generation problem with a sequence of conditional probabilities, then use network to model them. The cons of autoregressive model is its slow generation.

$p(x) = p(x_1)\prod_{i=2}^D p(x_i|x_{<i})$

Modeling $$p(x_d | x_{<d})$$ separately requires $$D$$ different model, which is infeasible. Instead we use the shared model (i.e: autoregression model).

To reduce the complexity, one simple idea is to use the finite memory

$p(x) = p(x_1)p(x_2|x_) \prod_d p(x_d | x_{d-1}, x_{d-2})$

where the trigram model $$p(x_d | x_{d-1}, x_{d-2})$$ is modeled using an MLP

### 1.1. Long-Range Memory with RNN

RNN can be used as an autoregressive model.

Model (char-rnn) The character-level language model is model the character sequence $$\mathbf{x}$$ with RNN as follows

$\log p(\mathbf{x}) = \sum_{i=1}^d \log p(x_i | \mathbf{x}_{1:i-1})$

Karpathy's blog shows that this model can be used to generate many different sequences such as Shakespeare, Wikipedia, XML, latex and source code.

This model can also generate non-text objects such as images by representing pixel as character.

Model (Masking-based autoregressive model, MADE) An MLP based autoencoder can be turned into an autoregressive model by removing (masking) some connections.

Model (wavenet) Wavenet is a 1d convolution AR model

Model (PixelCNN) Pixel CNN is the 2d convolution AR model. Unlike normal CNN which will use all neighborhood pixels to convolve, PixelCNN masks out those pixels it has not seen (e.g. with the raster scan ordering)

Model (PixelCNN++) OpenAI's implementation of PixelCNN with several improvement:

1. Use mixture of logistic (e.g: 5 component) to model the discretized distribution instead of 256 softmax because
• saves memory
• allow dense gradient flow to speedup training
2. pixel conditioning is simplified
3. short-cut connection like the U-net

The mixture of logistic is sa follows:

$\nu = \sum_i \pi_i \text{logistic}(\mu_i, s_i)$

PMF is modeled as

$p(x | \pi, \mu, s) = \sum_i \pi_i (\sigma((x+0.5 - \mu_i)/s_i) - \sigma((x-0.5-\mu_i)/s_i))$

## 2. Variational Autoencoder

The idea behind the latent variable model is to assume a lower-dimensional latent space and the following generative process

$Z \sim P(Z)$
$X \sim P(X|Z)$

We want to sample from the simple low-dimensional latent space $$Z$$ easily (e.g: Gaussian), and maximize the evidence function over the dataset $$X \sim \mathcal{D}$$

$P(X) = \int P(X|Z; \theta)P(Z)dZ$

The idea behind the latent variable model is to assume a lower-dimensional latent space and the following generative process

$Z \sim P(Z)$
$X \sim P(X|Z)$

We want to sample from the simple low-dimensional latent space $$Z$$ easily (e.g: Gaussian), and maximize the evidence function over the dataset $$X \sim \mathcal{D}$$

$P(X) = \int P(X|Z; \theta)P(Z)dZ$

### 2.1. Vanilla VAE

VAE implement this idea with the following modeling

Model ($$P(Z)$$) the prior in VAE is

$P(Z) = N(0, I)$

Note this is a fixed prior in contrast with the VQ-VAE's learnt prior

Model ($$P(X|Z)$$) likelihood is modeled using a a deep neural network function $$f(Z; \theta)$$: VAE approximates $$X \approx f(Z; \theta)$$ and measure $$P(X|Z)$$ by penalizing using Gaussian distribution

$P(X|Z; \theta) = N(X | f(Z; \theta), \sigma^2I)$

if $$X$$ is discrete, it can other discrete distribution penalty such as Bernouli)

Probabilistic PCA

Recall pPCA is a simplified version of VAE

$P(Z) = N(Z | 0, I)$

and likelihoood function $$f$$ is linear

$P(X|Z) = N(X | WZ + \mu, \sigma^2I)$

The graphical model of VAE is

The integration of evidence is very expensive,

$P(X) = \int P(X|Z)P(Z) dZ$

so we are maximizing the lower bound of evidence (ELBO) instead of the evidence itself

Recall the the standard ELBO form (RHS usually denoted $$\mathcal{L}(Q)$$) is

$\log P(X) \geq E_{Z \sim Q}[\log\frac{P(X,Z)}{Q(Z)}]$

where the right hand expression is the ELBO, ELBO can be further decomposed into two terms

$ELBO = E_{Z \sim Q} \log \frac{P(X|Z)P(Z)}{Q(Z|X)} = E_{Z \sim Q}[ \log P(X|Z)] - \mathcal{D}[Q(Z|X)||P(Z)]$

We model $$Q(Z|X)$$ as

$Q(Z|X) \sim N(\mu(X), \Sigma(X))$

wherre $$\mu(X), \Sigma(X)$$ is implemented using neural network.

Look at the formula again,

$ELBO = E_{Z \sim Q}[ \log P(X|Z)] - \mathcal{D}[Q(Z|X)||P(Z)]$

The first term on RHS has a sampling step $$z \sim Q$$ which cannot backprogate. The training process can be done using the reparametrization trick where we sample $$\epsilon \sim N(0, I)$$ and transform $$\epsilon$$ to $$z$$ (instead of sampling $$Z \sim Q$$ directly)

$Z = \mu + \Sigma^{1/2} \epsilon$

The 2nd term is simple to compute

$\mathcal{D}[Q(Z|X)||P(Z)] = KL(N(\mu, \Sigma) || N(0, I)) = \frac{1}{2}(tr(\Sigma) + \mu^T\mu - k -\log\det\Sigma)$

The likelihood function cannot be exactly calculated, only the lower bound could be provided

### 2.2. Posterior Collapse

VAE suffers from the posterior collapse problem when the signal from posterior $$Q(Z|X)$$ is too weak or too noisy, it collapses towards the prior

$Q(Z|X) \sim P(X)$

where a subset of $$Z$$ is not meaninfully used and it matches the uninformative prior

The decoder then starts ignoring it and generate sample without signal from $$X$$, the reconstructed output becomes independent of $$X$$

Some works claims this is because of the KL term in the objective,

Most common approaches to solve these are either

• change objective
• weaken decoder

### 2.3. Architecture

Model (VAE-GAN) Attach a discriminator after encoder/decoder.

### 2.4. Loss

Model ($$\beta$$-VAE) attempts to learn an disentangle distribution with $$\beta > 1$$

$E_{Z \sim Q}[ \log P(X|Z)] - \beta KL[Q(Z|X)||P(X)]$

Another approach to solve the posterior collapse problem

Model ($$\delta$$-VAE) prevent KL from falling to zero by constraining posterior $$Q$$ and prior $$P$$ such that they have a minimum distance $$KL > 0$$

A trivial choice is to set the Gaussian with a fixed different variance. For a non-trivial sequential model, they use non-correlated $$q$$ and corelated prior AR(1)

$P(Z_t | Z_{<t}) = N(Z_t; \alpha Z_{t-1},, \sigma_{\epsilon})$

There is a minimum distance because one is correlated and the other is not correlated

### 2.5. Vanilla VQ-VAE (Discrete Model)

Model (VQ-VAE) VQ uses the discrete latent variables instead of the continous one. It has a latent embedding $$e \in R^{KD}$$ where $$K$$ is the size of the discrete latent space and $$D$$ is the hidden dimension.

It models the posterior distribution as a deterministic categorical distribution

$q(Z=k | X) = \begin{cases} 1 \text{ when } k = \text{argmin}_j ||z_e(x) - e_j|| \\ 0 \text{ otherwise} \end{cases}$

The loss function is

$L = \log p(x|z_q(x)) + \| sg[z_e(x)] - e \|^2 + \beta \|z_e(x) - sg[e] \|^2$

It consists of

• reconstruction loss: $$\log p(x|z_q(x))$$
• codebook loss: $$\| sg[z_e(x)] - e \|^2$$, bringing codebook close to encoder output, can be replaced with EMA (exponential moving average) for stability
• commitment loss: $|z_e(x) - sg[e] |^2$, encourages encoder output to be close to codebook

VAE vs VQ-VAE

VQ-VAE can be seen as a special case of VAE. The KL term in the original VAE disappears by assuming prior $$p(z)$$ is uniform $$p(z=k) = 1/K$$ and the proposal distribution $$q(Z=k|X)$$ is deteterministic:

$KL[Q(Z|X) || P(Z)] = \sum_{i=1}^K q(z|x) \log \frac{q(z|x)}{p(z)} = q(z=k |x) \log \frac{q(z=k|x)}{p(z=k)} = \log K$

While training the model, the prior $$p(Z)$$ is kept constant and uniform $$p(z=k)=1/K$$,. After training, it can be fit to an autoregressive model over $$Z$$, so that we can sample using ancestral sampling

In this work, they model the autoregressive latent prior using PixelCNN for image and WaveNet for raw audio

The experiment settings are interesting

Image settings:

• 128x128x3 -> 32x32x1 (K=512)
• 43 times reduction

Audio settings:

• encoder: 6 convolution with stride 2 and window 4 (K=512)
• 64 times reduction
• decoder: dilated convolutional architecture like the WaveNet decoder

Problems:

The VQVAE also has its own problem, namely, the low codebook usage due to poor codebook initialization.

### 2.6. Hierarchical VQ-VAE

Model (Hierarchical VQ-VAE)

It has a hierarchical latent code

• top latent code models global information
• bottom latent code, conditioned on the top latent, models local information

256x256 images -> 64x64 (bottom) -> 32x32 (top)

Prior

• top prior: PixelCNN + multihead self attention to capture larger receptive field
• bottom prior: no self-attention

## 3. Normalizing Flow

survey paper

Let $$Z \in R^D$$ be a tractable random variable with pdf $$p(Z)$$, let $$g$$ be a invertible function (with inverse $$f$$)

$Y = g(Z)$

using change of variable formula, we know

$p_Y(y) = p_Z(f(y)) |\det Df(y) |$

$$g$$ is the generator which moves the simple distribution to a complicated distribution, its inverse $$f$$ normalizes the complicated distribution towards simpler form.

To train a model, we optimize the log-likelihood only using $$f$$

$\log p(\mathcal{D} | \theta) = \sum_i \log p(y_i | \theta) = \sum \log p_Z(f(y_i | \theta)) + \log |\det Df(y_i | \theta) |$

To sample a new point, we just use sample $$z$$ and transform using $$g(z)$$

Normalizing flow vs VAE

Architecture:

• VAE's encoder/decoder is usually not invertible
• NF's encoder/decoder is bijective

Objective:

• VAE is to maxmize the lower bound of log-likelihood (ELBO)
• NF is to maximize the exact log-likelihood

$$f,g$$ control the expressiveness of the model, one way to build complicated bijective functions is to compose them

$g = g_N \circ g_{N-1} \circ ... \circ g_1$

which has the inverse $$f = f_1 \circ f_2 ... \circ f_N$$ and determinant

$\det Df(y) = \prod \det Df_i (x_i)$

### 3.1. Linear Flow

Model (linear flow) We first consider the simple linear flow with invertible $$A$$

$g(x) = Ax + b$

Linear flows is limited in its expressiveness: when $$p(z) = N(\mu, \Sigma)$$, then $$p(y) = N(A\mu + b, A^T\Sigma A)$$.

Additionally, computing determinant of Jacobian ($$\det A$$) is $$O(D^3)$$, computing inverse $$A^{-1}$$ costs same $$O(D^3)$$.

By constraining the matrix $$A$$ to be triangular, orthogonal etc improves the computational cost.

### 3.2. Planar Flow

Model (planar flow)

$g(x) = x + u h(W^Tx + b)$

where $$h$$ is a nonlinearity

## 4. Diffusion Model

### 4.1. Score Matching Models

The general score matching description is here

Model (denoising score matching)

Model (sliced score matching)

Model (NCSN, Noise Conditional Score Networks) Contributions are

• perturbing the data using various levels of noise $$\sigma_1, ..., \sigma_L$$
• simultaneously estimating scores corresponding to all noise levels by training a single conditional score network $$s_\theta$$
$s_\theta(x, \sigma) \approx \nabla_x \log q_\sigma(x)$

The sampling is done by the annealed Langevin dynamic, which continue to applye Langevin dynamic for each noise scale $$\sigma_i$$

### 4.2. Denoising Diffusion

Model (DDPM, Denoising Diffusion Models) Diffusion models are latent variable models of the forms

$p_\theta(x_0) = \int p_\theta(x_{0:T}) dx_{1:T}$

where $$x_{1:T}$$ are latent variables

reverse process The joint complete distribution $$p_\theta(x_{0:T})$$ is called the reverse process, it is defined with

$p_\theta(x_{0:T}) = p(x_T)\prod_{t=1}^T p _\theta(x_{t-1} | x_t)$

where $$p_{\theta}(T) = N(0, I)$$ and

$p_\theta(x_{t-1} | x_t) = N(x_{t-1} | \mu_\theta(x_t, t), \Sigma_\theta(x_t, t))$

forward process, diffusion process The approximate posterior is a fixed markov chain which adds noise to the data according to a variance schedule $$\beta_1, ..., \beta_T$$

$q(x_{1:T} | x_0) = \prod_{t=1}^T q(x_t | x_{t-1})$

where:

$q(x_t | x_{t-1}) = N(\sqrt{1-\beta_t} x_{t-1}, \beta_t I)$

The simplified objective is

$L_{simple}(\theta) = E_{t, x_0, \sigma}( \| \sigma - \sigma_\theta(\sqrt{\bar{\alpha}_t}x_0 + \sqrt{1- \bar{\alpha_t}}\sigma, t \|))$

This objective is analogous to the loss weighting used by the NCSN denoising score matching model

Model (improved diffusion) Improvment diff are

Noise scheduling is cosine instead of linear, it adds noise more slowly

Learning variance $$\Sigma_\theta(x_t, t)$$ instead of using a fixed one $$\sigma^2I$$ where $$v$$ is learned output

$\Sigma_\theta(x_t, t) = \exp(v \log \beta_t + (1-v) \tilde{\beta}_t)$

### 4.3. Sampling

Model (DDIM, denoising diffusion implicit model) faster sampling with a non-Markovian diffusion process

### 4.4. Conditional Diffusion

Model (Guided diffusion, classifier-guided)

Model (classifier-free guidance)

Model (GLIDE, text-to-image)

Model (SDEdit)

Model (bit diffusion, discrete diffusion)

Model (DreamFusion, 3d diffusion, text to 3d)

### 4.5. Latent Diffusion

Model (latent diffusion, stable diffusion)

run diffusion on the latent space, the diffusied latent vector is further decoded into an image

Goodfellow's tutorial

Let $$X \in \mathcal{X}$$ be the random variable of interest, $$P(X)$$ to be its distribution and a training sample $$X_1, ..., X_n$$.

We have two main components:

• generator: a map $$g_\theta: \mathcal{Z} \to \mathcal{X}$$. It takes an random Guassian noise $$Z$$ and generates outputs $$g_\theta(Z)$$. Its goal is to choose $$\theta$$ such that $$g_\theta(Z)$$ is close to $$X$$
• discriminator: a map $$D_w \mathcal{X} \to [0, 1]$$. its goal is to assign 1 to samples from the real distribution $$P_X$$ and 0 to samples from the generated samples $$P_\theta$$

The parameters are $$(\theta, w)$$, it can be obtained by solving the min-max problem

$\min_\theta \max_w E [ \log(D_w(x)) + \log(1-D_w(g_\theta(x))) ]$

It is equivalent to minimizing the JS divergence

$\min_\theta JS(P_X \| P_\theta )$

It means we choose the closest $$P_\theta$$ to the target distribution $$P_X$$ in the JS divergence

### 5.2. Architecture

Model (DCGAN, deep convolutional GAN) transposed convolution to upsampling

An application of DCGAN (2 dim) to audio generation (1 dim) is WaveGAN

Model (SAGAN, Self-Attention GAN)

Add self-attention to GAN to enable both generator and discriminator to model long-range relation. $$f,g,h$$ in the figure is corresponding to $$k,q,v$$

Model (BiGAN) use discriminator to distinguish whether $$(x,z)$$ is from encoder or decoder

### 5.3. Representation

Model (Info GAN) Modifies GAN to encourage it to learn meaning representation by maximizing the mutual information between a small subset of noise and the observations.

The input noise vector is decomposed into $$z$$: incompressible noise, $$c$$, latent code which encode salient semantic features. The goal is to minimize $$I(c; x=G(z,c))$$, which is not available because $$P(c|x)$$ is unknown.

Instead we lower bound this using an auxiliary distribution $$Q(c|x)$$ to approximate $$P(c|x)$$

$I(c; G(z,c)) \geq E_{x \in G(z, c)} E_{c' \sim P(c|x)} [\log Q(c'|x)] + H(c)$

By ignoring the second term and rewriting the first term, the lower bound becomes

$L_I(G,Q) = E_{c \in P(c), x \sim G(z,c)} [\log Q(c|x)]$

### 5.4. Loss

Model (spectral normalization) Stabilize the training of dsicriminator by normalize the weight by its spectral norm so that its Lipschitz constant is controlled

$\bar{W} = W/\sigma(W)$

Model (WGAN, Wasserstein GAN)

The main points of WGAN is to replace the JS distance to $$L^1$$-Wasserstein distance. because

• Wasserstein distance respects the geometry of underlying distribution
• it captures the distance between two distribution even their support do not intersect

Not intersecting support is common in high dimensional applications where the target distribution lies in a low dimensional manifold

Recall the $$L^1$$-Wasserstein distance is

$W_1(P_X, P_Y) = \inf_\pi E [ | X - Y | ]$

where $$\pi$$ is any coupling between pair of random variables $$(X,Y)$$.

It can be shown that the Wasserstein distance $$W_1(P_X, P_{g_\theta})$$ is continuous with respect to $$\theta$$ if $$g_\theta$$ is continous wrt $$\theta$$

To minimze $$W_1(P_X, P_\theta)$$, we use the Kantorovich-Rubinstein duality

$W_1(P_X, P_\theta) = \sup_{\|D \|_L \leq 1} E [D(X) - D(g_\theta(Z))]$

where the sup is over functions whose Lipschitz constant is less than 1, expanding the entire forms, we get

$\min_\theta \max_w E [ D(X) - D(g_\theta(Z))]$

subject to $$\| D \|_L \leq K$$

In practice, the constraint is enforced by constraining the infinity norm of the weights (known as clipping)

tutorials:

Model (WGAN-GP, WGAN + Gradient Penalty)

Model (LS-GAN) use least-square instead of sigmoid cross entropy in discriminator, it can

• generates higher quality
• stable learning process

### 5.5. Application forcused GAN

Model (Cycle GAN)

## 6. Energy-based Model

The partition function chapter of the deep learning book has a good coverage of those methods.

Energy-based model defines a energy function $$E(X)$$, and models the generation using the Boltzmann distribution

$P(X) = \frac{\exp(-E(X))}{Z}$

where $$Z$$ is the partition function

Model (Boltzmann Machine)

$E(x) = x^TWx$

Model (Restricted Boltzmann Machine)

$E(x, z) = x^TWz$