# 0x422 Generative Model

- 1. Autoregressive Model
- 2. Variational Autoencoder
- 3. Normalizing Flow
- 4. Diffusion Model
- 5. Adversarial Model
- 6. Energy-based Model
- 7. Reference

The Deep Generative Book has a good comparison across all models

## 1. Autoregressive Model

Neural AR models factorize the generation problem with a sequence of conditional probabilities, then use network to model them. The cons of autoregressive model is its slow generation.

Modeling \(p(x_d | x_{<d})\) separately requires \(D\) different model, which is infeasible. Instead we use the shared model (i.e: autoregression model).

To reduce the complexity, one simple idea is to use the finite memory

where the trigram model \(p(x_d | x_{d-1}, x_{d-2})\) is modeled using an MLP

### 1.1. Long-Range Memory with RNN

RNN can be used as an autoregressive model.

**Model (char-rnn)** The character-level language model is model the character sequence \(\mathbf{x}\) with RNN as follows

Karpathy's blog shows that this model can be used to generate many different sequences such as Shakespeare, Wikipedia, XML, latex and source code.

This model can also generate non-text objects such as images by representing pixel as character.

### 1.2. Masking-based Models

**Model (Masking-based autoregressive model, MADE)** An MLP based autoencoder can be turned into an autoregressive model by removing (masking) some connections.

**Model (wavenet)** Wavenet is a 1d convolution AR model

**Model (PixelCNN)** Pixel CNN is the 2d convolution AR model. Unlike normal CNN which will use all neighborhood pixels to convolve, PixelCNN masks out those pixels it has not seen (e.g. with the raster scan ordering)

**Model (PixelCNN++)** OpenAI's implementation of PixelCNN with several improvement:

- Use mixture of logistic (e.g: 5 component) to model the discretized distribution instead of 256 softmax because
- saves memory
- allow dense gradient flow to speedup training

- pixel conditioning is simplified
- short-cut connection like the U-net

The mixture of logistic is sa follows:

PMF is modeled as

## 2. Variational Autoencoder

The idea behind the latent variable model is to assume a lower-dimensional latent space and the following generative process

We want to sample from the simple low-dimensional latent space \(Z\) easily (e.g: Gaussian), and maximize the evidence function over the dataset \(X \sim \mathcal{D}\)

The idea behind the latent variable model is to assume a lower-dimensional latent space and the following generative process

We want to sample from the simple low-dimensional latent space \(Z\) easily (e.g: Gaussian), and maximize the evidence function over the dataset \(X \sim \mathcal{D}\)

### 2.1. Vanilla VAE

VAE implement this idea with the following modeling

**Model (\(P(Z)\))** the prior in VAE is

Note this is a fixed prior in contrast with the VQ-VAE's learnt prior

**Model (\(P(X|Z)\))** likelihood is modeled using a a deep neural network function \(f(Z; \theta)\): VAE approximates \(X \approx f(Z; \theta)\) and measure \(P(X|Z)\) by penalizing using Gaussian distribution

if \(X\) is discrete, it can other discrete distribution penalty such as Bernouli)

Probabilistic PCA

Recall pPCA is a simplified version of VAE

and likelihoood function \(f\) is linear

The graphical model of VAE is

The integration of evidence is very expensive,

so we are maximizing the lower bound of evidence (ELBO) instead of the evidence itself

Recall the the standard ELBO form (RHS usually denoted \(\mathcal{L}(Q)\)) is

where the right hand expression is the ELBO, ELBO can be further decomposed into two terms

We model \(Q(Z|X)\) as

wherre \(\mu(X), \Sigma(X)\) is implemented using neural network.

Look at the formula again,

The first term on RHS has a sampling step \(z \sim Q\) which cannot backprogate. The training process can be done using the reparametrization trick where we sample \(\epsilon \sim N(0, I)\) and transform \(\epsilon\) to \(z\) (instead of sampling \(Z \sim Q\) directly)

The 2^{nd} term is simple to compute

The likelihood function cannot be exactly calculated, only the lower bound could be provided

### 2.2. Posterior Collapse

VAE suffers from the posterior collapse problem when the signal from posterior \(Q(Z|X)\) is too weak or too noisy, it collapses towards the prior

where a subset of \(Z\) is not meaninfully used and it matches the uninformative prior

The decoder then starts ignoring it and generate sample without signal from \(X\), the reconstructed output becomes independent of \(X\)

Some works claims this is because of the KL term in the objective,

Most common approaches to solve these are either

- change objective
- weaken decoder

### 2.3. Architecture

**Model (VAE-GAN)** Attach a discriminator after encoder/decoder.

### 2.4. Loss

**Model (\(\beta\)-VAE)** attempts to learn an disentangle distribution with \(\beta > 1\)

Another approach to solve the posterior collapse problem

**Model (\(\delta\)-VAE)** prevent KL from falling to zero by constraining posterior \(Q\) and prior \(P\) such that they have a minimum distance \(KL > 0\)

A trivial choice is to set the Gaussian with a fixed different variance. For a non-trivial sequential model, they use non-correlated \(q\) and corelated prior AR(1)

There is a minimum distance because one is correlated and the other is not correlated

### 2.5. Vanilla VQ-VAE (Discrete Model)

**Model (VQ-VAE)** VQ uses the discrete latent variables instead of the continous one. It has a latent embedding \(e \in R^{KD}\) where \(K\) is the size of the discrete latent space and \(D\) is the hidden dimension.

It models the posterior distribution as a deterministic categorical distribution

The loss function is

It consists of

- reconstruction loss: \(\log p(x|z_q(x))\)
- codebook loss: \(\| sg[z_e(x)] - e \|^2\), bringing codebook close to encoder output, can be replaced with EMA (exponential moving average) for stability
- commitment loss: $ |z_e(x) - sg[e] |^2$, encourages encoder output to be close to codebook

VAE vs VQ-VAE

VQ-VAE can be seen as a special case of VAE. The KL term in the original VAE disappears by assuming prior \(p(z)\) is uniform \(p(z=k) = 1/K\) and the proposal distribution \(q(Z=k|X)\) is deteterministic:

While training the model, the prior \(p(Z)\) is kept constant and uniform \(p(z=k)=1/K\),. After training, it can be fit to an autoregressive model over \(Z\), so that we can sample using ancestral sampling

In this work, they model the autoregressive latent prior using PixelCNN for image and WaveNet for raw audio

The experiment settings are interesting

Image settings:

- 128x128x3 -> 32x32x1 (K=512)
- 43 times reduction

Audio settings:

- encoder: 6 convolution with stride 2 and window 4 (K=512)
- 64 times reduction
- decoder: dilated convolutional architecture like the WaveNet decoder

Problems:

The VQVAE also has its own problem, namely, the low codebook usage due to poor codebook initialization.

### 2.6. Hierarchical VQ-VAE

**Model (Hierarchical VQ-VAE)**

It has a hierarchical latent code

*top latent code*models global information*bottom latent code*, conditioned on the top latent, models local information

256x256 images -> 64x64 (bottom) -> 32x32 (top)

Prior

- top prior: PixelCNN + multihead self attention to capture larger receptive field
- bottom prior: no self-attention

## 3. Normalizing Flow

Let \(Z \in R^D\) be a tractable random variable with pdf \(p(Z)\), let \(g\) be a invertible function (with inverse \(f\))

using change of variable formula, we know

\(g\) is the generator which moves the simple distribution to a complicated distribution, its inverse \(f\) normalizes the complicated distribution towards simpler form.

To train a model, we optimize the log-likelihood only using \(f\)

To sample a new point, we just use sample \(z\) and transform using \(g(z)\)

Normalizing flow vs VAE

Architecture:

- VAE's encoder/decoder is usually not invertible
- NF's encoder/decoder is bijective

Objective:

- VAE is to maxmize the lower bound of log-likelihood (ELBO)
- NF is to maximize the exact log-likelihood

\(f,g\) control the expressiveness of the model, one way to build complicated bijective functions is to compose them

which has the inverse \(f = f_1 \circ f_2 ... \circ f_N\) and determinant

### 3.1. Linear Flow

**Model (linear flow)** We first consider the simple linear flow with invertible \(A\)

Linear flows is limited in its expressiveness: when \(p(z) = N(\mu, \Sigma)\), then \(p(y) = N(A\mu + b, A^T\Sigma A)\).

Additionally, computing determinant of Jacobian (\(\det A\)) is \(O(D^3)\), computing inverse \(A^{-1}\) costs same \(O(D^3)\).

By constraining the matrix \(A\) to be triangular, orthogonal etc improves the computational cost.

### 3.2. Planar Flow

**Model (planar flow)**

where \(h\) is a nonlinearity

### 3.3. RealNVP

### 3.4. Inverse Autoregressive Flow

## 4. Diffusion Model

Links

- Lecture Video by Jascha
- nnabla lecture

### 4.1. Score Matching Models

The general score matching description is here

**Model (denoising score matching)**

**Model (sliced score matching)**

**Model (NCSN, Noise Conditional Score Networks)** Contributions are

- perturbing the data using various levels of noise \(\sigma_1, ..., \sigma_L\)
- simultaneously estimating scores corresponding to all noise levels by training a single conditional score network \(s_\theta\)

The sampling is done by the **annealed Langevin dynamic**, which continue to applye Langevin dynamic for each noise scale \(\sigma_i\)

### 4.2. Denoising Diffusion

**Model (DDPM, Denoising Diffusion Models)** Diffusion models are latent variable models of the forms

where \(x_{1:T}\) are latent variables

**reverse process** The joint complete distribution \(p_\theta(x_{0:T})\) is called the reverse process, it is defined with

where \(p_{\theta}(T) = N(0, I)\) and

**forward process, diffusion process** The approximate posterior is a fixed markov chain which adds noise to the data according to a variance schedule \(\beta_1, ..., \beta_T\)

where:

The simplified objective is

This objective is analogous to the loss weighting used by the NCSN denoising score matching model

**Model (improved diffusion)** Improvment diff are

Noise scheduling is cosine instead of linear, it adds noise more slowly

Learning variance \(\Sigma_\theta(x_t, t)\) instead of using a fixed one \(\sigma^2I\) where \(v\) is learned output

### 4.3. Sampling

**Model (DDIM, denoising diffusion implicit model)** faster sampling with a non-Markovian diffusion process

**Model (PNDM, pseudo numerical methods for diffusion
models)**

### 4.4. Conditional Diffusion

**Model (Guided diffusion, classifier-guided)**

**Model (classifier-free guidance)**

**Model (GLIDE, text-to-image)**

**Model (SDEdit)**

**Model (bit diffusion, discrete diffusion)**

**Model (DreamFusion, 3d diffusion, text to 3d)**

### 4.5. Latent Diffusion

**Model (latent diffusion, stable diffusion)**

run diffusion on the latent space, the diffusied latent vector is further decoded into an image

## 5. Adversarial Model

Let \(X \in \mathcal{X}\) be the random variable of interest, \(P(X)\) to be its distribution and a training sample \(X_1, ..., X_n\).

We have two main components:

- generator: a map \(g_\theta: \mathcal{Z} \to \mathcal{X}\). It takes an random Guassian noise \(Z\) and generates outputs \(g_\theta(Z)\). Its goal is to choose \(\theta\) such that \(g_\theta(Z)\) is close to \(X\)
- discriminator: a map \(D_w \mathcal{X} \to [0, 1]\). its goal is to assign 1 to samples from the real distribution \(P_X\) and 0 to samples from the generated samples \(P_\theta\)

The parameters are \((\theta, w)\), it can be obtained by solving the min-max problem

It is equivalent to minimizing the JS divergence

It means we choose the closest \(P_\theta\) to the target distribution \(P_X\) in the JS divergence

### 5.1. Problems

#### 5.1.1. Vanishing Gradient

#### 5.1.2. Mode Collapse

### 5.2. Architecture

**Model (DCGAN, deep convolutional GAN)** transposed convolution to upsampling

An application of DCGAN (2 dim) to audio generation (1 dim) is **WaveGAN**

**Model (SAGAN, Self-Attention GAN)**

Add self-attention to GAN to enable both generator and discriminator to model long-range relation. \(f,g,h\) in the figure is corresponding to \(k,q,v\)

**Model (BiGAN)** use discriminator to distinguish whether \((x,z)\) is from encoder or decoder

### 5.3. Representation

**Model (Info GAN)** Modifies GAN to encourage it to learn meaning representation by maximizing the mutual information between a small subset of noise and the observations.

The input noise vector is decomposed into \(z\): incompressible noise, \(c\), latent code which encode salient semantic features. The goal is to minimize \(I(c; x=G(z,c))\), which is not available because \(P(c|x)\) is unknown.

Instead we lower bound this using an auxiliary distribution \(Q(c|x)\) to approximate \(P(c|x)\)

By ignoring the second term and rewriting the first term, the lower bound becomes

### 5.4. Loss

**Model (spectral normalization)** Stabilize the training of dsicriminator by normalize the weight by its spectral norm so that its Lipschitz constant is controlled

**Model (WGAN, Wasserstein GAN)**

The main points of WGAN is to replace the JS distance to \(L^1\)-Wasserstein distance. because

- Wasserstein distance respects the geometry of underlying distribution
- it captures the distance between two distribution even their support do not intersect

Not intersecting support is common in high dimensional applications where the target distribution lies in a low dimensional manifold

Recall the \(L^1\)-Wasserstein distance is

where \(\pi\) is any coupling between pair of random variables \((X,Y)\).

It can be shown that the Wasserstein distance \(W_1(P_X, P_{g_\theta})\) is continuous with respect to \(\theta\) if \(g_\theta\) is continous wrt \(\theta\)

To minimze \(W_1(P_X, P_\theta)\), we use the Kantorovich-Rubinstein duality

where the sup is over functions whose Lipschitz constant is less than 1, expanding the entire forms, we get

subject to \(\| D \|_L \leq K\)

In practice, the constraint is enforced by constraining the infinity norm of the weights (known as clipping)

tutorials:

- Here is a short introduction to optimal transport
- A good introduction to the WGAN
- a mandarin introduction

**Model (WGAN-GP, WGAN + Gradient Penalty)**

**Model (LS-GAN)** use least-square instead of sigmoid cross entropy in discriminator, it can

- generates higher quality
- stable learning process

### 5.5. Application forcused GAN

**Model (Cycle GAN)**

## 6. Energy-based Model

The partition function chapter of the deep learning book has a good coverage of those methods.

Energy-based model defines a energy function \(E(X)\), and models the generation using the Boltzmann distribution

where \(Z\) is the partition function

**Model (Boltzmann Machine)**

**Model (Restricted Boltzmann Machine)**

## 7. Reference

- [1] Berkeley CS249
- [2] http://karpathy.github.io/2015/05/21/rnn-effectiveness/
- [3] Hung-yi Lee Youtube Flow-based Generative Model
- [4] Jakub M. Tomczak Deep Generative Modeling book