0x422 Generative Model
- 1. Autoregressive Model
- 2. Variational Autoencoder
- 3. Normalizing Flow
- 4. Diffusion Model
- 5. Adversarial Model
- 6. Energy-based Model
- 7. Reference
The Deep Generative Book has a good comparison across all models
1. Autoregressive Model
Neural AR models factorize the generation problem with a sequence of conditional probabilities, then use network to model them. The cons of autoregressive model is its slow generation.
Modeling \(p(x_d | x_{<d})\) separately requires \(D\) different model, which is infeasible. Instead we use the shared model (i.e: autoregression model).
To reduce the complexity, one simple idea is to use the finite memory
where the trigram model \(p(x_d | x_{d-1}, x_{d-2})\) is modeled using an MLP
1.1. Long-Range Memory with RNN
RNN can be used as an autoregressive model.
Model (char-rnn) The character-level language model is model the character sequence \(\mathbf{x}\) with RNN as follows
Karpathy's blog shows that this model can be used to generate many different sequences such as Shakespeare, Wikipedia, XML, latex and source code.
This model can also generate non-text objects such as images by representing pixel as character.
1.2. Masking-based Models
Model (Masking-based autoregressive model, MADE) An MLP based autoencoder can be turned into an autoregressive model by removing (masking) some connections.
Model (wavenet) Wavenet is a 1d convolution AR model
Model (PixelCNN) Pixel CNN is the 2d convolution AR model. Unlike normal CNN which will use all neighborhood pixels to convolve, PixelCNN masks out those pixels it has not seen (e.g. with the raster scan ordering)
Model (PixelCNN++) OpenAI's implementation of PixelCNN with several improvement:
- Use mixture of logistic (e.g: 5 component) to model the discretized distribution instead of 256 softmax because
- saves memory
- allow dense gradient flow to speedup training
- pixel conditioning is simplified
- short-cut connection like the U-net
The mixture of logistic is sa follows:
PMF is modeled as
2. Variational Autoencoder
The idea behind the latent variable model is to assume a lower-dimensional latent space and the following generative process
We want to sample from the simple low-dimensional latent space \(Z\) easily (e.g: Gaussian), and maximize the evidence function over the dataset \(X \sim \mathcal{D}\)
The idea behind the latent variable model is to assume a lower-dimensional latent space and the following generative process
We want to sample from the simple low-dimensional latent space \(Z\) easily (e.g: Gaussian), and maximize the evidence function over the dataset \(X \sim \mathcal{D}\)
2.1. Vanilla VAE
VAE implement this idea with the following modeling
Model (\(P(Z)\)) the prior in VAE is
Note this is a fixed prior in contrast with the VQ-VAE's learnt prior
Model (\(P(X|Z)\)) likelihood is modeled using a a deep neural network function \(f(Z; \theta)\): VAE approximates \(X \approx f(Z; \theta)\) and measure \(P(X|Z)\) by penalizing using Gaussian distribution
if \(X\) is discrete, it can other discrete distribution penalty such as Bernouli)
Probabilistic PCA
Recall pPCA is a simplified version of VAE
and likelihoood function \(f\) is linear
The graphical model of VAE is
The integration of evidence is very expensive,
so we are maximizing the lower bound of evidence (ELBO) instead of the evidence itself
Recall the the standard ELBO form (RHS usually denoted \(\mathcal{L}(Q)\)) is
where the right hand expression is the ELBO, ELBO can be further decomposed into two terms
We model \(Q(Z|X)\) as
wherre \(\mu(X), \Sigma(X)\) is implemented using neural network.
Look at the formula again,
The first term on RHS has a sampling step \(z \sim Q\) which cannot backprogate. The training process can be done using the reparametrization trick where we sample \(\epsilon \sim N(0, I)\) and transform \(\epsilon\) to \(z\) (instead of sampling \(Z \sim Q\) directly)
The 2nd term is simple to compute
The likelihood function cannot be exactly calculated, only the lower bound could be provided
2.2. Posterior Collapse
VAE suffers from the posterior collapse problem when the signal from posterior \(Q(Z|X)\) is too weak or too noisy, it collapses towards the prior
where a subset of \(Z\) is not meaninfully used and it matches the uninformative prior
The decoder then starts ignoring it and generate sample without signal from \(X\), the reconstructed output becomes independent of \(X\)
Some works claims this is because of the KL term in the objective,
Most common approaches to solve these are either
- change objective
- weaken decoder
2.3. Architecture
Model (VAE-GAN) Attach a discriminator after encoder/decoder.
2.4. Loss
Model (\(\beta\)-VAE) attempts to learn an disentangle distribution with \(\beta > 1\)
Another approach to solve the posterior collapse problem
Model (\(\delta\)-VAE) prevent KL from falling to zero by constraining posterior \(Q\) and prior \(P\) such that they have a minimum distance \(KL > 0\)
A trivial choice is to set the Gaussian with a fixed different variance. For a non-trivial sequential model, they use non-correlated \(q\) and corelated prior AR(1)
There is a minimum distance because one is correlated and the other is not correlated
2.5. Vanilla VQ-VAE (Discrete Model)
Model (VQ-VAE) VQ uses the discrete latent variables instead of the continous one. It has a latent embedding \(e \in R^{KD}\) where \(K\) is the size of the discrete latent space and \(D\) is the hidden dimension.
It models the posterior distribution as a deterministic categorical distribution
The loss function is
It consists of
- reconstruction loss: \(\log p(x|z_q(x))\)
- codebook loss: \(\| sg[z_e(x)] - e \|^2\), bringing codebook close to encoder output, can be replaced with EMA (exponential moving average) for stability
- commitment loss: $ |z_e(x) - sg[e] |^2$, encourages encoder output to be close to codebook
VAE vs VQ-VAE
VQ-VAE can be seen as a special case of VAE. The KL term in the original VAE disappears by assuming prior \(p(z)\) is uniform \(p(z=k) = 1/K\) and the proposal distribution \(q(Z=k|X)\) is deteterministic:
While training the model, the prior \(p(Z)\) is kept constant and uniform \(p(z=k)=1/K\),. After training, it can be fit to an autoregressive model over \(Z\), so that we can sample using ancestral sampling
In this work, they model the autoregressive latent prior using PixelCNN for image and WaveNet for raw audio
The experiment settings are interesting
Image settings:
- 128x128x3 -> 32x32x1 (K=512)
- 43 times reduction
Audio settings:
- encoder: 6 convolution with stride 2 and window 4 (K=512)
- 64 times reduction
- decoder: dilated convolutional architecture like the WaveNet decoder
Problems:
The VQVAE also has its own problem, namely, the low codebook usage due to poor codebook initialization.
2.6. Hierarchical VQ-VAE
Model (Hierarchical VQ-VAE)
It has a hierarchical latent code
- top latent code models global information
- bottom latent code, conditioned on the top latent, models local information
256x256 images -> 64x64 (bottom) -> 32x32 (top)
Prior
- top prior: PixelCNN + multihead self attention to capture larger receptive field
- bottom prior: no self-attention
3. Normalizing Flow
Let \(Z \in R^D\) be a tractable random variable with pdf \(p(Z)\), let \(g\) be a invertible function (with inverse \(f\))
using change of variable formula, we know
\(g\) is the generator which moves the simple distribution to a complicated distribution, its inverse \(f\) normalizes the complicated distribution towards simpler form.
To train a model, we optimize the log-likelihood only using \(f\)
To sample a new point, we just use sample \(z\) and transform using \(g(z)\)
Normalizing flow vs VAE
Architecture:
- VAE's encoder/decoder is usually not invertible
- NF's encoder/decoder is bijective
Objective:
- VAE is to maxmize the lower bound of log-likelihood (ELBO)
- NF is to maximize the exact log-likelihood
\(f,g\) control the expressiveness of the model, one way to build complicated bijective functions is to compose them
which has the inverse \(f = f_1 \circ f_2 ... \circ f_N\) and determinant
3.1. Linear Flow
Model (linear flow) We first consider the simple linear flow with invertible \(A\)
Linear flows is limited in its expressiveness: when \(p(z) = N(\mu, \Sigma)\), then \(p(y) = N(A\mu + b, A^T\Sigma A)\).
Additionally, computing determinant of Jacobian (\(\det A\)) is \(O(D^3)\), computing inverse \(A^{-1}\) costs same \(O(D^3)\).
By constraining the matrix \(A\) to be triangular, orthogonal etc improves the computational cost.
3.2. Planar Flow
Model (planar flow)
where \(h\) is a nonlinearity
3.3. RealNVP
3.4. Inverse Autoregressive Flow
4. Diffusion Model
Links
- Lecture Video by Jascha
- nnabla lecture
4.1. Score Matching Models
The general score matching description is here
Model (denoising score matching)
Model (sliced score matching)
Model (NCSN, Noise Conditional Score Networks) Contributions are
- perturbing the data using various levels of noise \(\sigma_1, ..., \sigma_L\)
- simultaneously estimating scores corresponding to all noise levels by training a single conditional score network \(s_\theta\)
The sampling is done by the annealed Langevin dynamic, which continue to applye Langevin dynamic for each noise scale \(\sigma_i\)
4.2. Denoising Diffusion
Model (DDPM, Denoising Diffusion Models) Diffusion models are latent variable models of the forms
where \(x_{1:T}\) are latent variables
reverse process The joint complete distribution \(p_\theta(x_{0:T})\) is called the reverse process, it is defined with
where \(p_{\theta}(T) = N(0, I)\) and
forward process, diffusion process The approximate posterior is a fixed markov chain which adds noise to the data according to a variance schedule \(\beta_1, ..., \beta_T\)
where:
The simplified objective is
This objective is analogous to the loss weighting used by the NCSN denoising score matching model
Model (improved diffusion) Improvment diff are
Noise scheduling is cosine instead of linear, it adds noise more slowly
Learning variance \(\Sigma_\theta(x_t, t)\) instead of using a fixed one \(\sigma^2I\) where \(v\) is learned output
4.3. Sampling
Model (DDIM, denoising diffusion implicit model) faster sampling with a non-Markovian diffusion process
Model (PNDM, pseudo numerical methods for diffusion models)
4.4. Conditional Diffusion
Model (Guided diffusion, classifier-guided)
Model (classifier-free guidance)
Model (GLIDE, text-to-image)
Model (SDEdit)
Model (bit diffusion, discrete diffusion)
Model (DreamFusion, 3d diffusion, text to 3d)
4.5. Latent Diffusion
Model (latent diffusion, stable diffusion)
run diffusion on the latent space, the diffusied latent vector is further decoded into an image
5. Adversarial Model
Let \(X \in \mathcal{X}\) be the random variable of interest, \(P(X)\) to be its distribution and a training sample \(X_1, ..., X_n\).
We have two main components:
- generator: a map \(g_\theta: \mathcal{Z} \to \mathcal{X}\). It takes an random Guassian noise \(Z\) and generates outputs \(g_\theta(Z)\). Its goal is to choose \(\theta\) such that \(g_\theta(Z)\) is close to \(X\)
- discriminator: a map \(D_w \mathcal{X} \to [0, 1]\). its goal is to assign 1 to samples from the real distribution \(P_X\) and 0 to samples from the generated samples \(P_\theta\)
The parameters are \((\theta, w)\), it can be obtained by solving the min-max problem
It is equivalent to minimizing the JS divergence
It means we choose the closest \(P_\theta\) to the target distribution \(P_X\) in the JS divergence
5.1. Problems
5.1.1. Vanishing Gradient
5.1.2. Mode Collapse
5.2. Architecture
Model (DCGAN, deep convolutional GAN) transposed convolution to upsampling
An application of DCGAN (2 dim) to audio generation (1 dim) is WaveGAN
Model (SAGAN, Self-Attention GAN)
Add self-attention to GAN to enable both generator and discriminator to model long-range relation. \(f,g,h\) in the figure is corresponding to \(k,q,v\)
Model (BiGAN) use discriminator to distinguish whether \((x,z)\) is from encoder or decoder
5.3. Representation
Model (Info GAN) Modifies GAN to encourage it to learn meaning representation by maximizing the mutual information between a small subset of noise and the observations.
The input noise vector is decomposed into \(z\): incompressible noise, \(c\), latent code which encode salient semantic features. The goal is to minimize \(I(c; x=G(z,c))\), which is not available because \(P(c|x)\) is unknown.
Instead we lower bound this using an auxiliary distribution \(Q(c|x)\) to approximate \(P(c|x)\)
By ignoring the second term and rewriting the first term, the lower bound becomes
5.4. Loss
Model (spectral normalization) Stabilize the training of dsicriminator by normalize the weight by its spectral norm so that its Lipschitz constant is controlled
Model (WGAN, Wasserstein GAN)
The main points of WGAN is to replace the JS distance to \(L^1\)-Wasserstein distance. because
- Wasserstein distance respects the geometry of underlying distribution
- it captures the distance between two distribution even their support do not intersect
Not intersecting support is common in high dimensional applications where the target distribution lies in a low dimensional manifold
Recall the \(L^1\)-Wasserstein distance is
where \(\pi\) is any coupling between pair of random variables \((X,Y)\).
It can be shown that the Wasserstein distance \(W_1(P_X, P_{g_\theta})\) is continuous with respect to \(\theta\) if \(g_\theta\) is continous wrt \(\theta\)
To minimze \(W_1(P_X, P_\theta)\), we use the Kantorovich-Rubinstein duality
where the sup is over functions whose Lipschitz constant is less than 1, expanding the entire forms, we get
subject to \(\| D \|_L \leq K\)
In practice, the constraint is enforced by constraining the infinity norm of the weights (known as clipping)
tutorials:
- Here is a short introduction to optimal transport
- A good introduction to the WGAN
- a mandarin introduction
Model (WGAN-GP, WGAN + Gradient Penalty)
Model (LS-GAN) use least-square instead of sigmoid cross entropy in discriminator, it can
- generates higher quality
- stable learning process
5.5. Application forcused GAN
Model (Cycle GAN)
6. Energy-based Model
The partition function chapter of the deep learning book has a good coverage of those methods.
Energy-based model defines a energy function \(E(X)\), and models the generation using the Boltzmann distribution
where \(Z\) is the partition function
Model (Boltzmann Machine)
Model (Restricted Boltzmann Machine)
7. Reference
- [1] Berkeley CS249
- [2] http://karpathy.github.io/2015/05/21/rnn-effectiveness/
- [3] Hung-yi Lee Youtube Flow-based Generative Model
- [4] Jakub M. Tomczak Deep Generative Modeling book