0x533 Diffusion Model

Links

Lecture Video by Jascha
nnabla lecture

4.1. Score Matching Models

The general score matching description is here

Model (denoising score matching)

Model (sliced score matching)

Model (NCSN, Noise Conditional Score Networks) Contributions are

perturbing the data using various levels of noise \(\sigma_1, ..., \sigma_L\)
simultaneously estimating scores corresponding to all noise levels by training a single conditional score network \(s_\theta\)

\[s_\theta(x, \sigma) \approx \nabla_x \log q_\sigma(x)\]

The sampling is done by the annealed Langevin dynamic, which continue to applye Langevin dynamic for each noise scale \(\sigma_i\)

4.2. Denoising Diffusion

Model (DDPM, Denoising Diffusion Models) Diffusion models are latent variable models of the forms

\[p_\theta(x_0) = \int p_\theta(x_{0:T}) dx_{1:T}\]

where \(x_{1:T}\) are latent variables

reverse process The joint complete distribution \(p_\theta(x_{0:T})\) is called the reverse process, it is defined with

\[p_\theta(x_{0:T}) = p(x_T)\prod_{t=1}^T p _\theta(x_{t-1} | x_t)\]

where \(p_{\theta}(T) = N(0, I)\) and

\[p_\theta(x_{t-1} | x_t) = N(x_{t-1} | \mu_\theta(x_t, t), \Sigma_\theta(x_t, t))\]

forward process, diffusion process The approximate posterior is a fixed markov chain which adds noise to the data according to a variance schedule \(\beta_1, ..., \beta_T\)

\[q(x_{1:T} | x_0) = \prod_{t=1}^T q(x_t | x_{t-1})\]

where:

\[q(x_t | x_{t-1}) = N(\sqrt{1-\beta_t} x_{t-1}, \beta_t I)\]

The simplified objective is

\[L_{simple}(\theta) = E_{t, x_0, \sigma}( \| \sigma - \sigma_\theta(\sqrt{\bar{\alpha}_t}x_0 + \sqrt{1- \bar{\alpha_t}}\sigma, t \|))\]

This objective is analogous to the loss weighting used by the NCSN denoising score matching model

Model (improved diffusion) Improvment diff are

Noise scheduling is cosine instead of linear, it adds noise more slowly

Learning variance \(\Sigma_\theta(x_t, t)\) instead of using a fixed one \(\sigma^2I\) where \(v\) is learned output

\[\Sigma_\theta(x_t, t) = \exp(v \log \beta_t + (1-v) \tilde{\beta}_t)\]