0x400 Distributions

Discrete Distributions

A random variable $X$ is said to have a discrete distribution if the range of $X$, the sample space, is countable. Otherwise it is a continuous distribution.


Distribution (Bernoulli) A random variable $X$ is said to be a Bernoulli random variable with parameter $p$, shown as $X \sim Bernoulli(p)$.

$$ P_X(x) =
p & \text{for } x = 1 \\
1-p & \text{for } x = 0 \\ 0 & otherwise \end{cases} $$

or simply

$$P_X(x) = p^x (1-p)^{1-x}$$

A Bernoulli random variable is associated with a certain event. If event $A$ occurs, then $X=1$; otherwise $X=0$. For this reason, the Bernoulli random variable is also called the indicator random variable.

Lemma Properties of Bernoulli distribution

$$EX = p$$

$$\mathrm{Var}X = p(1-p)$$

$$M_X(t) = [pe^t + (1-p)]^n$$

Distribution (multiclass Bernoulli)

Bernoulli can be generalized to deal with multi-class. Suppose $\mathbf{\mu}=(\mu_1, …, \mu_K)$ where each $mu_k \geq 0$ and $\sum_k \mu_k = 1$. Then the probability of $\mathbf{x}=(x_1, …, x_K)$ where $x_k \in \{0, 1\} \land \sum_k x_k=1$

$$P_X(\mathbf{x}) = \prod_{k=1}^K \mu_k^{x_k}$$


Distribution (Binomial) A random variable $X$ is said to be a binomial random variable with parameters n and p, shown as $X \sim Binomial(n,p)$ iff

$$ P_X(x) =
\binom{n}{k} p^k(1-p)^{n-k} & \text{for } k = 0,1,2,3… \\
0 & otherwise \end{cases} $$

Note that $EX = np, Var(X) = np(1-p)$.

Lemma If $X_1, X_2, … X_n$ are independent $Bernoulli(p)$ random variables, then the random variable $X$ defined by $X = X_1 + X_2 + … + X_n$ has a $Binomial(n,p)$ distribution

$$M_X(t) = [pe^t + (1-p)]^n$$

Note that this is easily recognized by multiplication of Bernoulli’s mgf


Distribution (multinomial) Binomial distribution can be generalized into the Multinomial distribution. A random variable $\mathbf{X}$ is said to have a $Multinomial(N, \mathbf{\mu})$ distribution when its pmf is

$$P_{\mathbf{X}}(\mathbf{x}) = \frac{N}{x_1 ! x_2 !, …, x_K !} \prod_{k=1}^{K} \mu_k^{x_k}$$

Conjugate prior of multinomial distribution is Dirichlet distribution.


Distribution (Geometric) A random variable $X$ is said to be a geometric random variable with parameter p, shown as $X \sim Geometric(p)$ iff

$$ P_X(x) =
p(1-p)^{k-1} & \text{for } k = 1,2,3… \\
0 & otherwise \end{cases} $$

The geometric distribution is the simplest of the waiting time distributions and sometimes used to model lifetime and time until failure


Pascal distribution is a generalization of the geometric distribution.

Distribution (Pascal) A random variable $X$ is said to be a Pascal random variable with parameters $m$ and $p$, shown as $X \sim Pascal(m,p)$ iff

$$ P_X(x) =
\binom{k-1}{m-1} p^m(1-p)^{k-m} & \text{for } k = m,m+1,m+2,m+3… \\
0 & otherwise \end{cases} $$

Poisson distribution is usually applied to model occurrences both for some time intervals and spatial intervals, whether the probability of an occurrence is proportional to the length of the intervals. For example, the number of earthquakes during 1 year, and the number of people in an area.


Distribution (Poisson) A random variable $X$ is said to be a Poisson random variable with parameter $\lambda$ shown as $X \sim Poisson(\lambda)$

$$ P_X(x) =
\frac{e^{-\lambda}\lambda^k}{k!} & \text{for } k = 0,1,2,3… \\
0 & otherwise \end{cases} $$

Poisson distribution is the limit of binomial distribution when $\lambda = np$ and $n$ is very large and $p$ is very small. This can be proved by mgfs convergence.

$$M_X(t) = e^{\lambda(e^t-1)}$$

Calculation of Poisson distribution can be done easily by the following recursive relation

$$P(X=x) = \frac{\lambda}{x} P(X=x-1)$$

Continuous Distributions


Distribution (uniform) The continuous uniform distribution is defined to spread mass over an interval $[a,b]$ where its pdf is

$$f(x|a,b) = \frac{1}{b-a}$$

Properties are

$$EX = \frac{a+b}{2}$$

$$\mathrm{Var} X = \frac{(b-a)^2}{12}$$


Distribution (exponential) A random variable $X$ has an exponential distribution $Exponential(\lambda)$ when its pdf is

$$f_X(x) = \frac{1}{\lambda} e^{-\frac{x}{\lambda}} $$

Some properties are

$$EX = \lambda$$

$$Var(X) = \lambda^2$$

Note that exponential distribution is a special case of gamma distribution by setting $\alpha=1$

Exponential distribution can be used to model lifetimes, analogous to the use of geometric distribution.

Distribution (Weibull) Weibull distribution can be obtained from the exponential distribution where $X$ has exponential$(\beta)$, then $Y=X^{1/\gamma}$ is a Weibull$(\gamma, \beta)$ distribution.


Distribution (beta)

Definition (beta function) The beta function $B(\alpha, \beta)$ is defined as

$$B(\alpha, \beta) = \int_0^1 x^{\alpha – 1} (1-x)^{\beta – 1}$$

It can be represented with gamma function by

$$B(\alpha, \beta) = \frac{\Gamma(\alpha)\Gamma(\beta)}{\Gamma(\alpha+\beta)}$$

Distribution (beta) The beta distribution’s pdf is

$$f(x|\alpha, \beta) = \frac{1}{B(\alpha, \beta)} x^{\alpha – 1} (1-x)^{\beta – 1}$$

Beta distribution is usually used to model proportions, which naturally lie between 0 and 1. It can takes on many shapes, for example, when $\alpha=\beta=1$, it reduced to the uniform distribution.

Because of this property, Beta distribution is the conjugate family of the binomial distribution.

Properties Moments of beta distribution is easily computed by the same integrand

$$EX^n = \frac{B(\alpha+n, \beta}{B(\alpha, \beta)}$$

The mean and variance of beta distribution is

$$EX = \frac{\alpha}{\alpha+\beta}$$

$$\mathrm{Var} X = \frac{\alpha\beta}{(\alpha+\beta)^2 (\alpha+\beta+1)}$$


Distribution (dirichlet)

Dirichlet distribution is a multiclass generalization of Beta distribution. It is used as the conjugate for multinomial distribution.

$$Dir(\mu | \alpha) = \frac{\Gamma(a_0)}{\Gamma(a_1)…\Gamma(a_K)} \prod_{k=1}^K \mu_k^{\alpha_k – 1}$$


Distribution (gamma)

Definition (gamma function) Gamma function is defined as

$$\Gamma(\alpha) = \int_{0}^{\infty} t^{\alpha-1} e^{-t} dt$$

The gamma function has the property of $\Gamma(\alpha+1) = \alpha \Gamma(\alpha)$ and $\Gamma(\frac{1}{2}) = \sqrt{\pi}$

Distribution (gamma) gamma $(\alpha, \beta)$ family is

$$f(x | \alpha, \beta) = \frac{1}{\Gamma(\alpha) \beta^{\alpha}} x^{\alpha-1} e^{-x/\beta}$$

The parameter $\alpha$ is known as the shape parameter, which influences the peakedness of the distribution, the parameter $\beta$ is the scale parameter

This can be obtained by normalize Gamma function and scale random variable as $X=\beta T$.

Properties Some properties of gamma distribution

$$EX = \alpha\beta$$

$$\mathrm{Var}X = \alpha\beta^2$$

$$M_X(t) = (\frac{1}{1-\beta t})^\alpha$$

Relation with Other Distributions

  • exponential distribution is a special case of Gamma distribution when $\alpha=1, \beta=\lambda$
  • chi-square is a special case of Gamma distrition when $\alpha=p/2, \beta=2$

Distribution (Wishart)


Distribution ($\chi^2$)

Distribution ($\chi^2$) $\chi^2$ distribution is a special case of gamma distribution. The chi square distribution with p degrees of freedom has pdf of Gamma$(p/2, 2)$

$$f(x|p) = \frac{1}{\Gamma(p/2)2^{p/2}} x^{p/2 – 1} e^{-x/2}$$

Lemma If $Z$ is a n(0,1) random variable $Z^2 \sim \chi_1^2$

Lemma If $X_1, …, X_n$ are independent and $X_i \sim \chi_{p_i}^2$, then $X_1 + … +X_n \sim \chi^2_{p1+…+pn}$

T distribution

Distribution (t)

In most practical cases, the variance $\sigma^2$ is unknown. thus to get any idea of the variability of $\hat{X}$, we need to estimate this variance first.

Distribution (Student’s t, Gosset) Let $X_1, …, X_n$ be a random sample from a $n(\mu, sigma^2)$ distribution. The quantity $(\hat{X}-\mu)/(S/\sqrt{n})$ has Student’s t distribution with $n-1$ degrees of freedom. It has pdf

$$f_T(t) = \frac{\Gamma((p+1)/2)}{\Gamma(p/2)} \frac{1}{(p\pi)^{1/2} (1+t^2/p)^{(p+1)/2}}$$

The distribution is a distribution of $U/\sqrt(V/p)$ where $U \sim n(0,1), V \sim \chi^2_p$

Student’s t has no mgf because it does not have moments of all orders. If there are $p$ degrees of freedom, then there are only $p-1$ moments.

t distribution has an important property robustness, which means it is much less sensitive to the outlines than Gaussian distributions

F distribution

Distribution (F)

Distribution (Snedecor’s F, Fisher) Let $X_1, …, X_n$ be a random sample from a $n(\mu_X, \sigma^2_X)$ population and $Y_1, …, Y_m$ from $n(\mu_Y, \sigma^2_Y)$ population. The random variable $F = (S_X^2/\sigma^2_X)/(S_Y^2/\sigma^2_Y)$ has a Snedecor’s F distribution with $n-1, m-1$ degrees of freedom.


If $X \sim F_{(p,q)}$, then $1/X \sim F_{(q,p)}$

If $X \sim t_q$, then $X^2 \sim F_{(1,q)}$

Gaussian Distribution

The normal distribution plays a central role in a large boy of statistics

Distribution (normal) The normal distribution with mean $\mu$ and variance $\sigma^2$ is given by

$$f_X(x | \mu, \sigma^2) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp\{{-\frac{(x-\mu)^2}{2\sigma^2}}\}$$

The moments generating function of normal distribution can be computed

$$M_X(t) = \exp\{ \mu t + \sigma^2 t^2 /2 \}$$

Lemma (Stein) Let $X \sim n(\theta, \sigma^2)$, and let $g$ be a differentiable function satisfying $E|g'(X)| < \infty$. Then

$$E[ g(X) (X-\theta) ] = \sigma^2 Eg'(X)$$

This is a useful equality to derive, for example, higher order moments for normal distribution

Relation with other distributions

suppose $X, Y$ are independent normal random distribution $N(0,1)$, then $X/Y$ is a Cauchy random variable

Conditional Gaussian Distribution

Given a joint Gaussian distribution $N(x|\mu, \Sigma)$ with $\Lambda = \Sigma^{-1}$

and $x = (x_a, x_b)$, $\mu = (\mu_a, \mu_b)$,

$$\Sigma = \begin{pmatrix} \Sigma_{aa} & \Sigma_{ab} \\ \Sigma_{ba} & \Sigma_{bb} \\ \end{pmatrix} \Lambda = \begin{pmatrix} \Lambda_{aa} & \Lambda_{ab} \\ \Lambda_{ba} & \Lambda_{bb} \\ \end{pmatrix}$$

The conditional distribution is given by

$$p(x_a | x_b) = N(x|\mu_{a|b}, \Lambda_{aa}^{-1})$$


$$\mu_{a|b} = \mu_a – \Lambda_{aa}^{-1}\Lambda_{ab}(x_b – \mu_b)$$

MLE Point Estimate

Maximum Likelihood Estimation on a dataset $(x_1, …, x_n)$ is

$$\hat{\mu}_{ML} = \frac{1}{n} \sum_{i=1}^n x_n$$

$$\hat{\Sigma}_{ML} = \frac{1}{n} \sum_{i=1}^n (x_n – \hat{\mu}_{ML})(x_n – \hat{\mu}_{ML})^T$$

Bayes Estimate

Conjugate distribution of $\mu$ is normal distribution. Suppose the prior distribution of $\mu$ is $\mathcal{N}(\mu_0, \sigma^2_0)$, then the posterior distribution is

$$p(\mu | x) = \mathcal{N}(\mu_N, \sigma^2_N)$$


$$\mu_N = \frac{\sigma^2}{N\sigma^2_0 + \sigma^2} \mu_0 + \frac{N \sigma^2_0}{N\sigma^2_0 + \sigma^2} \mu_{ML}$$

$$\frac{1}{\sigma^2_N} = \frac{1}{\sigma^2_0} + \frac{N}{\sigma^2}$$

Conjugate distribution of precision $\lambda$ is Gamma distribution. Suppose the prior distribution is Gamma distribution with $(a,b)$ hyperparameter, then posterior Gamma distribution is

$$a_N = a_0 + \frac{N}{2}$$

$$b_N = b_0 + \frac{N}{2}\sigma^2_{ML}$$

The multivariate version is the Wishart distribution

Conjugate distribution of $\sigma$ is inverse-Gamma distribution

Conjugate distribution of both mean and precision is called normal gamma distribution

Family distributions

Exponential Family

Many common probability distribution are examples of a broad class called exponential family whose form is

$$p(x|\eta) = h(x)g(\eta) \exp(\eta^T u(x))$$

$u(x)$ is the sufficient statistics here

This expression has many nice mathematical properties: for example, its covariance can be computed easily with following formula. High order moments also have similar properties, which can be computed easily by replacing integration by differentiation.

$$E[u(x)] = – \nabla \log(g(\eta))$$

MLE of exponential family has the similar form

$$ -\nabla \log g(\eta_{ML}) = \frac{1}{N} \sum u(x_n)$$

Sub-Gaussian Family


[1] Casella, George, and Roger L. Berger. Statistical inference. Vol. 2. Pacific Grove, CA: Duxbury, 2002.

[2] Bishop, Christopher M. Pattern recognition and machine learning. springer, 2006.