# 0x042 Parametric Distribution

- 1. Discrete Distributions
- 2. Continuous Distributions
- 3. Gaussian Distribution
- 4. Exponential Family
- 5. Sub-Gaussian Family
- 6. Reference

## 1. Discrete Distributions

A random variable \(X\) is said to have a discrete distribution if the range of \(X\), the sample space, is countable. Otherwise it is a continuous distribution.

### 1.1. Bernoulli

**Distribution (Bernoulli)** A random variable \(X\) is said to be a Bernoulli random variable with parameter \(p\), shown as \(X \sim Bernoulli(p)\).

or simply

A Bernoulli random variable is associated with a certain event. If event \(A\) occurs, then \(X=1\); otherwise \(X=0\). For this reason, the Bernoulli random variable is also called the indicator random variable.

**Lemma** Properties of Bernoulli distribution

**Distribution (multiclass Bernoulli)**
Bernoulli can be generalized to deal with multi-class. Suppose \(\mathbf{\mu}=(\mu_1, ..., \mu_K)\) where each \(mu_k \geq 0\) and \(\sum_k \mu_k = 1\). Then the probability of \(\mathbf{x}=(x_1, ..., x_K)\) where \(x_k \in \{0, 1\} \land \sum_k x_k=1\)

### 1.2. Binomial

**Distribution (Binomial)** A random variable \(X\) is said to be a binomial random variable with parameters n and p, shown as \(X \sim Binomial(n,p)\) iff

**Lemma (properties)**

**Lemma** If \(X_1, X_2, ... X_n\) are independent \(Bernoulli(p)\) random variables, then the random variable \(X\) defined by \(X = X_1 + X_2 + ... + X_n\) has a \(Binomial(n,p)\) distribution

Note that this is easily recognized by multiplication of Bernoulli's mgf

**Lemma (conjugacy)** The conjugate distribution of binomial is the Beta distribution.

If the prior has the beta distribution

The posteior would be

where \(l=n-k\). This is the kernel form of the beta distribution \(Beta(k+a-1, l+b-1)\) where which can regard the hyperparameters \(a,b\) as an *effective number of observations* of \(x=1, x=0\)

Under this formulation, we can compute the expectation of the posterior using the expectation of the beta distribution.

Notice that when \(k, l \to \infty\), this is reduced to the MLE of binomial distribution.

### 1.3. Multinomial

**Distribution (multinomial)** Binomial distribution can be generalized into the Multinomial distribution. A random variable \(\mathbf{X}\) is said to have a \(Multinomial(N, \mathbf{\mu})\) distribution when its pmf is

Conjugate prior of multinomial distribution is Dirichlet distribution.

### 1.4. Geometric

**Distribution (Geometric)** A random variable \(X\) is said to be a geometric random variable with parameter p, shown as \(X \sim Geometric(p)\) iff

The geometric distribution is the simplest of the waiting time distributions and sometimes used to model lifetime and time until failure, however, it fails to model lifetimes for which the probability of failure is expected to increase with time.

It has a "memoryless" property, that is, it "forgets" what has occured.

failure times

If the probability is 0.001 for a light bulb will fail on any given day, then the probability that it will last at least 30 days is

### 1.5. Negative Binomial

Negative Binomial distribution is a generalization of the geometric distribution, it models the number of failure before \(r\)-th success (while geometric model the number of failure before the first success)

**Distribution (Negative Binomial, Pascal)** A random variable \(X\) is said to be a negative binomial random variable with parameters \(m\) and \(p\), shown as \(X \sim NB(r,p)\) iff

where r is the number of successes, k is the number of failures, and p is the probability of success on each trial

A equivalent formulation of this distribution is to model the waiting time until \(r\)-th success

which can be interpreted as the sum of iid geometric

where \(X_i\) are iid geometric

### 1.6. Poisson

Poisson distribution is usually applied to model occurrences both for some time intervals and spatial intervals, whether the probability of an occurrence is proportional to the length of the intervals. For example, the number of earthquakes during 1 year, and the number of people in an area.

**Distribution (Poisson)** A random variable \(X\) is said to be a Poisson random variable with parameter \(\lambda\) shown as \(X \sim Poisson(\lambda)\)

Poisson distribution is the limit of binomial distribution when \(\lambda = np\) and \(n\) is very large and \(p\) is very small. This can be proved by mgfs convergence.

Calculation of Poisson distribution can be done easily by the following recursive relation

**Corollary (additivity)** If \(X \sim Poisson(\theta), Y \sim Poisson(\lambda)\), then

This can be provided by either MGF or bivariate transformation with \((X,Y) \to (X+Y, Y)\)

## 2. Continuous Distributions

### 2.1. Uniform

**Distribution (uniform)** The continuous uniform distribution is defined to spread mass over an interval \([a,b]\) where its pdf is

Properties are

### 2.2. Exponential

**Distribution (exponential)** A random variable \(X\) has an exponential distribution \(Exponential(\lambda)\) when its pdf is

Some properties are

Note that exponential distribution is a special case of gamma distribution by setting \(\alpha=1\)

Exponential distribution can be used to model lifetimes, analogous to the use of geometric distribution.

**Distribution (Weibull)** Weibull distribution can be obtained from the exponential distribution where \(X\) has exponential\((\beta)\), then \(Y=X^{1/\gamma}\) is a Weibull\((\gamma, \beta)\) distribution.

### 2.3. Beta

Beta distribution offers a wide variety of shapes for distributions with support on bounded intervals (e.g: [0,1])

**Definition (beta function)** The beta function \(B(\alpha, \beta)\) is defined as

It can be represented with gamma function by

**Distribution (beta)** The beta distribution's pdf is

Beta distribution is usually used to model proportions, which naturally lie between 0 and 1. It can takes on many shapes, for example, when \(\alpha=\beta=1\), it reduced to the uniform distribution.

Because of this property, Beta distribution is the conjugate family of the binomial distribution.

**Properties** Moments of beta distribution is easily computed by the same integrand

The mean and variance of beta distribution is

**Relations with other distributions**

Beta distribution can be otained from Gamma distribution with transformation

where \(X_1, X_2\) are \(Gamma(\alpha, 1), Gamma(\beta, 1)\) distributions.

When \(\alpha, \beta=1\), it becomes the uniform distribution.

### 2.4. Dirichlet

**Distribution (dirichlet)**
Dirichlet distribution is a multiclass generalization of Beta distribution. It is used as the conjugate for multinomial distribution.

Dirichlet distribution can be obtained by following transformations

where \(X_1, ..., X_{k+1}\) are gamma distributions.

### 2.5. Gamma

The main reason for the appeal of the Gamma distribution in applications is the variety of shapes of the distribution for difrerent various \(\alpha, \beta\).

Note beta distribution also has this variety property, but beta family has support for a bounded interval \((a,b)\), Gamma distribution has an unbounded support \((0, \infty)\).

**Definition (gamma function)** Gamma function is defined as

The gamma function has the property of \(\Gamma(\alpha+1) = \alpha \Gamma(\alpha)\) and \(\Gamma(\frac{1}{2}) = \sqrt{\pi}\), it is sometimese called the factorial function.

Note the support is the positive real numbers \([0, \infty)\)

**Distribution (gamma)** gamma \((\alpha, \beta)\) family is

The parameter \(\alpha\) is known as the shape parameter, which influences the peakedness of the distribution, the parameter \(\beta\) is the scale parameter

This can be obtained by normalize Gamma function and scale random variable as \(X=\beta T\).

**Properties (Gamma distribution)**

**Lemma (additivity)** Let \(X_1, ..., X_n\) be independent random variable and \(X_i\) has a \(Gamma(\alpha_i, \beta)\) distribution, let \(Y=\sum_i X_i\), then Y has a distribution of \(Gamma(\sum_i \alpha_i, \beta)\)

Relation with Other Distributions - exponential distribution is a special case of Gamma distribution when \(\alpha=1, \beta=\lambda\) - chi-square is a special case of Gamma distrition when \(\alpha=p/2, \beta=2\)

Distribution (Wishart)

## 3. Gaussian Distribution

The normal distribution plays a central role in a large boy of statistics

### 3.1. Univariate Gaussian

**Distribution (normal)** The normal distribution with mean \(\mu\) and variance \(\sigma^2\) is given by

The moments generating function of normal distribution can be computed

68–95–99.7 rule

It is worth remembering a couple of numbers related to standard normal distribution

These number can be obtained using ppf or cdf in `scipy.stats`

```
In [1]: scipy.stats.norm.ppf(0.95)
Out[1]: 1.6448536269514722
In [6]: scipy.stats.norm.cdf(1.64)
Out[6]: 0.9494974165258963
```

confidence interval

The four commonly used confidence intervals for a normal distribution are:

- 68% of values fall within 1 standard deviation of the mean (-1s <= X <= 1s)
- 90% of values fall within 1.65 standard deviations of the mean (-1.65s <= X <= 1.65s)
- 95% of values fall within 1.96 standard deviations of the mean (-1.96s <= X <= 1.96s)
- 99% of values fall within 2.58 standard deviations of the mean (-2.58s <= X <= 2.58s)

**Corollary (additivity)** Let \(X_1, ..., X_n\) be independent random variables with \(N(\mu_i, \sigma^2_i)\) distributions, let \(Y=\sum_{i=1}^{n} a_i X_i\), then \(Y\) has the distribution

**Lemma (Stein)** Let \(X \sim N(\theta, \sigma^2)\), and let \(g\) be a differentiable function satisfying \(E|g'(X)| < \infty\). Then

This is a useful equality to derive, for example, higher order moments for normal distribution

**Lemma (Cauchy and Gaussian)** Suppose \(X, Y\) are independent normal random distribution \(N(0,1)\), then \(X/Y\) is a Cauchy random variable

### 3.2. Multivariate Gaussian

**Distribution (multivariable normal)** The multivariable verison of normal distribution is

To verify \(\mathbf{\mu}, \mathbf{\Sigma}\) are actually mean and variance, we can decompose \(\Sigma\) with eigendecomposition \(\Sigma=Q \Lambda Q^T\)

bivariate gaussian

By the previous representation, we can show the bivariate density is

This indicates when the corelation is 0, \(x_1, x_2\) are independent

**Definition (canonical parameter)** The distribution can be also represented using the canonical parameters \(\Lambda, \xi\) instead of \(\mu, \Sigma\) where

Using the canonical parameters, we can write MVN in information form (canonical form) as follows

**Definition (mgf)** The moments generating function of multivariate distribution is

**Corollary (linear transformation)** Suppose \(X\) has a \(N_n(\mu, \Sigma)\) distribution, let \(Y=AX+b\) where \(A\) is a \(m \times n\) matrix, then Y has a distribution

Although multivariable normal distribution is widely used as a density model, there are two limitations to it

- There are \(D(D+3)/2\) pindependent parameters in total
- it is intrinsically unimodal and unable to provide good approximation to multimodal distributions.

However, these two limitations can be overcome by - using diagonal normal distribution to reduce paramters from \(O(D^2)\) to \(O(D)\) - use mixture distribution instead

### 3.3. Conditional and Marginal Distribution

Given a joint Gaussian distribution \(N(x|\mu, \Sigma)\) with the precision matrix \(\Lambda = \Sigma^{-1}\)

and \(x = (x_a, x_b)\)

The conditional distribution is given by

where

Notice \(\mu_{a|b}\) is a linear function with respect to \(x_b\) and \(\Sigma_{a|b}\) has the form of the Schur's complement \(A-BD^{-1}C\)

bivariate conditional distribution

Applying the previous derivation to bivariate normal distribution of \(Y\) given \(X=x\) is

The marginal distribution is

It is also a Gaussian

### 3.4. Linear Gaussian System

Suppose we have two random variables, \(x, y\) whose distributions are given as

This is an example of a linear Gaussian system, the marginal distribution of \(y\) is given by

the posterior is given by

where

## 4. Exponential Family

For a full treatment of exponential family, see Jordan's class note

**Distribution (exponential family)** A family of pdfs or pmfs is called an exponential family if it can be expressed

Many distributions are exponential families: binomial, Posson, normal etc.

Binomial distribution

Binomial distribution is a exponential family, it can be expressed as

It is sometimes convenient to reparameterize exponential distribution as

where \(\eta = (\eta_1, \eta_2, ..., \eta_k)\) is called the **natural parameter** or **canonical parameter**, and \(T(x) = (T_1(x), ..., T_k(x))\) are **sufficient statistic**, \(A(\eta)\) is known as the **cumulant function**, which normalizes the distribution.

Gaussian distribution

It can be written as

where natural parameter is \((\eta_1, \eta_2)= (\mu/\sigma^2, -1/2\sigma^2)\) and sufficient stats are \((x, x^2)\)

sample of exponential family

The exponential family is preserved for an iid sample: Now consider a sample \(X = \{ x_1, ..., x_N \}\), the likelihood function is given by

The sufficient statistics here are:

Take the log and set gradient to 0, we obtain

Notice that \(\sum_n u(x_n)\) is also the **sufficient statistics** here

The set of \(\eta\) for which the integral is finite is referred to as the **natural parameter space**

**Definition (regular exponential family)** are the families where the natural parameter space are nonempty open set. Otherwise, they are called curved exponential family.

**Definition (curved exponential family)** A curved exponential family is a family of densities whose dimension of the vector \(\theta\) is equal to \(d < k\). Otherwise it is called a full exponential family.
This happens when the parameter space is a lower-dimensional space.

The exponential family has several appealing statistical and computational properties:

**Lemma (convexity)** The natural parameter space \(\mathcal{N}\) is a convex set, and the cumulant function \(A(\eta)\) is a convex function. If the family is minimal, then \(A(\eta)\) is strictly convex.

**Lemma (moments computation)** Moments of sufficient statistics can be computed easily by replacing integration by differentiation.

The latter reveals that \(A\) is a convex function.

**Lemma (convexity)** The natural parameter space \(\mathcal{N}\) is a convex set, and the cumulant function \(A(\eta)\) is a convex function. If the family is minimal, then \(A(\eta)\) is strictly convex.

loglikelihood of exponential family is concave

Consider the loglikelihood function

It is obvious to be concave over \(\theta\)

## 5. Sub-Gaussian Family

Sub-gaussian

## 6. Reference

- [1] Casella, George, and Roger L. Berger. Statistical inference. Vol. 2. Pacific Grove, CA: Duxbury, 2002.
- [2] Bishop, Christopher M. Pattern recognition and machine learning. springer, 2006.