0x042 Parametric Distribution
- 1. Discrete Distributions
- 2. Continuous Distributions
- 3. Gaussian Distribution
- 4. Exponential Family
- 5. Sub-Gaussian Family
- 6. Reference
1. Discrete Distributions
A random variable \(X\) is said to have a discrete distribution if the range of \(X\), the sample space, is countable. Otherwise it is a continuous distribution.
1.1. Bernoulli
Distribution (Bernoulli) A random variable \(X\) is said to be a Bernoulli random variable with parameter \(p\), shown as \(X \sim Bernoulli(p)\).
or simply
A Bernoulli random variable is associated with a certain event. If event \(A\) occurs, then \(X=1\); otherwise \(X=0\). For this reason, the Bernoulli random variable is also called the indicator random variable.
Lemma Properties of Bernoulli distribution
Distribution (multiclass Bernoulli) Bernoulli can be generalized to deal with multi-class. Suppose \(\mathbf{\mu}=(\mu_1, ..., \mu_K)\) where each \(mu_k \geq 0\) and \(\sum_k \mu_k = 1\). Then the probability of \(\mathbf{x}=(x_1, ..., x_K)\) where \(x_k \in \{0, 1\} \land \sum_k x_k=1\)
1.2. Binomial
Distribution (Binomial) A random variable \(X\) is said to be a binomial random variable with parameters n and p, shown as \(X \sim Binomial(n,p)\) iff
Lemma (properties)
Lemma If \(X_1, X_2, ... X_n\) are independent \(Bernoulli(p)\) random variables, then the random variable \(X\) defined by \(X = X_1 + X_2 + ... + X_n\) has a \(Binomial(n,p)\) distribution
Note that this is easily recognized by multiplication of Bernoulli's mgf
Lemma (conjugacy) The conjugate distribution of binomial is the Beta distribution.
If the prior has the beta distribution
The posteior would be
where \(l=n-k\). This is the kernel form of the beta distribution \(Beta(k+a-1, l+b-1)\) where which can regard the hyperparameters \(a,b\) as an effective number of observations of \(x=1, x=0\)
Under this formulation, we can compute the expectation of the posterior using the expectation of the beta distribution.
Notice that when \(k, l \to \infty\), this is reduced to the MLE of binomial distribution.
1.3. Multinomial
Distribution (multinomial) Binomial distribution can be generalized into the Multinomial distribution. A random variable \(\mathbf{X}\) is said to have a \(Multinomial(N, \mathbf{\mu})\) distribution when its pmf is
Conjugate prior of multinomial distribution is Dirichlet distribution.
1.4. Geometric
Distribution (Geometric) A random variable \(X\) is said to be a geometric random variable with parameter p, shown as \(X \sim Geometric(p)\) iff
The geometric distribution is the simplest of the waiting time distributions and sometimes used to model lifetime and time until failure, however, it fails to model lifetimes for which the probability of failure is expected to increase with time.
It has a "memoryless" property, that is, it "forgets" what has occured.
failure times
If the probability is 0.001 for a light bulb will fail on any given day, then the probability that it will last at least 30 days is
1.5. Negative Binomial
Negative Binomial distribution is a generalization of the geometric distribution, it models the number of failure before \(r\)-th success (while geometric model the number of failure before the first success)
Distribution (Negative Binomial, Pascal) A random variable \(X\) is said to be a negative binomial random variable with parameters \(m\) and \(p\), shown as \(X \sim NB(r,p)\) iff
where r is the number of successes, k is the number of failures, and p is the probability of success on each trial
A equivalent formulation of this distribution is to model the waiting time until \(r\)-th success
which can be interpreted as the sum of iid geometric
where \(X_i\) are iid geometric
1.6. Poisson
Poisson distribution is usually applied to model occurrences both for some time intervals and spatial intervals, whether the probability of an occurrence is proportional to the length of the intervals. For example, the number of earthquakes during 1 year, and the number of people in an area.
Distribution (Poisson) A random variable \(X\) is said to be a Poisson random variable with parameter \(\lambda\) shown as \(X \sim Poisson(\lambda)\)
Poisson distribution is the limit of binomial distribution when \(\lambda = np\) and \(n\) is very large and \(p\) is very small. This can be proved by mgfs convergence.
Calculation of Poisson distribution can be done easily by the following recursive relation
Corollary (additivity) If \(X \sim Poisson(\theta), Y \sim Poisson(\lambda)\), then
This can be provided by either MGF or bivariate transformation with \((X,Y) \to (X+Y, Y)\)
2. Continuous Distributions
2.1. Uniform
Distribution (uniform) The continuous uniform distribution is defined to spread mass over an interval \([a,b]\) where its pdf is
Properties are
2.2. Exponential
Distribution (exponential) A random variable \(X\) has an exponential distribution \(Exponential(\lambda)\) when its pdf is
Some properties are
Note that exponential distribution is a special case of gamma distribution by setting \(\alpha=1\)
Exponential distribution can be used to model lifetimes, analogous to the use of geometric distribution.
Distribution (Weibull) Weibull distribution can be obtained from the exponential distribution where \(X\) has exponential\((\beta)\), then \(Y=X^{1/\gamma}\) is a Weibull\((\gamma, \beta)\) distribution.
2.3. Beta
Beta distribution offers a wide variety of shapes for distributions with support on bounded intervals (e.g: [0,1])
Definition (beta function) The beta function \(B(\alpha, \beta)\) is defined as
It can be represented with gamma function by
Distribution (beta) The beta distribution's pdf is
Beta distribution is usually used to model proportions, which naturally lie between 0 and 1. It can takes on many shapes, for example, when \(\alpha=\beta=1\), it reduced to the uniform distribution.
Because of this property, Beta distribution is the conjugate family of the binomial distribution.
Properties Moments of beta distribution is easily computed by the same integrand
The mean and variance of beta distribution is
Relations with other distributions
Beta distribution can be otained from Gamma distribution with transformation
where \(X_1, X_2\) are \(Gamma(\alpha, 1), Gamma(\beta, 1)\) distributions.
When \(\alpha, \beta=1\), it becomes the uniform distribution.
2.4. Dirichlet
Distribution (dirichlet) Dirichlet distribution is a multiclass generalization of Beta distribution. It is used as the conjugate for multinomial distribution.
Dirichlet distribution can be obtained by following transformations
where \(X_1, ..., X_{k+1}\) are gamma distributions.
2.5. Gamma
The main reason for the appeal of the Gamma distribution in applications is the variety of shapes of the distribution for difrerent various \(\alpha, \beta\).
Note beta distribution also has this variety property, but beta family has support for a bounded interval \((a,b)\), Gamma distribution has an unbounded support \((0, \infty)\).
Definition (gamma function) Gamma function is defined as
The gamma function has the property of \(\Gamma(\alpha+1) = \alpha \Gamma(\alpha)\) and \(\Gamma(\frac{1}{2}) = \sqrt{\pi}\), it is sometimese called the factorial function.
Note the support is the positive real numbers \([0, \infty)\)
Distribution (gamma) gamma \((\alpha, \beta)\) family is
The parameter \(\alpha\) is known as the shape parameter, which influences the peakedness of the distribution, the parameter \(\beta\) is the scale parameter
This can be obtained by normalize Gamma function and scale random variable as \(X=\beta T\).
Properties (Gamma distribution)
Lemma (additivity) Let \(X_1, ..., X_n\) be independent random variable and \(X_i\) has a \(Gamma(\alpha_i, \beta)\) distribution, let \(Y=\sum_i X_i\), then Y has a distribution of \(Gamma(\sum_i \alpha_i, \beta)\)
Relation with Other Distributions - exponential distribution is a special case of Gamma distribution when \(\alpha=1, \beta=\lambda\) - chi-square is a special case of Gamma distrition when \(\alpha=p/2, \beta=2\)
Distribution (Wishart)
3. Gaussian Distribution
The normal distribution plays a central role in a large boy of statistics
3.1. Univariate Gaussian
Distribution (normal) The normal distribution with mean \(\mu\) and variance \(\sigma^2\) is given by
The moments generating function of normal distribution can be computed
68–95–99.7 rule
It is worth remembering a couple of numbers related to standard normal distribution
These number can be obtained using ppf or cdf in scipy.stats
In [1]: scipy.stats.norm.ppf(0.95)
Out[1]: 1.6448536269514722
In [6]: scipy.stats.norm.cdf(1.64)
Out[6]: 0.9494974165258963
confidence interval
The four commonly used confidence intervals for a normal distribution are:
- 68% of values fall within 1 standard deviation of the mean (-1s <= X <= 1s)
- 90% of values fall within 1.65 standard deviations of the mean (-1.65s <= X <= 1.65s)
- 95% of values fall within 1.96 standard deviations of the mean (-1.96s <= X <= 1.96s)
- 99% of values fall within 2.58 standard deviations of the mean (-2.58s <= X <= 2.58s)
Corollary (additivity) Let \(X_1, ..., X_n\) be independent random variables with \(N(\mu_i, \sigma^2_i)\) distributions, let \(Y=\sum_{i=1}^{n} a_i X_i\), then \(Y\) has the distribution
Lemma (Stein) Let \(X \sim N(\theta, \sigma^2)\), and let \(g\) be a differentiable function satisfying \(E|g'(X)| < \infty\). Then
This is a useful equality to derive, for example, higher order moments for normal distribution
Lemma (Cauchy and Gaussian) Suppose \(X, Y\) are independent normal random distribution \(N(0,1)\), then \(X/Y\) is a Cauchy random variable
3.2. Multivariate Gaussian
Distribution (multivariable normal) The multivariable verison of normal distribution is
To verify \(\mathbf{\mu}, \mathbf{\Sigma}\) are actually mean and variance, we can decompose \(\Sigma\) with eigendecomposition \(\Sigma=Q \Lambda Q^T\)
bivariate gaussian
By the previous representation, we can show the bivariate density is
This indicates when the corelation is 0, \(x_1, x_2\) are independent
Definition (canonical parameter) The distribution can be also represented using the canonical parameters \(\Lambda, \xi\) instead of \(\mu, \Sigma\) where
Using the canonical parameters, we can write MVN in information form (canonical form) as follows
Definition (mgf) The moments generating function of multivariate distribution is
Corollary (linear transformation) Suppose \(X\) has a \(N_n(\mu, \Sigma)\) distribution, let \(Y=AX+b\) where \(A\) is a \(m \times n\) matrix, then Y has a distribution
Although multivariable normal distribution is widely used as a density model, there are two limitations to it
- There are \(D(D+3)/2\) pindependent parameters in total
- it is intrinsically unimodal and unable to provide good approximation to multimodal distributions.
However, these two limitations can be overcome by - using diagonal normal distribution to reduce paramters from \(O(D^2)\) to \(O(D)\) - use mixture distribution instead
3.3. Conditional and Marginal Distribution
Given a joint Gaussian distribution \(N(x|\mu, \Sigma)\) with the precision matrix \(\Lambda = \Sigma^{-1}\)
and \(x = (x_a, x_b)\)
The conditional distribution is given by
where
Notice \(\mu_{a|b}\) is a linear function with respect to \(x_b\) and \(\Sigma_{a|b}\) has the form of the Schur's complement \(A-BD^{-1}C\)
bivariate conditional distribution
Applying the previous derivation to bivariate normal distribution of \(Y\) given \(X=x\) is
The marginal distribution is
It is also a Gaussian
3.4. Linear Gaussian System
Suppose we have two random variables, \(x, y\) whose distributions are given as
This is an example of a linear Gaussian system, the marginal distribution of \(y\) is given by
the posterior is given by
where
4. Exponential Family
For a full treatment of exponential family, see Jordan's class note
Distribution (exponential family) A family of pdfs or pmfs is called an exponential family if it can be expressed
Many distributions are exponential families: binomial, Posson, normal etc.
Binomial distribution
Binomial distribution is a exponential family, it can be expressed as
It is sometimes convenient to reparameterize exponential distribution as
where \(\eta = (\eta_1, \eta_2, ..., \eta_k)\) is called the natural parameter or canonical parameter, and \(T(x) = (T_1(x), ..., T_k(x))\) are sufficient statistic, \(A(\eta)\) is known as the cumulant function, which normalizes the distribution.
Gaussian distribution
It can be written as
where natural parameter is \((\eta_1, \eta_2)= (\mu/\sigma^2, -1/2\sigma^2)\) and sufficient stats are \((x, x^2)\)
sample of exponential family
The exponential family is preserved for an iid sample: Now consider a sample \(X = \{ x_1, ..., x_N \}\), the likelihood function is given by
The sufficient statistics here are:
Take the log and set gradient to 0, we obtain
Notice that \(\sum_n u(x_n)\) is also the sufficient statistics here
The set of \(\eta\) for which the integral is finite is referred to as the natural parameter space
Definition (regular exponential family) are the families where the natural parameter space are nonempty open set. Otherwise, they are called curved exponential family.
Definition (curved exponential family) A curved exponential family is a family of densities whose dimension of the vector \(\theta\) is equal to \(d < k\). Otherwise it is called a full exponential family. This happens when the parameter space is a lower-dimensional space.
The exponential family has several appealing statistical and computational properties:
Lemma (convexity) The natural parameter space \(\mathcal{N}\) is a convex set, and the cumulant function \(A(\eta)\) is a convex function. If the family is minimal, then \(A(\eta)\) is strictly convex.
Lemma (moments computation) Moments of sufficient statistics can be computed easily by replacing integration by differentiation.
The latter reveals that \(A\) is a convex function.
Lemma (convexity) The natural parameter space \(\mathcal{N}\) is a convex set, and the cumulant function \(A(\eta)\) is a convex function. If the family is minimal, then \(A(\eta)\) is strictly convex.
loglikelihood of exponential family is concave
Consider the loglikelihood function
It is obvious to be concave over \(\theta\)
5. Sub-Gaussian Family
Sub-gaussian
6. Reference
- [1] Casella, George, and Roger L. Berger. Statistical inference. Vol. 2. Pacific Grove, CA: Duxbury, 2002.
- [2] Bishop, Christopher M. Pattern recognition and machine learning. springer, 2006.