Probabilistic Machine Learning: CP2 - Probability: Univariate Models
May 10, 2022
THIS chapter discussed how to describe continuous and discrete random variables. For RV, we have Mean(E) and Variance(V). Bayes' theorem is to the theory of probability what Pythagoras's theorem is to geometry. -- Sir Harold Jeffreys, 1973.

2.1.3.2 Probability of an event

We denote the joint Probability of events A and B both happening as follows:

Pr(AB)=Pr(A,B)Pr(A \wedge B)=Pr(A, B)

If A and B are independent events, we have:

Pr(A,B)=Pr(A)Pr(B)Pr(A, B)=Pr(A)Pr(B)

2.1.3.3 Probability of a union of two events

The probability of evetnt A or B happening is given by:

Pr(AB)=Pr(A)+Pr(B)Pr(AB)Pr(A \vee B)=Pr(A)+Pr(B)-Pr(A \wedge B)

If the events are mutually exclusive (so they cannot happen at the same time), we get:

Pr(AB)=Pr(A)+Pr(B)Pr(A \vee B)=Pr(A)+Pr(B)

2.1.3.4 Conditional probability of one event given another

We define the conditional probability of event B happening given that A has occurred as follows:

Pr(BA)Pr(A,B)Pr(A)Pr(B|A) \triangleq \dfrac{Pr(A, B)}{Pr(A)}

2.1.3.5 Independence of events

We say that event A is independent of event B if:

Pr(A,B)=Pr(A)Pr(B)Pr(A, B)=Pr(A)Pr(B)

2.1.3.6 Conditional independence of events

We say taht events A and B are conditionally independent given event C if:

Pr(A,BC)=Pr(AC)Pr(BC)Pr(A, B|C)=Pr(A|C)Pr(B|C)

This is written as ABCA \bot B|C. Events are often dependent on each other, but may be rendered independent if we condition on the relevant intermediate variables.

2.2 Random variables

If the value of XX is unknown and/or cound change, we call it a random variable or rv. The set of possible values, denoted X\mathcal{X}, known as the sample space or state space.

2.2.1 Discrete random variables (pmf)

If the sample space X\mathcal{X} is finite or countably infinite, then XX is called a discrete random variable. In this case, we denote the probability of the event that XX has value xx by Pr(X=x)Pr(X=x). We define the probability mass function or pmf as a function which computes the probability of events which correspond to setting the rv to each possible value:

p(x)Pr(X=x)p(x) \triangleq Pr(X=x)

0p(x)10 \leq p(x) \leq 1 and xXp(x)=1\sum_{x \in \mathcal{X}} p(x)=1

2.2.2 Continuous random variables

If XRX \in \mathbb{R} is a real-valued quantity, it is called a continuous random variable.

2.2.2.1 Cumulative distribution function (cdf)

Define the events A=(Xa)A=(X \leq a), B=(Xb)B=(X \leq b) and C=(a<Xb)C=(a < X \leq b), where a<ba<b. We have that B=ACB=A \vee C, and since AA and CC are mutually exclusive, the sum rules gives:

Pr(B)=Pr(A)+Pr(C)Pr(B)=Pr(A)+Pr(C)

and hence the probability of being in interval CC is given by

Pr(C)=Pr(B)Pr(A)Pr(C)=Pr(B)-Pr(A)

In general, we define the cumulative distribution function or cdf of the rvrv XX as follows:

P(X)Pr(Xx)P(X) \triangleq Pr(X \leq x)

Using this, we can compute the probability of being in any interval as follows:

Pr(a<Xb)=P(b)P(a)Pr(a<X \leq b)=P(b)-P(a)

Here is an example of cdf for standard normal distribution: N(x0,1)\mathcal N (x|0,1) picture 1

2.2.2.2 probability density function (pdf)

We define the probability density function or pdf as the derivative of the cdf:

p(x)ddxP(x)p(x) \triangleq \dfrac{d}{dx}P(x)

Example of pdf for standard normal distribution: N(x0,1)\mathcal N (x|0,1) picture 2

Given a pdf, we can compute the probability of a continuous variable being in a finite interval as follows:

Pr(a<Xb)=abp(x)=P(b)P(a)Pr(a<X \leq b)= \int_{a}^{b}p(x)=P(b)-P(a)

As the size of the interval gets smaller, we can write

Pr(xXx+dx)p(x)dxPr(x \leq X \leq x+dx) \approx p(x)dx

2.2.2.3 Quantiles (percent point function, ppf)

If the cdf PP is strictly monotonically increasing, it has an inverse, called the inverse cdf, or percent point function (ppf), or quantile function. For example, let Φ\Phi be the cdf of Gaussian distribution N(x0,1)\mathcal N (x|0,1), and Φ1\Phi^{-1} be the inverse cdf.

picture 4

If the distribution is N(μ,σ2)\mathcal N (\mu, \sigma^2), then 95% interval becomes (μ1.96σ,μ+1.96σ\mu-1.96\sigma, \mu+1.96\sigma). This is often approximated by writing μ±2σ\mu \pm 2\sigma

We have two random variables, XX and YY. We can define the joint distribution of two random variables using p(x,y)=p(X=x,Y=y)p(x, y)=p(X=x, Y=y) for all possible values of XX and YY. With finite cardinality, we can represent the joint distribution as a 2d table:

p(X,Y)p(X,Y) Y=0Y=0 Y=1Y=1
X=0X=0 0.2 0.3
X=1X=1 0.3 0.2

Given a joint distribution, we define the marginal distribution of an rv as follows:

p(X=x)=yp(X=x,Y=y)p(X=x)=\sum_y p(X=x,Y=y)

This is also called the sum rule.

We difine the conditional distribution of an rv using

p(Y=yX=x)=p(X=x,Y=y)p(X=x)p(Y=y|X=x) = \dfrac{p(X=x,Y=y)}{p(X=x)}

We can rearrange this equation to get

p(x,y)=p(x)p(yx)p(x,y) = p(x)p(y|x)

This is called the product rule.

By extending the product rule to DD variables, we get the chain rule of probability:

p(x1:D)=p(x1)p(x2x1)p(x3x1,x2)p(x4x1,x2,x3)...p(xDx1:D1)p(x_{1:D})=p(x_1)p(x_2|x_1)p(x_3|x_1,x_2)p(x_4|x_1,x_2,x_3)...p(x_D|x_{1:D-1})

2.2.4 Independence and conditional independence

We say XX and YY are unconditional independent or marginally independent, denoted XYX \bot Y, if we can represent the joint as the product of the two marginals, i.e.,

XYp(X,Y)=p(X)p(Y)X \bot Y \Longleftrightarrow p(X,Y) = p(X)p(Y)

In general, we say a set of variables X1,...,XnX_1, ... , X_n is independent if the joint can be written as a product of marginals, i.e.,

p(X1,...,Xn)=i=1np(Xi)p(X_1,...,X_n)= \prod^n_{i=1}p(X_i)

Unconditional independence is rare. We therefore say XX and YY are conditionally independent (CI) given ZZ iff the conditional joint can be written as a product of conditional marginals:

XYZp(X,YZ)=p(XZ)p(YZ)X \bot Y | Z \Longleftrightarrow p(X,Y|Z)=p(X|Z)p(Y|Z)

2.2.5 Mean of a distribution

For continuous rv's, the mean is defined as follows:

E[X]Xxp(x)dx\mathbb{E} [X] \triangleq \int_\mathcal{X} xp(x)dx

For discrete rv's, mean:

E[X]xXxp(x)\mathbb{E}[X] \triangleq \sum_{x \in \mathcal{X}}xp(x)

Mean is a linear operator, we have

E[aX+b]=aE[X]+b\mathbb{E} [aX+b] = a\mathbb{E}[X]+b

This is called linearty of expectation.

For a set of nn rv's, one can show that the expectation of their sum is as follows:

E[i=1nXi]=i=1nE[Xi]\mathbb{E}[\sum_{i=1}^{n}X_i]=\sum_{i=1}^{n} \mathbb{E}[X_i]

If they are independent, the expectation of their product is given by

E[i=1nXi]=i=1nE[Xi]\mathbb{E}[\prod_{i=1}^{n}X_i]=\prod_{i=1}^{n} \mathbb{E}[X_i]

2.2.5.2 Variance of a distribution

The variance is a measure of the "spread" of a distribution, often denoted by σ2\sigma^2.

V[X]E[(Xμ)2]=(xμ)2p(x)dx=x2p(x)dx+μ2p(x)dx2μxp(x)dx=E[x2]+μ22μ2=E[X2]μ2=σ2\begin{split} \mathbb{V}[X] &\triangleq \mathbb{E}[(X-\mu)^2]= \int (x-\mu)^2p(x)dx \\ &=\int x^2p(x)dx + \mu^2\int p(x)dx - 2\mu \int xp(x)dx \\ &=\mathbb{E}[x^2]+\mu^2-2\mu^2\\ &=\mathbb{E}[X^2]-\mu^2 = \sigma^2 \end{split}

from which we derive the useful result:

E[x2]=μ2+σ2\mathbb{E}[x^2]=\mu^2 + \sigma^2

The standard deviation is defined as:

std[X]V[X]=σstd[X] \triangleq \sqrt{\mathbb{V}[X]}=\sigma

This is useful since it has the same units as XX itself.

Shifted and scaled variance:

V[aX+b]=a2[X]\mathbb{V}[aX+b]=a^2 \mathbb[X]

Variance of the sum of a set of nn independent random variables: [i=1nXi]=i=1nV[Xi]\mathbb[\sum^n_{i=1}X_i]=\sum^n_{i=1}\mathbb{V}[X_i]

Variance of the product of a set of nn independent random variables: V[i=1nXi]=i(σi2+μi2)iμi2\mathbb{V}[\prod^n_{i=1}X_i]=\prod_i(\sigma^2_i+\mu^2_i)-\prod_i\mu^2_i

2.2.5.3 Mode of a distribution

The mode of a distribution is the value with the highest probability mass or probability density:

x=arg maxxp(x)x^*=\argmax_xp(x)

2.2.5.4 Conditional moments

When we have two or more dependent random variables, we can compute the moments of one given knowledge of the other. For example, the law of iterated expectations, also called the law of total expetation, tells us that:

E[X]=E[E[XY]]\mathbb{E}[X]=\mathbb{E}[\mathbb{E}[X|Y]]

Similarly, for variance, we have the law of total variance, also called the conditional variance formula, tells us that:

V=E[V[XY]]+V[E[XY]]\mathbb{V}=\mathbb{E}[\mathbb{V}[X|Y]]+\mathbb{V}[\mathbb{E}[X|Y]]

2.3 Bayes' rule

p(H=hY=y)=p(H=h)p(Y=yH=h)p(Y=y)p(hy)p(y)=p(h)p(yh)=p(h,y)p(H=h|Y=y)=\dfrac{p(H=h)p(Y=y|H=h)}{p(Y=y)} \\ p(h|y)p(y)=p(h)p(y|h)=p(h,y)